# Machine learning engineering exercise
## The goal of this exercise is to apply a machine learning solution to identify if a DNS domain is valid (more accurately user typed) or  auto-generated

## Content:

- Dataset
- Pre-processing
- Build Model
- Training
- Plots
- Reference

## Dataset

- <b>Genuine dataset:</b> Alexa top 0.5 million domains
- <b>Malware dataset:</b> Consider four different family from (https://github.com/baderj/domain_generation_algorithms)
- <b>Malware families considered:</b> banjori, chinad, corebot and fobber 


In [2]:
from data import get_data, genuine_data, malware_data, gather_data, domain_name_dictionary
import pandas as pd

In [3]:
genuine_df = genuine_data()
genuine_df.head()

Unnamed: 0,domain,flag
0,google.com,1
1,youtube.com,1
2,tmall.com,1
3,baidu.com,1
4,sohu.com,1


In [4]:
genuine_df.tail()

Unnamed: 0,domain,flag
408750,websitedeveloperzone.com,1
408751,wewritecontentforyou.com,1
408752,yourcodercamp.com,1
408753,zapsend.co,1
408754,zirconia.co.il,1


In [5]:
malware_df = malware_data()
malware_df.head()

Unnamed: 0,domain,flag
0,vhkintjtksyxgjrzz.net,0
1,btpnxlsfdqbhzazyx.net,0
2,ukfmknjdenthvktgc.net,0
3,qupxsrhrmuoinqrit.net,0
4,gjsbydmrpfzsmnfiu.net,0


In [6]:
malware_df.tail()

Unnamed: 0,domain,flag
995,txszestnessbiophysicalohax.com,0
996,lrvxestnessbiophysicalohax.com,0
997,bvosestnessbiophysicalohax.com,0
998,moskestnessbiophysicalohax.com,0
999,pdzyestnessbiophysicalohax.com,0


In [7]:
df = pd.concat([genuine_df, malware_df])
print(genuine_df.shape)
print(malware_df.shape)
print(df.shape)

(408755, 2)
(4256, 2)
(413011, 2)


In [8]:
df = gather_data()
df.shape

(413011, 2)

### Tensorflow tf.data.Dataset (https://www.tensorflow.org/api_docs/python/tf/data/Dataset)

Advantage: Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.


In [9]:
EPOCHS = 10
MAX_LENGTH = 100
BATCH_SIZE = 64
SHUFFLE_BUFFER = 50

MODEL_PATH = "/tmp/models/"

In [10]:
train_dataset, valid_dataset = get_data(
        max_length=MAX_LENGTH,
        batch_size=BATCH_SIZE,
        shuffle_buffer=SHUFFLE_BUFFER,
)

In [11]:
# Helper function to convert numerical sequence to domain string
inv_map = {v: k for k, v in domain_name_dictionary.items()}

def get_domain_from_sequence(sequence):
    domain = ""
    for num in sequence:
        if num != 0:
            domain += inv_map[num]
    return domain

In [13]:
for feature_batch in train_dataset.take(1).as_numpy_iterator():
    
    print("Shape of the Text data: batch_size x MAX_LENGTH")
    print(list(feature_batch[0].shape))
    print()
    
    print("Shape of Label: batch_size")
    print(list(feature_batch[1].shape))
    print()
    
    print("First Data in the first batch of dataset")
    print("Sequence: ", feature_batch[0][0])
    print("Domain for Sequence: ", get_domain_from_sequence(feature_batch[0][0]))
    print("Genuine(1)/Malware(0): ", feature_batch[1][0])
    
    

Shape of the Text data: batch_size x MAX_LENGTH
[64, 100]

Shape of Label: batch_size
[64]

First Data in the first batch of dataset
Sequence:  [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 15 15 32
 23 12 32 35]
Domain for Sequence:  aari.ru
Genuine(1)/Malware(0):  1


## Pre-processing

- Padding is done to all the domains to meet the constant length input to the model.
- MAX_LENGTH is used to set the padding.
- Defatult padding method is used. That is padding is applied at the begging.

In [14]:
from data import domain_to_ints, prep_data, prep_dataframe

In [14]:
domain_name_dictionary

{'0': 0,
 '1': 1,
 '2': 2,
 '3': 3,
 '4': 4,
 '5': 5,
 '6': 6,
 '7': 7,
 '8': 8,
 '9': 9,
 ':': 10,
 '-': 11,
 '.': 12,
 '/': 13,
 '_': 14,
 'a': 15,
 'b': 16,
 'c': 17,
 'd': 18,
 'e': 19,
 'f': 20,
 'g': 21,
 'h': 22,
 'i': 23,
 'j': 24,
 'k': 25,
 'l': 26,
 'm': 27,
 'n': 28,
 'o': 29,
 'p': 30,
 'q': 31,
 'r': 32,
 's': 33,
 't': 34,
 'u': 35,
 'v': 36,
 'w': 37,
 'x': 38,
 'y': 39,
 'z': 40,
 nan: 41}

<b>Helper function to convert the characters to ints</b>

In [15]:
print(domain_to_ints("q"))
print(domain_to_ints("qwerty"))
print(domain_to_ints("qwerty$a"))

[31]
[31, 37, 19, 32, 34, 39]
[31, 37, 19, 32, 34, 39, 41, 15]


<b>Helper function to convert entire DataFrame to sequecnes and labels</b>

In [16]:
sample_dict = {
    "domain": ["tigera.io", "github.com", "zfdzktxu0h2dhcul.net"],
    "flag": [1, 1, 0]
}
sample_df = pd.DataFrame(sample_dict)

In [17]:
X, y = prep_dataframe(sample_df, max_length=20)
print(f"Sequence: {X[0]}, Domain: {get_domain_from_sequence(X[0])}, Flag: {y[0]}")

Sequence: [ 0  0  0  0  0  0  0  0  0  0  0 34 23 21 19 32 15 12 23 29], Domain: tigera.io, Flag: 1


<b>Helper function to convert domain to sequence for a list of domain</b>

In [18]:
domains = ["tigera.io", "github.com", "zfdzktxu0h2dhcul.net"]
prep_data(domains, max_length=20)

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 34, 23, 21, 19, 32,
        15, 12, 23, 29],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 21, 23, 34, 22, 35, 16,
        12, 17, 29, 27],
       [40, 20, 18, 40, 25, 34, 38, 35,  0, 22,  2, 18, 22, 17, 35, 26,
        12, 28, 19, 34]], dtype=int32)

## Building Model

In [19]:
from main import build_model

In [20]:
model = build_model()
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Input (InputLayer)           [(None, 100)]             0         
_________________________________________________________________
Embedding (Embedding)        (None, 100, 128)          16384     
_________________________________________________________________
Conv1 (Conv1D)               (None, 100, 128)          49280     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 50, 128)           0         
_________________________________________________________________
Conv2 (Conv1D)               (None, 50, 128)           32896     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 25, 128)           0         
_________________________________________________________________
flatten (Flatten)            (None, 3200)              0     

### The above model is a CNN based Architecture using 1D Convolution Layer

- Input layer takes integer matrix of size (batch_zise, max_length). Each charater corresponds to a feature column.
- Embedding Layer: The role of the embedding layer is to learn to represent each character that can occur in a domain name by a 128 dimensional numerical vector. These vectors are learned as the model trains.
- Two convolution layer with 128 filters is used along with MaxPoolLayer which returns a fixed-length output vector for each example by averaging over the sequence dimension. 
- Dropout is used to improve the model performance and overcome overfitting by randomly excluding nodes during training.
- Finally a fully connected Dense layer with single node output with sigmoid activation function.

## Training Model

In [21]:
from main import train

In [22]:
history = train(
        train_dataset,
        valid_dataset,
    )

Epoch 1/10
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: unsupported operand type(s) for -: 'NoneType' and 'int'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: unsupported operand type(s) for -: 'NoneType' and 'int'
   1/5486 [..............................] - ETA: 44:32 - loss: 0.7062 - mae: 0.5062 - mean_squared_error: 0.2565 - acc: 0.3594

2021-07-19 18:35:45.473452: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-07-19 18:35:45.474904: W tensorflow/core/platform/profile_utils/cpu_utils.cc:126] Failed to get CPU frequency: 0 Hz


  97/5486 [..............................] - ETA: 1:15 - loss: 0.0525 - mae: 0.0402 - mean_squared_error: 0.0165 - acc: 0.9659

KeyboardInterrupt: 

## Plots

In [None]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  

In [None]:
from data import plot_curve_v2

In [None]:
plot_curve_v2(history, 'acc')

## Reference

http://faculty.washington.edu/mdecock/papers/byu2018a.pdf
https://www.tensorflow.org/tutorials/keras/text_classification
https://www.tensorflow.org/text/guide/word_embeddings#using_the_embedding_layer
https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool1D
https://www.tensorflow.org/responsible_ai/fairness_indicators/tutorials/Fairness_Indicators_TFCO_Wiki_Case_Study