# Domain Generation Algorithm (DGA) Detection

## Authors
 - Gorkem Batmaz (NVIDIA) [gbatmaz@nvidia.com]
 - Bhargav Suryadevara (NVIDIA) [bsuryadevara@nvidia.com]

## Development Notes
* Developed using: RAPIDS v0.12.0 and CLX v0.12
* Last tested using: RAPIDS v0.12.0 and CLX v0.12 on Jan 28, 2020

## Table of Contents
* Introduction
* Data Importing
* Data Preprocessing
* Training and Evaluation
* Inference
* Conclusion

## Introduction
[Domain Generation Algorithms](https://en.wikipedia.org/wiki/Domain_generation_algorithm) (DGAs) are used to generate domain names that can be used by the malware to communicate with the command and control servers. IP addresses and static domain names can be easily blocked, and a DGA provides an easy method to generate a large number of domain names and rotate through them to circumvent traditional block lists. We will use a type of recurrent neural network called the [Gated Recurrent Unit](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21) (GRU) for this example. The [CLX](https://github.com/rapidsai/clx) and [RAPIDS](https://rapids.ai) libraries enable users train their models with up-to-date domain names representative of both benign and DGA generated strings. Using a CLX workflow, this capability could also be used in production. This notebook provides a view into the data science workflow to create a DGA detection implementation.

In [1]:
import os
import time
import cudf
import torch
import s3fs
import numpy as np
from datetime import datetime
from sklearn.metrics import accuracy_score, average_precision_score
from clx.utils.data.dga_dataset import DGADataset
from clx.utils.data.dataloader import DataLoader
from clx.analytics.dga_detector import DGADetector
from cuml.preprocessing.model_selection import train_test_split

#### Download Input Dataset from S3

In [2]:
INPUT_CSV = "benign_and_dga_domains.csv"

S3_BASE_PATH = "rapidsai-data/cyber/clx"

In [3]:
# Read Benign and DGA dataset
if not os.path.exists(INPUT_CSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + INPUT_CSV, INPUT_CSV)

#### Load Input Dataset to GPU Dataframe

In [4]:
input_df = cudf.read_csv(INPUT_CSV)

#### Create Train and Test Dataset
We utilize the [`train_test_split` function](https://docs.rapids.ai/api/cuml/0.10/api.html#model-selection-and-data-splitting) from [cuML](https://github.com/rapidsai/cuml) and create a shuffled dataset for training and testing.

In [5]:
domain_train, domain_test, type_train, type_test = train_test_split(input_df, 'type', train_size=0.7)

In [6]:
def create_df(domain_df, type_series):
    df = cudf.DataFrame()
    df['domain'] = domain_df['domain'].reset_index(drop=True)
    df['type'] = type_series.reset_index(drop=True)
    return df

In [7]:
train_df = create_df(domain_train, type_train)
test_df = create_df(domain_test, type_test)

In [8]:
train_dataset = DGADataset(train_df)
test_dataset = DGADataset(test_df)

#### Create Batches using DataLoader
We need to partition the input dataframe into one or more smaller dataframes as per the given batch size for training and testing of a model.

In [9]:
BATCH_SIZE = 10000
train_dataloader = DataLoader(train_dataset, batchsize=BATCH_SIZE)
test_dataloader = DataLoader(test_dataset, batchsize=BATCH_SIZE)

Because we have only benign and DGA (malicious) categoriesm, the number of domain types need to be set to 2 (`N_DOMAIN_TYPE=2`). Vocabulary size(`CHAR_VOCAB`) is set to 128 ASCII characters. The values below set for `HIDDEN_SIZE`, `N_LAYERS` of the network, and the `LR` (Learning Rate) give an optimum balance for the network size and performance. They might need be set via experiments when working with other datasets.

In [10]:
LR = 0.001
N_LAYERS = 3
CHAR_VOCAB = 128
HIDDEN_SIZE = 100
N_DOMAIN_TYPE = 2

#### Instantiate DGA Detector
Now that the data is ready, the datasets are created, and we've set the parameters for the model, we can use the DGADetector method built into CLX to create and train the model.

In [11]:
dd = DGADetector(lr=LR)
dd.init_model(n_layers=N_LAYERS, char_vocab=CHAR_VOCAB, hidden_size=HIDDEN_SIZE, n_domain_type=N_DOMAIN_TYPE)

In [12]:
def create_dir(dir_path):
    print("Verify if directory `%s` is already exists." % (dir_path))
    if not os.path.exists(dir_path):
        print("Directory `%s` does not exists." % (dir_path))
        print("Creating directory `%s` to store trained models." % (dir_path))
        os.makedirs(dir_path)

In [13]:
def cleanup_cache():
    # release memory.
    torch.cuda.empty_cache()

In [14]:
def train_and_eval(dd, train_dataloader, test_dataloader, epoch, model_dir):
    print("Initiating model training")
    create_dir(model_dir)
    max_accuracy = 0
    prev_model_file_path = ""
    for i in range(1, epoch + 1):
        print("---------")
        print("Epoch: %s" % (i))
        print("---------")
        dd.train_model(train_dataloader)
        accuracy = dd.evaluate_model(test_dataloader)
        now = datetime.now()
        output_filepath = (
            model_dir
            + "/"
            + "rnn_classifier_{}.pth".format(now.strftime("%Y-%m-%d_%H_%M_%S"))
        )
        if accuracy > max_accuracy:
            dd.save_model(output_filepath)
            max_accuracy = accuracy
            if prev_model_file_path:
                os.remove(prev_model_file_path)
            prev_model_file_path = output_filepath
    print("Model with highest accuracy (%s) is stored to location %s" % (max_accuracy, prev_model_file_path))
    return prev_model_file_path

### Training and Evaluation
Using the function we created above, we now train and evaluate the model.
*NOTE: You may see warnings when you run the training due to a [bug in PyTorch](https://github.com/pytorch/pytorch/issues/27972) which is being actively investigated.*

In [15]:
%%time
epoch = 30
model_dir='/trained_models'
model_filepath = train_and_eval(dd, train_dataloader, test_dataloader, epoch, model_dir)
cleanup_cache()

Initiating model training
Verify if directory `/trained_models` is already exists.
---------
Epoch: 1
---------


  return libdlpack.to_dlpack(gdf_cols)


Test set: Accuracy: 395451/614179 (0.643869295433416)

---------
Epoch: 2
---------
Test set: Accuracy: 539359/614179 (0.8781788371142615)

---------
Epoch: 3
---------
Test set: Accuracy: 582241/614179 (0.9479988732926394)

---------
Epoch: 4
---------
Test set: Accuracy: 593620/614179 (0.9665260453385739)

---------
Epoch: 5
---------
Test set: Accuracy: 594871/614179 (0.9685629108126458)

---------
Epoch: 6
---------
Test set: Accuracy: 595502/614179 (0.9695902985937325)

---------
Epoch: 7
---------
Test set: Accuracy: 597311/614179 (0.9725356939914911)

---------
Epoch: 8
---------
Test set: Accuracy: 599865/614179 (0.9766940908106594)

---------
Epoch: 9
---------
Test set: Accuracy: 598551/614179 (0.9745546493774616)

---------
Epoch: 10
---------
Test set: Accuracy: 602861/614179 (0.9815721475335366)

---------
Epoch: 11
---------
Test set: Accuracy: 603683/614179 (0.9829105195716559)

---------
Epoch: 12
---------
Test set: Accuracy: 604999/614179 (0.9850532173845085)

-------

### Inference

Using the model generated above, we now score the test dataset against the model to determine if the domain is likely generated by a DGA or benign.

In [16]:
dd = DGADetector()
dd.load_model(model_filepath)

pred_results = []
true_results = []
for chunk in test_dataloader.get_chunks():
    pred_results.append(list(dd.predict(chunk['domain']).values_host))
    true_results.append(list(chunk['type'].values_host))
pred_results = np.concatenate(pred_results)
true_results = np.concatenate(true_results)
accuracy_score = accuracy_score(pred_results, true_results)
print('Model accuracy: %s'%(accuracy_score))
cleanup_cache()

Model accuracy: 0.9916994882599373


In [17]:
average_precision = average_precision_score(true_results, pred_results)

print('Average precision score: {0:0.3f}'.format(average_precision))

Average precision score: 0.978


## Conclusion

DGA detector in CLX enables users to train their models for detection and also use existing models. This capability could also be used in conjunction with log parsing efforts if the logs contain domain names. DGA detection done with CLX and RAPIDS keeps data in GPU memory, removing unnecessary copy/converts and providing a 4X speed advantage over CPU only implementations. This is esepcially true with large batch sizes.