# Domain Generation Algorithm (DGA) Detection

## Authors
 - Gorkem Batmaz (NVIDIA) [gbatmaz@nvidia.com]
 - Bhargav Suryadevara (NVIDIA) [bsuryadevara@nvidia.com]

## Development Notes
* Developed using: RAPIDS v0.12.0 and CLX v0.12
* Last tested using: RAPIDS v0.12.0 and CLX v0.12 on Jan 28, 2020

## Table of Contents
* Introduction
* Data Importing
* Data Preprocessing
* Training and Evaluation
* Inference
* Conclusion

## Introduction
[Domain Generation Algorithms](https://en.wikipedia.org/wiki/Domain_generation_algorithm) (DGAs) are used to generate domain names that can be used by the malware to communicate with the command and control servers. IP addresses and static domain names can be easily blocked, and a DGA provides an easy method to generate a large number of domain names and rotate through them to circumvent traditional block lists. We will use a type of recurrent neural network called the [Gated Recurrent Unit](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21) (GRU) for this example. The [CLX](https://github.com/rapidsai/clx) and [RAPIDS](https://rapids.ai) libraries enable users train their models with up-to-date domain names representative of both benign and DGA generated strings. Using a CLX workflow, this capability could also be used in production. This notebook provides a view into the data science workflow to create a DGA detection implementation.

In [1]:
import os
import cudf
import torch
import s3fs
import numpy as np
from datetime import datetime
from sklearn.metrics import accuracy_score, average_precision_score
from clx.analytics.dga_detector import DGADetector
from clx.utils.data.dataloader import DataLoader
from clx.analytics.dga_dataset import DGADataset
from cuml.preprocessing.model_selection import train_test_split

#### Download Input Dataset from S3

In [2]:
INPUT_CSV = "benign_and_dga_domains.csv"

S3_BASE_PATH = "rapidsai-data/cyber/clx"

In [3]:
# Read Benign and DGA dataset
if not os.path.exists(INPUT_CSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + INPUT_CSV, INPUT_CSV)

#### Load Input Dataset to GPU Dataframe

In [4]:
gdf = cudf.read_csv(INPUT_CSV)

In [5]:
train_data = gdf['domain']
labels = gdf['type']

Because we have only benign and DGA (malicious) categoriesm, the number of domain types need to be set to 2 (`N_DOMAIN_TYPE=2`). Vocabulary size(`CHAR_VOCAB`) is set to 128 ASCII characters. The values below set for `HIDDEN_SIZE`, `N_LAYERS` of the network, and the `LR` (Learning Rate) give an optimum balance for the network size and performance. They might need be set via experiments when working with other datasets.

In [6]:
LR = 0.001
N_LAYERS = 3
CHAR_VOCAB = 128
HIDDEN_SIZE = 100
N_DOMAIN_TYPE = 2

#### Instantiate DGA Detector
Now that the data is ready, the datasets are created, and we've set the parameters for the model, we can use the DGADetector method built into CLX to create and train the model.

In [7]:
dd = DGADetector(lr=LR)
dd.init_model(n_layers=N_LAYERS, char_vocab=CHAR_VOCAB, hidden_size=HIDDEN_SIZE, n_domain_type=N_DOMAIN_TYPE)

In [8]:
EPOCHS = 25
TRAIN_SIZE = 0.7
BATCH_SIZE = 10000
MODELS_DIR = 'models'

### Training and Evaluation
Now we train and evaluate the model.

In [9]:
%%time
dd.train_model(train_data, labels, batch_size=BATCH_SIZE, epochs=EPOCHS, train_size=0.7)

  return libdlpack.to_dlpack(gdf_cols)




Epoch:   4%|▍         | 1/25 [00:24<09:49, 24.55s/it]

Test set: Accuracy: 421091/614179 (0.6856160826078391)



Epoch:   8%|▊         | 2/25 [00:44<08:50, 23.06s/it]

Test set: Accuracy: 534782/614179 (0.8707266122742718)



Epoch:  12%|█▏        | 3/25 [01:04<08:06, 22.12s/it]

Test set: Accuracy: 579823/614179 (0.9440619102899969)



Epoch:  16%|█▌        | 4/25 [01:23<07:30, 21.45s/it]

Test set: Accuracy: 592368/614179 (0.9644875516746746)



Epoch:  20%|██        | 5/25 [01:43<06:59, 21.00s/it]

Test set: Accuracy: 594845/614179 (0.9685205778771335)



Epoch:  24%|██▍       | 6/25 [02:04<06:37, 20.94s/it]

Test set: Accuracy: 596958/614179 (0.9719609429824204)



Epoch:  28%|██▊       | 7/25 [02:24<06:10, 20.59s/it]

Test set: Accuracy: 598526/614179 (0.9745139446317768)



Epoch:  32%|███▏      | 8/25 [02:44<05:47, 20.44s/it]

Test set: Accuracy: 600067/614179 (0.9770229851557933)



Epoch:  36%|███▌      | 9/25 [03:04<05:24, 20.29s/it]

Test set: Accuracy: 602612/614179 (0.9811667282665152)



Epoch:  40%|████      | 10/25 [03:23<05:00, 20.04s/it]

Test set: Accuracy: 603439/614179 (0.9825132412537713)



Epoch:  44%|████▍     | 11/25 [03:43<04:40, 20.01s/it]

Test set: Accuracy: 604280/614179 (0.983882548898611)



Epoch:  48%|████▊     | 12/25 [04:03<04:20, 20.01s/it]

Test set: Accuracy: 604919/614179 (0.9849229621983168)



Epoch:  52%|█████▏    | 13/25 [04:23<03:58, 19.90s/it]

Test set: Accuracy: 605363/614179 (0.9856458784816804)



Epoch:  56%|█████▌    | 14/25 [04:43<03:40, 20.01s/it]

Test set: Accuracy: 605379/614179 (0.9856719295189188)



Epoch:  60%|██████    | 15/25 [05:04<03:20, 20.09s/it]

Test set: Accuracy: 604896/614179 (0.9848855138322867)



Epoch:  64%|██████▍   | 16/25 [05:23<03:00, 20.04s/it]

Test set: Accuracy: 606386/614179 (0.9873115166751061)



Epoch:  68%|██████▊   | 17/25 [05:44<02:42, 20.26s/it]

Test set: Accuracy: 607270/614179 (0.9887508364825238)



Epoch:  72%|███████▏  | 18/25 [06:05<02:23, 20.50s/it]

Test set: Accuracy: 607590/614179 (0.9892718572272904)



Epoch:  76%|███████▌  | 19/25 [06:25<02:02, 20.35s/it]

Test set: Accuracy: 607761/614179 (0.989550277687775)



Epoch:  80%|████████  | 20/25 [06:45<01:41, 20.29s/it]

Test set: Accuracy: 607699/614179 (0.9894493299184766)



Epoch:  84%|████████▍ | 21/25 [07:05<01:20, 20.17s/it]

Test set: Accuracy: 607497/614179 (0.9891204355733426)



Epoch:  88%|████████▊ | 22/25 [07:25<01:00, 20.16s/it]

Test set: Accuracy: 607795/614179 (0.9896056361419066)



Epoch:  92%|█████████▏| 23/25 [07:46<00:40, 20.17s/it]

Test set: Accuracy: 608339/614179 (0.9904913714080097)



Epoch:  96%|█████████▌| 24/25 [08:06<00:20, 20.21s/it]

Test set: Accuracy: 608413/614179 (0.9906118574552369)



Epoch: 100%|██████████| 25/25 [08:26<00:00, 20.27s/it]

Test set: Accuracy: 608604/614179 (0.9909228417122695)

CPU times: user 9h 8min 3s, sys: 1h 37min 11s, total: 10h 45min 14s
Wall time: 8min 28s





### Save Model
Save pretrained model to a given output location.

In [10]:
if not os.path.exists(MODELS_DIR):
    print("Creating directory '{}'".format(MODELS_DIR))
    os.makedirs(MODELS_DIR)

now = datetime.now()
model_filename = "rnn_classifier_{}.bin".format(now.strftime("%Y-%m-%d_%H_%M_%S"))
model_filepath = os.path.join(MODELS_DIR, model_filename)
dd.save_model(model_filepath)
print("Pretrained model saved to location: '{}'".format(model_filepath))

Pretrained model saved to location: 'models/rnn_classifier_2021-01-06_20_23_15.bin'


### Inference

Using the model generated above, we now score the test dataset against the model to determine if the domain is likely generated by a DGA or benign.

In [11]:
dga_detector = DGADetector()
dga_detector.load_model(model_filepath)

domain_train, domain_test, type_train, type_test = train_test_split(gdf, "type", train_size=0.7)
test_df = cudf.DataFrame()
test_df["type"] = type_test.reset_index(drop=True)
test_df["domain"] = domain_test.reset_index(drop=True)

test_dataset = DGADataset(test_df)
test_dataloader = DataLoader(test_dataset, batchsize=BATCH_SIZE)

pred_results = []
true_results = []
for chunk in test_dataloader.get_chunks():
    pred_results.append(list(dga_detector.predict(chunk['domain']).values_host))
    true_results.append(list(chunk['type'].values_host))
pred_results = np.concatenate(pred_results)
true_results = np.concatenate(true_results)
accuracy_score = accuracy_score(pred_results, true_results)

print('Model accuracy: %s'%(accuracy_score))

Model accuracy: 0.9922400472826326


In [12]:
average_precision = average_precision_score(true_results, pred_results)

print('Average precision score: {0:0.3f}'.format(average_precision))

Average precision score: 0.979


## Conclusion

DGA detector in CLX enables users to train their models for detection and also use existing models. This capability could also be used in conjunction with log parsing efforts if the logs contain domain names. DGA detection done with CLX and RAPIDS keeps data in GPU memory, removing unnecessary copy/converts and providing a 4X speed advantage over CPU only implementations. This is esepcially true with large batch sizes.