# Domain Generation Algorithm (DGA) Detection

## Authors
 - Gorkem Batmaz (NVIDIA) [gbatmaz@nvidia.com]
 - Bhargav Suryadevara (NVIDIA) [bsuryadevara@nvidia.com]

## Development Notes
* Developed using: RAPIDS v0.12.0 and CLX v0.12
* Last tested using: RAPIDS v21.08 and CLX v21.08 on Aug 18, 2021

## Table of Contents
* Introduction
* Data Importing
* Data Preprocessing
* Training and Evaluation
* Inference
* Conclusion

## Introduction
[Domain Generation Algorithms](https://en.wikipedia.org/wiki/Domain_generation_algorithm) (DGAs) are used to generate domain names that can be used by the malware to communicate with the command and control servers. IP addresses and static domain names can be easily blocked, and a DGA provides an easy method to generate a large number of domain names and rotate through them to circumvent traditional block lists. We will use a type of recurrent neural network called the [Gated Recurrent Unit](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21) (GRU) for this example. The [CLX](https://github.com/rapidsai/clx) and [RAPIDS](https://rapids.ai) libraries enable users train their models with up-to-date domain names representative of both benign and DGA generated strings. Using a CLX workflow, this capability could also be used in production. This notebook provides a view into the data science workflow to create a DGA detection implementation.

In [1]:
import os
import cudf
import torch
import s3fs
import logging
import numpy as np
from datetime import datetime
from sklearn.metrics import accuracy_score, average_precision_score
from clx.analytics.dga_detector import DGADetector
from clx.utils.data.dataloader import DataLoader
from clx.analytics.dga_dataset import DGADataset
from cuml.model_selection import train_test_split

#### Enable console logging

In [2]:
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.StreamHandler()
    ]
)

#### Download Input Dataset from S3

In [3]:
INPUT_CSV = "benign_and_dga_domains.csv"

S3_BASE_PATH = "rapidsai-data/cyber/clx"

In [4]:
# Read Benign and DGA dataset
if not os.path.exists(INPUT_CSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + INPUT_CSV, INPUT_CSV)

#### Load Input Dataset to GPU Dataframe

In [5]:
gdf = cudf.read_csv(INPUT_CSV)

In [6]:
train_data = gdf['domain']
labels = gdf['type']

Because we have only benign and DGA (malicious) categoriesm, the number of domain types need to be set to 2 (`N_DOMAIN_TYPE=2`). Vocabulary size(`CHAR_VOCAB`) is set to 128 ASCII characters. The values below set for `HIDDEN_SIZE`, `N_LAYERS` of the network, and the `LR` (Learning Rate) give an optimum balance for the network size and performance. They might need be set via experiments when working with other datasets.

In [7]:
LR = 0.001
N_LAYERS = 3
CHAR_VOCAB = 128
HIDDEN_SIZE = 100
N_DOMAIN_TYPE = 2

#### Instantiate DGA Detector
Now that the data is ready, the datasets are created, and we've set the parameters for the model, we can use the DGADetector method built into CLX to create and train the model.

In [8]:
dd = DGADetector(lr=LR)
dd.init_model(n_layers=N_LAYERS, char_vocab=CHAR_VOCAB, hidden_size=HIDDEN_SIZE, n_domain_type=N_DOMAIN_TYPE)

2021-08-19 22:58:33,371 [INFO] CUDA device count: 2
2021-08-19 22:58:33,374 [INFO] Found GPU's now setting up cuda for the model


In [9]:
EPOCHS = 25
TRAIN_SIZE = 0.7
BATCH_SIZE = 10000
MODELS_DIR = 'models'

### Training and Evaluation
Now we train and evaluate the model.

In [10]:
%%time
dd.train_model(train_data, labels, batch_size=BATCH_SIZE, epochs=EPOCHS, train_size=0.7)

2021-08-19 22:58:35,840 [INFO] Initiating model training ...
2021-08-19 22:58:35,841 [INFO] Truncate domains to width: 100
Epoch:   0%|          | 0/25 [00:00<?, ?it/s]2021-08-19 22:58:36,849 [INFO] init
2021-08-19 22:58:52,002 [INFO] Evaluating trained model ...
2021-08-19 22:58:54,910 [INFO] Test set accuracy: 337339/614179 (0.5492519281838031)

2021-08-19 22:59:07,223 [INFO] Evaluating trained model ...
2021-08-19 22:59:10,042 [INFO] Test set accuracy: 506200/614179 (0.8241896906276509)

2021-08-19 22:59:22,719 [INFO] Evaluating trained model ...
2021-08-19 22:59:25,589 [INFO] Test set accuracy: 585468/614179 (0.9532530418656451)

2021-08-19 22:59:38,054 [INFO] Evaluating trained model ...
2021-08-19 22:59:40,850 [INFO] Test set accuracy: 595090/614179 (0.9689194843848454)

2021-08-19 22:59:53,284 [INFO] Evaluating trained model ...
2021-08-19 22:59:56,118 [INFO] Test set accuracy: 595581/614179 (0.9697189255900966)

2021-08-19 23:00:08,545 [INFO] Evaluating trained model ...
2021-0

CPU times: user 19min 58s, sys: 26.2 s, total: 20min 24s
Wall time: 6min 27s





### Save Model
Save pretrained model to a given output location.

In [11]:
if not os.path.exists(MODELS_DIR):
    print("Creating directory '{}'".format(MODELS_DIR))
    os.makedirs(MODELS_DIR)

now = datetime.now()
model_filename = "rnn_classifier_{}.bin".format(now.strftime("%Y-%m-%d_%H_%M_%S"))
model_filepath = os.path.join(MODELS_DIR, model_filename)
dd.save_checkpoint(model_filepath)

2021-08-19 23:05:03,062 [INFO] Pretrained model checkpoint saved to location: 'models/rnn_classifier_2021-08-19_23_05_03.bin'


### Inference

Using the model generated above, we now score the test dataset against the model to determine if the domain is likely generated by a DGA or benign.

In [12]:
dga_detector = DGADetector()
dga_detector.load_checkpoint(model_filepath)

domain_train, domain_test, type_train, type_test = train_test_split(gdf, "type", train_size=0.7)
test_df = cudf.DataFrame()
test_df["type"] = type_test.reset_index(drop=True)
test_df["domain"] = domain_test.reset_index(drop=True)

test_dataset = DGADataset(test_df, 100)
test_dataloader = DataLoader(test_dataset, batchsize=BATCH_SIZE)

pred_results = []
true_results = []
for chunk in test_dataloader.get_chunks():
    pred_results.append(list(dga_detector.predict(chunk['domain']).values_host))
    true_results.append(list(chunk['type'].values_host))
pred_results = np.concatenate(pred_results)
true_results = np.concatenate(true_results)
accuracy_score_result = accuracy_score(pred_results, true_results)

print('Model accuracy: %s'%(accuracy_score_result))

2021-08-19 23:05:03,074 [INFO] CUDA device count: 2
2021-08-19 23:05:03,075 [INFO] Found GPU's now setting up cuda for the model


Model accuracy: 0.9919144093171535


In [13]:
average_precision = average_precision_score(true_results, pred_results)

print('Average precision score: {0:0.3f}'.format(average_precision))

Average precision score: 0.977


## Conclusion

DGA detector in CLX enables users to train their models for detection and also use existing models. This capability could also be used in conjunction with log parsing efforts if the logs contain domain names. DGA detection done with CLX and RAPIDS keeps data in GPU memory, removing unnecessary copy/converts and providing a 4X speed advantage over CPU only implementations. This is esepcially true with large batch sizes.