Skip to content
This deep learning model uses a CNN-LSTM architecture to predict whether a given domain name is genuine or was artificially generated by a DGA.
JavaScript Jupyter Notebook HTML Python Shell CSS Other
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
datasets
notebooks
static
templates
.gitignore
Dockerfile
LICENSE
Makefile
README.md
app.py
dga-intel.ini
docker-compose.yml
forms.py
init-letsencrypt.sh
intel_query.py
requirements.txt
test.py
wsgi.py

README.md

DGA Intel

This deep learning model uses a CNN-LSTM architecture to predict whether a given domain name is genuine or was artificially generated by a DGA.

The Problem

Many forms of malware uses domain generation algorithms (DGAs) to connect with a C&C, which enables it to recieve instructions and perform malicious activities. There have been many attempts to detect whether a given domain name corresponds to a genuine domain, or a fake domain generated by a DGA. Some machine learning methods have utilized clustering based on WHOIS data, etc., to this end. This model builds on past work by using a deep learning architecture to achieve increased accuracy over other methods.

The Model

This model was based on an architecture from [2] and implemented in Tensorflow. It embeds domain names, feeds the embeddings through a convolutional network, feeds that through an LSTM, and passes that through a dense layer for classification. This approach captures the local similarity inherent in genuine domains, as well as spatial connections between characters.

The Data

The training data was a set of 1.5 million domain names labelled as either 0 (genuine) or 1 (fake) from the Splunk DGA app, Alexa's top 1 million domains, and the Bambenek DGA feed. 10% of domains were stripped of their TLD and subdomain before being fed through the model. The test data was a set of 100000 domains from a different slice of this data.

Results

The model was trained for twenty epochs with the Adam optimizer. It was tested by evaluating its predictive accuracy on 100000 domains from the shuffled test datasets. It achieved 98.8% accuracy on the test data.

Website Usage

You can query whether a given domain is legit or fake through this model at http://dgaintel.com/.

Development

The model can be loaded through Tensorflow's Keras API from the domain_classifier_model.h5 file. To further experiment with the code:

  1. Go to Google Colab
  2. Go to File > Open Notebook... > Github
  3. Search for https://github.com/sudo-rushil/dga-intel
  4. Open domain_data.ipynb or domain_model.ipynb

Code Usage

$ git clone https://github.com/sudo-rushil/dga-intel
$ cd dga-intel
$ python predict_domain.py [domain name]

Example

$ python predict_domain.py wikipedia.com

The domain wikipedia.com is genuine with probability 1.0

Contact

If you run across any issues, file an issue at https://github.com/sudo-rushil/dga-intel/issues.

My LinkedIn page can be found here.

References

[1] Abadi, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] Yu, Bin; Pan, Jie; Hu, Jiaming; Nascimento, Anderson; De Cock, Martine. "Character Level based Detection of DGA Domain Names". 2018 International Joint Conference on Neural Networks (IJCNN).

You can’t perform that action at this time.