This deep learning model uses a CNN-LSTM architecture to predict whether a given domain name is genuine or was artificially generated by a DGA.
Many forms of malware uses domain generation algorithms (DGAs) to connect with a C&C, which enables it to recieve instructions and perform malicious activities. There have been many attempts to detect whether a given domain name corresponds to a genuine domain, or a fake domain generated by a DGA. Some machine learning methods have utilized clustering based on WHOIS data, etc., to this end. This model builds on past work by using a deep learning architecture to achieve increased accuracy over other methods.
This model was based on an architecture from  and implemented in Tensorflow. It embeds domain names, feeds the embeddings through a convolutional network, feeds that through an LSTM, and passes that through a dense layer for classification. This approach captures the local similarity inherent in genuine domains, as well as spatial connections between characters.
The training data was a set of 1.5 million domain names labelled as either 0 (genuine) or 1 (fake) from the Splunk DGA app, Alexa's top 1 million domains, and the Bambenek DGA feed. 10% of domains were stripped of their TLD and subdomain before being fed through the model. The test data was a set of 100000 domains from a different slice of this data.
The model was trained for twenty epochs with the Adam optimizer. It was tested by evaluating its predictive accuracy on 100000 domains from the shuffled test datasets. It achieved 98.8% accuracy on the test data.
You can query whether a given domain is legit or fake through this model at http://dgaintel.com/.
The model can be loaded through Tensorflow's Keras API from the
To further experiment with the code:
- Go to Google Colab
- Go to File > Open Notebook... > Github
- Search for https://github.com/sudo-rushil/dga-intel
$ git clone https://github.com/sudo-rushil/dga-intel $ cd dga-intel $ python predict_domain.py [domain name]
$ python predict_domain.py wikipedia.com The domain wikipedia.com is genuine with probability 1.0
If you run across any issues, file an issue at https://github.com/sudo-rushil/dga-intel/issues.
My LinkedIn page can be found here.
 Abadi, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 Yu, Bin; Pan, Jie; Hu, Jiaming; Nascimento, Anderson; De Cock, Martine. "Character Level based Detection of DGA Domain Names". 2018 International Joint Conference on Neural Networks (IJCNN).