# model2regex Prototyping

As part of my master thesis with the chair MLSEC at TU Berlin, I am currently working on a machine learning approach to generate a regular expression of Domain Generating Algorithms (DGA) from a Language model (LM). In this prototype we learn an LM with pytorch using the GRU module, we then combine this with a Feed Forward Network (FFN) [[1]] to solve the classification problem of classifying a given domain as generated by the DGA. 

[1]: https://web.stanford.edu/~jurafsky/slp3/9.pdf#page=9

## Settings

In [1]:
from dga import banjori, simple_dga # import the algorithms from the dga.py file
from typing import Callable
import logging
import sys
# Paths
real_domains_path = 'data/top-1m.csv' # write the full or relative path to a csv file containing a list of real domains in with rows in the form [#rank, domain]

# DGA settings
dga_algorithm : Callable[[str], str] = banjori # the algorithm to use can also be any callable that returns a list of domains
dga_seed: str = 'earnestnessbiophysicalohax.com' # the initial seed of our dga algorithm

# Model settings
hidden_size = 128
num_layers = 1
embed_dim = 64
device ='cuda:0' # the GPU device to use

#logging
level = logging.DEBUG
format = '{message}'
logging.basicConfig(level=level, format=format, stream=sys.stdout, style='{')

## Classifier Network

First we are going to define our network module with the help of Pytorch.
The module contains an embedding layer, an rnn layer for the language model part and ends in a simple 1 layer network which will output a single class property. The whole forward function in the end will output both the class and the hidden state of the rnn.

In [2]:
from torch import nn

class DGAClassifier(nn.Module):
    def __init__(self, vocabSize: int, emb: int, size: int, nlayers: int):
        super(DGAClassifier, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=vocabSize,
                                      embedding_dim=emb)
        self.rnn = nn.GRU(input_size=emb, hidden_size=size,
                          num_layers=nlayers)
        self.out = nn.Linear(in_features=size, out_features=1)
        self.drop = nn.Dropout(0.3)
        self.sig = nn.Sigmoid()
        
    def forward(self, input_seq, hidden_state):
        embedding = self.embedding(input_seq)
        output, hidden_state = self.rnn(embedding, hidden_state)
        x = hidden_state[-1, :]
        x = self.drop(x)
        x = self.out(x)
        x = self.sig(x)
        return x, hidden_state.detach()


## Dataset Preparation

To learn the network we've build we are going to generate a [pytorch Dataset][1]. The dataset contains both legitimate Domains from a dataset of the top [1 million domains][2]. 
The Dataset will contain a few helper functions to make working with it inside the training code simpler.

[1]: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files
[2]: https://github.com/PeterDaveHello/top-1m-domains

In [3]:
import torch
from typing import Tuple, Sequence
from torch.utils.data import Dataset
from functools import singledispatchmethod

class BiMapping:
    def __init__(self, characters: str):
        characters = set(characters)
        self._dict = {ch: i for i,ch in enumerate(characters, start=1)}
        self._reverse = {i: ch for i,ch in enumerate(characters, start=1)}
        
    @singledispatchmethod 
    def __getitem__(self, item):
        raise NotImplementedError("item must be of type int or str")
    @__getitem__.register
    def _(self, item: str):
        return self._dict[item]
    @__getitem__.register
    def _(self, item: int):
        return self._reverse[item]

class DomainsAndDGA(Dataset):
    def __init__(self, domains: Sequence[Tuple[str, int]]):
        self.data = domains
        self.max_size = len(max(self.data, key=lambda d: len(d[0]))[0])
        self.chars = sorted(list(set(chain(*[d[0] for d in self.data]))))
        self.vocabSize = len(self.chars) + 1
        self.char_bimap = BiMapping(self.chars)

    def __len__(self) -> int:
        return len(self.data)

    def isEndChar(self, ind) -> bool:
        """helper function that checks if  the current index is the 'end' character (index 0)"""
        return ind == 0
    def charTensor(self, _input: str) -> torch.Tensor:
        """Helper function to turn an input string into a tensor of indices"""
        return torch.tensor([[self.char_bimap[c] for c in _input]]).permute(1,0)
        
    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
        item, label = self.data[idx]
        item = torch.tensor([self.char_bimap[c] for c in item])
        # we need tensors of same size, so if any domain has a different size we then pad it with 0 which will be our "end char"
        item = F.pad(item, (0,self.max_size - len(item)), "constant", 0)
        return (item, torch.tensor(label, dtype=torch.float))

In [4]:
import pandas as pd
from itertools import chain, repeat
from dga import generate_dataset
from pathlib import Path
#########################
# Preparing the Dataset #
#########################
top1m = pd.read_csv(Path(real_domains_path))
real_domains = top1m.values[:,1]
real_domains = list(
    tuple(
        zip(real_domains, repeat(1))
    )
)
dga_domains = generate_dataset(algorithm=dga_algorithm, seed=dga_seed, size=len(real_domains))
dga_domains = list(
    tuple(
        zip(dga_domains, repeat(0))
    )
)

dataset = list(chain(real_domains, dga_domains))
dataset = DomainsAndDGA(dataset)

## Model training

Now that everything is set up, it is time to run the training. We create a class for the Training to easily modify our training settings in a few lines of code.


In [5]:
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import DataLoader, SubsetRandomSampler
from torch import optim
from sklearn.model_selection import KFold
import random
import torch.nn.functional as F
from random import shuffle

class ModelTrain:
    def __init__(self, dataset: DomainsAndDGA, model: DGAClassifier, **kwargs):
        self.dataset = dataset
        self.model = model
        self.untrained_model_path = './untrained_model.pth'
        torch.save(model.state_dict(), self.untrained_model_path)
        self.criterion = kwargs.get('criterion', nn.CrossEntropyLoss(reduction='mean'))
        self.optimizer = kwargs.get('optimizer', optim.Adam(self.model.parameters(), lr=kwargs.get('optim_lr', 0.001)))
        self.device = kwargs.get('device', 'cuda:0')
        self.save_path = kwargs.get('save_path', './model-fold-{}-path')

    def train(self, *, k=5, save_model=True):
        kfold = KFold(n_splits=k, shuffle=True)
        accuracies = []
        for fold, (train_dataset, test_dataset) in enumerate(kfold.split(dataset)):
            writer = SummaryWriter(f'./runs/Classifier-prototype-fold-{fold}')
            checkpoint = torch.load(self.untrained_model_path)
            self.model.load_state_dict(checkpoint)
            train_sampler = SubsetRandomSampler(train_dataset)
            test_sampler = SubsetRandomSampler(test_dataset)
            trainloader = DataLoader(dataset, batch_size=500, sampler=train_sampler)
            testloader = DataLoader(dataset, batch_size=500, sampler=test_sampler)
            
            self.train_fold(dataloader=trainloader, writer=writer)
            if save_model:
                logging.info("saving model to %s", self.save_path.format(fold))
                torch.save({
                    'model_state_dict': model.state_dict(),
                }, self.save_path.format(fold))    
            total, correct = self.validate_fold(dataloader=testloader)
            accuracy = correct / total
            accuracies.append(accuracy)
            logging.info(f'Accuracy for fold {fold}: {accuracy:%}')
            writer.close()
        logging.info("Showing summary...")
        for idx, accuracy in enumerate(accuracies):
            logging.info(f"accuracy of fold {idx}: {accuracy:%}")
    def validate_fold(self, *, dataloader: DataLoader):
        total, correct = 0, 0
        with torch.no_grad():
            self.model.eval()
            for batch, (x,y) in enumerate(dataloader):
                output, _ = self.model(x.permute(1,0).to(self.device), None)
                total += y.size(0)
                correct += (output.permute(1,0).round() == y.to(self.device)).sum().item()
        return total , correct

        
    def train_fold(self, *, dataloader: DataLoader, epochs=10, writer: SummaryWriter):
        criterion = self.criterion
        device = self.device
        model = self.model
        model.to(device)
        model.train()
        optimizer = self.optimizer
        h0 = None
        for epoch in range(epochs):
            logging.info("epoch: %s\n\n", epoch)
            for batch, (x,y) in enumerate(dataloader):
                optimizer.zero_grad()
                if h0 is not None:
                    h0 = h0.to(device)
                output, _ = model(x.permute(1,0).to(device), h0)
                if batch % 500 == 0:
                    idx = random.randint(0, len(x)-1)
                    logging.info("----------------------------------------------------")
                    logging.info(f"showing one prediction for random sample of batch: {batch:,}")
                    logging.info("inputstr: %s", ''.join(dataset.char_bimap[c] for c in x[idx].tolist() if c != 0))
                    logging.info("real class:\t%-d", y[idx].item())
                    logging.info("output:\t\t%-d", output[idx].round().item())
                    logging.info("----------------------------------------------------")
                loss = criterion(output.squeeze(), y.to(device))
                if writer:
                    writer.add_scalar('Loss/train', loss, epoch)
                loss.backward()
                optimizer.step()
                if batch % 500 == 0:
                    logging.info(f"loss at batch {batch:,}: {loss.item()}")
            model.train()
        if writer:
            writer.flush()
        return model, h0

Here we are initializing the model and the trainer class and run the training.

In [6]:
model = DGAClassifier(dataset.vocabSize, embed_dim, hidden_size, num_layers)
trainer = ModelTrain(dataset, model)

trainer.train(k=3)

epoch: 0


----------------------------------------------------
showing one prediction for random sample of batch: 0
inputstr: xhcnestnessbiophysicalohax.com
real class:	0
output:		0
----------------------------------------------------
loss at batch 0: 1529.505615234375
----------------------------------------------------
showing one prediction for random sample of batch: 500
inputstr: bprxestnessbiophysicalohax.com
real class:	0
output:		1
----------------------------------------------------
loss at batch 500: 1401.2078857421875
----------------------------------------------------
showing one prediction for random sample of batch: 1,000
inputstr: thavestnessbiophysicalohax.com
real class:	0
output:		0
----------------------------------------------------
loss at batch 1,000: 1377.0853271484375
----------------------------------------------------
showing one prediction for random sample of batch: 1,500
inputstr: s1.lunacorgie.xyz
real class:	1
output:		1
--------------------------------