# DNS Log Parsing with FLAIR

## Authors
 - Gorkem Batmaz (NVIDIA) [gbatmaz@nvidia.com]

## Development Notes
* Developed using: RAPIDS v0.10.0 
* Last tested using: RAPIDS v0.10.0 on Nov 4, 2019

## Table of Contents
* Introduction
* Log Parsing on a Clean Dataset from a Single Source
* Log Parsing on a Corrupted Dataset from a Single Source
* Conclusion

## Introduction
Log parsing is a complex and highly manual process. In addition to cyBERT, we present an alternate technique that utilizes a combination of character and word embeddings using [FLAIR](https://github.com/zalandoresearch/flair) combined with [RAPIDS](https://rapids.ai). The long term goal of this work is to parse unstructured and nonstandard log types using a probabilistic method so it can also perform on previously unseen log types. Phase 1 demonstrates this works on a single dataset. Tags in the raw logs might vary depending on the sensor, thus we only use the values to predict their tags. For Phase 1, we assume that we know where a log ends and the key/value pair values. This allows ut to reframe the log parsing problem as an entity recognition challenge. Phase 2 proves this can also work with corrupt and messy data in the same dataset.

## Phase 1: Using a Clean dataset from A Single Source

### Preprocessing of the DNS Logs

In [20]:
import cudf, io, requests
from io import StringIO

### Read the Data

In [21]:
data1 = cudf.read_csv('query_output1545120200000_1545163200000.tab', sep='\t',nrows=500000, quoting=3)

In [22]:
list(data1) # Listing the column names of the dataframe

['time',
 'uuid',
 'hostname',
 'flow_id',
 'bytes',
 'bytes_in',
 'bytes_out',
 'dest_ip',
 'dest_mac',
 'dest_port',
 'endtime',
 'message_type',
 'name',
 'protocol_stack',
 'query',
 'query_type',
 'reply_code',
 'reply_code_id',
 'response_time',
 'src_ip',
 'src_mac',
 'src_port',
 'ttl',
 'time_taken',
 'transaction_id',
 'transport',
 'insert_date',
 'id']

In [23]:
data1['hostname'] = data1['hostname'].replace(' ', '')#Remove spaces so that each field can be treated as a single word

### Check for and Eliminate `null` Values

In [24]:
for i in list(data1):
    if len(data1)==data1[i].isna().sum():
        print(i)

id


The `id` column happens to be the last column, we will change it to a dot to indicate the end of each log so it works seamlessly with the NLP framework.

In [25]:
data1.drop_column('id')
data1['id']='.'

In [26]:
data1[""]="" # add an empty column

### Change the Data Type

Changing the data type to `string` so that it can be manipulated later. Then the columns are pivoted in to single column so that the output can be directly used for training in the next section.

In [27]:
for i in list(data1):
    data1[i]=data1[i].astype(str)

In [29]:
#pivoting the columns
data1=data1.reset_index().melt(id_vars=['index']).sort_values('index').drop(columns=['index'])

In [30]:
data1.head()

Unnamed: 0,variable,value
0,time,1545142695971
500000,uuid,8aca9eb6-b16e-463a-b188-3bf029a669fc
1000000,hostname,elb-agent.agent.datadoghq.com;e.gtld-servers.n...
1500000,flow_id,958d6f44-902d-4dcc-aad7-9373679832a9
2000000,bytes,556


We create the columns that the FLAIR framework will use for training.

In [31]:
data1['ner']=data1['variable']
data1['pos']=data1['ner']
data1.drop_column('variable') #Create two tag columns for POS and NER options

for i in list(data1):
    data1[i]=data1[i].astype(str)#change type so categories become strings too
    
dftrain=data1[0:1000000]#remember to increase the size back.
dftest=data1[1000000:1100000]
dfdev=data1[1100000:1200000]

In [32]:
data1.head()

Unnamed: 0,value,ner,pos
0,1545142695971,time,time
500000,8aca9eb6-b16e-463a-b188-3bf029a669fc,uuid,uuid
1000000,elb-agent.agent.datadoghq.com;e.gtld-servers.n...,hostname,hostname
1500000,958d6f44-902d-4dcc-aad7-9373679832a9,flow_id,flow_id
2000000,556,bytes,bytes


### Create Training, Validation, and Test Datasets

FLAIR Framework already has an ingest method for NER problems. To make it readable for the next step, the dataframe will be written into a CSV and then the quotes that are added by the `to_csv` function are removed.

In [33]:
dftrain.to_csv("rapids_train.txt",sep='\t',header=False,index=False)
dftest.to_csv("rapids_test.txt",sep='\t',header=False,index=False)
dfdev.to_csv("rapids_dev.txt",sep='\t',header=False,index=False)

In [34]:
with open('rapids_dev.txt', 'r') as f, open('rapids_dev_.txt', 'w') as fo:
    for line in f:
        fo.write(line.replace('"', ''))

In [35]:
with open('rapids_test.txt', 'r') as f, open('rapids_test_.txt', 'w') as fo:
    for line in f:
        fo.write(line.replace('"', ''))

In [36]:
with open('rapids_train.txt', 'r') as f, open('rapids_train_.txt', 'w') as fo:
    for line in f:
        fo.write(line.replace('"', ''))

### Training of the Model and Inference Against the Model

In [37]:
#Original of this part is at https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md#training-a-sequence-labeling-model
import torch
import os
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharacterEmbeddings, FlairEmbeddings
from typing import List

# define columns

columns = {0: 'text', 1: 'pos', 2: 'ner'}

# this is the folder in which train, test and dev files reside
data_folder = '.'

# 1. init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='rapids_train_.txt',
                              test_file='rapids_test_.txt',
                              dev_file='rapids_dev_.txt')
# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [

    FlairEmbeddings('multi-forward'),
    FlairEmbeddings('multi-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

# 6. initialize trainer
from flair.trainers import ModelTrainer


trainer: ModelTrainer = ModelTrainer(tagger, corpus)
print("lentgh of train",len(corpus.train))
#print(corpus.dev[0])
# 7. start training and run test after each epoch
trainer.train('resources/taggers/example-ner',
              learning_rate=0.3,
              mini_batch_size=128,
              max_epochs=4)

2019-11-04 10:36:09,995 Reading data from .
2019-11-04 10:36:09,995 Train: rapids_train_.txt
2019-11-04 10:36:09,996 Dev: rapids_dev_.txt
2019-11-04 10:36:09,996 Test: rapids_test_.txt
[b'<unk>', b'O', b'time', b'uuid', b'hostname', b'flow_id', b'bytes', b'bytes_in', b'bytes_out', b'dest_ip', b'dest_mac', b'dest_port', b'endtime', b'message_type', b'name', b'protocol_stack', b'query', b'query_type', b'reply_code', b'reply_code_id', b'response_time', b'src_ip', b'src_mac', b'src_port', b'ttl', b'time_taken', b'transaction_id', b'transport', b'insert_date', b'id', b'mini._sftp-ssh._tcp.local', b'[14:10:9f:dd:22:9d]._workstation._tcp.local;kshook-mlt.local;kshook-mlt.local;kshook-mlt._ssh._tcp.local;kshook-mlt._sftp-ssh._tcp.local;kshook-mlt._companion-link._tcp.local;kshook-mlt', b'<START>', b'<STOP>']
lentgh of train 34483
2019-11-04 10:36:36,437 ----------------------------------------------------------------------------------------------------
2019-11-04 10:36:36,438 Model: "SequenceT

{'test_score': 1.0,
 'dev_score_history': [1.0, 1.0, 1.0, 1.0],
 'train_loss_history': [8.815754337360461,
  0.02166381318949991,
  0.012212355707392649,
  0.007855682107792408],
 'dev_loss_history': [tensor(0.0025, device='cuda:0'),
  tensor(0.0009, device='cuda:0'),
  tensor(0.0002, device='cuda:0'),
  tensor(8.4052e-05, device='cuda:0')]}

## Phase 2: Data Perturbation to Test Against Corrupted and Missing Values

In this phase, we investigate if performance deteroiates when values in the fields get corrupted. We iterate the preprocessing corruption process to increase the level of corruption in the test datset.

### Prepare the New Test Dataset

In [29]:
import string
import random
import pandas as pd
import numpy as np

#load the data into a dataframe
df1 = pd.read_csv('query_output1545120200000_1545163200000.tab', sep='\t',low_memory=False,nrows=100000)

### Create Function for Character Insertion Corruption

Here we create a function to insert a character into a string to break the format and the originality of each field.

In [25]:
def insert_str(string, str_to_insert, totalrandom, index):
    return string[:index] + (totalrandom)*(str_to_insert) + string[index:]

We decide how many random characters we need to insert into each field and how much of the data should be excluded from the test set.

In [26]:
missing_proportion = 1
totalrandom = 1

In [27]:
df1["id"].fillna( '.', inplace = True)
df1.rename(columns={'id':'.'}, inplace=True)
df2=df1.dropna()
df2=df2.astype(str)
for i in list(df2):
    df2[i] = df2[i].str.replace(' ', '') #remove spaces to be able to treat each field as one word. 
df2[""]="" # add an empty column
df2=df2.stack()
df2 = df2.to_frame().reset_index() #change the dataframe from multilevel to single level
df2=df2.drop(['level_0'], axis=1)
df2.columns=['pos','text']
df2 = df2[['text','pos']]
df2['ner']=df2['pos'] #Create two tag columns for POS and NER options

Each log ends with a `.` and not to add random characters in to the empty lines and to the `.` fields we run the commands below

In [33]:
np.random.seed(7)

n_drop = missing_proportion * round((len(df2)) / 100)

drop_indices = np.random.choice(df2.index, n_drop, replace=False)

dataindex=range(1,28)
counter = 0
for i in range(1000007:1050032):

    counter += 1
    if counter in dataindex:
        
        df2['text'][i] = insert_str((df2['text'][i]), str(random.choice(string.ascii_letters)), totalrandom,
                                random.randint(0, len(df2['text'][i])))
    elif counter == 29:
        counter = 0
df2 = df2.drop(drop_indices)

In [34]:
dftest=df2[1000007:1050032]
dftest.to_csv("perturbed_test_data.txt",index=False,sep='\t',header=False)

The content of the test dataset has been changed and standard formats (e.g., IP addresses) are broken. Categorical columns have added spurious categories.

### Training of the Model and Inference Against the Model Using Corrupted Data

In [35]:
#Original of this part is at https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md#training-a-sequence-labeling-model
import torch
import os
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharacterEmbeddings, FlairEmbeddings
from typing import List

# define columns

columns = {0: 'text', 1: 'pos', 2: 'ner'}

# this is the folder in which train, test and dev files reside
data_folder = '.'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='_train.txt',
                              test_file='perturbed_test_data.txt',
                              dev_file='_val.txt')
# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [

    FlairEmbeddings('multi-forward'),
    FlairEmbeddings('multi-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

# 6. initialize trainer
from flair.trainers import ModelTrainer


trainer: ModelTrainer = ModelTrainer(tagger, corpus)
print("lentgh of train",len(corpus.train))
print(corpus.dev[0])
# 7. start training and run test after each epoch
trainer.train('resources/taggers/example-ner',
              learning_rate=0.3,
              mini_batch_size=128,
              max_epochs=3)

2019-10-11 20:41:52,335 Reading data from .
2019-10-11 20:41:52,336 Train: train.txt
2019-10-11 20:41:52,346 Dev: val.txt
2019-10-11 20:41:52,347 Test: perturbed_test_data.txt
[b'<unk>', b'O', b'time', b'uuid', b'hostname', b'flow_id', b'bytes', b'bytes_in', b'bytes_out', b'dest_ip', b'dest_mac', b'dest_port', b'endtime', b'message_type', b'name', b'protocol_stack', b'query', b'query_type', b'reply_code', b'reply_code_id', b'response_time', b'src_ip', b'src_mac', b'src_port', b'ttl', b'time_taken', b'transaction_id', b'transport', b'insert_date', b'.', b'<START>', b'<STOP>']
lentgh of train 34483
Sentence: "544 34 510 172.16.136.26 AC:16:2D:88:D8:38 53 2018-12-18T12:48:25.298790Z QUERY;RESPONSE dns.msftncsi.com;com;com;com;com;com;com;com;com;com;com;com;com;com;a.gtld-servers.net;b.gtld-servers.net;c.gtld-servers.net;d.gtld-servers.net;e.gtld-servers.net;f.gtld-servers.net;g.gtld-servers.net;h.gtld-servers.net;i.gtld-servers.net;j.gtld-servers.net;k.gtld-servers.net;l.gtld-servers.net

{'test_score': 0.9986,
 'dev_score_history': [1.0, 1.0, 1.0],
 'train_loss_history': [6.186323038999129,
  0.012852874613815436,
  0.008184892177599034],
 'dev_loss_history': [tensor(0.0020, device='cuda:0'),
  tensor(0.0008, device='cuda:0'),
  tensor(0.0004, device='cuda:0')]}

After 3 epochs, the F1 score of the corrupted test set is 0.9986. This test dataset has one randomcharacter inserted in each field and 1% of the fields in the dataset are missing values.

## Conclusion

The first part of the notebook proves that this set of DNS logs can be parsed probabilistically with high accuracy. In pertubing the data, we show that the impact of corrupted data on model performance is minimal. In typical real-life DNS logs, less than 1% of the logs might be corrupted. The experiment shown here goes well beyond 1% corruption to show the effects at extreme levels.