# Deep Learning Approach for Network Intrusion Detection in Software Defined Networking

This is a practical implementation and adaptation of the paper of Tuan A Tang et al.: [10.1109/WINCOM.2016.7777224](https://doi.org/10.1109/WINCOM.2016.7777224).  
Tang et al. built a deep neural network around software defined infrastructure with the target of anomaly-based intrusion detection and archived impressive results.  
Besides the practical implementation they made use of the NSL-KDD Dataset.  
As I am using the CICIDS2017 dataset, some tuning of input parameters is required. Mostly, the *count* and *srv_count* variables need to be adapted.  
These variables, which serve as two of six inputs of the neural network at hand, are calculated as the number of connections to the same host/service as the current connection __in the last two seconds__.  
As the CICIDS2017 dataset does not count the number of connections, it stands to be defined how to deal with this.

Problem / ToDo summary:

- is my keras layer architecture right? 4 vs. 5 layers
    - especially the last layer - image shows 2 nodes -> I am using 1 node with binary crossentropy
- Float issue / overflow with infinite values being recognized by numpy/pandas
- count & srv_count don't have an appropriate representation in CICIDS
- training dropout not defined by authors
- activation functions not defined by authors
- normalization strategy not defined by authors

In [1]:
from datetime import datetime
import json
import os
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

# Data loading and prep

In [2]:
def load_df(filename):
    filepath = os.path.join('CICIDS2017', filename+'.pkl')
    return pd.read_pickle(filepath)

In [3]:
cic_train_data = load_df('cic_train_data')
cic_test_data = load_df('cic_test_data')
cic_train_labels = load_df('cic_train_labels')
cic_test_labels = load_df('cic_test_labels')

We only need 6 features, so we create a new DF that only holds them.  
The mapping is as follows:  

| NSL-KDD field | CICIDS2017 field |
|---------------|---------------------|
| duration | flow_duration |
| protocol_type | protocol |
| src_bytes | total_fwd_packets |
| dst_bytes | total_backward_packets |
| count | flow_packets_per_s |
| srv_count | destination_port |


In [4]:
fields = ['flow_duration', 'protocol', 'total_fwd_packets', 'total_backward_packets','flow_packets_per_s','destination_port']

cic_train_data = cic_train_data.filter(fields, axis=1) 
cic_test_data = cic_test_data.filter(fields, axis=1)

In [5]:
cic_train_data.head()

Unnamed: 0,flow_duration,protocol,total_fwd_packets,total_backward_packets,flow_packets_per_s,destination_port
0,6.666667e-08,0.352941,5e-06,0.0,0.5,0.750561
1,4.166667e-08,0.352941,5e-06,0.0,0.8,0.750561
2,4.166667e-08,0.352941,5e-06,0.0,0.8,0.750561
3,4.166667e-08,0.352941,5e-06,0.0,0.8,0.750561
4,5.833333e-08,0.352941,5e-06,0.0,0.533333,0.755108


## Label Translation

In [6]:
# Load the translation data from the Keras Tokenizer
with open(os.path.join('NSL_KDD','kdd_label_wordindex.json')) as json_in:
    data = json.load(json_in)
    print(data)
    normal_index = data['normal']

{'normal': 1, 'neptune': 2, 'warezclient': 3, 'ipsweep': 4, 'portsweep': 5, 'teardrop': 6, 'nmap': 7, 'satan': 8, 'smurf': 9, 'pod': 10, 'back': 11, 'guess_passwd': 12, 'ftp_write': 13, 'multihop': 14, 'rootkit': 15, 'buffer_overflow': 16, 'imap': 17, 'warezmaster': 18, 'phf': 19, 'land': 20, 'loadmodule': 21, 'spy': 22, 'perl': 23, 'saint': 24, 'mscan': 25, 'apache2': 26, 'snmpgetattack': 27, 'processtable': 28, 'httptunnel': 29, 'ps': 30, 'snmpguess': 31, 'mailbomb': 32, 'named': 33, 'sendmail': 34, 'xterm': 35, 'worm': 36, 'xlock': 37, 'xsnoop': 38, 'sqlattack': 39, 'udpstorm': 40}


In [7]:
def f(x):
    return 0 if x == normal_index else 1
f = np.vectorize(f)

In [8]:
cic_train_labels.head()

Unnamed: 0,label,label_encoded
0,BENIGN,1
1,BENIGN,1
2,BENIGN,1
3,BENIGN,1
4,BENIGN,1


In [9]:
cic_train_labels = f(cic_train_labels['label_encoded'].values)
cic_test_labels = f(cic_test_labels['label_encoded'].values)

In [10]:
print("Training Set Size:\t",len(cic_train_data))
print("Training Label Size:\t",len(cic_train_labels))
print("Test Set Size:\t\t",len(cic_test_data))
print("Test Label Size:\t",len(cic_test_labels))

Training Set Size:	 1839982
Training Label Size:	 1839982
Test Set Size:		 990761
Test Label Size:	 990761


## Runtime preqs

In [11]:
# Define some semi-global stuff

batch_size = 10
epochs     = 100
learn_rate = 0.001

In [12]:
run_date = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
runtype_name = 'cicids2017-sdn-dnn'
log_folder_path = os.path.join('logs',runtype_name + '-{}'.format(run_date))

In [14]:
# https://github.com/keras-team/keras/blob/master/examples/tensorboard_embeddings_mnist.py

# save the class labels to disk to color data points in TensorBoard accordingly
filename = os.path.join(log_folder_path,'metadata.tsv')
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, 'w') as f:
    np.savetxt(f, cic_test_labels)

## Building and training the model

In [15]:
# Time for some nice vizualization stuff. Set this up and include as callback, then:
# tensorboard --logdir=path/to/logdir
from keras.callbacks import EarlyStopping,ModelCheckpoint,TensorBoard

callbacks = [
    ModelCheckpoint(
        filepath='models/'+runtype_name+'-{}.h5'.format(run_date),
        monitor='val_loss',   
        save_best_only=True    # Only save one. Only overwrite this one if val_loss has improved
    ),
    TensorBoard(
        log_dir=log_folder_path,
        #histogram_freq=1,     # Record activation histograms every epoch
        #embeddings_freq=1,     # Record embedding data every epoch -> There's something wrong with the embeddings here. Keras crashed with them enabled
        #embeddings_layer_names=['LSTMnet'],
        #embeddings_metadata='metadata.tsv',
        #embeddings_data=data_test,
       # batch_size=batch_size
    )
]

Using TensorFlow backend.


In [16]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import RMSprop
from keras.utils import plot_model

# see implementation/sdn-dnn.py for details, alternatives and comments

model = Sequential()
model.add(Dense(12, activation='relu', input_dim=6))
model.add(Dense(6, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(RMSprop(lr=learn_rate), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

history = model.fit(cic_train_data, cic_train_labels, 
                    epochs=100, 
                    batch_size=batch_size,
                    verbose=1,
                    validation_data=(cic_test_data, cic_test_labels),
                    callbacks=callbacks)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 12)                84        
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 78        
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 21        
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 8         
Total params: 191
Trainable params: 191
Non-trainable params: 0
_________________________________________________________________
Train on 1839982 samples, validate on 990761 samples
Epoch 1/100
  95210/1839982 [>.............................] - ETA: 2:35 - loss: 0.1915 - acc: 0.9253