# XGBoost with CICIDS

This notebook provides comparsion stats.  
The CICIDS2017 version used is [from the University of New Brunswick, Canada](http://www.unb.ca/cic/datasets/ids-2017.html).

In [1]:
import json
import os
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

# Data loading and prep

In [2]:
def load_df(filename):
    filepath = os.path.join('CICIDS2017', filename+'.pkl')
    return pd.read_pickle(filepath)

In [3]:
cic_train_data = load_df('cic_train_data')
cic_test_data = load_df('cic_test_data')
cic_train_labels = load_df('cic_train_labels')
cic_test_labels = load_df('cic_test_labels')

We only need 6 features, so we filter accordingly.   
The mapping is as follows:  

| NSL-KDD field | CICIDS2017 field |
|---------------|---------------------|
| duration | flow_duration |
| protocol_type | protocol |
| src_bytes | total_fwd_packets |
| dst_bytes | total_backward_packets |
| count | flow_packets_per_s |
| srv_count | destination_port |


In [4]:
fields = ['flow_duration', 'protocol', 'total_fwd_packets', 'total_backward_packets','flow_packets_per_s','destination_port']

cic_train_data = cic_train_data.filter(fields, axis=1) 
cic_test_data = cic_test_data.filter(fields, axis=1)

## Label Translation

As we are doing binary classification, we only need to know if the entry is normal/benign (*0*) or malicious (*1*).  
For this, we load the Tokenizer word index to get the benign label representation and then define a function to transfor our labels.

In [5]:
with open(os.path.join('CICIDS2017','cic_label_wordindex.json')) as json_in:
    data = json.load(json_in)
    print(data)
    normal_index = data['benign']

{'benign': 1, 'ftppatator': 2, 'sshpatator': 3, 'dosslowloris': 4, 'dosslowhttptest': 5, 'doshulk': 6, 'dosgoldeneye': 7, 'heartbleed': 8, 'bruteforce': 9, 'xss': 10, 'sqlinjection': 11, 'infiltration': 12, 'bot': 13, 'portscan': 14, 'ddos': 15}


In [6]:
def f(x):
    return 0 if x == normal_index else 1
f = np.vectorize(f)

In [7]:
cic_train_labels.head()

Unnamed: 0,label,label_encoded
0,BENIGN,1
1,BENIGN,1
2,BENIGN,1
3,BENIGN,1
4,BENIGN,1


In [8]:
cic_train_labels = f(cic_train_labels['label_encoded'].values)
cic_test_labels = f(cic_test_labels['label_encoded'].values)

In [9]:
cic_train_labels[:5]

array([0, 0, 0, 0, 0])

## Building and training the model

In [10]:
print("Training Set Size:",len(cic_train_data))
print("Training Label Size:",len(cic_train_labels))
print("Test Set Size:",len(cic_test_data))
print("Test Label Size:",len(cic_test_labels))

Training Set Size: 1839982
Training Label Size: 1839982
Test Set Size: 990761
Test Label Size: 990761


In [11]:
from xgboost import XGBClassifier

xgb = XGBClassifier(
    n_estimators=100,
    n_jobs=-1,
    random_state=0
)

xgb.fit(cic_train_data, cic_train_labels)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=-1, nthread=None, objective='binary:logistic',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [12]:
from sklearn.metrics import accuracy_score

cic_test_predictions = xgb.predict(cic_test_data)
predicted = [round(value) for value in cic_test_predictions]

accuracy = accuracy_score(cic_test_labels, predicted)

  if diff:


In [13]:
print("Mean accuarcy score: {0:1.3f}".format(accuracy))

Mean accuarcy score: 0.795
