# Use of Machine Learning Methods for Traffic Classification

Playing with supervised and unsupervised machine learning for classification purposes. 

The main tools used are python, pandas, scikit-learn and keras (with tensorflow as backend). 

Dealing with [NSL-KDD](https://www.unb.ca/cic/datasets/nsl.html) dataset as a concise representation of a real life network. With the aim of understanding, mining, processing and shaping the best model for the task of identifying incoming data packets, in real time, on a flow. 

After lots of attemps and approaches, I've included here some of the most relevant.

## Summary

1. [Project Description](#1.-Project-Description)
2. [Libraries and Versions](#2.-Libraries_and_Version)
3. [Dataset Loading](#3.-Dataset_Loading)
4. [Analytics and Visualization](#4.Analytics)
5. [Pre-processing](#5.Pre-Processing)
6. [Feature Selection and Dimensionality Reduction](#6.Feature_Selection_and_Dimensionality_Redcuction)
7. [Classic Models](#7.Classic_Models)
8. [Ensemble Approaches](#8.Ensemble_Approaches)
9. [ANN Model](#9.ANN_Model)
10. [Results and Conclusion](#10.Results_and_Conclusion) 

## 1. Project Description

The internet is now something invisible and inherent to our civilization, like electricity. 
Almost every human being tends to spend part of his day on communications apps, devices or tools. 
We constantly applying our social skills thru the virtual realms.

Computer networks are the backbone of all digital communications being made around the world today. Smartphones, notebooks, smart-homes and any connected device exchanges data thru one of these networks. Although they may use 'different' protocols, they are all subject of the same networks and routers. The internet traffic is huge. IPV6 is already working and supporting this massive information demand. There is something between 22 and 26 billion devices connected to the internet [1]. This numbers tend to grow massively as internet of things becomes the rule. More traffic means more treats, sophisticated viruses and malwares, zero-day attacks, and even, the now popular term, “cyber war”.

It is not new to use machine learning techniques within offensive and defensive security areas. In this project, various classifiers are used to identify unknown and / or unwanted traffic. Their results are compared with each other and with the more common approaches, the signature-based systems. We choose a well-known dataset in the research of Intrusion Detection techniques, the NSL-KDD. 

An intrusion detection system is used to monitor the traffic in real time, classifying the treats per level. It can also become an autonomous system that drops these anomalous, inappropriate and / or unauthorized packets, an IPS (Intrusion Prevention System). An IPS can be seen as an extension of and IDS.

Being a computer networks engineer and working on a company that develop devices for network defensive security based on dynamic automated over-the-air update signatures and rules, personal and professional motivations are just relevant and clear enough.

## 2. Libraries and Versions

Versions for reproducibility
------------------- -------
- Package / Version
<br>
<br>Python 3.6.7 (needed for tensorflow)
<br>
<br>h5py                2.9.0
<br>pandas              0.23.4
<br>numpy               1.16.0
<br>scipy               1.2.0
<br>
<br>ipykernel           5.1.0
<br>ipython             7.2.0
<br>
<br>scikit-learn        0.20.2
<br>
<br>Keras               2.2.4
<br>Keras-Applications  1.0.6
<br>Keras-Preprocessing 1.0.5
<br>
<br>tensorboard         1.12.2
<br>tensorflow-gpu      1.12.0

In [1]:
# Libraries
import h5py
import pandas as pd
import numpy as np

from IPython.display import display, HTML

In [2]:
from sklearn.preprocessing import (Normalizer, 
                                   StandardScaler, 
                                   MinMaxScaler,
                                   OneHotEncoder)

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

from sklearn.metrics import (precision_score, 
                             recall_score,
                             f1_score, 
                             accuracy_score,
                             mean_squared_error,
                             mean_absolute_error)

In [3]:
from keras.utils import np_utils
from keras.models import Sequential
from keras.preprocessing import sequence
from keras.layers import Dense, Dropout, Activation, Embedding
from keras.layers import LSTM, SimpleRNN, GRU
from keras.utils import np_utils
from keras import callbacks
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau, CSVLogger

Using TensorFlow backend.


## 3. Dataset Loading

   The [NSL-KDD](https://www.unb.ca/cic/datasets/nsl.html)
is a prime tool for research and improvement of intrusion detection approaches. The good quality of this data presents an ideal scenario for offline works and benchmarks. The NSL-KDD dataset is an improved version of the KDD'99 dataset. 
    The inherent drawbacks of the KDD-cup'99 dataset has been, almost all, handled in the NSL-KDD. To address some of the corrections refinements we can quote:
<br>
- Redundant and / or duplicated records was removed from the train and the test set, which enable the classifiers with better detection rates to perform un-biased results on more frequent records. 
<br><br>
- There are enough records in the train and test sets, which makes it affordable to run the experiments on the complete set without the need to randomly select a small portion or use some technique like random cross validation. This enables evaluation results of different works to be consistent and comparable.
<br><br>
- The number of selected records from each difficulty level group is inversely proportional to the percentage of records in the original KDD data set. As a result, the classification rates of distinct machine learning methods vary in a wider range, which makes it more efficient to have an accurate evaluation of different learning techniques.

The NSL-KDD is not perfect. It still has some problems as it does not match exactly a real network. However, it is the maybe the most effective benchmark for the purpose of researching and comparing different methods and techniques for flow detection. 
-- --

#### Train set features and types

On exploring the dataset, we can count forty-two attributes (plus the difficult level), that we will treat as features of the flow. Each one has an assigned label as a type of attack or normal traffic. We can access, besides what was considered above, detailed information of all 42 attributes, classes of the network connection (1 normal class and 4 attack class) and description of each attack class.

In [4]:
### NSL-KDD DataSet ###
TRAIN = 'NSL-KDD/KDDTrain+.txt'
TEST = 'NSL-KDD/KDDTest+.txt'

# features / attributes / columns names
feat_names = [
     'duration',
     'protocol_type',
     'service',
     'flag',
     'src_bytes',
     'dst_bytes',
     'land',
     'wrong_fragment',
     'urgent',
     'hot',
     'num_failed_logins',
     'logged_in',
     'num_compromised',
     'root_shell',
     'su_attempted',
     'num_root',
     'num_file_creations',
     'num_shells',
     'num_access_files',
     'num_outbound_cmds',
     'is_host_login',
     'is_guest_login', 
     'count',
     'srv_count',    
     'serror_rate',
     'srv_serror_rate',
     'rerror_rate',
     'srv_rerror_rate',
     'same_srv_rate',
     'diff_srv_rate',
     'srv_diff_host_rate',
     'dst_host_count',
     'dst_host_srv_count',
     'dst_host_same_srv_rate',
     'dst_host_diff_srv_rate',
     'dst_host_same_src_port_rate',
     'dst_host_srv_diff_host_rate',
     'dst_host_serror_rate',
     'dst_host_srv_serror_rate',
     'dst_host_rerror_rate',
     'dst_host_srv_rerror_rate',
     'class',
     'difficult_level'
]

# import dataset addind columns / features names
df_train_raw = pd.read_csv(TRAIN, 
                       names=feat_names,
                       index_col=False)

df_test_raw = pd.read_csv(TEST, 
                      names=feat_names,
                      index_col=False)
df_train_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125973 entries, 0 to 125972
Data columns (total 43 columns):
duration                       125973 non-null int64
protocol_type                  125973 non-null object
service                        125973 non-null object
flag                           125973 non-null object
src_bytes                      125973 non-null int64
dst_bytes                      125973 non-null int64
land                           125973 non-null int64
wrong_fragment                 125973 non-null int64
urgent                         125973 non-null int64
hot                            125973 non-null int64
num_failed_logins              125973 non-null int64
logged_in                      125973 non-null int64
num_compromised                125973 non-null int64
root_shell                     125973 non-null int64
su_attempted                   125973 non-null int64
num_root                       125973 non-null int64
num_file_creations             125973 

#### Type of data (numeric, nominal, categorical, binary)

In [5]:
# analize type of data and segment columns
nominal = [1, 2, 3]
binary = [6, 11, 13, 14, 20, 21]
numeric = list(set(range(41)).difference(nominal).difference(binary))

feat_names = np.array(feat_names)

nominal_feats = feat_names[nominal].tolist()
binary_feats = feat_names[binary].tolist()
numeric_feats = feat_names[numeric].tolist()

#### Traffic classification for train and test sets

In [6]:
print('Train Set')
display(df_train_raw['class'].value_counts())

Train Set


normal             67343
neptune            41214
satan               3633
ipsweep             3599
portsweep           2931
smurf               2646
nmap                1493
back                 956
teardrop             892
warezclient          890
pod                  201
guess_passwd          53
buffer_overflow       30
warezmaster           20
land                  18
imap                  11
rootkit               10
loadmodule             9
ftp_write              8
multihop               7
phf                    4
perl                   3
spy                    2
Name: class, dtype: int64

In [7]:
print('Test Set')
display(df_test_raw['class'].value_counts())

Test Set


normal             9711
neptune            4657
guess_passwd       1231
mscan               996
warezmaster         944
apache2             737
satan               735
processtable        685
smurf               665
back                359
snmpguess           331
saint               319
mailbomb            293
snmpgetattack       178
portsweep           157
ipsweep             141
httptunnel          133
nmap                 73
pod                  41
buffer_overflow      20
multihop             18
named                17
ps                   15
sendmail             14
xterm                13
rootkit              13
teardrop             12
xlock                 9
land                  7
xsnoop                4
ftp_write             3
loadmodule            2
perl                  2
sqlattack             2
udpstorm              2
worm                  2
phf                   2
imap                  1
Name: class, dtype: int64

Maps traffic type into four attack classes / normal class. 
<br>Maps traffic for a binary classification (attack / normal) 

In [8]:
# dict that maps attacks to four groups for multiclass classification
attack_types5 = {
    'normal': 'normal',
    
    'back': 'DoS',
    'land': 'DoS',
    'neptune': 'DoS',
    'pod': 'DoS',
    'smurf': 'DoS',
    'teardrop': 'DoS',
    'mailbomb': 'DoS',
    'apache2': 'DoS',
    'processtable': 'DoS',
    'udpstorm': 'DoS',
    
    'ipsweep': 'Probe',
    'nmap': 'Probe',
    'portsweep': 'Probe',
    'satan': 'Probe',
    'mscan': 'Probe',
    'saint': 'Probe',

    'ftp_write': 'R2L',
    'guess_passwd': 'R2L',
    'imap': 'R2L',
    'multihop': 'R2L',
    'phf': 'R2L',
    'spy': 'R2L',
    'warezclient': 'R2L',
    'warezmaster': 'R2L',
    'sendmail': 'R2L',
    'named': 'R2L',
    'snmpgetattack': 'R2L',
    'snmpguess': 'R2L',
    'xlock': 'R2L',
    'xsnoop': 'R2L',
    'worm': 'R2L',
    
    'buffer_overflow': 'U2R',
    'loadmodule': 'U2R',
    'perl': 'U2R',
    'rootkit': 'U2R',
    'httptunnel': 'U2R',
    'ps': 'U2R',    
    'sqlattack': 'U2R',
    'xterm': 'U2R'
}

# dict that maps attacks and normal traffic for binary classification
attack_types2 = dict()
for key, value in attack_types5.items():
    attack_types2[key] = 'normal' if value == 'normal' else 'attack'

In [9]:
# maps attack types (multiclass and binary) / train data
df_train_01 = df_train_raw.copy()
class_2 = df_train_01['class'].map(attack_types2).rename('class_2')
class_5 = df_train_01['class'].map(attack_types5).rename('class_5')

# drops features 'class' and 'difficult_level' as they will not be useful for this project
df_train_01.drop(['class','difficult_level'], axis=1, inplace=True)
df_train_01 = pd.concat([df_train_01, class_2, class_5], axis=1)

In [10]:
# maps attack types (multiclass and binary) / test data
df_test_01 = df_test_raw.copy()
class_2 = df_test_01['class'].map(attack_types2).rename('class_2')
class_5 = df_test_01['class'].map(attack_types5).rename('class_5')

# drops features 'class' and 'difficult_level' as they will not be useful for this project
df_test_01.drop(['class','difficult_level'], axis=1, inplace=True)
df_test_01 = pd.concat([df_test_01, class_2, class_5], axis=1)

## 4. Analytics


Some numbers about the features / columns.

In [11]:
# count_id to groupby and counts
count_id = range(df_train_01.shape[0])
df_train_01.insert(43,'counts', count_id)

count_id = range(df_test_01.shape[0])
df_test_01.insert(43,'counts', count_id)

#### - Attacks and Normal traffic

In [12]:
# Attacks vs normal traffic (Train Data)
display(df_train_01[['class_2', 'counts']].groupby(
    'class_2').count().transpose())

class_2,attack,normal
counts,58630,67343


In [13]:
# Attacks vs normal traffic (Test Data)
display(df_test_01[['class_2', 'counts']].groupby(
    'class_2').count().transpose())

class_2,attack,normal
counts,12833,9711


In [14]:
# Types of attacks vs normal traffic (Train Data)
display(df_train_01[['class_5', 'counts']].groupby(
    'class_5').count().transpose())

class_5,DoS,Probe,R2L,U2R,normal
counts,45927,11656,995,52,67343


In [15]:
# Types of attacks vs normal traffic (Test Data)
display(df_test_01[['class_5', 'counts']].groupby(
    'class_5').count().transpose())

class_5,DoS,Probe,R2L,U2R,normal
counts,7458,2421,2754,200,9711


#### -  Nominal Features

In [16]:
# Nominal features ('protocol_type') /  3 unique items
display(df_train_01[['protocol_type','counts']].groupby(
    'protocol_type').count().transpose())

protocol_type,icmp,tcp,udp
counts,8291,102689,14993


In [17]:
# Nominal features ('Services') / 70 unique items
display(df_train_01[['service','counts']].groupby(
    'service').count().transpose())

service,IRC,X11,Z39_50,aol,auth,bgp,courier,csnet_ns,ctf,daytime,...,telnet,tftp_u,tim_i,time,urh_i,urp_i,uucp,uucp_path,vmnet,whois
counts,187,73,862,2,955,710,734,545,563,521,...,2353,3,8,654,10,602,780,689,617,693


In [18]:
# Nominal features ('flag') / 11 unique items
display(df_train_01[['flag','counts']].groupby(
    'flag').count().transpose())

flag,OTH,REJ,RSTO,RSTOS0,RSTR,S0,S1,S2,S3,SF,SH
counts,46,11233,1562,103,2421,34851,365,127,49,74945,271


#### -  Numeric Features

In [19]:
display(df_train_01[numeric_feats].describe().transpose()[['min','max']])

Unnamed: 0,min,max
duration,0.0,42908.0
src_bytes,0.0,1379964000.0
dst_bytes,0.0,1309937000.0
wrong_fragment,0.0,3.0
urgent,0.0,3.0
hot,0.0,77.0
num_failed_logins,0.0,5.0
num_compromised,0.0,7479.0
num_root,0.0,7468.0
num_file_creations,0.0,43.0


The feature 'num_outbound_cmds' has only 0.0 values, 
<br>so it is ok to drop it on both train and test sets: num_outbound_cmds	MIN: 0.0	MAX: 0.0

In [20]:
# drop from train and test sets
df_train_01.drop(['num_outbound_cmds'], axis=1, inplace=True)
df_test_01.drop(['num_outbound_cmds'], axis=1, inplace=True)

# remove from numeric features list
numeric_feats.remove('num_outbound_cmds')

#### -  Binary Features

In [21]:
display(df_train_01[binary_feats].describe().transpose()[['min','max']])

Unnamed: 0,min,max
land,0.0,1.0
logged_in,0.0,1.0
root_shell,0.0,1.0
su_attempted,0.0,2.0
is_host_login,0.0,1.0
is_guest_login,0.0,1.0


The feature 'su_attempted' has 3 values, but it should be binary. 

In [22]:
display(df_train_01['su_attempted'].unique())

array([0, 1, 2])

In [23]:
# replace the value 2.0 for 0.0 on the feature 'su_attempted' on both sets
df_train_01['su_attempted'] = df_train_01['su_attempted'].replace(2, 0)
df_test_01['su_attempted'] = df_test_01['su_attempted'].replace(2, 0)

display(df_train_01['su_attempted'].unique())

array([0, 1])

## 5. Pre-processing Data

First, we get all categorical features and one hot encode them. 
<br>Can be done with sklearn OneHotEncoder or any other OHE function. 
<br>Here, pandas get_dummies on a custom function gets the job done.

In [24]:
# One hot encoding for categorical features
def ohe(df, cols_and_prefs):
    return [pd.get_dummies(df[col], prefix=prefix) for col,prefix in cols_and_prefs]

# Categorical features and prefixes for newly one hot encoded ones
cols_and_prefs = [('protocol_type','proto'), ('service','serv'), ('flag','flag')]
df_train_ohe = ohe(df_train_01[nominal_feats], cols_and_prefs)
df_test_ohe = ohe(df_test_01[nominal_feats], cols_and_prefs)

Drop 'counts' column from both datasets. (only used to do anaytics).

In [25]:
# Train data
df_train_01.drop(['counts'], axis=1, inplace=True)

# Test data
df_test_01.drop(['counts'], axis=1, inplace=True)

Custom function to remove categorical columns and add encoded ones

In [26]:
# remove old columns and add ohe / encoded ones    
def del_add_cols(df, cols_and_prefs, cols_list):
    temp_df = df.copy()
    del_list = [col[0] for col in cols_and_prefs]
    for col in del_list:
        temp_df.drop([col], axis=1, inplace=True)
    for new_col in cols_list:
        temp_df = pd.concat([temp_df, new_col], axis=1)
    return temp_df

# Train data
df_train_02 = del_add_cols(df_train_01, cols_and_prefs, df_train_ohe)
    
# Test data
df_test_02 = del_add_cols(df_test_01, cols_and_prefs, df_test_ohe)

After one hot encoding, there is six features found only on the train data:

In [27]:
list(df_train_02.columns.difference(df_test_02.columns))

['serv_aol',
 'serv_harvest',
 'serv_http_2784',
 'serv_http_8001',
 'serv_red_i',
 'serv_urh_i']

Adding empty columns / zero value rows (for missing data), to the test data.

In [28]:
diff_cols = list(df_train_02.columns.difference(df_test_02.columns))
for col in diff_cols:
    df_test_02[col] = 0
    
df_train_02.columns.difference(df_test_02.columns)

Index([], dtype='object')

Segmenting datasets in five parts to each class of traffic

In [29]:
# train data segmented per attack class / normal traffic
df_train_probe = df_train_02[df_train_02['class_5'].isin(['normal', 'Probe'])].copy()
df_train_dos = df_train_02[df_train_02['class_5'].isin(['normal', 'DoS'])].copy()
df_train_u2r = df_train_02[df_train_02['class_5'].isin(['normal', 'U2R'])].copy()
df_train_r2l = df_train_02[df_train_02['class_5'].isin(['normal', 'R2L'])].copy()

# test data segmented per attack class / normal traffic
df_test_probe = df_test_02[df_test_02['class_5'].isin(['normal', 'Probe'])].copy()
df_test_dos = df_test_02[df_test_02['class_5'].isin(['normal', 'DoS'])].copy()
df_test_u2r = df_test_02[df_test_02['class_5'].isin(['normal', 'U2R'])].copy()
df_test_r2l = df_test_02[df_test_02['class_5'].isin(['normal', 'R2L'])].copy()

targets for train and test data, for each class of traffic

In [30]:
# multiclass dict
attack_types_num = {'normal': 0, 'Probe': 1, 'DoS': 2, 'U2R': 3, 'R2L': 4}

# all data
df_train = df_train_02.copy()
target_train2 = df_train['class_2'].map({'normal': 0, 'attack': 1}).rename('label')
target_train5 = df_train['class_5'].map(attack_types_num).rename('label')
df_train.drop(['class_2', 'class_5'], axis=1, inplace=True)

df_test = df_test_02.copy()
target_test2 = df_test['class_2'].map({'normal': 0, 'attack': 1}).rename('label')
target_test5 = df_test['class_5'].map(attack_types_num).rename('label')
df_test.drop(['class_2', 'class_5'], axis=1, inplace=True)

# segmented train data
target_train_probe = df_train_probe['class_5'].map({'normal':0, 'Probe': 1}).rename('probe')
target_train_dos = df_train_dos['class_5'].map({'normal':0, 'DoS': 1}).rename('probe')
target_train_u2r = df_train_u2r['class_5'].map({'normal':0, 'U2R': 1}).rename('probe')
target_train_r2l = df_train_r2l['class_5'].map({'normal':0, 'R2L': 1}).rename('probe')

df_train_probe.drop(['class_2', 'class_5'], axis=1, inplace=True)
df_train_dos.drop(['class_2', 'class_5'], axis=1, inplace=True)
df_train_u2r.drop(['class_2', 'class_5'], axis=1, inplace=True)
df_train_r2l.drop(['class_2', 'class_5'], axis=1, inplace=True)

# segmented target data
target_test_probe = df_test_probe['class_5'].map({'normal':0, 'Probe': 1}).rename('probe')
target_test_dos = df_test_dos['class_5'].map({'normal':0, 'DoS': 1}).rename('probe')
target_test_u2r = df_test_u2r['class_5'].map({'normal':0, 'U2R': 1}).rename('probe')
target_test_r2l = df_test_r2l['class_5'].map({'normal':0, 'R2L': 1}).rename('probe')

df_test_probe.drop(['class_2', 'class_5'], axis=1, inplace=True)
df_test_dos.drop(['class_2', 'class_5'], axis=1, inplace=True)
df_test_u2r.drop(['class_2', 'class_5'], axis=1, inplace=True)
df_test_r2l.drop(['class_2', 'class_5'], axis=1, inplace=True)

Data normalization 

In [31]:
# all data
X = df_train
y2 = target_train2
y5 = target_train5
X_train = np.array(Normalizer().fit_transform(X))
y_train2 = np.array(y2)
y_train5 = np.array(y5)

X = df_test
y2 = target_test2
y5 = target_test5
X_test = np.array(Normalizer().fit_transform(X))
y_test2 = np.array(y2)
y_test5 = np.array(y5)


# Probe traffic
X = df_train_probe
y = target_train_probe
X_train_probe = np.array(Normalizer().fit_transform(X))
y_train_probe = np.array(y)

X = df_test_probe
y = target_test_probe
X_test_probe = np.array(Normalizer().fit_transform(X))
y_test_probe = np.array(y)


# DoS traffic
X = df_train_dos
y = target_train_dos
X_train_dos = np.array(Normalizer().fit_transform(X))
y_train_dos = np.array(y)

X = df_test_dos
y = target_test_dos
X_test_dos = np.array(Normalizer().fit_transform(X))
y_test_dos = np.array(y)


# U2R traffic
X = df_train_u2r
y = target_train_u2r
X_train_u2r = np.array(Normalizer().fit_transform(X))
y_train_u2r = np.array(y)

X = df_test_u2r
y = target_test_u2r
X_test_u2r = np.array(Normalizer().fit_transform(X))
y_test_u2r = np.array(y)


# R2L traffic
X = df_train_r2l
y = target_train_r2l
X_train_r2l = np.array(Normalizer().fit_transform(X))
y_train_r2l = np.array(y)

X = df_test_r2l
y = target_test_r2l
X_test_r2l = np.array(Normalizer().fit_transform(X))
y_test_r2l = np.array(y)

## 6. Feature Selection and Dimensionality Reduction

20 features / columns are selected for dimensionality reduction purposes.
<br>It means around 16% of total features.

#### Probe selected features

In [32]:
kb_probe = SelectKBest(chi2, k=20)
kb_probe.fit_transform(X_train_probe, target_train_probe);

In [33]:
true = kb_probe.get_support()
ix_probe = [i for i,x in enumerate(true) if x]

X_train_probe_kb = df_train_probe.iloc[:, ix_probe]
X_test_probe_kb = df_test_probe.iloc[:, ix_probe]
display(list(X_train_probe_kb.columns))

# normalize selected features
X_train_probe_kb = np.array(Normalizer().fit_transform(X_train_probe_kb))
X_test_probe_kb = np.array(Normalizer().fit_transform(X_test_probe_kb))

['duration',
 'src_bytes',
 'dst_bytes',
 'count',
 'srv_count',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_srv_rerror_rate',
 'proto_icmp',
 'serv_eco_i',
 'serv_private',
 'flag_SF']

#### DoS selected features

In [34]:
kb_dos = SelectKBest(chi2, k=20)
kb_dos.fit_transform(X_train_dos, target_train_dos);

In [35]:
true = kb_dos.get_support()
ix_dos = [i for i,x in enumerate(true) if x]

X_train_dos_kb = df_train_dos.iloc[:, ix_dos]
X_test_dos_kb = df_test_dos.iloc[:, ix_dos]
display(list(X_train_dos_kb.columns))

# normalize selected features
X_train_dos_kb = np.array(Normalizer().fit_transform(X_train_dos_kb))
X_test_dos_kb = np.array(Normalizer().fit_transform(X_test_dos_kb))

['duration',
 'src_bytes',
 'dst_bytes',
 'logged_in',
 'count',
 'serror_rate',
 'srv_serror_rate',
 'same_srv_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'proto_tcp',
 'proto_udp',
 'serv_domain_u',
 'serv_http',
 'serv_private',
 'flag_S0',
 'flag_SF']

#### U2R selected features

In [36]:
kb_u2r = SelectKBest(chi2, k=20)
kb_u2r.fit_transform(X_train_u2r, target_train_u2r);

In [37]:
true = kb_u2r.get_support()
ix_u2r = [i for i,x in enumerate(true) if x]

X_train_u2r_kb = df_train_u2r.iloc[:, ix_u2r]
X_test_u2r_kb = df_test_u2r.iloc[:, ix_u2r]
display(list(X_train_u2r_kb.columns))

# normalize selected features
X_train_u2r_kb = np.array(Normalizer().fit_transform(X_train_u2r_kb))
X_test_u2r_kb = np.array(Normalizer().fit_transform(X_test_u2r_kb))

['src_bytes',
 'dst_bytes',
 'hot',
 'num_failed_logins',
 'num_compromised',
 'root_shell',
 'num_file_creations',
 'num_shells',
 'count',
 'srv_count',
 'same_srv_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_same_src_port_rate',
 'proto_udp',
 'serv_http',
 'serv_other',
 'serv_telnet',
 'flag_SF']

#### R2L selected features

In [38]:
kb_r2l = SelectKBest(chi2, k=20)
kb_r2l.fit_transform(X_train_r2l, target_train_r2l);

In [39]:
true = kb_r2l.get_support()
ix_r2l = [i for i,x in enumerate(true) if x]

X_train_r2l_kb = df_train_r2l.iloc[:, ix_r2l]
X_test_r2l_kb = df_test_r2l.iloc[:, ix_r2l]
display(list(X_train_r2l_kb.columns))

# normalize selected features
X_train_r2l_kb = np.array(Normalizer().fit_transform(X_train_r2l_kb))
X_test_r2l_kb = np.array(Normalizer().fit_transform(X_test_r2l_kb))

['duration',
 'src_bytes',
 'dst_bytes',
 'hot',
 'num_failed_logins',
 'logged_in',
 'is_guest_login',
 'count',
 'srv_count',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_src_port_rate',
 'proto_tcp',
 'proto_udp',
 'serv_ftp',
 'serv_ftp_data',
 'serv_http',
 'serv_imap4',
 'serv_telnet',
 'flag_RSTO']

#### Selected features considering multi and binary classifications and for all kinds of traffic

In [40]:
# for binary classification
kb_train2 = SelectKBest(chi2, k=30)
kb_train2.fit_transform(X_train, target_train2);

true = kb_train2.get_support()
ix_probe = [i for i,x in enumerate(true) if x]

X_train_kb2 = df_train.iloc[:, ix_probe]
X_test_kb2 = df_test.iloc[:, ix_probe]
display(list(X_train_kb2.columns))

# normalize selected features
X_train_kb2 = np.array(Normalizer().fit_transform(X_train_kb2))
X_test_kb2 = np.array(Normalizer().fit_transform(X_test_kb2))

['duration',
 'src_bytes',
 'dst_bytes',
 'wrong_fragment',
 'logged_in',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate',
 'proto_icmp',
 'proto_tcp',
 'proto_udp',
 'serv_domain_u',
 'serv_eco_i',
 'serv_http',
 'serv_private',
 'flag_S0']

In [41]:
# for multi classification
kb_train5 = SelectKBest(chi2, k=30)
kb_train5.fit_transform(X_train, target_train5);

true = kb_train5.get_support()
ix_probe = [i for i,x in enumerate(true) if x]

X_train_kb5 = df_train.iloc[:, ix_probe]
X_test_kb5 = df_test.iloc[:, ix_probe]
display(list(X_train_kb5.columns))

# normalize selected features
X_train_kb5 = np.array(Normalizer().fit_transform(X_train_kb5))
X_test_kb5 = np.array(Normalizer().fit_transform(X_test_kb5))

['duration',
 'src_bytes',
 'dst_bytes',
 'hot',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate',
 'proto_icmp',
 'proto_tcp',
 'serv_eco_i',
 'serv_private',
 'flag_REJ',
 'flag_RSTR',
 'flag_S0',
 'flag_SF']

## 7. Classic Models

Classic classifiers used for binary classification
<br>between 'normal' and 'attack' for each group of traffic

In [42]:
classic_clfs = [('Naive Bayes', MultinomialNB()),
                ('Decision Trees', DecisionTreeClassifier(max_depth=18, 
                                                          min_samples_leaf=2, 
                                                          min_samples_split=10)),
                ('K-Nearest Neighbor', KNeighborsClassifier(n_neighbors=2,
                                                            weights='distance')),
                ('SVM', SVC(C=1000, kernel='linear'))]

The parameters used above was obteined thru greedsearchCV.
<br>It can be accecced on helper.py

In [43]:
scorers = [('accuracy', accuracy_score), ('precision', precision_score)]
x_val_scorers = ['accuracy', 'precision']


# function to apply different classifiers on segmented data 
def classifiers(X_train, y_train, X_test, y_test, clfs, scorers=scorers, x_val_scorers=x_val_scorers):
    results = dict()
    for name,clf in clfs:
        clf.fit(X_train, y_train)
        predict = clf.predict(X_test)
        results[name] = dict()
        for s_name,score in scorers:
            results[name][s_name] = score(y_test, predict) 
        for x_score in x_val_scorers:
            results[name]['cross validation '+x_score] = cross_val_score(clf, X_test, y_test, 
                                                                         cv=10, scoring=x_score).mean()
    return results


def print_results(results_dict):
    bold = "\033[1m"
    reset = "\033[0;0m"
    for key,value in results_dict.items():
        print(bold,f'\n{key}:\n',reset)
        for k,v in value.items():
            print(f'{k}: {v:.2%}')

#### Probe data classification

In [44]:
probe_results = classifiers(X_train_probe, y_train_probe,
                            X_test_probe, y_test_probe,
                            classic_clfs)

In [45]:
print_results(probe_results)

[1m 
Naive Bayes:
 [0;0m
accuracy: 85.24%
precision: 96.74%
cross validation accuracy: 87.26%
cross validation precision: 90.21%
[1m 
Decision Trees:
 [0;0m
accuracy: 88.26%
precision: 68.23%
cross validation accuracy: 99.37%
cross validation precision: 98.72%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 91.45%
precision: 87.57%
cross validation accuracy: 98.90%
cross validation precision: 96.91%
[1m 
SVM:
 [0;0m
accuracy: 86.88%
precision: 69.62%
cross validation accuracy: 97.77%
cross validation precision: 91.87%


--- -
- Results with best features

In [46]:
probe_results_kb = classifiers(X_train_probe_kb, y_train_probe,
                               X_test_probe_kb, y_test_probe,
                               classic_clfs)

In [47]:
print_results(probe_results_kb)

[1m 
Naive Bayes:
 [0;0m
accuracy: 85.14%
precision: 96.82%
cross validation accuracy: 87.00%
cross validation precision: 89.62%
[1m 
Decision Trees:
 [0;0m
accuracy: 88.17%
precision: 89.00%
cross validation accuracy: 99.07%
cross validation precision: 98.01%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 91.15%
precision: 87.31%
cross validation accuracy: 98.54%
cross validation precision: 96.20%
[1m 
SVM:
 [0;0m
accuracy: 86.42%
precision: 70.97%
cross validation accuracy: 97.32%
cross validation precision: 91.15%


#### DOS data classification

In [48]:
dos_results = classifiers(X_train_dos, y_train_dos,
                          X_test_dos, y_test_dos,
                          classic_clfs)

In [49]:
print_results(dos_results)

[1m 
Naive Bayes:
 [0;0m
accuracy: 80.19%
precision: 86.22%
cross validation accuracy: 84.75%
cross validation precision: 85.71%
[1m 
Decision Trees:
 [0;0m
accuracy: 84.87%
precision: 98.99%
cross validation accuracy: 99.62%
cross validation precision: 99.66%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 87.80%
precision: 96.29%
cross validation accuracy: 99.11%
cross validation precision: 98.68%
[1m 
SVM:
 [0;0m
accuracy: 87.44%
precision: 98.79%
cross validation accuracy: 96.83%
cross validation precision: 93.42%


--- -
- Results with best features

In [50]:
dos_results_kb = classifiers(X_train_dos_kb, y_train_dos,
                             X_test_dos_kb, y_test_dos,
                             classic_clfs)

In [51]:
print_results(dos_results_kb)

[1m 
Naive Bayes:
 [0;0m
accuracy: 80.14%
precision: 86.27%
cross validation accuracy: 84.57%
cross validation precision: 85.30%
[1m 
Decision Trees:
 [0;0m
accuracy: 90.82%
precision: 98.20%
cross validation accuracy: 99.60%
cross validation precision: 99.57%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 87.90%
precision: 96.78%
cross validation accuracy: 99.02%
cross validation precision: 98.68%
[1m 
SVM:
 [0;0m
accuracy: 80.60%
precision: 83.98%
cross validation accuracy: 96.21%
cross validation precision: 92.34%


#### U2R data classification

In [52]:
u2r_results = classifiers(X_train_u2r, y_train_u2r,
                          X_test_u2r, y_test_u2r,
                          classic_clfs)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [53]:
print_results(u2r_results)

[1m 
Naive Bayes:
 [0;0m
accuracy: 97.98%
precision: 0.00%
cross validation accuracy: 97.98%
cross validation precision: 0.00%
[1m 
Decision Trees:
 [0;0m
accuracy: 98.07%
precision: 80.00%
cross validation accuracy: 99.65%
cross validation precision: 96.43%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 98.06%
precision: 78.57%
cross validation accuracy: 99.11%
cross validation precision: 79.43%
[1m 
SVM:
 [0;0m
accuracy: 97.98%
precision: 0.00%
cross validation accuracy: 99.07%
cross validation precision: 99.09%


--- -
- Results with best features

In [54]:
u2r_results_kb = classifiers(X_train_u2r_kb, y_train_u2r,
                             X_test_u2r_kb, y_test_u2r,
                             classic_clfs)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

In [55]:
print_results(u2r_results_kb)

[1m 
Naive Bayes:
 [0;0m
accuracy: 97.98%
precision: 0.00%
cross validation accuracy: 97.98%
cross validation precision: 0.00%
[1m 
Decision Trees:
 [0;0m
accuracy: 98.08%
precision: 81.25%
cross validation accuracy: 99.03%
cross validation precision: 77.36%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 98.02%
precision: 83.33%
cross validation accuracy: 98.63%
cross validation precision: 67.07%
[1m 
SVM:
 [0;0m
accuracy: 97.98%
precision: 0.00%
cross validation accuracy: 97.98%
cross validation precision: 0.00%


#### R2L data classification

In [56]:
r2l_results = classifiers(X_train_r2l, y_train_r2l,
                          X_test_r2l, y_test_r2l,
                          classic_clfs)

  'precision', 'predicted', average, warn_for)


In [57]:
print_results(r2l_results)

[1m 
Naive Bayes:
 [0;0m
accuracy: 77.91%
precision: 0.00%
cross validation accuracy: 77.75%
cross validation precision: 28.22%
[1m 
Decision Trees:
 [0;0m
accuracy: 78.48%
precision: 95.00%
cross validation accuracy: 97.49%
cross validation precision: 95.20%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 81.65%
precision: 98.95%
cross validation accuracy: 97.27%
cross validation precision: 93.91%
[1m 
SVM:
 [0;0m
accuracy: 78.50%
precision: 94.05%
cross validation accuracy: 95.23%
cross validation precision: 89.73%


--- -
- Results with best features

In [58]:
r2l_results_kb = classifiers(X_train_r2l_kb, y_train_r2l,
                             X_test_r2l_kb, y_test_r2l,
                             classic_clfs)

  'precision', 'predicted', average, warn_for)


In [59]:
print_results(r2l_results_kb)

[1m 
Naive Bayes:
 [0;0m
accuracy: 77.91%
precision: 0.00%
cross validation accuracy: 77.73%
cross validation precision: 25.56%
[1m 
Decision Trees:
 [0;0m
accuracy: 78.59%
precision: 96.70%
cross validation accuracy: 97.48%
cross validation precision: 94.72%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 81.57%
precision: 98.93%
cross validation accuracy: 97.15%
cross validation precision: 93.62%
[1m 
SVM:
 [0;0m
accuracy: 77.94%
precision: 83.33%
cross validation accuracy: 92.60%
cross validation precision: 84.36%


### All data binary classification

In [60]:
all_class_bin_results = classifiers(X_train, y_train2,
                                    X_test, y_test2,
                                    classic_clfs)

In [61]:
print_results(all_class_bin_results)

[1m 
Naive Bayes:
 [0;0m
accuracy: 70.88%
precision: 86.66%
cross validation accuracy: 82.41%
cross validation precision: 82.82%
[1m 
Decision Trees:
 [0;0m
accuracy: 60.48%
precision: 84.09%
cross validation accuracy: 98.01%
cross validation precision: 98.24%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 77.86%
precision: 96.83%
cross validation accuracy: 97.24%
cross validation precision: 97.42%
[1m 
SVM:
 [0;0m
accuracy: 70.10%
precision: 90.06%
cross validation accuracy: 92.81%
cross validation precision: 90.53%


--- -
- Results with best features

In [62]:
all_class_bin_results_kb = classifiers(X_train_kb2, y_train2,
                                       X_test_kb2, y_test2,
                                       classic_clfs)

In [63]:
print_results(all_class_bin_results_kb)

[1m 
Naive Bayes:
 [0;0m
accuracy: 70.83%
precision: 86.64%
cross validation accuracy: 82.34%
cross validation precision: 82.79%
[1m 
Decision Trees:
 [0;0m
accuracy: 75.02%
precision: 96.66%
cross validation accuracy: 97.91%
cross validation precision: 98.33%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 77.62%
precision: 96.76%
cross validation accuracy: 97.00%
cross validation precision: 97.23%
[1m 
SVM:
 [0;0m
accuracy: 67.83%
precision: 86.97%
cross validation accuracy: 91.63%
cross validation precision: 90.60%


### All data multi classification
<br> change average to 'macro' due to multiclassification

In [44]:
def classifiers_macro(X_train, y_train, X_test, y_test, clfs, scorers=scorers, x_val_scorers=x_val_scorers):
    results = dict()
    for name,clf in clfs:
        clf.fit(X_train, y_train)
        predict = clf.predict(X_test)
        results[name] = dict()
        for s_name,score in scorers:
            if s_name == 'precision':
                results[name][s_name] = score(y_test, predict, average='macro') 
            else:
                results[name][s_name] = score(y_test, predict) 
            results[name]['cross validation accuracy'] = cross_val_score(clf, X_test, y_test, 
                                                                         cv=10, scoring='accuracy').mean()
    return results

In [None]:
multi_class_results = classifiers_macro(X_train, y_train5,
                                        X_test, y_test5,
                                        classic_clfs)

In [89]:
print_results(multi_class_results)

[1m 
Naive Bayes:
 [0;0m
accuracy: 61.08%
cross validation accuracy: 64.56%
precision: 33.88%
[1m 
Decision Trees:
 [0;0m
accuracy: 71.89%
cross validation accuracy: 97.82%
precision: 77.24%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 75.18%
cross validation accuracy: 96.73%
precision: 77.58%
[1m 
SVM:
 [0;0m
accuracy: 71.88%
cross validation accuracy: 93.46%
precision: 49.77%


--- -
- Results with best features

In [90]:
multi_class_results_kb = classifiers_macro(X_train_kb5, y_train5,
                                           X_test_kb5, y_test5,
                                           classic_clfs)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [91]:
print_results(multi_class_results_kb)

[1m 
Naive Bayes:
 [0;0m
accuracy: 61.05%
cross validation accuracy: 64.50%
precision: 32.48%
[1m 
Decision Trees:
 [0;0m
accuracy: 71.38%
cross validation accuracy: 97.20%
precision: 71.19%
[1m 
K-Nearest Neighbor:
 [0;0m
accuracy: 74.88%
cross validation accuracy: 96.32%
precision: 78.17%
[1m 
SVM:
 [0;0m
accuracy: 69.11%
cross validation accuracy: 91.40%
precision: 65.91%


## 8. Ensemble Approaches

Ensemble classifiers used for binary classification
between 'normal' and 'attack' for each group of traffic

In [45]:
ensemble_clfs = [('Adaboost', 
                 AdaBoostClassifier(base_estimator=DecisionTreeClassifier( 
                                    max_depth=18, 
                                    min_samples_leaf=2, 
                                    min_samples_split=10),
                                    n_estimators=100)),
                ('Random Forest', 
                 RandomForestClassifier(n_estimators=100, 
                                        max_depth=18, 
                                        min_samples_leaf=2, 
                                        min_samples_split=10))]

#### Probe data classification

In [93]:
probe_results = classifiers(X_train_probe, y_train_probe,
                            X_test_probe, y_test_probe,
                            ensemble_clfs)

In [94]:
print_results(probe_results)

[1m 
Adaboost:
 [0;0m
accuracy: 89.06%
precision: 85.75%
cross validation accuracy: 99.69%
cross validation precision: 99.42%
[1m 
Random Forest:
 [0;0m
accuracy: 90.22%
precision: 87.67%
cross validation accuracy: 99.53%
cross validation precision: 98.85%


--- -
- Results with best features

In [95]:
probe_results_kb = classifiers(X_train_probe_kb, y_train_probe,
                               X_test_probe_kb, y_test_probe,
                               ensemble_clfs)

In [96]:
print_results(probe_results_kb)

[1m 
Adaboost:
 [0;0m
accuracy: 88.78%
precision: 88.02%
cross validation accuracy: 99.36%
cross validation precision: 98.59%
[1m 
Random Forest:
 [0;0m
accuracy: 88.25%
precision: 85.93%
cross validation accuracy: 99.23%
cross validation precision: 97.91%


#### DOS data classification

In [97]:
dos_results = classifiers(X_train_dos, y_train_dos,
                          X_test_dos, y_test_dos,
                          ensemble_clfs)

In [98]:
print_results(dos_results)

[1m 
Adaboost:
 [0;0m
accuracy: 87.77%
precision: 97.22%
cross validation accuracy: 99.84%
cross validation precision: 99.87%
[1m 
Random Forest:
 [0;0m
accuracy: 85.72%
precision: 99.39%
cross validation accuracy: 99.81%
cross validation precision: 99.89%


--- -
- Results with best features

In [99]:
dos_results_kb = classifiers(X_train_dos_kb, y_train_dos,
                             X_test_dos_kb, y_test_dos,
                             ensemble_clfs)

In [100]:
print_results(dos_results_kb)

[1m 
Adaboost:
 [0;0m
accuracy: 77.57%
precision: 99.34%
cross validation accuracy: 99.79%
cross validation precision: 99.80%
[1m 
Random Forest:
 [0;0m
accuracy: 88.21%
precision: 97.55%
cross validation accuracy: 99.79%
cross validation precision: 99.83%


#### U2R data classification

In [101]:
u2r_results = classifiers(X_train_u2r, y_train_u2r,
                          X_test_u2r, y_test_u2r,
                          ensemble_clfs)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [102]:
print_results(u2r_results)

[1m 
Adaboost:
 [0;0m
accuracy: 97.98%
precision: 0.00%
cross validation accuracy: 99.71%
cross validation precision: 96.33%
[1m 
Random Forest:
 [0;0m
accuracy: 97.98%
precision: 0.00%
cross validation accuracy: 99.67%
cross validation precision: 97.21%


--- -
- Results with best features

In [103]:
u2r_results_kb = classifiers(X_train_u2r_kb, y_train_u2r,
                             X_test_u2r_kb, y_test_u2r,
                             ensemble_clfs)

  'precision', 'predicted', average, warn_for)


In [104]:
print_results(u2r_results_kb)

[1m 
Adaboost:
 [0;0m
accuracy: 97.98%
precision: 0.00%
cross validation accuracy: 99.29%
cross validation precision: 86.13%
[1m 
Random Forest:
 [0;0m
accuracy: 98.02%
precision: 100.00%
cross validation accuracy: 99.20%
cross validation precision: 86.62%


#### R2L data classification

In [105]:
r2l_results = classifiers(X_train_r2l, y_train_r2l,
                          X_test_r2l, y_test_r2l,
                          ensemble_clfs)

In [106]:
print_results(r2l_results)

[1m 
Adaboost:
 [0;0m
accuracy: 78.56%
precision: 100.00%
cross validation accuracy: 98.06%
cross validation precision: 95.79%
[1m 
Random Forest:
 [0;0m
accuracy: 77.90%
precision: 0.00%
cross validation accuracy: 97.87%
cross validation precision: 96.40%


--- -
- Results with best features

In [107]:
r2l_results_kb = classifiers(X_train_r2l_kb, y_train_r2l,
                             X_test_r2l_kb, y_test_r2l,
                             ensemble_clfs)

  'precision', 'predicted', average, warn_for)


In [108]:
print_results(r2l_results_kb)

[1m 
Adaboost:
 [0;0m
accuracy: 79.67%
precision: 100.00%
cross validation accuracy: 97.90%
cross validation precision: 95.83%
[1m 
Random Forest:
 [0;0m
accuracy: 77.91%
precision: 0.00%
cross validation accuracy: 98.03%
cross validation precision: 96.70%


### All data binary classification

In [109]:
all_class_bin_results = classifiers(X_train, y_train2,
                                    X_test, y_test2,
                                    ensemble_clfs)

In [110]:
print_results(all_class_bin_results)

[1m 
Adaboost:
 [0;0m
accuracy: 74.61%
precision: 97.15%
cross validation accuracy: 98.54%
cross validation precision: 98.77%
[1m 
Random Forest:
 [0;0m
accuracy: 75.18%
precision: 96.87%
cross validation accuracy: 98.54%
cross validation precision: 98.80%


--- -
- Results with best features

In [47]:
all_class_bin_results_kb = classifiers(X_train_kb2, y_train2,
                                       X_test_kb2, y_test2,
                                       ensemble_clfs)

In [48]:
print_results(all_class_bin_results_kb)

[1m 
Adaboost:
 [0;0m
accuracy: 76.80%
precision: 96.79%
cross validation accuracy: 98.35%
cross validation precision: 98.65%
[1m 
Random Forest:
 [0;0m
accuracy: 76.91%
precision: 97.03%
cross validation accuracy: 98.27%
cross validation precision: 98.69%


### All data multi classification

In [50]:
multi_class_results = classifiers_macro(X_train, y_train5,
                                        X_test, y_test5,
                                        ensemble_clfs)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [51]:
print_results(multi_class_results)

[1m 
Adaboost:
 [0;0m
accuracy: 70.72%
cross validation accuracy: 98.42%
precision: 68.60%
[1m 
Random Forest:
 [0;0m
accuracy: 72.48%
cross validation accuracy: 98.24%
precision: 48.89%


--- -
- Results with best features

In [53]:
multi_class_results_kb = classifiers_macro(X_train_kb5, y_train5,
                                           X_test_kb5, y_test5,
                                           ensemble_clfs)

  'precision', 'predicted', average, warn_for)


In [54]:
print_results(multi_class_results_kb)

[1m 
Adaboost:
 [0;0m
accuracy: 73.62%
cross validation accuracy: 98.23%
precision: 89.20%
[1m 
Random Forest:
 [0;0m
accuracy: 72.37%
cross validation accuracy: 98.06%
precision: 68.73%


## 9. ANN Model
<br> Use of deep neural networks for classification purposes. 
<br> Both scenarios are handled, 
- 'Normal' / 'Attack' binary classification; 
-  The more complex 5 types of traffic, multi classification one. 
<br><br> The 'segmented data set' approaches are ignored for ANN model, 
<br>as it can handle all traffic together with very good accuracy scores,
<br>even with just 20 selected features.

Pre-processing data for neural net model

In [57]:
# Train data pre-processing
train_target2 = df_train_01['class_2'] 
train_target5 = df_train_01['class_5']
df_train_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
y_train2 = train_target2.astype('category').cat.codes
y_train5 = train_target5.astype('category').cat.codes

# Test data pre-processing
test_target2 = df_test_01['class_2'] 
test_target5 = df_test_01['class_5']
df_test_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
y_test2 = test_target2.astype('category').cat.codes
y_test5 = test_target5.astype('category').cat.codes

# convert labels to NN multiclass
y_train5 = np_utils.to_categorical(y_train5, 5)
y_test5 = np_utils.to_categorical(y_test5, 5)

Deep neural network definitions for **binary classification**
- Using all data

In [72]:
# dnn model for classification 
batch_size = 64
epochs = 100

# definitions
model_2 = Sequential()
model_2.add(Dense(1024, 
                  input_dim = X_train.shape[1], 
                  activation = 'relu'))
model_2.add(Dropout(0.1))
model_2.add(Dense(128, 
              activation = 'relu')) 
model_2.add(Dense(1))
model_2.add(Activation('sigmoid'))

# optimizers and configs
model_2.compile(loss = 'binary_crossentropy',
                optimizer = 'adam',
                metrics = ['accuracy'])

csv_logger = CSVLogger('dnn/dnn_2_analysis.csv',
                       separator = ',', 
                       append = False)

model_2.fit(X_train, y_train2, 
          validation_data = (X_test, y_test2),
          batch_size = batch_size, 
          epochs = epochs, 
          callbacks = [checkpointer, csv_logger])

model_2.save('dnn/dnn_model_2.hdf5')

Train on 125973 samples, validate on 22544 samples
Epoch 1/100

Epoch 00001: loss did not improve from 0.01047
Epoch 2/100

Epoch 00002: loss did not improve from 0.01047
Epoch 3/100

Epoch 00003: loss did not improve from 0.01047
Epoch 4/100

Epoch 00004: loss did not improve from 0.01047
Epoch 5/100

Epoch 00005: loss did not improve from 0.01047
Epoch 6/100

Epoch 00006: loss did not improve from 0.01047
Epoch 7/100

Epoch 00007: loss did not improve from 0.01047
Epoch 8/100

Epoch 00008: loss did not improve from 0.01047
Epoch 9/100

Epoch 00009: loss did not improve from 0.01047
Epoch 10/100

Epoch 00010: loss did not improve from 0.01047
Epoch 11/100

Epoch 00011: loss did not improve from 0.01047
Epoch 12/100

Epoch 00012: loss did not improve from 0.01047
Epoch 13/100

Epoch 00013: loss did not improve from 0.01047
Epoch 14/100

Epoch 00014: loss did not improve from 0.01047
Epoch 15/100

Epoch 00015: loss did not improve from 0.01047
Epoch 16/100

Epoch 00016: loss did not imp


Epoch 00044: loss improved from 0.01047 to 0.01039, saving model to dnn/checkpoints/checkpoint_2-44.hdf5
Epoch 45/100

Epoch 00045: loss did not improve from 0.01039
Epoch 46/100

Epoch 00046: loss did not improve from 0.01039
Epoch 47/100

Epoch 00047: loss did not improve from 0.01039
Epoch 48/100

Epoch 00048: loss improved from 0.01039 to 0.00992, saving model to dnn/checkpoints/checkpoint_2-48.hdf5
Epoch 49/100

Epoch 00049: loss did not improve from 0.00992
Epoch 50/100

Epoch 00050: loss improved from 0.00992 to 0.00959, saving model to dnn/checkpoints/checkpoint_2-50.hdf5
Epoch 51/100

Epoch 00051: loss did not improve from 0.00959
Epoch 52/100

Epoch 00052: loss improved from 0.00959 to 0.00941, saving model to dnn/checkpoints/checkpoint_2-52.hdf5
Epoch 53/100

Epoch 00053: loss did not improve from 0.00941
Epoch 54/100

Epoch 00054: loss improved from 0.00941 to 0.00937, saving model to dnn/checkpoints/checkpoint_2-54.hdf5
Epoch 55/100

Epoch 00055: loss did not improve from


Epoch 00083: loss did not improve from 0.00793
Epoch 84/100

Epoch 00084: loss did not improve from 0.00793
Epoch 85/100

Epoch 00085: loss did not improve from 0.00793
Epoch 86/100

Epoch 00086: loss did not improve from 0.00793
Epoch 87/100

Epoch 00087: loss did not improve from 0.00793
Epoch 88/100

Epoch 00088: loss improved from 0.00793 to 0.00778, saving model to dnn/checkpoints/checkpoint_2-88.hdf5
Epoch 89/100

Epoch 00089: loss did not improve from 0.00778
Epoch 90/100

Epoch 00090: loss improved from 0.00778 to 0.00748, saving model to dnn/checkpoints/checkpoint_2-90.hdf5
Epoch 91/100

Epoch 00091: loss did not improve from 0.00748
Epoch 92/100

Epoch 00092: loss did not improve from 0.00748
Epoch 93/100

Epoch 00093: loss did not improve from 0.00748
Epoch 94/100

Epoch 00094: loss did not improve from 0.00748
Epoch 95/100

Epoch 00095: loss did not improve from 0.00748
Epoch 96/100

Epoch 00096: loss did not improve from 0.00748
Epoch 97/100

Epoch 00097: loss did not imp

Binary classification:
- Results with **best features**

In [73]:
# definitions
model_kb2 = Sequential()
model_kb2.add(Dense(512, 
                    input_dim = X_train_kb2.shape[1], 
                    activation = 'relu'))  
model_kb2.add(Dropout(0.1))
model_kb2.add(Dense(128, 
              activation = 'relu'))  
model_kb2.add(Dense(1))
model_kb2.add(Activation('sigmoid'))

# optimizers and configs
model_kb2.compile(loss = 'binary_crossentropy',
                  optimizer = 'adam',
                  metrics = ['accuracy'])

csv_logger = CSVLogger('dnn/dnn_kb2_analysis.csv',
                       separator = ',', 
                       append = False)

model_kb2.fit(X_train_kb2, y_train2, 
          validation_data = (X_test_kb2, y_test2),
          batch_size = batch_size, 
          epochs = epochs, 
          callbacks = [checkpointer, csv_logger])

model_kb2.save('dnn/dnn_model_kb2.hdf5')

Train on 125973 samples, validate on 22544 samples
Epoch 1/100

Epoch 00001: loss improved from inf to 0.09490, saving model to dnn/checkpoints/checkpoint_kb2-01.hdf5
Epoch 2/100

Epoch 00002: loss improved from 0.09490 to 0.06047, saving model to dnn/checkpoints/checkpoint_kb2-02.hdf5
Epoch 3/100

Epoch 00003: loss improved from 0.06047 to 0.05080, saving model to dnn/checkpoints/checkpoint_kb2-03.hdf5
Epoch 4/100

Epoch 00004: loss improved from 0.05080 to 0.04507, saving model to dnn/checkpoints/checkpoint_kb2-04.hdf5
Epoch 5/100

Epoch 00005: loss improved from 0.04507 to 0.03960, saving model to dnn/checkpoints/checkpoint_kb2-05.hdf5
Epoch 6/100

Epoch 00006: loss improved from 0.03960 to 0.03654, saving model to dnn/checkpoints/checkpoint_kb2-06.hdf5
Epoch 7/100

Epoch 00007: loss improved from 0.03654 to 0.03293, saving model to dnn/checkpoints/checkpoint_kb2-07.hdf5
Epoch 8/100

Epoch 00008: loss improved from 0.03293 to 0.03150, saving model to dnn/checkpoints/checkpoint_kb2-0


Epoch 00034: loss improved from 0.01593 to 0.01541, saving model to dnn/checkpoints/checkpoint_kb2-34.hdf5
Epoch 35/100

Epoch 00035: loss improved from 0.01541 to 0.01474, saving model to dnn/checkpoints/checkpoint_kb2-35.hdf5
Epoch 36/100

Epoch 00036: loss did not improve from 0.01474
Epoch 37/100

Epoch 00037: loss did not improve from 0.01474
Epoch 38/100

Epoch 00038: loss improved from 0.01474 to 0.01463, saving model to dnn/checkpoints/checkpoint_kb2-38.hdf5
Epoch 39/100

Epoch 00039: loss improved from 0.01463 to 0.01422, saving model to dnn/checkpoints/checkpoint_kb2-39.hdf5
Epoch 40/100

Epoch 00040: loss did not improve from 0.01422
Epoch 41/100

Epoch 00041: loss improved from 0.01422 to 0.01361, saving model to dnn/checkpoints/checkpoint_kb2-41.hdf5
Epoch 42/100

Epoch 00042: loss did not improve from 0.01361
Epoch 43/100

Epoch 00043: loss did not improve from 0.01361
Epoch 44/100

Epoch 00044: loss improved from 0.01361 to 0.01334, saving model to dnn/checkpoints/check


Epoch 00073: loss did not improve from 0.01098
Epoch 74/100

Epoch 00074: loss improved from 0.01098 to 0.01090, saving model to dnn/checkpoints/checkpoint_kb2-74.hdf5
Epoch 75/100

Epoch 00075: loss did not improve from 0.01090
Epoch 76/100

Epoch 00076: loss improved from 0.01090 to 0.01041, saving model to dnn/checkpoints/checkpoint_kb2-76.hdf5
Epoch 77/100

Epoch 00077: loss improved from 0.01041 to 0.01025, saving model to dnn/checkpoints/checkpoint_kb2-77.hdf5
Epoch 78/100

Epoch 00078: loss did not improve from 0.01025
Epoch 79/100

Epoch 00079: loss did not improve from 0.01025
Epoch 80/100

Epoch 00080: loss did not improve from 0.01025
Epoch 81/100

Epoch 00081: loss improved from 0.01025 to 0.01014, saving model to dnn/checkpoints/checkpoint_kb2-81.hdf5
Epoch 82/100

Epoch 00082: loss did not improve from 0.01014
Epoch 83/100

Epoch 00083: loss improved from 0.01014 to 0.01008, saving model to dnn/checkpoints/checkpoint_kb2-83.hdf5
Epoch 84/100

Epoch 00084: loss improved f

Deep neural network definitions for **multi classification**
- Using **all data**

In [67]:
# definitions
model_5 = Sequential()
model_5.add(Dense(1024, 
                  input_dim = X_train.shape[1], 
                  activation = 'relu'))  
model_5.add(Dropout(0.01))
model_5.add(Dense(5))
model_5.add(Activation('softmax'))

# optimizers and configs
model_5.compile(loss = 'categorical_crossentropy',
                optimizer = 'adam',
                metrics = ['accuracy'])

csv_logger = CSVLogger('dnn/dnn_5_analysis.csv',
                       separator = ',', 
                       append = False)

model_5.fit(X_train, y_train5, 
            validation_data = (X_test, y_test5),
            batch_size = batch_size, 
            epochs = epochs, 
            callbacks = [checkpointer, csv_logger])

model_5.save('dnn/dnn_model_5.hdf5')

Train on 125973 samples, validate on 22544 samples
Epoch 1/100

Epoch 00001: loss improved from inf to 0.19867, saving model to dnn/checkpoints/checkpoint_5-01.hdf5
Epoch 2/100

Epoch 00002: loss improved from 0.19867 to 0.08827, saving model to dnn/checkpoints/checkpoint_5-02.hdf5
Epoch 3/100

Epoch 00003: loss improved from 0.08827 to 0.06411, saving model to dnn/checkpoints/checkpoint_5-03.hdf5
Epoch 4/100

Epoch 00004: loss improved from 0.06411 to 0.05210, saving model to dnn/checkpoints/checkpoint_5-04.hdf5
Epoch 5/100

Epoch 00005: loss improved from 0.05210 to 0.04464, saving model to dnn/checkpoints/checkpoint_5-05.hdf5
Epoch 6/100

Epoch 00006: loss improved from 0.04464 to 0.03917, saving model to dnn/checkpoints/checkpoint_5-06.hdf5
Epoch 7/100

Epoch 00007: loss improved from 0.03917 to 0.03481, saving model to dnn/checkpoints/checkpoint_5-07.hdf5
Epoch 8/100

Epoch 00008: loss improved from 0.03481 to 0.03234, saving model to dnn/checkpoints/checkpoint_5-08.hdf5
Epoch 9/1


Epoch 00071: loss improved from 0.00955 to 0.00954, saving model to dnn/checkpoints/checkpoint_5-71.hdf5
Epoch 72/100

Epoch 00072: loss did not improve from 0.00954
Epoch 73/100

Epoch 00073: loss improved from 0.00954 to 0.00941, saving model to dnn/checkpoints/checkpoint_5-73.hdf5
Epoch 74/100

Epoch 00074: loss improved from 0.00941 to 0.00929, saving model to dnn/checkpoints/checkpoint_5-74.hdf5
Epoch 75/100

Epoch 00075: loss did not improve from 0.00929
Epoch 76/100

Epoch 00076: loss improved from 0.00929 to 0.00924, saving model to dnn/checkpoints/checkpoint_5-76.hdf5
Epoch 77/100

Epoch 00077: loss did not improve from 0.00924
Epoch 78/100

Epoch 00078: loss improved from 0.00924 to 0.00910, saving model to dnn/checkpoints/checkpoint_5-78.hdf5
Epoch 79/100

Epoch 00079: loss improved from 0.00910 to 0.00908, saving model to dnn/checkpoints/checkpoint_5-79.hdf5
Epoch 80/100

Epoch 00080: loss did not improve from 0.00908
Epoch 81/100

Epoch 00081: loss did not improve from 0.

**Multi classification:**
- Results using only the best features

In [74]:
# definitions for kb_5 (30 features)
model_kb5 = Sequential()
model_kb5.add(Dense(1024, 
                    input_dim = X_train_kb5.shape[1], 
                    activation = 'relu'))  
model_kb5.add(Dropout(0.1))
model_kb5.add(Dense(128, 
                    activation = 'relu'))  
model_kb5.add(Dropout(0.2))
model_kb5.add(Dense(5))
model_kb5.add(Activation('softmax'))

# optimizers and configs
model_kb5.compile(loss = 'categorical_crossentropy',
                    optimizer = 'adam',
                    metrics = ['accuracy'])

# checkpointer = callbacks.ModelCheckpoint(filepath = "dnn/checkpoints/checkpoint_kb5-{epoch:02d}.hdf5",
#                                          verbose = 1, 
#                                          save_best_only = True, 
#                                          monitor = 'loss')

csv_logger = CSVLogger('dnn/dnn_kb5_analysis.csv',
                       separator = ',', 
                       append = False)

model_kb5.fit(X_train_kb5, y_train5, 
              validation_data = (X_test_kb5, y_test5),
              batch_size = batch_size, 
              epochs = epochs, 
              callbacks = [checkpointer, csv_logger])

model_kb5.save('dnn/dnn_model_kb5.hdf5')

Train on 125973 samples, validate on 22544 samples
Epoch 1/100

Epoch 00001: loss did not improve from 0.00945
Epoch 2/100

Epoch 00002: loss did not improve from 0.00945
Epoch 3/100

Epoch 00003: loss did not improve from 0.00945
Epoch 4/100

Epoch 00004: loss did not improve from 0.00945
Epoch 5/100

Epoch 00005: loss did not improve from 0.00945
Epoch 6/100

Epoch 00006: loss did not improve from 0.00945
Epoch 7/100

Epoch 00007: loss did not improve from 0.00945
Epoch 8/100

Epoch 00008: loss did not improve from 0.00945
Epoch 9/100

Epoch 00009: loss did not improve from 0.00945
Epoch 10/100

Epoch 00010: loss did not improve from 0.00945
Epoch 11/100

Epoch 00011: loss did not improve from 0.00945
Epoch 12/100

Epoch 00012: loss did not improve from 0.00945
Epoch 13/100

Epoch 00013: loss did not improve from 0.00945
Epoch 14/100

Epoch 00014: loss did not improve from 0.00945
Epoch 15/100

Epoch 00015: loss did not improve from 0.00945
Epoch 16/100

Epoch 00016: loss did not imp


Epoch 00087: loss did not improve from 0.00945
Epoch 88/100

Epoch 00088: loss did not improve from 0.00945
Epoch 89/100

Epoch 00089: loss did not improve from 0.00945
Epoch 90/100

Epoch 00090: loss did not improve from 0.00945
Epoch 91/100

Epoch 00091: loss did not improve from 0.00945
Epoch 92/100

Epoch 00092: loss did not improve from 0.00945
Epoch 93/100

Epoch 00093: loss did not improve from 0.00945
Epoch 94/100

Epoch 00094: loss did not improve from 0.00945
Epoch 95/100

Epoch 00095: loss did not improve from 0.00945
Epoch 96/100

Epoch 00096: loss did not improve from 0.00945
Epoch 97/100

Epoch 00097: loss did not improve from 0.00945
Epoch 98/100

Epoch 00098: loss did not improve from 0.00945
Epoch 99/100

Epoch 00099: loss did not improve from 0.00945
Epoch 100/100

Epoch 00100: loss did not improve from 0.00945


## 10. Results and Conclusion

#### Various classifiers using ACCURACY score:

|  Classifier       | Probe    | DOS      | UR2      | R2L      |Binary Class|MultiClass|
|-------------      |-------   |------    |------    |------    |------------|----------| 
|**Decision Tree**  | 99.37%   |99.62%    |99.65%    |97.49%    |   98.01%   |  97.82%  |
|Best Features      | 99.07%   |99.60%    |99.03%    |97.48%    |   97.91%   |  97.20%  |
|                                                                                       |
|**KNN**            | 98.90%   |99.11%    |99.11%    |97.27%    |   97.24%   |  96.73%  |
|Best Features      | 98.54%   |99.02%    |98.63%    |97.15%    |   97.23%   |  96.32%  |
|                                                                                       |
|**SVM**            | 97.77%   |96.83%    |99.07%    |95.23%    |   92.81%   |  93.46%  |
|Best Features      | 97.32%   |96.21%    |97.98%    |92.60%    |   91.63%   |  91.40%  |
|                                                                                       |
|**Nayve Bayes**    | 87.26%   |84.75%    |97.98%    |77.75%    |   82.41%   |  64.56%  |
|Best Features      | 87.00%   |84.57%    |97.98%    |77.73%    |   82.34%   |  64.50%  |
|                                                                                       |
|**Adaboost**       | 99.69%   |**99.84%**|**99.79%**|**98.06%**|   98.54%   |  98.42%  |
|Best Features      |**99.84%**|99.79%    |99.29%    |97.90%    |   98.35%   |  98.23%  |
|                                                                                       |
|**Random Forest**  | 99.53%   |99.81%    |**99.79%**|97.87%    |   98.54%   |  98.24%  |
|Best Features      | 99.81%   |99.79%    |99.20%    |98.03%    |   98.27%   |  98.06%  |
|                                                                                       |
|**Deep Neural Net**|    -     |   -      |  -       |    -     | **99.77%** |**99.77%**|
|Best Features      |    -     |   -      |  -       |    -     |   99.72%   |  99.70%  |
|                                                                                       |

The best overall results were obtained by the **deep learning models**.
<br>
<br>The second best, were followed by **ensemble methods** (Adaboost first and Random Forest second).
<br>For the classic classifiers, we had **Decision Tree** as the best approach.<br>

For the trading between accuracy and processing time, due to dimensionality reduction, we can afford to lose around 0,06% of accuracy while handling with less complex dataset, with a very good gain of computation time.
--- -- 
For the future works, we can consider using other techniques for feature selection, like 'feature ratio', 'random projection' and 'principal component analysis', all for dimensionality reduction. Another neural networks approaches can also be considered, like LSTM, GRU and RNN.
--- --
Despite the use of high amounts of time and data to train the model, on the real life problems, these models (mainly DNN), when loaded, can be applied with real time processing results. They would run with updates over the air, and the data collected among the many points would be used to improve accuracy on all devices. It can work together with a logs consolidator. 
