# Report on Multi-Class Classifiers Using NSL-KDD Dataset

## Section 1 - Comparing Algorithms and Evaluation Metrics

I will first explore the performance of 3 different algorithms (Random Forest, Linear SVC, and logistic regression) when using all the attributes as features, with regards to different evaluation metrics.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 100)

from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedShuffleSplit

from sklearn.model_selection import train_test_split

train_set = pd.read_csv('../../KDDTrain+.csv', header=None)

field_names = pd.read_csv('../../Field Names.csv', header=None)

field_names_extension = pd.DataFrame([['attack_type', 'symbolic'],['??','continuous']])
field_names = field_names.append(field_names_extension, ignore_index=True)

cont_feats = field_names.loc[field_names[1] == 'continuous']
cont_feats = cont_feats.drop(columns=[1])

field_names = field_names.drop(columns=[1])

dataframe_headers = field_names[0].tolist()
train_set.columns = dataframe_headers

cols = train_set.columns.tolist()
cols = cols[-2:-1] + cols[:-2] + cols[-1:]
train_set = train_set[cols]

att_types = train_set['attack_type']

# train_set.sort_values('attack_type')

In order to achieve such exploration, I would have to enumerate the categoricals (service, protocol type, and flag) 

In [2]:
# ENUMERATING SERVICE TYPE
import copy

service_list = train_set['service'].unique()

try:
    service_list2
except NameError:
    service_list2 = copy.deepcopy(service_list)    
    svc_enum_data = {'service': service_list}
    service_enumerations = pd.DataFrame(data=svc_enum_data)
    
enumerated_service_list = [x for x in range(len(service_list))]


train_set = train_set.replace(service_list,enumerated_service_list)

# ENUMERATING PROTOCOL TYPE

prot_list = train_set['protocol_type'].unique()

try:
    prot_list2
except NameError:
    prot_list2 = copy.deepcopy(prot_list)    
    prot_enum_data = {'protocol_type': prot_list}
    prot_enumerations = pd.DataFrame(data=prot_enum_data)
    
enumerated_prot_list = [x for x in range(len(prot_list))]


train_set = train_set.replace(prot_list,enumerated_prot_list)

# ENUMERATING FLAG TYPE

flag_list = train_set['flag'].unique()

try:
    flag_list2
except NameError:
    flag_list2 = copy.deepcopy(flag_list)    
    flag_enum_data = {'flag': flag_list}
    flag_enumerations = pd.DataFrame(data=flag_enum_data)
    
enumerated_flag_list = [x for x in range(len(flag_list))]


train_set = train_set.replace(flag_list,enumerated_flag_list)


Importing Test Set

In [3]:
test_set = pd.read_csv('../../KDDTest+.csv', header=None)
test_set.columns = dataframe_headers

test_set = test_set[cols]

test_set = test_set.replace(service_list2,enumerated_service_list)
test_set['service'] = test_set['service'].apply(lambda x: '-1' if type(x) is str else x)

test_set = test_set.replace(prot_list2, enumerated_prot_list)
test_set['protocol_type'] = test_set['protocol_type'].apply(lambda x: '-1' if type(x) is str else x)

test_set = test_set.replace(flag_list2, enumerated_flag_list)
test_set['flag'] = test_set['flag'].apply(lambda x: '-1' if type(x) is str else x)


Combining Test and Train to perform Stratified Sampling

In [4]:
combined_table = train_set.append(test_set, ignore_index=True)
combined_table_features = combined_table.drop(columns=['attack_type'])
combined_table_target = combined_table['attack_type']


In [5]:
train_features, test_features, train_target, test_target = train_test_split(combined_table_features, combined_table_target, test_size=0.1, shuffle=True, stratify=combined_table_target)

Setting up the classifiers:

In [6]:
%%time

svc_clf_lin = LinearSVC()

svc_clf_lin.fit(train_features, train_target)


CPU times: user 13min 32s, sys: 87.6 ms, total: 13min 32s
Wall time: 13min 32s


In [7]:
%%time

rf_clf = RandomForestClassifier()
rf_clf.fit(train_features, train_target)

CPU times: user 1.21 s, sys: 4 ms, total: 1.21 s
Wall time: 1.21 s


In [8]:
%%time

lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_target)

CPU times: user 3min 37s, sys: 3.99 ms, total: 3min 37s
Wall time: 3min 37s


Testing The Classifiers - Micro is used due to class imbalance in multi class classification

In [9]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score


result_df_columns = ['Classifier', 'Accuracy', 'Precision (Micro)', 'Precision (Weighted)', 'Recall (Micro)', 'Recall (Weighted)', 'F-1 (Micro)', 'F1 (Weighted)']
clf_cmp_dataframe = pd.DataFrame(columns=result_df_columns)


In [None]:
 import warnings
 warnings.filterwarnings('ignore')

In [10]:
%%capture --no-display

# SVC linear
svc_lin_pred_list = svc_clf_lin.predict(test_features)
svc_lin_acc = accuracy_score(test_target, svc_lin_pred_list)
svc_lin_prec_mic = precision_score(test_target, svc_lin_pred_list, average='micro')
svc_lin_prec_mac = precision_score(test_target, svc_lin_pred_list, average='weighted')
svc_lin_rec_mic = recall_score(test_target, svc_lin_pred_list, average='micro')
svc_lin_rec_mac = recall_score(test_target, svc_lin_pred_list, average='weighted')
svc_lin_f1_mic = f1_score(test_target, svc_lin_pred_list, average='micro')
svc_lin_f1_mac = f1_score(test_target, svc_lin_pred_list, average='weighted')

svc_lin_res = ["SVC Linear", svc_lin_acc, svc_lin_prec_mic, svc_lin_prec_mac, svc_lin_rec_mic, svc_lin_rec_mac, svc_lin_f1_mic, svc_lin_f1_mac]
svc_lin_df = pd.DataFrame([svc_lin_res], columns=result_df_columns)

clf_cmp_dataframe = clf_cmp_dataframe.append(svc_lin_df, ignore_index=True)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [11]:
# RF

rf_pred_list = rf_clf.predict(test_features)
rf_acc = accuracy_score(test_target, rf_pred_list)
rf_prec_mic = precision_score(test_target, rf_pred_list, average='micro')
rf_prec_mac = precision_score(test_target, rf_pred_list, average='weighted')
rf_rec_mic = recall_score(test_target, rf_pred_list, average='micro')
rf_rec_mac = recall_score(test_target, rf_pred_list, average='weighted')
rf_f1_mic = f1_score(test_target, rf_pred_list, average='micro')
rf_f1_mac = f1_score(test_target, rf_pred_list, average='weighted')

rf_res = ["RF", rf_acc, rf_prec_mic, rf_prec_mac, rf_rec_mic, rf_rec_mac, rf_f1_mic, rf_f1_mac]
rf_df = pd.DataFrame([rf_res], columns=result_df_columns)

clf_cmp_dataframe = clf_cmp_dataframe.append(rf_df, ignore_index=True)

In [12]:
# LR

lr_pred_list = lr_clf.predict(test_features)
lr_acc = accuracy_score(test_target, lr_pred_list)
lr_prec_mic = precision_score(test_target, lr_pred_list, average='micro')
lr_prec_mac = precision_score(test_target, lr_pred_list, average='weighted')
lr_rec_mic = recall_score(test_target, lr_pred_list, average='micro')
lr_rec_mac = recall_score(test_target, lr_pred_list, average='weighted')
lr_f1_mic = f1_score(test_target, lr_pred_list, average='micro')
lr_f1_mac = f1_score(test_target, lr_pred_list, average='weighted')

lr_res = ["LR", lr_acc, lr_prec_mic, lr_prec_mac, lr_rec_mic, lr_rec_mac, lr_f1_mic, lr_f1_mac]
lr_df = pd.DataFrame([lr_res], columns=result_df_columns)

clf_cmp_dataframe = clf_cmp_dataframe.append(lr_df, ignore_index=True)

In [13]:
clf_cmp_dataframe

Unnamed: 0,Classifier,Accuracy,Precision (Micro),Precision (Weighted),Recall (Micro),Recall (Weighted),F-1 (Micro),F1 (Weighted)
0,SVC Linear,0.853353,0.853353,0.867082,0.853353,0.853353,0.853353,0.850501
1,RF,0.997307,0.997307,0.99711,0.997307,0.997307,0.997307,0.997149
2,LR,0.865001,0.865001,0.806384,0.865001,0.865001,0.865001,0.82797


## Section 2 - Evaluating Results


## Model to Choose

As seen from the data above, the Random Forest model performs better than the other models in every category. This could be possible due to the fact that this dataset is a multi-class problem, which is not equally distributed amongst every class. Having an ensemble of decision trees to form a Random Forest allows for different cases to be explored, and build a better model to predict by. 

## Metrics to Choose

As the data involved pertains to potential attacks on a network system, it is important to take necessary precautions to mitigate any potential damage. In such a situation, it may be beneficial to allow for more False Positives as compared to False Negatives. This is to ensure that any potential risks can be explored, where it is better to realise that there is nothing wrong as compared to not detecting an attack, and realising when it is too late. Hence, we want to have a high recall, where the proportions of False Negatives are as low as possible


## Cross-Validation Using K-Fold CV

In order to ensure that the results obtained were representative, I have used K-Folds cross-validation to check if the results obtained are similar using different folds of the data instead.

In [54]:
# k-fold cv

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kf = KFold(n_splits=10, shuffle=True)
rf2_clf = RandomForestClassifier()

np.mean(cross_val_score(rf2_clf, combined_table_features, combined_table_target, cv=kf))


0.996929613494585

The value obtained is very similar to the one above, which validates the appropriateness and accuracy of the RF classifier on this dataset.

## Section 3 - Iterations to Improve Model


In [None]:
# Precision and Recall Curve (set probability = true for svc?)
# Roc curve
# one-hot encoding
# scaling data/pre-processing

# feature engineering
# feature transformation
# tweak model param


## Using Binary Classification on a Multi-Class Model

In [None]:
rf_pred_list

## Using Grid Search CV to Find Best RF Hyper-parameters
While the default RF classifier performed extremely well already, I will use Grid Search CV to see if it can improve on the default model above. The default uses n_estimators at 10, and max_depth at None. These are the 2 hyper-parameters which I will be varying in order to observe the difference it has on the results.

In [60]:
from sklearn.model_selection import GridSearchCV

parameters = {'n_estimators':[10,50,100,200], 'max_depth':[None,3,5,10,15,20,25,30,35,40]}

cv_clf = GridSearchCV(rf_clf, parameters)
cv_clf.fit(train_features, train_target)

cv_clf.best_params_

{'max_depth': 25, 'n_estimators': 200}

In [61]:
print(cv_clf.best_estimator_)
print(cv_clf.best_score_)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=25, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
0.9971944577447929


In [None]:
rf_clf3 = RandomForestClassifier()
cv_clf_2 = GridSearchCV(rf_clf3, parameters)
cv_clf_2.fit(train_features, train_target)

print(cv_clf_2.best_params_)
print(cv_clf_2.best_score_)

## Section 4 - Limitations and Other Considerations

# Justification of choices
# Add in what i Have learnt

1) Imbalanced Dataset :
    The data set does not reflect the true nature of how network attacks take place, as the data represents close to a 50/50 split between "normal" and attacks. This would signify more of a binary or multi-class classifier, rather than an anomaly detector. In reality, the majority of network traffic should be "normal", which makes it harder for a true classifier to be able to be trained to do such detections of attack.
    
2) Clean Dataset :
    The data set also 
    
- Academic Dataset (Well-researched already)

- Could have run further iteratins on other models to compare, but it is expensive
- didnt use svc rbf cos too expensive too


## Section 5 - Conclusion