> **Essential ML process for Intrusion Detection**
<br>` python  3.7.13    scikit-learn  1.0.2 `
<br>`numpy   1.19.5          pandas  1.3.5`

**Import the main libraries**

In [1]:
import numpy
import pandas

from time import time

import os
data_path = '../datasets/NSL_KDD'

_import the local library_

In [2]:
# add parent folder path where lib folder is
import sys
if ".." not in sys.path:import sys; sys.path.insert(0, '..') 

In [3]:
from mylib import show_labels_dist, show_metrics, bias_var_metrics

**Import the Dataset**

In [4]:
# Using boosted Train and preprocessed Test

data_file = os.path.join(data_path, 'NSL_boosted-2.csv') 
train_df = pandas.read_csv(data_file)
print('Train Dataset: {} rows, {} columns'.format(train_df.shape[0], train_df.shape[1]))

data_file = os.path.join(data_path, 'NSL_ppTest.csv') 
test_df = pandas.read_csv(data_file)
print('Test Dataset: {} rows, {} columns'.format(test_df.shape[0], test_df.shape[1]))

Train Dataset: 63280 rows, 43 columns
Test Dataset: 22544 rows, 43 columns


***
**Data Preparation and EDA** (unique to this dataset)

* _Quick visual check of unique values, deal with unique identifiers_

In [5]:
# Identify columns with only one value 
# or with number of unique values == number of rows
n_eq_one = []
n_eq_all = []

print('Unique value count: Train (',train_df.shape[0],'rows ) ~ Test(',test_df.shape[0],'rows )')
for col in train_df.columns:
    lctrn = len(train_df[col].unique())
    lctst = len(test_df[col].unique())

#    print(col, ' ::> ', lctrn, ' ~ ', lctst)
    
    if (lctrn == 1) and (lctrn == lctst): 
        n_eq_one.append(train_df[col].name)
    if lctrn == train_df.shape[0]:
        n_eq_all.append(train_df[col].name)

Unique value count: Train ( 63280 rows ) ~ Test( 22544 rows )


In [6]:
# Drop columns with only one value
if len(n_eq_one) > 0:
    print('Dropping single-valued features')
    print(n_eq_one)
    train_df.drop(n_eq_one, axis=1, inplace=True)
    test_df.drop(n_eq_one, axis=1, inplace=True)

# Drop or bin columns with number of unique values == number of rows
if len(n_eq_all) > 0:
    print('Dropping unique identifiers')
    print(n_eq_all)
    train_df.drop(n_eq_all, axis=1, inplace=True)
    test_df.drop(n_eq_all, axis=1, inplace=True)

# continue with featue selection / feature engineering

Dropping single-valued features
['num_outbound_cmds']


* _Combine for processing classification target and text features_

In [7]:
combined_df = pandas.concat([train_df, test_df])
print('Combined Dataset: {} rows, {} columns'.format(
    combined_df.shape[0], combined_df.shape[1]))

Combined Dataset: 85824 rows, 42 columns


* _Classification Target feature:_
two columns of labels are available 
    * Two-class: Reduce the detailed attack labels to 'normal' or 'attack'
    * Multiclass: Use the category labels (atakcat)

In [8]:
# Set the classification target
twoclass = True     # True or False

In [9]:
if twoclass:
# Two-class: Reduce the detailed attack labels to 'normal' or 'attack'
# new single column data structure is a [series]
    labels_df = combined_df['label'].copy()
    labels_df[labels_df != 'normal'] = 'attack'
else:
# Multiclass: Use the category labels (atakcat)
# new single column data structure is a [[dataframe]]
# rename the column and convert to a series for later
    labels_df = combined_df[['atakcat']].copy()
    labels_df.rename(columns={'atakcat':'label'}, inplace=True)
    labels_df = labels_df.squeeze('columns')

# drop target features 
combined_df.drop(['label'], axis=1, inplace=True)
combined_df.drop(['atakcat'], axis=1, inplace=True)

* _One-Hot Encoding the categorical (text) features_

In [10]:
# put the names into a python list - for pandas.get_dummies()
categori = combined_df.select_dtypes(include=['object']).columns
category_cols = categori.tolist()
#print(category_cols)

In [11]:
# Apply to the list of Categorical columns (text fields)
features_df = pandas.get_dummies(combined_df, columns=category_cols)
#features_df.info()

In [12]:
# generate a list of numeric columns for scaling - After test // train split
numeri = combined_df.select_dtypes(include=['float64','int64']).columns
#print(numeri.to_list())

***
**<br>Create Test // Train Datasets**
> Normally we split the dataset into train 70 % // test 30 % like this
<br>`from sklearn.model_selection import train_test_split`
<br>`X_train, X_test, y_train, y_test = `
<br>`    train_test_split(features_df, labels_df, `
<br>`        test_size=0.3, stratify=labels_df, random_state=42)`

In [13]:
# Restore the train // test split: slice 1 Dataframe into 2 
features_train = features_df.iloc[:len(train_df),:].copy()    # X_train
features_train.reset_index(inplace=True, drop=True)
# pandas has a lot of rules about returning a 'view' vs. a copy from slice
# so we force it to create a new dataframe [avoiding SettingWithCopy Warning]
features_test = features_df.iloc[len(train_df):,:].copy()     # X_test
features_test.reset_index(inplace=True, drop=True)

# Restore the train // test split: slice 1 Series into 2 
labels_train = labels_df[:len(train_df)]               # y_train
labels_train.reset_index(inplace=True, drop=True)

labels_test = labels_df[len(train_df):]                # y_test
labels_test.reset_index(inplace=True, drop=True)

***
Next are standard steps for all datasets: _scaling, classifiers, results_

**Scaling** comes _after_ test // train split

In [14]:
# scaling the Numeric columns 
# StandardScaler range: -1 to 1, MinMaxScaler range: zero to 1

# from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# sklearn docs say 
#   "Don't cheat - fit only on training data, then transform both"
#   fit() expects 2D array: reshape(-1, 1) for single col or (1, -1) single row

for i in numeri:
    arr = numpy.array(features_train[i])
    scale = MinMaxScaler().fit(arr.reshape(-1, 1))
    features_train[i] = scale.transform(arr.reshape(len(arr),1))

    arr = numpy.array(features_test[i])
    features_test[i] = scale.transform(arr.reshape(len(arr),1))

**<br>Classifier Selection**

In [15]:
# prepare list
models = []

##  --  Linear  --  ## 
#from sklearn.linear_model import LogisticRegression 
#models.append (("LogReg",LogisticRegression())) 
from sklearn.linear_model import SGDClassifier 
models.append (("StocGradDes",SGDClassifier())) 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
models.append(("LinearDA", LinearDiscriminantAnalysis())) 
#from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis 
#models.append(("QuadraticDA", QuadraticDiscriminantAnalysis())) 

##  --  Support Vector  --  ## 
#from sklearn.svm import SVC 
#models.append(("SupportVectorClf", SVC())) 
from sklearn.svm import LinearSVC 
models.append(("LinearSVC", LinearSVC())) 
from sklearn.linear_model import RidgeClassifier
models.append (("RidgeClf",RidgeClassifier())) 

##  --  Non-linear  --  ## 
from sklearn.tree import DecisionTreeClassifier 
models.append (("DecisionTree",DecisionTreeClassifier())) 
#from sklearn.naive_bayes import GaussianNB 
#models.append (("GaussianNB",GaussianNB())) 
#from sklearn.neighbors import KNeighborsClassifier 
#models.append(("K-NNeighbors", KNeighborsClassifier())) 

##  --  Ensemble: bagging  --  ## 
from sklearn.ensemble import RandomForestClassifier 
models.append(("RandomForest", RandomForestClassifier())) 
##  --  Ensemble: boosting  --  ## 
#from sklearn.ensemble import AdaBoostClassifier 
#models.append(("AdaBoost", AdaBoostClassifier())) 
#from sklearn.ensemble import GradientBoostingClassifier 
#models.append(("GradientBoost", GradientBoostingClassifier())) 

##  --  NeuralNet (simplest)  --  ## 
#from sklearn.linear_model import Perceptron 
#models.append (("SingleLayerPtron",Perceptron())) 
#from sklearn.neural_network import MLPClassifier 
#models.append(("MultiLayerPtron", MLPClassifier())) 

print(models)

[('StocGradDes', SGDClassifier()), ('LinearDA', LinearDiscriminantAnalysis()), ('LinearSVC', LinearSVC()), ('RidgeClf', RidgeClassifier()), ('DecisionTree', DecisionTreeClassifier()), ('RandomForest', RandomForestClassifier())]


<br>_compatibility block for pasting in from sample code_

In [16]:
# dataset names
X_train = features_train
y_train = labels_train
X_test = features_test
y_test = labels_test
labels_col = 'label'
# library names
pd = pandas
np = numpy

**<br>Target Label Distributions** (standard block)

In [17]:
# from our local library
show_labels_dist(X_train,X_test,y_train,y_test)

features_train: 63280 rows, 121 columns
features_test:  22544 rows, 121 columns

labels_train: 63280 rows, 1 column
labels_test:  22544 rows, 1 column

Frequency and Distribution of labels
        label  %_train  label  %_test
normal  33672    53.21   9711   43.08
attack  29608    46.79  12833   56.92


**<br>Fit and Predict** (standard block)

In [18]:
# evaluate each model in turn
results = []
from sklearn.metrics import confusion_matrix
#print('macro average: unweighted mean per label')
#print('weighted average: support-weighted mean per label')
#print('MCC: correlation between prediction and ground truth')
#print('     (+1 perfect, 0 random prediction, -1 inverse)\n')

for name, clf in models:
    trs = time()
    print('\nConfusion Matrix:', name)
    
    clf.fit(X_train, y_train)
    ygx = clf.predict(X_test)
    results.append((name, ygx))
    
    tre = time() - trs
    print ("Run Time {} seconds".format(round(tre,2)))
    
# Easy way to ensure that the confusion matrix rows and columns
#   are labeled exactly as the classifier has coded the classes
#   [[note the _ at the end of clf.classes_ ]]

    tptn_df = pd.DataFrame(confusion_matrix(y_test, ygx, labels=clf.classes_), 
                           index=['train:{:}'.format(x) for x in clf.classes_], 
                           columns=['pred:{:}'.format(x) for x in clf.classes_])
    print(tptn_df)  

#    show_metrics(y_test, ygx, clf.classes_)   # from our local library
#    print('\nParameters: ', clf.get_params(), '\n\n')


Confusion Matrix: StocGradDes
Run Time 0.9 seconds
              pred:attack  pred:normal
train:attack         8484         4349
train:normal          755         8956

Confusion Matrix: LinearDA
Run Time 4.34 seconds
              pred:attack  pred:normal
train:attack         8346         4487
train:normal          663         9048

Confusion Matrix: LinearSVC
Run Time 4.3 seconds
              pred:attack  pred:normal
train:attack         8317         4516
train:normal          735         8976

Confusion Matrix: RidgeClf
Run Time 1.12 seconds
              pred:attack  pred:normal
train:attack         8345         4488
train:normal          664         9047

Confusion Matrix: DecisionTree
Run Time 3.42 seconds
              pred:attack  pred:normal
train:attack        10995         1838
train:normal          916         8795

Confusion Matrix: RandomForest
Run Time 26.69 seconds
              pred:attack  pred:normal
train:attack        10611         2222
train:normal          875 

***

***

**<br>Baseline Model**
>Select this block - Go to the Run menu - Run all Above

***

***
**Statistical Comparison of Models**
<br>The the null hypothesis statement:
>H0: Both models perform equally well on the dataset.
<br>H1: Both models do not have the same performance on the dataset.

Chosen significance threshold is `alpha = 0.05` for rejecting the null hypothesis.
***

* Cochran's Q omnibus test<br>
* McNemar post-hoc with multiple adjustments

In [19]:
# nonparametric tests for multiple classifiers trained on one dataset
from P_HAKN import cq_mph, filtr_ap2h0, filtr_psig  

In [20]:
ph_pvals_df = cq_mph(y_test,results)
ph_pvals_df

Cochran Q Test: ['StocGradDes', 'LinearDA', 'LinearSVC', 'RidgeClf', 'DecisionTree', 'RandomForest']
	p_value = 0.0 and ChiSquare = 7432.861 

H0: there is no difference in performance at the 95.0% confidence level
	Reject - Continuing with post-hoc tests

Classifiers: 6    Tests: 15


  num_models * sum([c**2 for c in correctly_classified_collection])


Unnamed: 0,p_noadj,ap_BDun,ap_Sdak,ap_Holm,ap_Finr,ap_Hoch,ap_Li
StocGradDes // DecisionTree,0.0,0.0,0.0,0.0,0.0,0.0,0.0
StocGradDes // RandomForest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearDA // DecisionTree,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearDA // RandomForest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearSVC // DecisionTree,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearSVC // RandomForest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RidgeClf // DecisionTree,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RidgeClf // RandomForest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DecisionTree // RandomForest,6.515636e-21,9.773452999999999e-20,0.0,4.560945e-20,0.0,4.560945e-20,1.251803e-20
StocGradDes // LinearSVC,6.275246000000001e-17,9.412869e-16,1.665335e-15,3.765148e-16,1.110223e-16,3.765148e-16,1.205619e-16


In [21]:
zz=filtr_ap2h0(ph_pvals_df)
zz

Unnamed: 0,p_noadj,H0: BDun,H0: Sdak,H0: Holm,H0: Finr,H0: Hoch,H0: Li
StocGradDes // DecisionTree,False,False,False,False,False,False,False
StocGradDes // RandomForest,False,False,False,False,False,False,False
LinearDA // DecisionTree,False,False,False,False,False,False,False
LinearDA // RandomForest,False,False,False,False,False,False,False
LinearSVC // DecisionTree,False,False,False,False,False,False,False
LinearSVC // RandomForest,False,False,False,False,False,False,False
RidgeClf // DecisionTree,False,False,False,False,False,False,False
RidgeClf // RandomForest,False,False,False,False,False,False,False
DecisionTree // RandomForest,False,False,False,False,False,False,False
StocGradDes // LinearSVC,False,False,False,False,False,False,False


For analysis it may be useful to consider just the pairs with significant differences

In [22]:
print("subset: unadjusted p_value is significant")
pv_sig_df = filtr_psig(ph_pvals_df)
pv_sig_df

subset: unadjusted p_value is significant
Significant: 14    Not: 1
Classifiers: 6    Tests: 14


Unnamed: 0,p_noadj,ap_BDun,ap_Sdak,ap_Holm,ap_Finr,ap_Hoch,ap_Li
StocGradDes // DecisionTree,0.0,0.0,0.0,0.0,0.0,0.0,0.0
StocGradDes // RandomForest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearDA // DecisionTree,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearDA // RandomForest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearSVC // DecisionTree,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearSVC // RandomForest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RidgeClf // DecisionTree,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RidgeClf // RandomForest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DecisionTree // RandomForest,6.515636e-21,9.12189e-20,0.0,3.9093809999999995e-20,0.0,3.9093809999999995e-20,6.683846e-21
StocGradDes // LinearSVC,6.275246000000001e-17,8.785344e-16,1.554312e-15,3.137623e-16,1.110223e-16,3.137623e-16,6.437251e-17


Essentially, the post-hoc test tells us which pairs showed a significant difference - and nothing else. To find out which one performed better, we need to go back to the composite metrics from the confusion matrix: AUC, F1, gmean etc.

#### One vs All 
In some cases, we do not care about all pairwise comparisons as we only propose a single method, or just need to compare to a baseline method. In this case we designate a control method, and compare all others to it.

In [23]:
ph_ctl_df = cq_mph(y_test,results,cq=False,control='RidgeClf')
ph_ctl_df

Classifiers: 6    Tests: 5


Unnamed: 0,p_noadj,ap_BDun,ap_Sdak,ap_Holm,ap_Finr,ap_Hoch,ap_Li
RidgeClf // DecisionTree,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RidgeClf // RandomForest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RidgeClf // LinearSVC,3e-06,1.3e-05,1.3e-05,8e-06,4e-06,8e-06,5e-06
RidgeClf // StocGradDes,0.01937,0.096849,0.093169,0.03874,0.024153,0.03874,0.035879
RidgeClf // LinearDA,0.4795,1.0,0.961796,0.4795,0.4795,0.4795,0.4795


In [24]:
za=filtr_ap2h0(ph_ctl_df)
za

Unnamed: 0,p_noadj,H0: BDun,H0: Sdak,H0: Holm,H0: Finr,H0: Hoch,H0: Li
RidgeClf // DecisionTree,False,False,False,False,False,False,False
RidgeClf // RandomForest,False,False,False,False,False,False,False
RidgeClf // LinearSVC,False,False,False,False,False,False,False
RidgeClf // StocGradDes,False,True,True,False,False,False,False
RidgeClf // LinearDA,True,True,True,True,True,True,True


 ***