# Supervised Machine Learning - Data Classification Demo
## Part 3 - MODEL TRAINING

In this notebook, we load the processed dataset file and use it to train several classification models.

> **INPUT:** the ready dataset csv file as cleaned and processed in the previous phases.<br>
> **OUTPUT:** a comparison of the prediction accuracy and performance of multiple machine learning classification algorithms.  

***

### 1. INITIALIZATION

In [118]:
#importing required libraries and modules
import pandas as pd
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import precision_score, confusion_matrix, recall_score, accuracy_score, f1_score
from statistics import mean
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression


In [119]:
#set display options
pd.set_option('display.max_columns', None)

### 2. LOADING PROCESSED DATASET

#### Reading dataset file into pandas DataFrame

In [120]:
#initialize required variables to read the cleaned data file
data_file_location = "..\\data\\processed\\"
data_file_name = "conn.log.labeled_processed"
data_file_ext = ".csv"


#read the dataset
data_df = pd.read_csv(data_file_location + data_file_name + data_file_ext, index_col=0)

#### Exploring dataset summary and statistics

In [121]:
#check dataset shape
data_df.shape

(23145, 33)

In [122]:
#check dataset head
data_df.head()

Unnamed: 0,id.orig_h,id.orig_p,id.resp_h,id.resp_p,duration,orig_bytes,resp_bytes,missed_bytes,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,label,proto_tcp,proto_udp,service_dhcp,service_dns,service_http,service_irc,conn_state_OTH,conn_state_RSTR,conn_state_S0,conn_state_S1,conn_state_S3,conn_state_SF,history_C,history_D,history_Dd,history_Other,history_S,history_ShAdDaf,history_ShAdDaft,history_ShAdfDr
0,1.0,0.628686,0.855795,0.001238,0.62005,0.0,0.0,0.0,0.000163,2.366458e-06,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.628686,0.855795,0.001238,0.623234,0.0,0.0,0.0,5.4e-05,7.888192e-07,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1.0,0.628686,0.855795,0.001238,0.623234,0.0,0.0,0.0,5.4e-05,7.888192e-07,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.628686,0.855795,0.001238,0.221583,2e-06,0.780758,0.5,0.005097,7.26371e-05,0.08972,0.823184,0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.628717,0.855795,0.001238,0.621946,0.0,0.0,0.0,0.000163,2.366458e-06,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### 3. MODEL TRAINING

In [123]:
#split data into independent and dependent variables
data_X = data_df.drop("label", axis=1)
data_y = data_df["label"]

#### Initializing classification models

To compare the performance of several models, we choose a set of the most popular machine learning models for classification tasks.

In [124]:
#initialize classification models
classifiers = [
    ("Naive Bayes", ComplementNB()), #since we have unbalanced labels, we use the Complement version of Naive Bayes which is particularly suited for imbalanced data sets.
    # ("Decision Tree", DecisionTreeClassifier()),
    # ("Support Vector Machines", SVC()),
    # ("Random Forest", RandomForestClassifier()),
    # ("K-Nearest Neighbors", KNeighborsClassifier()),
    # ("Logistic Regression", LogisticRegression()),
    # ("AdaBoost", AdaBoostClassifier()),
    # ("XGBoost", ),
]

#### Initializing the cross-validation technique

- In order to obtain better representative results of the performance of each model, we use cross validation instead of the regular train/test split.
- Since we are dealing with imbalanced class distribution, we prefer Stratified K-Folds cross-validator over KFold to ensure enough samples of the labels are represented in each fold. 

In [125]:
#initialize the cross-validator with sample shuffling activated
skf_cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

#### Training classification models

In [128]:
print("Model Training Started!")
#initialize the results summary
classification_results = pd.DataFrame(index=[c[0] for c in classifiers], columns=["Accuracy", "TP", "FP", "TN", "FN", "Recall", "Precision", "F1"])

#iterate over the estimators
for est_name, est_object in classifiers:
    
    print(f"### [{est_name,2}]: Processing ...")
    
    #initialize the results for each classifier
    accuracy_scores = []
    confusion_matrices = []
    recall_scores = []
    precision_scores = []
    f1_scores = []
    
    #iterate over the obtained folds
    for train_index, test_index in skf_cv.split(data_X, data_y):

        #get train and test samples from the cross validation model
        X_train, X_test = data_X.iloc[train_index], data_X.iloc[test_index]
        y_train, y_test = data_y.iloc[train_index], data_y.iloc[test_index]
        
        #train the model
        est_object.fit(X_train.values, y_train.values)
        
        #predict the test samples
        y_pred = est_object.predict(X_test.values)
        
        accuracy_scores.append(accuracy_score(y_test, y_pred))
        confusion_matrices.append(confusion_matrix(y_test, y_pred))
        recall_scores.append(recall_score(y_test, y_pred))
        precision_scores.append(precision_score(y_test, y_pred))
        f1_scores.append(f1_score(y_test, y_pred))
    
    #summarize the results for all folds for each classifier
    tn, fp, fn, tp = sum(confusion_matrices).ravel()
    classification_results.loc[est_name] = [mean(accuracy_scores),tp,fp,tn,fn,mean(recall_scores),mean(precision_scores),mean(f1_scores)]
print("Model Training Finished!")   
    

Model Training Started!
### [('Naive Bayes', 2)]: Processing ...
Model Training Finished!


In [129]:
#check the results
classification_results

Unnamed: 0,Accuracy,TP,FP,TN,FN,Recall,Precision,F1
Naive Bayes,0.995679,21220,98,1825,2,0.999906,0.995403,0.997649


### 4. RESULT ANALYSIS

1. Although our labels are not normally distributed, Naive Bayes achieved good results.
2. Most errors are concentrated in the False Positive rates, which is commonly the case in most cybersecurity problems.