# Supervised Machine Learning - Data Classification Demo
## Part 3 - MODEL TRAINING

In this notebook, we load the processed dataset file and use it to train several classification models.

> **INPUT:** the ready dataset csv file as cleaned and processed in the previous phases.<br>
> **OUTPUT:** a comparison of the prediction accuracy and performance of multiple machine learning classification algorithms.  

***

### 1. INITIALIZATION

In [98]:
#importing required libraries and modules
import pandas as pd
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import precision_score, confusion_matrix, recall_score, accuracy_score, f1_score
from statistics import mean
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression


### 2. LOADING PROCESSED DATASET

#### Reading dataset file into pandas DataFrame

In [84]:
#initialize required variables to read the cleaned data file
data_file_location = "..\\data\\processed\\"
data_file_name = "conn.log.labeled_processed"
data_file_ext = ".csv"


#read the dataset
data_df = pd.read_csv(data_file_location + data_file_name + data_file_ext, index_col=0)

#### Exploring dataset summary and statistics

In [85]:
#check dataset shape
data_df.shape

(23145, 33)

In [86]:
#check dataset head
data_df.head()

Unnamed: 0,id.orig_h,id.orig_p,id.resp_h,id.resp_p,duration,orig_bytes,resp_bytes,missed_bytes,orig_pkts,orig_ip_bytes,...,conn_state_S3,conn_state_SF,history_C,history_D,history_Dd,history_Other,history_S,history_ShAdDaf,history_ShAdDaft,history_ShAdfDr
0,3232236000.0,41040.0,3119782000.0,80.0,3.139211,0.0,0.0,0.0,3.0,180.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,3232236000.0,41040.0,3119782000.0,80.0,3.152487,0.0,0.0,0.0,1.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3232236000.0,41040.0,3119782000.0,80.0,3.152487,0.0,0.0,0.0,1.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,3232236000.0,41040.0,3119782000.0,80.0,1.477656,149.0,128252.0,2896.0,94.0,5525.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,3232236000.0,41042.0,3119782000.0,80.0,3.147116,0.0,0.0,0.0,3.0,180.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### 3. MODEL TRAINING

In [87]:
#split data into independent and dependent variables
data_X = data_df.drop("label", axis=1)
data_y = data_df["label"]

#### Initializing classification models

To compare the performance of several models, we choose a set of the most popular machine learning models for classification tasks.

In [101]:
#initialize classification models
classifiers = [
    ("Naive Bayes", GaussianNB()),
    ("Decision Tree", DecisionTreeClassifier(random_state=0)),
    ("Support Vector Machines", SVC(gamma='auto')),
    ("Random Forest", RandomForestClassifier(max_depth=2, random_state=0)),
    ("K-Nearest Neighbors", KNeighborsClassifier(n_neighbors=5)),
    ("Logistic Regression", LogisticRegression(random_state=0)),
    # ("XGBoost", ),
]

#### Initializing the cross-validation technique

- In order to obtain better representative results of the performance of each model, we use cross validation instead of the regular train/test split.
- Since we are dealing with imbalanced class distribution, we prefer Stratified K-Folds cross-validator over KFold to ensure enough samples of the labels are represented in each fold. 

In [89]:
#initialize the cross-validator with sample shuffling activated
skf_cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

#### Training classification models

In [103]:
#initialize the results summary
results = pd.DataFrame(index=[c[0] for c in classifiers], columns=["Accuracy", "TP", "FP", "TN", "FN", "Recall", "Precision", "F1"])

#iterate over the estimators
for est_name, est_object in classifiers:
    
    #initialize the results for each classifier
    accuracy_scores = []
    confusion_matrices = []
    recall_scores = []
    precision_scores = []
    f1_scores = []
    
    #iterate over the obtained folds
    for train_index, test_index in skf_cv.split(data_X, data_y):
        #get train and test samples from the cross validation model
        X_train, X_test = data_X.iloc[train_index], data_X.iloc[test_index]
        y_train, y_test = data_y.iloc[train_index], data_y.iloc[test_index]
        
        #train the model
        est_object.fit(X_train.values, y_train.values)
        
        #predict the test samples
        y_pred = est_object.predict(X_test.values)
        
        accuracy_scores.append(accuracy_score(y_test, y_pred))
        confusion_matrices.append(confusion_matrix(y_test, y_pred))
        recall_scores.append(recall_score(y_test, y_pred))
        precision_scores.append(precision_score(y_test, y_pred))
        f1_scores.append(f1_score(y_test, y_pred))
    
    #summarize the results for all folds for each classifier
    tn, fp, fn, tp = sum(confusion_matrices).ravel()
    results.loc[est_name] = [mean(accuracy_scores),tp,fp,tn,fn,mean(recall_scores),mean(precision_scores),mean(f1_scores)]
        
    



In [104]:
results.head()

Unnamed: 0,Accuracy,TP,FP,TN,FN,Recall,Precision,F1
Naive Bayes,0.263037,4225,60,1863,16997,0.199105,0.756846,0.199389
Decision Tree,0.999914,21220,0,1923,2,0.999906,1.0,0.999953
Support Vector Machines,0.98639,21222,315,1608,0,1.0,0.985376,0.992633
Random Forest,0.995334,21220,106,1817,2,0.999906,0.99503,0.997462
K-Nearest Neighbors,0.999827,21220,2,1921,2,0.999906,0.999906,0.999906
