# ML Classification - Network Traffic Analysis
## Part 3 - MODEL TRAINING

In this notebook, we load the processed dataset file and use it to train several classification models.

> **INPUT:** the ready dataset csv file as cleaned and processed in the previous phases.<br>
> **OUTPUT:** a comparison of the prediction accuracy and performance of multiple machine learning classification algorithms.  

***

### 1. INITIALIZATION

In [61]:
# Import necessary libraries and modules
import pandas as pd
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import precision_score, confusion_matrix, recall_score, accuracy_score, f1_score
from statistics import mean
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from joblib import dump


In [62]:
# Set display options
pd.set_option('display.max_columns', None)

### 2. LOADING PROCESSED DATASET

#### Reading dataset file into pandas DataFrame

In [63]:
# Initialize required variables to read the cleaned data file
data_file_location = "..\\data\\processed\\"
data_file_name = "conn.log.labeled_processed"
data_file_ext = ".csv"


# Read the dataset
data_df = pd.read_csv(data_file_location + data_file_name + data_file_ext, index_col=0)

#### Exploring dataset summary and statistics

In [64]:
# Check dataset shape
data_df.shape

(23145, 31)

In [65]:
# Check dataset head
data_df.head()

Unnamed: 0,id.orig_p,id.resp_p,duration,orig_bytes,resp_bytes,missed_bytes,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,label,proto_tcp,proto_udp,service_dhcp,service_dns,service_http,service_irc,conn_state_OTH,conn_state_RSTR,conn_state_S0,conn_state_S1,conn_state_S3,conn_state_SF,history_C,history_D,history_Dd,history_Other,history_S,history_ShAdDaf,history_ShAdDaft,history_ShAdfDr
0,0.628686,0.001238,0.62005,0.0,0.0,0.0,0.000163,2.366458e-06,0.0,0.0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.628686,0.001238,0.620022,3.097425e-07,4.7e-05,0.0,5.4e-05,7.888192e-07,0.0,0.0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.628686,0.001238,0.620022,3.097425e-07,4.7e-05,0.0,5.4e-05,7.888192e-07,0.0,0.0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.628686,0.001238,0.221583,1.972292e-06,0.780758,0.5,0.005097,7.26371e-05,0.08972,0.823184,0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.628717,0.001238,0.621946,0.0,0.0,0.0,0.000163,2.366458e-06,0.0,0.0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### 3. MODEL TRAINING

In [66]:
# Split data into independent and dependent variables
data_X = data_df.drop("label", axis=1)
data_y = data_df["label"]

#### Initializing classification models

To compare the performance of several models, we choose a set of the most popular machine learning algorithms for classification tasks.

In [67]:
# Initialize classification models
classifiers = [
    # Since we have unbalanced labels, we use the Complement version of Naive Bayes which is particularly suited for imbalanced data sets.
    ("Naive Bayes", ComplementNB()),
    
    # We use the Decision Tree with its default parameters, including the "Gini Impurity" to measure the quality of splits and ccp_alpha=0 (no pruning is performed). 
    ("Decision Tree", DecisionTreeClassifier()),
    
    # Logistic Regression model to help discovering linearity separation in the data set.
    ("Logistic Regression", LogisticRegression()),
    
    # The efficient Random Forest model with a default base estimators of 100.
    ("Random Forest", RandomForestClassifier()),
    
    # The classifier version of Support Vector Machine model.
    ("Support Vector Classifier", SVC()),
    
    # The distance-based KNN classifier with a default n_neighbors=5.
    ("K-Nearest Neighbors", KNeighborsClassifier()),
  
    # The most powerful ensemble model of XGBoost with some initially tuned hyperparameters.
    ("XGBoost", xgb.XGBClassifier(objective = "binary:logistic", alpha = 10)),
]

#### Initializing the cross-validation technique

- In order to obtain better representative results of the performance of each model across several iterations, we use cross-validation instead of the regular train/test split.
- Since we are dealing with imbalanced class distributions, we implement a Stratified K-Folds cross-validator instead of the random KFold sampling. This is useful to preserve the percentage of both labels in each fold. 

In [68]:
# Initialize the cross-validator with 5 splits and sample shuffling activated
skf_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

#### Training classification models

In [69]:
print("Model Training Started!")
# Initialize the results summary
classification_results = pd.DataFrame(index=[c[0] for c in classifiers], columns=["Accuracy", "TN", "FP", "FN", "TP", "Recall", "Precision", "F1"])

# Iterate over the estimators
for est_name, est_object in classifiers:
    
    print(f"### [{est_name}]: Processing ...")
    
    # Initialize the results for each classifier
    accuracy_scores = []
    confusion_matrices = []
    recall_scores = []
    precision_scores = []
    f1_scores = []
    
    # Initialize best model object to be saved
    models_path = "..\\models"
    best_model = None
    best_f1 = -1
    
    # Iterate over the obtained folds
    for train_index, test_index in skf_cv.split(data_X, data_y):

        # Get train and test samples from the cross-validation model
        X_train, X_test = data_X.iloc[train_index], data_X.iloc[test_index]
        y_train, y_test = data_y.iloc[train_index], data_y.iloc[test_index]
        
        # Train the model
        est_object.fit(X_train.values, y_train.values)
        
        # Predict the test samples
        y_pred = est_object.predict(X_test.values)
        
        # Calculate and register accuracy metrics
        accuracy_scores.append(accuracy_score(y_test, y_pred))
        confusion_matrices.append(confusion_matrix(y_test, y_pred))
        recall_scores.append(recall_score(y_test, y_pred))
        precision_scores.append(precision_score(y_test, y_pred))
        est_f1_score = f1_score(y_test, y_pred)
        f1_scores.append(est_f1_score)
        
        # Compare with best performing model
        if best_f1 < est_f1_score:
            best_model = est_object
            best_f1 = est_f1_score
    
    # Summarize the results for all folds for each classifier
    tn, fp, fn, tp = sum(confusion_matrices).ravel()
    classification_results.loc[est_name] = [mean(accuracy_scores),tn,fp,fn,tp,mean(recall_scores),mean(precision_scores),mean(f1_scores)]
    
    # Save the best performing model
    if best_model:
        model_name = est_name.replace(' ', '_').replace('-', '_').lower()
        model_file = model_name + ".pkl"
        dump(best_model, models_path + "\\" + model_file)
    
print("Model Training Finished!")   
    

Model Training Started!
### [Naive Bayes]: Processing ...
### [Decision Tree]: Processing ...
### [Logistic Regression]: Processing ...


### [Random Forest]: Processing ...
### [Support Vector Classifier]: Processing ...
### [K-Nearest Neighbors]: Processing ...
### [XGBoost]: Processing ...
Model Training Finished!


In [70]:
# Check the results
classification_results

Unnamed: 0,Accuracy,TN,FP,FN,TP,Recall,Precision,F1
Naive Bayes,0.994772,1838,85,36,21186,0.998304,0.996004,0.997152
Decision Tree,0.999914,1923,0,2,21220,0.999906,1.0,0.999953
Logistic Regression,0.994772,1830,93,28,21194,0.998681,0.995631,0.997154
Random Forest,0.999827,1923,0,4,21218,0.999812,1.0,0.999906
Support Vector Classifier,0.995636,1824,99,2,21220,0.999906,0.995356,0.997626
K-Nearest Neighbors,0.99771,1880,43,10,21212,0.999529,0.997977,0.998752
XGBoost,0.999914,1923,0,2,21220,0.999906,1.0,0.999953


### 4. RESULT ANALYSIS

Overall, all the models are performing very well with very high accuracy, precision, recall, and F1 scores. The Decision Tree, Random Forest, and XGBoost models are achieving near-perfect performance.

*Models evaluation:*
- **Naive Bayes** achieved relatively good overall accuracy although the labels are not evenly distributed. 
- **Decision Tree** delivered one of the highest prediction accuracies, benefiting from its algorithmic resilience to imbalanced labels.
- **Logistic Regression** also achieved good results, though it yielded a higher number of incorrect predictions, suggesting some linearity in the dataset.
- **Random Forest** as anticipated, demonstrated superior performance as one of the most efficient prediction methods. However, given the strong performance of the Decision Tree, there was no significant improvement noticed when using Random Forest.
- **Support Vector Classifier** also produced relatively good results with slightly higher False Positive rates.
- **KNN** model likewise performed well, with a minimal number of incorrect predictions, which can be attributed to the dataset's normalization between 0 and 1.
- **XGBoost** was expectedly among the best estimators since it's arguably the most powerful machine learning algorithm these days.

*Overall observations:*
- Remarkably accurate predictions were generated by most models, considering that the numbers of False Positives/Negatives are cumulative results from five separate iterations.
- Out of the seven estimators, four achieved relatively lower accuracy, but these could potentially be improved with further model tuning.
- Regardless of the model used, there were consistently some False Negative predictions, which might be attributed to anomalies or outliers in the original dataset.
- Lower accuracy models tend to produce errors primarily in the form of False Positives, largely because the majority of the population is labeled as "Malicious".
- Based on their performance, models can be categorized into two distinct groups with quite similar behavior: one group exhibits significantly high accuracy, including DT, RF, and XGB, while the second group shows relatively good performance, comprising NB, KNN, LogR, and SVC.