# Supervised Machine Learning - Data Classification Demo
## Part 3 - MODEL TRAINING

In this notebook, we load the processed dataset file and use it to train several classification models.

> **INPUT:** the ready dataset csv file as cleaned and processed in the previous phases.<br>
> **OUTPUT:** a comparison of the prediction accuracy and performance of multiple machine learning classification algorithms.  

***

### 1. INITIALIZATION

In [70]:
#importing required libraries and modules
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix


### 2. LOADING PROCESSED DATASET

#### Reading dataset file into pandas DataFrame

In [2]:
#initialize required variables to read the cleaned data file
data_file_location = "..\\data\\processed\\"
data_file_name = "conn.log.labeled_processed"
data_file_ext = ".csv"


#read the dataset
data_df = pd.read_csv(data_file_location + data_file_name + data_file_ext, index_col=0)

#### Exploring dataset summary and statistics

In [3]:
#check dataset shape
data_df.shape

(23145, 33)

In [4]:
#check dataset head
data_df.head()

Unnamed: 0,id.orig_h,id.orig_p,id.resp_h,id.resp_p,duration,orig_bytes,resp_bytes,missed_bytes,orig_pkts,orig_ip_bytes,...,conn_state_S3,conn_state_SF,history_C,history_D,history_Dd,history_Other,history_S,history_ShAdDaf,history_ShAdDaft,history_ShAdfDr
0,3232236000.0,41040.0,3119782000.0,80.0,3.139211,0.0,0.0,0.0,3.0,180.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,3232236000.0,41040.0,3119782000.0,80.0,3.152487,0.0,0.0,0.0,1.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3232236000.0,41040.0,3119782000.0,80.0,3.152487,0.0,0.0,0.0,1.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,3232236000.0,41040.0,3119782000.0,80.0,1.477656,149.0,128252.0,2896.0,94.0,5525.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,3232236000.0,41042.0,3119782000.0,80.0,3.147116,0.0,0.0,0.0,3.0,180.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### 3. MODEL TRAINING

In [5]:
#split data into dependent and independent variables
data_X = data_df.drop("label", axis=1)
data_y = data_df["label"]

In [6]:
#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=.2, random_state=101)

In [63]:
#an overview of the data sets
print(f"data X: {data_X.shape}")
print(f"data_y: {data_y.shape}")
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}")
print(f"y_test: {y_test.shape}")

data X: (23145, 32)
data_y: (23145,)
X_train: (18516, 32)
X_test: (4629, 32)
y_train: (18516,)
y_test: (4629,)


In [71]:
classifiers = [
    ("Naive Bayes", GaussianNB()),
    # ("Decision Tree", ),
    # ("Support Vector Machines", ),
    # ("Random Forest", ),
    # ("K-Nearest Neighbors", ),
    # ("Logistic Regression", ),
    # ("XGBoost", ),
]

for est_name, est_obj in classifiers:
    y_pred = est_obj.fit(X_train, y_train).predict(X_test)
    print(confusion_matrix(y_test, y_pred))






[[ 335   66]
 [  22 4206]]
