# Network Intrusion Detection
This notebook is a simple **ML workflow** of using classifiers to solve **intrusion detection**, which can be modeled as a **binary classification** problem. That is, the goal is to determine whether the network traffic is **an abnormal behavior or not**.

## Dataset
* `Train_data.csv`: Training data containing 41 features and 1 column of groundtruths.
* `Test_data.csv`: Testing data containing 41 features and no labels.

<div class="alert alert-blocks alert-warning" style="font-size: 15px;">
   <p>In order to test the <b>generalization ability</b> of the classifers, I'll use <code>Train_data.csv</code> only. That is, <code>Test_data.csv</code> won't be used.</p>
</div>

In [None]:
# Import packages
import os 
import warnings
import gc

import pandas as pd 
import numpy as np 
from plotly.subplots import make_subplots
import plotly.express as px 
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support, classification_report

# Configuration
warnings.simplefilter('ignore')
pd.set_option('max_columns', 50)

In [None]:
# Variable definitions
DATA_PATH = "../input/network-intrusion-detection"

In [None]:
# Utility functions 
def describe(df, stats):
    '''Describe the basic information of the raw dataframe.
    
    Parameters:
        df: pd.DataFrame, raw dataframe to be analyzed
        stats: boolean, whether to get descriptive statistics 
    
    Return:
        None
    '''
    df_ = df.copy(deep=True)   # Copy of the raw dataframe
    n_features = df_.shape[1]
    if n_features > pd.get_option("max_columns"):
        # If the feature (column) number is greater than max number of columns displayed
        warnings.warn("Please reset the display-related options max_columns \
                      to enable the complete display.", 
                      UserWarning) 
    print("=====Basic information=====")
    display(df_.info())
    get_nan_ratios(df_)
    if stats:
        print("=====Description=====")
        numeric_col_num = df_.select_dtypes(include=np.number).shape[1]   # Number of cols in numeric type
        if numeric_col_num != 0:
            display(df_.describe())
        else:
            print("There's no description of numeric data to display!")
    del df_
    gc.collect()

def get_nan_ratios(df):
    '''Get NaN ratios of columns with NaN values.
    
    Parameters:
        df: pd.DataFrame, raw dataframe to be analyzed
        
    Return:
        None
    '''
    df_ = df.copy()   # Copy of the raw dataframe
    nan_ratios = df_.isnull().sum() / df_.shape[0] * 100   # Ratios of value nan in each column
    nan_ratios = pd.DataFrame([df_.columns, nan_ratios]).T   # Take transpose 
    nan_ratios.columns = ["Columns", "NaN ratios"]
    nan_ratios = nan_ratios[nan_ratios["NaN ratios"] != 0.0]
    print("=====NaN ratios of columns with NaN values=====")
    if len(nan_ratios) == 0:
        print("There isn't any NaN value in the dataset!")
    else:
        display(nan_ratios)
    del df_
    gc.collect() 

# 1. Data Split
Based on the reason I mentioned above in orange block, I'll first do **train-test splitting** on data in `Train_data.csv` to prevent one specific type of data leakage, **Train-Test Contamination**. That is, I first split a **hold-out dataset** to be the **final testing data**. Furthermore, **KFold cross validation** will be implemented on the training dataset to help us determine whether the model has **generalization ability**.
<div class="alert alert-blocks alert-info" style="font-size: 15px;">
    <h4>Reference</h4>
    <p><a href="https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f" style="color: orange;">Hold-out vs. Cross-validation in Machine Learning</a></p>
</div>

In [None]:
# Split the training set and testing set (hold-out).
df = pd.read_csv(os.path.join(DATA_PATH, "Train_data.csv"))
X, y = df.iloc[:, :-1], df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
print(f"Shape of X_train: {X_train.shape}\nShape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}\nShape of y_test: {y_test.shape}")

# 2. Exploratory Data Analysis
Following is the exploratory data analysis of the training set, I'll analyze and display some characteristics of features and also the predicting target. Let's go!

## *2.1 Basic Description*
Basic information about the training data is shown, including:
* Data appearance
* Data types of features
* NaN ratio of each features
* Stats of features

In [None]:
print("=====DataFrame: X_train=====")
display(X_train.head())
describe(X_train, stats=True)

## *2.2 Categorical Features* 
There are three categorical features in the dataset, including `protocol type`, `service` and `flag`. Let's dig deeper into the these features!
### 2.2.1 Unique Values

In [None]:
cat_features = ['protocol_type', 'service', 'flag']
for f in cat_features:
    print(f"=====Unique values of {f}=====")
    unique_vals = X_train[f].unique()
    print(unique_vals)
    print(f"Number of unique values: {len(unique_vals)}\n")

### 2.2.2 Ratio of Each Unique Value

In [None]:
for f in cat_features:
    val_counts = X_train[f].value_counts()
    fig = go.Figure()
    fig.add_trace(go.Pie(
        labels=val_counts.index,
        values=val_counts
    ))
    fig.update_traces(textposition='inside') 
    fig.update_layout(
        title=f"Pie Chart of {f}",
        uniformtext_minsize=12, 
        uniformtext_mode='hide'
    )
    fig.show()

## *2.3 Numeric Features*
### 2.3.1 Univariate Distribution
First, the univariate histograms of the features are plotted to give us preliminary understanding about the distributions.

In [None]:
numeric_features = [col for col in X_train.columns if col not in cat_features]

fig = make_subplots(rows=10, cols=4, subplot_titles=numeric_features)
for i in range(1, 11):
    for j in range(1, 5):
        feature_idx = 4 * (i-1) + (j-1)
        if feature_idx == len(numeric_features):
            break
        feature = numeric_features[feature_idx]
        feature_series = X_train[feature]
        sub_fig = go.Histogram(x=feature_series, name=feature)
        fig.add_trace(
            sub_fig,
            row=i,
            col=j
        )
        
fig.update_layout(height=1200, title_text="Univariate Distribution of Numeric Features") 
fig.show()

### 2.3.2 Statistical Dispersion and Variation
Based on the observations of distributions above, some features behave as if they're constant features. Hence, I'll show two measurements to gain a better understanding about the features:
* The proportion of the value with **the most count** in each feature - from the perspective of **value count**
* The **variance** of each feature - from the perspective of **value**

In [None]:
n_samples = X_train.shape[0]   # Total number of samples

# Get the proportion of the value with the most count in each feature
max_proportions = pd.DataFrame()
for f in numeric_features:
    feature_series = X_train[f]
    max_proportion = np.max(feature_series.value_counts()) / n_samples
    max_proportions[f] = [max_proportion]
max_proportions.index = ["Max Proportion"]

# Get the variance of each feature 
vars = pd.DataFrame(X_train.var()).T
vars.index = ["Variance"]

disp_and_var = max_proportions.append(vars)
print("=====Statistical dispersion and variation=====")
display(disp_and_var)

### 2.3.3 Bivariate Distribution
After simple univariate analysis, let's move on to the bivariate part. First, I'll filter out the features with **high max proportion** or **low variance** based on the pre-defined thresholds to simplify the analysis. The thresholds are defined as follows:
* Max proportion: 0.99
* Variance: 0.001

<div class="alert alert-blocks alert-warning">
    <h4>Notice</h4>
    <p>In order to make the visualizations more clear, I'll just show some randomly picked joint distributions. And, others can be shown in similar way.</p>
    <p>Furthermore, the <b>groundtruths</b> are added in to help us find whether there is any <b>clustering</b> property.</p>
</div>

In [None]:
# Filter out features with high "max proportion" or low "variance"
disp_and_var_T = disp_and_var.T   # Take the transpose
features_remained = disp_and_var_T[(disp_and_var_T['Max Proportion'] < 0.99) & disp_and_var_T['Variance'] > 0.001].index.tolist()
X_train = X_train.loc[:, features_remained]
print(f"After filtering, there are {len(features_remained)} numeric features remained.")

# Plot bivariate distributions 
features_picked = features_remained[-5:]
df_train = X_train.loc[:, features_picked]
df_train['gt'] = y_train
fig = px.scatter_matrix(df_train, 
                        dimensions=features_picked,
                        color="gt", 
                        symbol="gt")
fig.update_traces(diagonal_visible=False)
fig.update_layout(height=1200, title_text="Bivariate Distribution of Numeric Feature Pairs (Randomly Picked)") 
fig.show()

## *2.4 Groundtruth Distribution*
In the final part of the simple EDA, let's take a look at the distribution of the groundtruths to see if there's a problem of **Class Imbalance**.
<div class="alert alert-blocks alert-info" style="font-size: 15px;">
    <h4>Reference</h4>
    <p><a href="https://machinelearningmastery.com/what-is-imbalanced-classification/" style="color: orange;">A Gentle Introduction to Imbalanced Classification</a></p>
</div>

In [None]:
class_count = pd.DataFrame(y_train).value_counts()
fig = go.Figure()
fig.add_trace(go.Pie(
    labels=class_count.index,
    values=class_count
))
fig.update_traces(textposition='inside') 
fig.update_layout(
    title=f"Pie Chart of Groundtruths",
    uniformtext_minsize=12, 
    uniformtext_mode='hide'
)
fig.show()

# 3. KFold Cross Validation with RandomForestClassifier (RFC)
KFold cross validation is used to measure the **generalization ability** of the model as I mentioned above. In this part, I'll take RandomForestClassifier as the baseline model.

In [None]:
# Encode the labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)

# Evaluate the model performance using KFold CV
kf = KFold(5, shuffle=True, random_state=42)
models = []   # Trained model record
fi = []   # Feature importance record
val_metrics = []   # Evaluation metrics record
fold = 0

for train_idx, val_idx in kf.split(X_train):
    print(f"=====Evaluation of fold{fold} starts=====")
    # Prepare the training and validation sets
    X_train_, X_val = X_train.iloc[train_idx, :], X_train.iloc[val_idx, :]
    y_train_, y_val = y_train[train_idx], y_train[val_idx]
    
    # Train the classifier (rfc)
    rfc = RandomForestClassifier(n_estimators=500)
    rfc.fit(X_train_, y_train_)
    models.append(rfc)    # Record the trained model
    fi.append(rfc.feature_importances_)   # Record the feature importance
    
    # Predict and evaluate the performance
    y_val_pred = rfc.predict(X_val)
    p_r_f1_mac = list(precision_recall_fscore_support(y_val, y_val_pred, average='macro')[:3])
    p_r_f1_mic = list(precision_recall_fscore_support(y_val, y_val_pred, average='micro')[:3])
    p_r_f1_wei = list(precision_recall_fscore_support(y_val, y_val_pred, average='weighted')[:3])
    val_metrics.append([p_r_f1_mac, p_r_f1_mic, p_r_f1_wei])   # Concatenate the evaluation metrics and record
    print(f"=====Classification Report=====\n{classification_report(y_val, y_val_pred)}")
    
    print(f"=====Evaluation of fold{fold} finishes=====\n")
    fold += 1

In [None]:
# Summarize the avarage performance in KFold CV
avg_metrics = np.mean(val_metrics, axis=0)
print("=====Average evaluatin metrics over 5 folds=====")
for i, method in enumerate(['Macro', 'Micro', 'Weighted']):
    print(f"=====Metrics {method}=====")
    print(f"Precision = {avg_metrics[i][0]} | Recall = {avg_metrics[i][1]} | F1-score = {avg_metrics[i][2]}")

# 4. Feature Importance  
After training the model, we can observe the **feature importance** of each feature to gain a better understanding about which features dominate the decision making process of the random forest.

In [None]:
# Sort feature importance
avg_fi = np.mean(fi, axis=0)   # Calculate average feature importance over 5 folds
fi_dict = {}
for feature, feature_importance in zip(features_remained, avg_fi):
    fi_dict[feature] = feature_importance
fi_dict = dict(sorted(fi_dict.items(), key=lambda item: item[1], reverse=True))

# Plot feature importance
fig = go.Figure([go.Bar(x=list(fi_dict.keys()), y=list(fi_dict.values()))])
fig.update_layout(title="Feature Importance")
fig.show()

# 5. Evaluation on Unseen Data (Testing Set)
In the end, I'll evaluate the model performance on the unseen data (i.e. testing set) to see how great the classier could perform. Moreover, the technique, **bagging**, is also implemented. In other words, the **majority voting mechanism** is used to make the prediction more stable.
<div class="alert alert-blocks alert-info" style="font-size: 15px;">
    <h4>Reference</h4>
    <p><a href="https://stackabuse.com/ensemble-voting-classification-in-python-with-scikit-learn" style="color: orange;">Ensemble/Voting Classification in Python with Scikit-Learn</a></p>
</div>

In [None]:
# Process the data to meet the model input 
X_test = X_test.loc[:, features_remained]
y_test = label_encoder.transform(y_test)

# Do inference using trained model from each fold 
y_test_preds = []
for rfc in models:
    y_test_pred = rfc.predict(X_test)
    y_test_preds.append(y_test_pred)

# Take majority voting
y_test_pred_voted = np.where(
    np.mean(y_test_preds, axis=0) >= 0.5, 
    1, 
    0
)

# Summarize the performance evaluated on testing set
print("=====Evaluation metrics on testing set=====")
for i, method in enumerate(['Macro', 'Micro', 'Weighted']):
    print(f"=====Metrics {method}=====")
    precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_test_pred_voted, average=method.lower())
    print(f"Precision = {precision} | Recall = {recall} | F1-score = {f1_score}")

# 6. Future Work
This is just a simple ML workflow from **exploratory data analysis** to final **model evaluation**. There are still many things we can try, which are listed as follows:
* Do more **exploratory data analysis** to observe the feature interactions.
* Add in **categorical features** to be the predictors.
* Implement advanced **feature selection pipeline** to select relevant features.
* Try to use different **classifiers** to do the model comparison.

<div class="alert alert-blocks alert-info" style="font-size: 15px; text-align: center;">
    <h4>That's all! Hope this would help!</h4>
    <h4>Thanks for your attention!</h4>
</div>