# Introduction
This notebook is based on a project which used the CIC Darknet Traffic dataset.  One of the goals of the project was to explore the various models (classifiers) that could be trained to predict the purpose of the darknet traffic based on a set of features and subsets of those features.  The dataset was modified from what was available on the UNB/CIC website ( link: https://www.unb.ca/cic/datasets/darknet2020.html ).  The changes are discussed in the "Changes to the Original Dataset" section below.

## Goals of this Notebook
1. Show some feature selection methods that can be used to identify the best features to use when attempting to determine the *Purpose* of the internet traffic
1. Show how various types of models/classifiers can be used to see which is the most accurate at predicting the *Purpose* of the internet traffic

## Changes to the Original Dataset

### Column Names
The orginal name of the target column was *Label*, which was the same as a feature column.  Both columns were renamed as follows:
* The feature column was renamed to *Conn Type* since it defines what type of connection was used to transfer the data (e.g. VPN, non VPN, etc.)
* The target column was renamed to *Purpose* because it defines what the purpose of the internet traffic was (e.g. email, audio streaming, etc.)

### Column Values
* "Timestamp" - all values in the "Timestamp" column have been convery to seconds since epoch
* "Src IP" and "Dst IP" - all IP addresses have been converted to integers representing a country code for the country associated with that IP.  If that IP was not designated for any one country, then the value was 0.
* "Conn Type" - all values were converted to integers
* "Purpose" - all names were replaced with integers


In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from datetime import datetime

from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import IsolationForest, RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, auc, confusion_matrix, f1_score, mean_squared_error, plot_confusion_matrix
from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve
from sklearn.multiclass import OneVsRestClassifier, OutputCodeClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import label_binarize
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

### CONSTANTS
SEED = 123
NUM_FEATURES = 5
TRAIN_PCT = 0.75

MAX_DEPTH = 4
MAX_ITER = 300
N_NEIGHBORS = 5

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
traffic = pd.read_csv("/kaggle/input/cicdarknet2020-numeric/DarkNetTraffic.csv")
print(f'Number of Rows: {traffic.shape[0]}')
print(f'Number of Columns: {traffic.shape[1]}')
traffic.head()

In [None]:
# Remove duplicate entries
traffic.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=False)

# Remove constant columns
traffic = traffic.loc[:, traffic.apply(pd.Series.nunique) != 1]

# Look at the dataset again
print(f'Number of Rows: {traffic.shape[0]}')
print(f'Number of Columns: {traffic.shape[1]}')
traffic.head()

# Feature Selection
To choose which features are the best to use, we will select the top N features from a group of classifiers and aggregate the results to get 1 set of features.  One of the function parameters is the train percentage split used when training the models

### Why were those model settings chosen?
These are just a few of the many available classifiers available to be used.  Furthermore, there are many settings for each classifier, and some parameters are similar, sometimes even the same.  Adding/changing the classifier settings *could* affect which features are selected.

In [None]:
def selectFeatures(x, y, train_size_pct=0.75):
    """
    selectFeatures
        x : The features of the dataset to be used for predictions
        y : The target class for each row in "x"
        train_size_pct : (default = 0.75) In the tange (0.0, 1.0), the ratio by which to split the data for training and testing
        @return (list) The names of the selected features
    """

    # Create classifiers
    rf = RandomForestClassifier(max_depth=MAX_DEPTH, criterion='entropy', random_state=SEED)
    et = ExtraTreesClassifier(max_depth=MAX_DEPTH, criterion='entropy', random_state=SEED)
    dectree = DecisionTreeClassifier(max_depth=MAX_DEPTH, random_state=SEED)

    classifier_mapping = {
        "RandomForest" : rf,
        "ExtraTrees" : et,
        "DecisionTree" : dectree
    }

    ### Split the dataset
    X_train_fs, X_test_fs, Y_train_fs, Y_test_fs = train_test_split(x, y, train_size=train_size_pct)

    model_features = {}

    for model_name, model in classifier_mapping.items():
        print(f'[Training] {model_name}')
        start_train = datetime.now()
        model.fit(X_train_fs, Y_train_fs)
        print(">>> Training Time: {}".format(datetime.now() - start_train))
        model_features[model_name] = model.feature_importances_
        model_score = model.score(X_test_fs, Y_test_fs)
        print(f'>>> Training Accuracy : {model_score*100.0}')
        print("")

    cols = X_train_fs.columns.values
    feature_df = pd.DataFrame({'features': cols})
    for model_name, model in classifier_mapping.items():
        feature_df[model_name] = model_features[model_name]

    ### Grab the nlargest features (by score) from each ensemble group
    all_f = []
    for model_name, model in classifier_mapping.items():
        try:
            all_f.append(feature_df.nlargest(NUM_FEATURES, model_name))
        except KeyError as e:
            print(f'*** Failed to add features for {model_name} : {e}')

    result = []
    for i in range(len(all_f)):
        result.extend(all_f[i]['features'].to_list())		# Concat the top nlargest scores from all groups into one list

    # selected_features contains the ensemble results for best features
    selected_features = list(set(result))					# Drop duplicate fields from the list

    return selected_features

# Train/Test with Models
Now that we have the dataset versions based on various feature sets, we can start evaluating the models.  However, we should first choose a metric by which to compare the models.  For this project, the 3 primary metrics were *accuracy*, *precision*, and *recall*.  There were also to additional metrics, *mean-squared error* and *F1 score*.

Several functions were defined to help ease the training and testing of the models.

## [FUNCTION] calculateMetrics
The **calculateMetrics** function is a helper function that will print out all the metrics we are interested in and some extra ones too.

In [None]:
def calculateMetrics(y_test, y_pred):
    acc = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred, average="macro")
    precision = precision_score(y_test, y_pred, average="macro", zero_division=0)
    mse = mean_squared_error(y_test, y_pred)
    f1score = f1_score(y_pred, y_test, average='weighted')
    print(">>> Metrics")
    print(f'- Accuracy  : {acc}')
    print(f'- Recall    : {recall}')
    print(f'- Precision : {precision}')
    print(f'- MSE       : {mse}')
    print(f'- F1 Score  : {f1score}')

    return [round(acc, 6), round(recall, 6), round(precision, 6), round(mse, 6), round(f1score, 6)]

## [FUNCTION] train_test_model
The process for evaluating every classifier shares the same general flow: train, test, analyze.  The **train_test_model** function does these three steps for every model and dataset given to it.  It will split the dataset accoridng to the specified train percentage, train the model, score the mode's training accuracy, make predictions, then grade the model's predicitons.

In [None]:
def train_test_model(model_name, model, x, y, train_size_pct):

    # Split the data
    X_train, X_test, Y_train, Y_test = train_test_split(x, y, train_size=train_size_pct)

    # Training
    print(f'\n[Training] {model_name}')
    start_train = datetime.now()
    model.fit(X_train, Y_train)
    print(f'>>> Training time: {datetime.now() - start_train}')

    ### Analyze Training
    train_acc = model.score(X_train, Y_train)
    print(f'>>> Training accuracy: {train_acc}')

    ### Testing
    start_predict = datetime.now()
    y_pred = model.predict(X_test)
    print(f'>>> Testing time: {datetime.now() - start_predict}')

    ### Analyze Testing
    calculateMetrics(Y_test, y_pred)


## [FUNCTION] **evaluateIndividualClassifiers**
The **evaluateIndividualClassifiers** is responsible for evaluating the dataset using several individual classifiers

In [None]:
def evaluateIndividualClassifiers(x, y, train_size_pct):
    """
    evaluateIndividualClassifiers
        x : The features of the dataset to be used for predictions
        y : The target class for each row in "x"
        train_size_pct : {float in the range(0.0, 1.0)} the percentage of the dataset that should be used for training
    """

    max_depth_x2 = MAX_DEPTH * 2
    max_iter_x2 = MAX_ITER * 2
    n_neighbors_x2 = N_NEIGHBORS * 2
    n_neighbors_d2 = N_NEIGHBORS // 2

    rf = RandomForestClassifier(max_depth=MAX_DEPTH, criterion='entropy', random_state=SEED)
    rf_x2 = RandomForestClassifier(max_depth=max_depth_x2, criterion='entropy', random_state=SEED)
    et = ExtraTreesClassifier(max_depth=MAX_DEPTH, criterion='entropy', random_state=SEED)
    dectree = DecisionTreeClassifier(max_depth=MAX_DEPTH, random_state=SEED)
    knn = KNeighborsClassifier(n_neighbors=N_NEIGHBORS)
    knn_x2 = KNeighborsClassifier(n_neighbors=n_neighbors_x2)
    knn_d2 = KNeighborsClassifier(n_neighbors=n_neighbors_d2)
    mlpnn = MLPClassifier(max_iter=MAX_ITER)
    mlpnnE = MLPClassifier(max_iter=MAX_ITER, early_stopping=True)
    mlpnn_x2 = MLPClassifier(max_iter=max_iter_x2)
    mlpnnE_x2 = MLPClassifier(max_iter=max_iter_x2, early_stopping=True)

    classifier_mapping = {
        f'RandomForest-{MAX_DEPTH}' : rf,
        f'RandomForest-{max_depth_x2}' : rf_x2,
        f'ExtraTrees-{MAX_DEPTH}' : et,
        f'DecisionTree-{MAX_DEPTH}' : dectree,
        f'KNeighbors-{N_NEIGHBORS}' : knn,
        f'KNeighbors-{n_neighbors_x2}' : knn_x2,
        f'KNeighbors-{n_neighbors_d2}' : knn_d2,
        f'MLP-{MAX_ITER}' : mlpnn,
        f'MLP-{MAX_ITER}-early' : mlpnnE,
        f'MLP-{max_iter_x2}' : mlpnn_x2,
        f'MLP-{max_iter_x2}-early' : mlpnnE_x2,
    }

    for model_name, model in classifier_mapping.items():

        train_test_model(model_name, model, x, y, train_size_pct)


# Feature Selection & Evaluation
We will examine four different festure sets
1. All Features
1. Selected Features (from "All Features")
1. All Features, except *Src Port* and *Dst Port*
1. Selected Features (from "All Features, except *Src Port* and *Dst Port*")

We use "all features" and "selected features" to compare the time and accuracy between both the two approaches.  We also remove the ports because some literature indicates that the ports often becomes important features.  It will be interesting to see which features are selected when the ports are and are not available and compare the performance of the models.

After we select which features will be used, we will train and test several individual models (as opposed to ensemble models) and compare the results.

In [None]:
# NOTE: TRAIN_PCT is defind as a constant, but we could also create a list and have it loop through various percentages

# (1) SELECT | ALL features
X = traffic.iloc[:, 0:(traffic.shape[1]-1)]
Y = traffic.iloc[:, -1]

print(f'[*] Beginning evaluations: All Features')
evaluateIndividualClassifiers(X, Y, TRAIN_PCT)  

In [None]:
# NOTE: TRAIN_PCT is defind as a constant, but we could also create a list and have it loop through various percentages

# (2) SELECT | choose from ALL features
selected_features = selectFeatures(X, Y)
print(f'Selected Features "from All": {selected_features}')
Xse_all = X[selected_features]

print(f'[*] Beginning evaluations: Selected Features (from "All Features")"')
evaluateIndividualClassifiers(Xse_all, Y, TRAIN_PCT)

In [None]:
# NOTE: TRAIN_PCT is defind as a constant, but we could also create a list and have it loop through various percentages

# (3) SELECT | ALL features except 'Src Port' and 'Dst Port'
X_noPorts = X.drop('Src Port', axis=1, inplace=False)
X_noPorts = X_noPorts.drop('Dst Port', axis=1, inplace=False)

print(f'[*] Beginning evaluations: All Features, except Src Port and Dst Port')
evaluateIndividualClassifiers(X_noPorts, Y, TRAIN_PCT)

In [None]:
# NOTE: TRAIN_PCT is defind as a constant, but we could also create a list and have it loop through various percentages

# (4) SELECT | choose from ALL features except 'Src Port' and 'Dst Port'
selected_features = selectFeatures(X_noPorts, Y)
print(f'Selected Features "from All w/o Ports": {selected_features}')
Xse_noPorts = X_noPorts[selected_features]

print(f'[*] Beginning evaluations: Selected Features (from "All Features, except Src Port and Dst Port")')
evaluateIndividualClassifiers(Xse_noPorts, Y, TRAIN_PCT)