# Bonus Task - Non-Sequential Models

In [1]:
import pandas as pd

from preprocessing import *

import glob

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.ensemble import IsolationForest

from sklearn.model_selection import KFold

import numpy as np

In [17]:
from sklearn.metrics import precision_score, recall_score

In [2]:
csvs = []
for i in range(9, 13):
    csvs.extend(glob.glob(f"data/CTU-13-Dataset/{i}/*.binetflow"))

In [3]:
csvs

['data/CTU-13-Dataset/9/capture20110817.binetflow',
 'data/CTU-13-Dataset/10/capture20110818.binetflow',
 'data/CTU-13-Dataset/11/capture20110818-2.binetflow',
 'data/CTU-13-Dataset/12/capture20110819.binetflow']

In [4]:
dfs = []
for csv in csvs:
    dfs.append(pd.read_csv(csv))

In [5]:
for i, df in enumerate(dfs):
    # encode labels: botnet=1, background=2, normal=0
    dfs[i] = encode_labels(df)
    
    # combine background and normal flow labels
    dfs[i].loc[dfs[i]["Label"] == 2, "Label"] = 0
    
    # drop StartTime
    dfs[i].drop(columns=["StartTime"], inplace=True)
    
    # numerically encode features
    dfs[i] = encode_features(df)

## Class Imbalance

In [6]:
for df in dfs:
    print(df["Label"].value_counts()/df.shape[0])

0    0.911384
1    0.088616
Name: Label, dtype: float64
0    0.918802
1    0.081198
Name: Label, dtype: float64
0    0.923879
1    0.076121
Name: Label, dtype: float64
0    0.993339
1    0.006661
Name: Label, dtype: float64


In [7]:
dfs[0]["Dur"].describe()

count    2.087508e+06
mean     2.945965e+02
std      8.375559e+02
min      0.000000e+00
25%      3.200000e-04
50%      9.890000e-04
75%      5.064933e+00
max      3.600080e+03
Name: Dur, dtype: float64

## Anomaly Detection using Isolation Forests

We would like to model the problem of detecting botnets as a anomaly detection task. We believe that this is a sound way to proceed due to the fact that the class imbalance is quite stark as we have seen above. Also, in a realistic scenario, we will not have labelled bot traffic on which to train our models, but what we can do is capture non-malicious traffic in a controlled setting where all nodes are known. This allows us to model non-malicious traffic and then isolate traffic that falls outside this distribution. This is the kind of setup that anomaly detection methods like Isolation Forests have.

We want to model the data in two ways:
1. Model non-malicious netflows:
   - **Train Set:** A random selection of normal and background netflows from the scenarios 9, 10, 11
   - **Test Set:** A random selection of netflows from scenarios 9, 10, 11, 12 (including normal, background and botnet)
   - We train the model to fit the distribution of non-malicious netflows
   - We see the performance of this model on a dataset containing both non-malicious and malicious netflows.
   - The hope is that the model can isolate those flows that do not correspond to the non-malicious distribution
   - The test set also contains scenario 12 which is not seen during training. This will test the ability of a model to generalize to new scenarios
2. Model non-malicious hosts:

   - Did not have the time to do this

In [8]:
normal_dfs = []
bot_dfs = []
for df in dfs[:-1]:
    normal_dfs.append(df.loc[df["Label"] == 0])
    bot_dfs.append(df.loc[df["Label"] == 1])

In [9]:
bot_test_df = dfs[-1].loc[dfs[-1]["Label"] == 1]

In [10]:
bot_test_flows = bot_test_df.sample(frac=1).reset_index(drop=True).to_numpy()

In [11]:
normal_flows = pd.concat(normal_dfs).sample(frac=1).reset_index(drop=True).to_numpy()
bot_flows = pd.concat(bot_dfs).sample(frac=1).reset_index(drop=True).to_numpy()

In [12]:
from sklearn.preprocessing import StandardScaler

In [13]:
scaler = StandardScaler()
scaler.fit(normal_flows)
normal_flows = scaler.transform(normal_flows)
bot_flows = scaler.transform(bot_flows)
bot_test_flows = scaler.transform(bot_test_flows)

In [14]:
kfold = KFold(n_splits=5, shuffle=True)

In [15]:
normal_folds = kfold.split(normal_flows)
bot_folds = kfold.split(bot_flows)

## The below cell may take a while (2-3 minutes per fold)

In [18]:
# y_preds = []
# y_valids = []
for ((train, norm_valid,), (_, bot_valid)) in zip(normal_folds, bot_folds):
    X_train = normal_flows[train]
    X_valid = np.vstack((normal_flows[norm_valid], bot_flows[bot_valid]))
    y_valid = np.concatenate((np.zeros((normal_flows[norm_valid].shape[0],)), np.ones((bot_flows[bot_valid].shape[0],))))

    clf = IsolationForest(max_samples=100, random_state=1337, )
    clf.fit(X_train)
    
    y_pred_valid = clf.predict(X_valid)
    y_pred_valid[y_pred_valid == 1] = 0
    y_pred_valid[y_pred_valid == -1] = 1
    
    y_pred_test = clf.predict(bot_test_flows)
    y_pred_test[y_pred_test == 1] = 0
    y_pred_test[y_pred_test == -1] = 1
    y_test = np.ones((bot_test_flows.shape[0], ))
    print("Precision and Recall for scenario 9, 10, 11")
    print(precision_score(y_valid, y_pred_valid))
    print(recall_score(y_valid, y_pred_valid))
    print("Recall for unseen botnet flows in scenario 12")
    print(recall_score(y_valid, y_pred_valid))
    print("------------------------------------\n Next fold")

Precision and Recall for scenario 9, 10, 11
0.20280184897045964
0.5962003973222484
Recall for unseen botnet flows in scenario 12
0.5962003973222484
Precision and Recall for scenario 9, 10, 11
0.20314212006145105
0.5849818867798534
Recall for unseen botnet flows in scenario 12
0.5849818867798534
Precision and Recall for scenario 9, 10, 11
0.20093424000804608
0.6003505843071786
Recall for unseen botnet flows in scenario 12
0.6003505843071786
Precision and Recall for scenario 9, 10, 11
0.21069923534908871
0.5984808013355593
Recall for unseen botnet flows in scenario 12
0.5984808013355593


The sequential vs non-sequential methods is an interesting experiment.

Let us first look at the results. The non-sequential models do quite poorly considering that they had all the data from several scenarios to learn from. This is because they do not leverage the power of sequential data

We have see that the sequence-based models show promising results given that the ngram flows we analysed here were discretized using a combination of just two features. Sequential models leverage online learning based on a stream netflows. In the sketches and hash based approaches, we can approximate the flow profiles in reasonable time and without considerable memory overhead. Additionally we should not overlook that fact that in our profiling approach, we try to learn from the botnet flow of one host to classify another type; which is a difficult setting given the data. 

In conclusion, given a scenario with limited data, we do not expect the non-sequential model to perform well enough. Though in our experiment, we use all data available, in the limited setting it will not do so well. Furthermore, as suggested earlier, if the scenario is such that we must tackle botnets in real time as the stream of data flows in, non-sequential methods are quite useless as they are built on assumptions of IID. This is reasonably well tackled by using sequential approaches.