# ERSP 22-23 Michael Yang

This is a notebook I wrote that annotates the training process of a very simple, low parameter feedforward neural network on the [CIC-IDS2017 Intrusion Detection Evaluation Dataset](https://www.unb.ca/cic/datasets/ids-2017.html).

It also uses the TRUSTEE framework published by Jacobs, et al. in the paper [AI/ML for Network Security: The Emperor has no Clothes](https://sites.cs.ucsb.edu/~arpitgupta/pdfs/trustee.pdf) to evaluate the neural network.

My research was under the Systems and Networking Lab (SNL) at UCSB, headed by Arpit Gupta. My mentor was Roman Beltiukov. All research conducted as part of the Early Research Scholars Program run by the UCSB CS department, which I highly recommend!

For context on this project, head to [my website](https://whugimy.github.io/projects/ersp/)!

First import all libraries. We set a random seed for reproducibility. We can see that this notebook was most recently run on a CPU, not a GPU. The model has intentionally been made to be simple and small for this reason.

In [1]:
import os
import random
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm_notebook
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

import torch

SEED=190
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

  from .autonotebook import tqdm as notebook_tqdm


cpu


Next we take a look at the features in our balanced dataset. The original published full dataset is relatively skewed towards normal network traffic. We have created a balanced subsample containing 50% normal points, 50% intrusions. The .csv file has not been included in the repository due to size contraints; the file was approximately 10GB. 

In [2]:
df = pd.read_csv('balanced_set.csv')
print(df.columns)

Index(['Protocol', 'Flow Duration', 'Tot Fwd Pkts', 'Tot Bwd Pkts',
       'TotLen Fwd Pkts', 'TotLen Bwd Pkts', 'Fwd Pkt Len Max',
       'Fwd Pkt Len Min', 'Fwd Pkt Len Mean', 'Fwd Pkt Len Std',
       'Bwd Pkt Len Max', 'Bwd Pkt Len Min', 'Bwd Pkt Len Mean',
       'Bwd Pkt Len Std', 'Flow Byts/s', 'Flow Pkts/s', 'Flow IAT Mean',
       'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min', 'Fwd IAT Tot',
       'Fwd IAT Mean', 'Fwd IAT Std', 'Fwd IAT Max', 'Fwd IAT Min',
       'Bwd IAT Tot', 'Bwd IAT Mean', 'Bwd IAT Std', 'Bwd IAT Max',
       'Bwd IAT Min', 'Fwd PSH Flags', 'Bwd PSH Flags', 'Fwd URG Flags',
       'Bwd URG Flags', 'Fwd Header Len', 'Bwd Header Len', 'Fwd Pkts/s',
       'Bwd Pkts/s', 'Pkt Len Min', 'Pkt Len Max', 'Pkt Len Mean',
       'Pkt Len Std', 'Pkt Len Var', 'FIN Flag Cnt', 'SYN Flag Cnt',
       'RST Flag Cnt', 'PSH Flag Cnt', 'ACK Flag Cnt', 'URG Flag Cnt',
       'CWE Flag Count', 'ECE Flag Cnt', 'Down/Up Ratio', 'Pkt Size Avg',
       'Fwd Seg Size Avg', 'Bwd

We now prepare our data for the model. We first normalize our data to prevent gradient explosions. We use z-score normalization.

We now split our pandas dataframe into two sets of separate `X` and `Y` dataframes for training and testing later. We also drop known problematic or unnecessary features from `X`. I have very little networking expertise; these features were suggested by Roman Beltiukov.

We finally output the shape of our dataframes. We've kept 68 features and randomly taken 10% of our dataset to use as a testing set. This random subsample was done with `sklearn`'s `train_test_split` function.

In [3]:
X = df.drop('Label', axis=1)

X = (X-X.mean())/X.std()
Y = df['Label']

X.columns[X.isna().any()].tolist()
X = X.drop(['Protocol','Bwd PSH Flags','Bwd URG Flags','Fwd Byts/b Avg','Fwd Pkts/b Avg','Fwd Blk Rate Avg','Bwd Byts/b Avg','Bwd Pkts/b Avg','Bwd Blk Rate Avg'], axis=1)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.1)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(4944481, 68)
(549387, 68)
(4944481,)
(549387,)


Ensuring our remaining features are correct

In [4]:
print(X_train.columns)

Index(['Flow Duration', 'Tot Fwd Pkts', 'Tot Bwd Pkts', 'TotLen Fwd Pkts',
       'TotLen Bwd Pkts', 'Fwd Pkt Len Max', 'Fwd Pkt Len Min',
       'Fwd Pkt Len Mean', 'Fwd Pkt Len Std', 'Bwd Pkt Len Max',
       'Bwd Pkt Len Min', 'Bwd Pkt Len Mean', 'Bwd Pkt Len Std', 'Flow Byts/s',
       'Flow Pkts/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
       'Flow IAT Min', 'Fwd IAT Tot', 'Fwd IAT Mean', 'Fwd IAT Std',
       'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Tot', 'Bwd IAT Mean',
       'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags',
       'Fwd URG Flags', 'Fwd Header Len', 'Bwd Header Len', 'Fwd Pkts/s',
       'Bwd Pkts/s', 'Pkt Len Min', 'Pkt Len Max', 'Pkt Len Mean',
       'Pkt Len Std', 'Pkt Len Var', 'FIN Flag Cnt', 'SYN Flag Cnt',
       'RST Flag Cnt', 'PSH Flag Cnt', 'ACK Flag Cnt', 'URG Flag Cnt',
       'CWE Flag Count', 'ECE Flag Cnt', 'Down/Up Ratio', 'Pkt Size Avg',
       'Fwd Seg Size Avg', 'Bwd Seg Size Avg', 'Subflow Fwd Pkts',
       'Subflow F

From here we start using PyTorch to create a Dataset and Dataloader to input to our model class.

In [5]:
from torch.utils.data.dataset import Dataset

class Dataset(Dataset):
    def __init__(self, features, labels):
        self.x = torch.tensor(features.values, dtype=torch.float32)
        self.y = torch.tensor(labels.values, dtype=torch.long)
    def __getitem__(self, index):
        return self.x[index], self.y[index]
    def __len__(self):
        return len(self.x)

We now declare our datasets with our created dataframes as inputs.

In [6]:
train_ds = Dataset(X_train, Y_train)
valid_ds = Dataset(X_test, Y_test)

We now take a quick look at a random datapoint. As we can see, the data has been normalized so that they are all fairly close to 0.

In [7]:
train_ds.__getitem__(99)

(tensor([-1.5358e-02, -2.1717e-02, -2.7047e-02, -2.3709e-02, -1.5888e-02,
         -5.6697e-01, -3.4663e-01, -6.8576e-01, -5.4623e-01, -6.4991e-01,
         -3.9474e-01, -6.2277e-01, -6.0950e-01, -5.4105e-02,  7.6153e-01,
         -1.8992e-01, -2.9473e-03, -8.7435e-03, -3.0280e-03, -1.4896e-02,
         -2.0126e-01, -3.0395e-03, -8.4495e-03, -3.1752e-03, -2.3742e-01,
         -1.4929e-01, -2.1273e-01, -2.1053e-01, -6.5273e-02, -1.6745e-01,
         -1.6794e-02, -2.3241e-02, -3.0489e-02,  5.5680e-01,  9.0088e-01,
         -3.6769e-01, -6.9129e-01, -6.6822e-01, -6.8519e-01, -1.5088e-01,
         -5.4206e-02, -1.6745e-01, -5.0728e-01,  1.2159e+00, -8.5239e-01,
         -2.0755e-01, -1.6794e-02, -5.0729e-01,  6.4583e-01, -7.2154e-01,
         -6.8576e-01, -6.2277e-01, -2.1717e-02, -2.3709e-02, -2.7047e-02,
         -1.5887e-02,  8.5792e-01, -3.2385e-01, -2.0935e-02,  2.2355e+00,
         -5.7687e-02, -4.6765e-02, -6.5431e-02, -4.6213e-02, -1.4954e-02,
         -1.4701e-03, -7.4057e-03, -1.

We define a very simple FFN model class. This model is low in parameters and is overall fairly small. It also does not make use of dropout to prevent overfitting. However, for the purpose of this sample, it still performs fairly well on the dataset as we will later see.

In [8]:
import torch.nn as nn
import torch.nn.functional as F

class FFN(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential( # mostly linear fully connected layers followed by ReLU activation function. No dropout.
            nn.Linear(68, 100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,100),
            nn.ReLU(),
            nn.Linear(100,15)
        )
    def forward(self, x):
        x = self.layers(x)
        # F.log_softmax returns the log probabilities of each class
        # of shape (num_samples, num_classes)
        return x
    def pred(self, x):
        with torch.no_grad():
            x_tensor = torch.tensor(x.values, dtype=torch.float32)
            output = self.layers(x_tensor)
            return torch.argmax(output, dim=1)
        

We define a general purpose training loop below. Annotations provided for a very detailed explanation.

Minibatches: [http://d2l.ai/chapter_optimization/minibatch-sgd.html](http://d2l.ai/chapter_optimization/minibatch-sgd.html)


In [9]:
def train(model, epochs, optimizer, criterion, train_loader, valid_loader):
    best_acc = 0
    for epoch in (range(epochs)): # we run some number of training epochs
        
        running_loss = 0.0 # store total loss
        running_total = 0.0 # store number of batches per epoch
        
        model.train() # set model to train mode so that it stores gradients
        for batch, (x,y) in (enumerate(train_loader)): # dataset is too big to load at once, so we split it into minibatches.
            x,y = x.to(device), y.to(device) # load data to our device
            y_hat = model(x) # process data in model to get initial results
            _, predicted = torch.max(y_hat.data,1) # predictions are the maximum value across the 1 axis
            loss = criterion(y_hat, y) # we calculate the loss (we are trying to minimize it)
            optimizer.zero_grad() # set the gradient to 0 (get rid of prior gradient)
            loss.backward() # back propagation (get gradient)
            optimizer.step() # gradient descent step (update parameters)
            running_loss += loss.item() # add to total loss
            running_total += 1 # add to total batch count
            if batch % 50 == 0: # output results for every 50th batch
                print(f'   [epoch: {epoch}, batch: {batch*len(y):5d}/{len(train_loader.dataset)}, loss: {running_loss / running_total:.5f}]')
                running_loss = 0.0
                running_total = 0.0
        current_accuracy = validate(model, valid_loader) # at the end of the batch, we validate the model against the other dataset split
        if (current_accuracy > best_acc): # if it performed better, we save the best model checkpoint
            best_acc = current_accuracy
            torch.save(model, 'best_checkpoint.pt')
            print('   best updated!')


We write a validation loop. It is largely similar to the training loop, but we don't store gradients or update parameters. Sole purpose is to evaluate the model against the other dataset.

In [10]:
def validate(model, valid_loader):
    with torch.no_grad():
        correct = 0.0
        total = 0

        model.eval()
        for batch, (x,y) in (enumerate(valid_loader)):
            x,y = x.to(device), y.to(device)
            y_hat = model(x)
            _, predicted = torch.max(y_hat.data,1)
            total += y.size(0)
            correct += (predicted == y).sum().item()

        accuracy = correct / total
        print(f'   Validation Accuracy: {100 * correct / total} %')
    
        return accuracy

Set hyperparameters! We use Cross Entropy Loss (standard for classification) and [Adam](https://arxiv.org/abs/1412.6980) as our learning rate scheduler. I am aware that techniques like [untuned warmup](https://arxiv.org/pdf/1910.04209.pdf#:~:text=Adaptive%20optimization%20algorithms%20such%20as,schedule%20for%20the%20learning%20rate.) have demonstrated strong performance, but for simplicity's sake this is not used here. We can also see that the model has only about 170k parameters.

In [11]:
import torch.optim as optim

model = FFN()
test_optimizer = optim.Adam(model.parameters())
test_criterion = nn.CrossEntropyLoss()
batch_size = 512
test_batch_size = 1000

model = model.to(device)
train_load = torch.utils.data.DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_load = torch.utils.data.DataLoader(valid_ds, batch_size=test_batch_size, shuffle=False)

model_params = filter(lambda p: p.requires_grad, model.parameters())
params = sum([np.prod(p.size()) for p in model_params])
print('model parameters:',params)

model parameters: 170015


Below we run the training loop, we can see that the model indeed improves over time!

In [28]:
train(model, 5, test_optimizer, test_criterion, train_load, valid_load)

   [epoch: 0, batch:     0/4944481, loss: 2.70962]
   [epoch: 0, batch: 25600/4944481, loss: 2.00988]
   [epoch: 0, batch: 51200/4944481, loss: 1.34024]
   [epoch: 0, batch: 76800/4944481, loss: 1.12538]
   [epoch: 0, batch: 102400/4944481, loss: 0.80638]
   [epoch: 0, batch: 128000/4944481, loss: 0.52152]
   [epoch: 0, batch: 153600/4944481, loss: 0.33524]
   [epoch: 0, batch: 179200/4944481, loss: 0.30501]
   [epoch: 0, batch: 204800/4944481, loss: 0.26055]
   [epoch: 0, batch: 230400/4944481, loss: 0.23211]
   [epoch: 0, batch: 256000/4944481, loss: 0.22333]
   [epoch: 0, batch: 281600/4944481, loss: 0.43532]
   [epoch: 0, batch: 307200/4944481, loss: 0.27668]
   [epoch: 0, batch: 332800/4944481, loss: 0.23332]
   [epoch: 0, batch: 358400/4944481, loss: 0.22486]
   [epoch: 0, batch: 384000/4944481, loss: 0.22042]
   [epoch: 0, batch: 409600/4944481, loss: 0.21422]
   [epoch: 0, batch: 435200/4944481, loss: 0.20378]
   [epoch: 0, batch: 460800/4944481, loss: 0.21706]
   [epoch: 0, ba

We have been saving the best model as a pt file, we now load it for evaluation.

In [12]:
model = torch.load('/home/myang/model_training/best_checkpoint.pt')

We ensure that it is the right one...

In [13]:
validate(model, train_load)

   Validation Accuracy: 95.36254664544165 %


0.9536254664544166

... and before we start, we take a loop at its results in more detail with `sklearn`'s `classification_report` function. We can see that it performs fairly well. The 0 classification, which is benign, has a 0.97 f1 score across the full dataset. We can see that when detecting any of our 13 different intrusion types, the performance varies, and seems to correlate linearly with how well represented it is within the data. For example, class 4, which is represented in 87 instances out of approximately 5 million, isn't ever caught, which is fairly expected. Meanwhile, class 14 and 9 are essentially perfect. The performance is overall good, but that could be problematic for reasons described [here](https://whugimy.github.io/projects/ersp/)

In [14]:
from sklearn.metrics import classification_report
preds = model.pred(X)
print(classification_report(Y, preds))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.95      0.99      0.97   2746934
           1       0.72      0.20      0.31    160639
           2       0.00      0.00      0.00       611
           3       0.00      0.00      0.00       230
           4       0.00      0.00      0.00        87
           5       0.74      0.52      0.61    139890
           6       1.00      1.00      1.00    461912
           7       1.00      0.99      1.00    576191
           8       0.71      0.87      0.78    193354
           9       1.00      1.00      1.00    187589
          10       0.96      0.99      0.98    286191
          11       0.97      0.99      0.98     41508
          12       0.86      0.98      0.92     10990
          13       0.72      0.99      0.83      1730
          14       1.00      1.00      1.00    686012

    accuracy                           0.95   5493868
   macro avg       0.71      0.70      0.69   5493868
weighted avg       0.95   

  _warn_prf(average, modifier, msg_start, len(result))


We again ensure the model is the right one again...

In [15]:
model_parameters = filter(lambda p: p.requires_grad, model.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
print(params)

170015


And finally, we can begin working with TRUSTEE. We first convert our dataframes into PyTorch `tensor` objects. TRUSTEE is largely built on `sklearn`'s decision tree training framework (not the same process, but same classes and formats), so we need to use tensors instead of dataframes.

In [16]:
from trustee import ClassificationTrustee
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier

# X = df.drop('Label', axis=1)

x_tensor = torch.tensor(X.values, dtype=torch.float32)
y_tensor = torch.tensor(Y.values, dtype=torch.long)

When we load a sample data point in our tensor, we see that it's still the normalized data—which we want! The FFN doesn't know data is being normalized before being put in, so this way the inputs for TRUSTEE are the same as that of the FFN.

In [17]:
x_tensor[1]

tensor([-1.5356e-02, -2.1717e-02, -2.7047e-02, -2.3092e-02, -1.5337e-02,
        -3.7221e-01,  2.2031e+00,  2.4280e-01, -5.4623e-01, -4.1921e-01,
         2.1418e+00,  1.1738e-01, -6.0950e-01, -4.4034e-03, -2.6216e-01,
        -1.8986e-01, -2.9473e-03, -8.7419e-03, -3.0266e-03, -1.4896e-02,
        -2.0126e-01, -3.0395e-03, -8.4495e-03, -3.1752e-03, -2.3742e-01,
        -1.4929e-01, -2.1273e-01, -2.1053e-01, -6.5273e-02, -1.6745e-01,
        -1.6794e-02, -2.4769e-02, -3.4850e-02, -2.3577e-01, -2.4631e-01,
         2.3738e+00, -4.6699e-01,  1.0238e-01, -4.7545e-01, -1.4637e-01,
        -5.4206e-02, -1.6745e-01, -5.0728e-01, -8.2245e-01, -8.5239e-01,
        -2.0755e-01, -1.6794e-02, -5.0729e-01,  6.4583e-01,  3.6519e-01,
         2.4280e-01,  1.1738e-01, -2.1717e-02, -2.3092e-02, -2.7047e-02,
        -1.5337e-02, -6.8569e-01, -3.2391e-01, -2.0935e-02, -1.5090e+00,
        -5.7687e-02, -4.6765e-02, -6.5431e-02, -4.6213e-02, -1.4954e-02,
        -1.4701e-03, -7.4057e-03, -1.3927e-01])

We can look at this exact point in the dataframe (also normalized) to look at the feature names as well.

In [18]:
X.iloc[1]

Flow Duration     -0.015356
Tot Fwd Pkts      -0.021717
Tot Bwd Pkts      -0.027047
TotLen Fwd Pkts   -0.023092
TotLen Bwd Pkts   -0.015337
                     ...   
Active Min        -0.046213
Idle Mean         -0.014954
Idle Std          -0.001470
Idle Max          -0.007406
Idle Min          -0.139268
Name: 1, Length: 68, dtype: float64

Finally, we can begin the training process. TRUSTEE trains lots of decision trees then chooses the best ones in a fairly complex way. If possible, it would be best to read about it in the [paper](). However, in short, TRUSTEE optimizes trees for two factors: Fidelity and Agreement.

Fidelity refers to the similarity of the tree and the black-box model. We optimize the trees such that they maximize fidelity, and are as close a representation to the black-box as possible.

Agreement refers to the similarity of the tree to other trees we train. We train a large distribution of trees, all of which undergo the same training process, but have some randomization such that they may each differ in some way. Each tree is given an agreement score based on how similar it is to other trees, in the hopes that choosing the tree with the highest agreement to output means we've chosen the tree that best represents the full distribution. That being said, the distribution could have multiple peaks; and we're not entirely sure if this is the best methodology.

The verbose training output is shown below.

In [19]:
trustee = ClassificationTrustee(expert=model)
trustee.fit(X, Y, num_iter=5, num_stability_iter=5, samples_size=1.0, verbose=True, predict_method_name='pred')

Initializing training dataset using MLP(
  (layers): Sequential(
    (0): Linear(in_features=68, out_features=100, bias=True)
    (1): ReLU()
    (2): Linear(in_features=100, out_features=100, bias=True)
    (3): ReLU()
    (4): Linear(in_features=100, out_features=100, bias=True)
    (5): ReLU()
    (6): Linear(in_features=100, out_features=100, bias=True)
    (7): ReLU()
    (8): Linear(in_features=100, out_features=100, bias=True)
    (9): ReLU()
    (10): Linear(in_features=100, out_features=100, bias=True)
    (11): ReLU()
    (12): Linear(in_features=100, out_features=100, bias=True)
    (13): ReLU()
    (14): Linear(in_features=100, out_features=100, bias=True)
    (15): ReLU()
    (16): Linear(in_features=100, out_features=100, bias=True)
    (17): ReLU()
    (18): Linear(in_features=100, out_features=100, bias=True)
    (19): ReLU()
    (20): Linear(in_features=100, out_features=100, bias=True)
    (21): ReLU()
    (22): Linear(in_features=100, out_features=100, bias=True)
   

We then use the TRUSTEE model's explain function to get:
1. `dt`: the full decision tree
2. `pruned_dt`: the pruned decision tree for easier reading (we only really care about the top features, reasoning denoted on website)

In [23]:
dt, pruned_dt, agreement, reward = trustee.explain()

We run the decision tree on our training dataset to see how it did

In [24]:
dt_y_pred = dt.predict(X_train)



We save the decision tree model for later use.

In [25]:
import pickle
with open('trustee_decision_tree_model.pkl', 'wb') as f:
    pickle.dump(dt, f)

We look at the full classification report for the TRUSTEE tree model. It performed noticably worse than the FNN model.

In [22]:
print(classification_report(Y_train, dt_y_pred))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.95      0.99      0.97   2472823
           1       0.72      0.20      0.31    144311
           2       0.00      0.00      0.00       551
           3       0.00      0.00      0.00       206
           4       0.00      0.00      0.00        75
           5       0.74      0.52      0.61    125805
           6       1.00      1.00      1.00    415856
           7       1.00      0.99      1.00    518214
           8       0.71      0.87      0.78    173820
           9       1.00      1.00      1.00    168771
          10       0.96      0.99      0.98    257658
          11       0.97      0.99      0.98     37349
          12       0.86      0.98      0.92      9897
          13       0.72      0.99      0.83      1573
          14       1.00      1.00      1.00    617572

    accuracy                           0.95   4944481
   macro avg       0.71      0.70      0.69   4944481
weighted avg       0.95   

  _warn_prf(average, modifier, msg_start, len(result))


Now we try the same with the pruned decision tree.

In [None]:
pruned_dt_y_pred = pruned_dt.predict(X_train)



And as expected, with a shorter tree, it performs a bit worse than the full DT, and significantly worse than the FNN,

In [None]:
print(classification_report(Y_train, pruned_dt_y_pred))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.94      0.92      0.93   2472823
           1       0.00      0.00      0.00    144311
           2       0.00      0.00      0.00       551
           3       0.00      0.00      0.00       206
           4       0.00      0.00      0.00        75
           5       0.32      0.55      0.41    125805
           6       0.98      0.94      0.96    415856
           7       0.98      0.97      0.97    518214
           8       0.71      0.78      0.74    173820
           9       0.40      0.50      0.45    168771
          10       0.68      0.98      0.80    257658
          11       0.00      0.00      0.00     37349
          12       0.00      0.00      0.00      9897
          13       0.00      0.00      0.00      1573
          14       0.99      1.00      1.00    617572

    accuracy                           0.87   4944481
   macro avg       0.40      0.44      0.42   4944481
weighted avg       0.86   

  _warn_prf(average, modifier, msg_start, len(result))


Finally, we use `graphviz` to look at an image of the tree. We input our features and parameters as labels.

In [29]:
from sklearn import tree
tree.export_graphviz(pruned_dt, 'prunedgraphvizoutput.gv', feature_names=X_train.columns, class_names=['Benign', 'Infilteration', 'Brute Force -Web',
       'Brute Force -XSS', 'SQL Injection', 'DoS attacks-SlowHTTPTest',
       'DoS attacks-Hulk', 'DDoS attacks-LOIC-HTTP', 'FTP-BruteForce',
       'SSH-Bruteforce', 'Bot', 'DoS attacks-GoldenEye',
       'DoS attacks-Slowloris', 'DDOS attack-LOIC-UDP',
       'DDOS attack-HOIC'])

And finally, below, we can see our decision tree! It would appear that Packet Length is the topmost decision split. Upon further discussion with Roman, it's kind of hard to say how impactful this feature is. If all malicious data had the same packet length, it could be problematic in some cases, but there are also some cases where that's okay. If you were to suddenly get a burst of data all with the same size, it could be indicative of an attack, but if your network is expecting all packets to be of some size, that feature is useless and the model we trained wouldn't help you at all.

In [42]:
import graphviz
from sklearn import tree

dot_data = tree.export_graphviz(
    pruned_dt,
    filled=True,
    rounded=True,
    special_characters=True,
    class_names=['Benign', 'Infilteration', 'Brute Force -Web',
       'Brute Force -XSS', 'SQL Injection', 'DoS attacks-SlowHTTPTest',
       'DoS attacks-Hulk', 'DDoS attacks-LOIC-HTTP', 'FTP-BruteForce',
       'SSH-Bruteforce', 'Bot', 'DoS attacks-GoldenEye',
       'DoS attacks-Slowloris', 'DDOS attack-LOIC-UDP',
       'DDOS attack-HOIC'],
    feature_names=X_train.columns
)

graph = graphviz.Source(dot_data)
graph.render("dt_explanation")

ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH

In [32]:
from trustee.report.trust import TrustReport
trust_report = TrustReport(
        model,
        X=X,
        y=Y,
        max_iter=5,
        num_pruning_iter=5,
        train_size=0.7,
        trustee_num_iter=5,
        trustee_num_stability_iter=5,
        trustee_sample_size=1.0,
        analyze_branches=True,
        analyze_stability=True,
        top_k=10,
        verbose=True,
        class_names=['Benign', 'Infilteration', 'Brute Force -Web',
       'Brute Force -XSS', 'SQL Injection', 'DoS attacks-SlowHTTPTest',
       'DoS attacks-Hulk', 'DDoS attacks-LOIC-HTTP', 'FTP-BruteForce',
       'SSH-Bruteforce', 'Bot', 'DoS attacks-GoldenEye',
       'DoS attacks-Slowloris', 'DDOS attack-LOIC-UDP',
       'DDOS attack-HOIC'],
        feature_names=X_train.columns,
        is_classify=True,
        predict_method_name='pred',
        skip_retrain = True
    )

Running Trust Report...
Preparing data...
Splitting dataset for training and testing...
X size: 5493868; y size: 5493868
Done!
Done!
Progress |----------------------------------------------------------------------------------------------------| 0.8% Complete
Collecting blackbox information...
Done!
Progress |█---------------------------------------------------------------------------------------------------| 1.6% Complete
Collecting trustee information...
Fitting blackbox model...
Done!
Blackbox model score report with training data:

              precision    recall  f1-score   support

           0      0.954     0.990     0.972    823804
           1      0.713     0.198     0.310     48210
           2      0.000     0.000     0.000       168
           3      0.000     0.000     0.000        64
           4      0.000     0.000     0.000        24
           5      0.739     0.527     0.615     41854
           6      0.998     0.998     0.998    138781
           7      0.997   

In [34]:
with open('trustee_report.pkl', 'wb') as f:
    pickle.dump(trust_report, f)