# Lab 4, Exercise 2

## Instructions
The goal of this exercise is to build a straightforward machine learning pipeline for a problem with more than two classes.  A lot of the data preprocessing has already been done, so the main focus of this exercise is to become familiar with loading data, training a model, doing inference, and analyzing the results.

In [1]:
import numpy as np
import pandas as pd

## Load the data

For example, here's the first couple rows of the dataset:

| Source IP    |  Source Port |  Destination IP   |  Destination Port |  Protocol |  Flow Duration |  Flow Bytes/s |  Flow Packets/s |  Flow IAT Mean |  Flow IAT Std |  Flow IAT Max |  Flow IAT Min | Fwd IAT Mean |  Fwd IAT Std |  Fwd IAT Max |  Fwd IAT Min | Bwd IAT Mean |  Bwd IAT Std |  Bwd IAT Max |  Bwd IAT Min | Active Mean |  Active Std |  Active Max |  Active Min | Idle Mean |  Idle Std |  Idle Max |  Idle Min | label |
|--------------|--------------|-------------------|-------------------|-----------|----------------|---------------|-----------------|----------------|---------------|---------------|---------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|-------------|-------------|-------------|-------------|-----------|-----------|-----------|-----------|-------|
| 10\.0\.2\.15 | 57188        | 82\.161\.239\.177 | 110               | 6         | 7248168        | 21126\.02798  | 29\.11080428    | 34515\.08571   | 273869\.2625  | 3897923       | 5             | 89483\.55556 | 437167\.5917 | 3898126      | 29           | 56614\.03906 | 349855\.1098 | 3898131      | 7            | 0           | 0           | 0           | 0           | 0         | 0         | 0         | 0         | AUDIO |
| 10\.0\.2\.15 | 57188        | 82\.161\.239\.177 | 110               | 6         | 5157723        | 1052\.790156  | 3\.683796125    | 286540\.1667   | 878838\.5256  | 3743359       | 135           | 644715\.375  | 1272066\.058 | 3743562      | 509          | 568901\.6667 | 1209110\.287 | 3743573      | 451          | 0           | 0           | 0           | 0           | 0         | 0         | 0         | 0         | AUDIO |



In [2]:
# Import CSV data as a Pandas dataframe
# The data is in 'data/exercise2/TOR_TimeBasedFeatures-10s-Layer2.csv'

# CODE HERE
csv_data = pd.read_csv('data/exercise2/TOR_TimeBasedFeatures-10s-Layer2.csv')

# Create data and labels that can be used by sklearn's 'train_test_split'
# Create the labels

# CODE HERE
labels = csv_data['label']

# Create the data
# -Keep just the numeric features (i.e., those features between 'Flow Duration' and 'Idle Min')
# -Make sure not to keep the labels

# CODE HERE
data = csv_data.iloc[:,5:28]

# You should now have data and labels that can be used by sklearn's 'train_test_split'

## Create a single train/test split for experimentation

In [3]:
# Randomly pick 50% of the data for the training set, and keep the remaining 50% for the test set
# Use sklearn's 'train_test_split'
# CODE HERE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data,labels,test_size=0.5,random_state=123)

## Train a classifier

In [4]:
# Train a random forest classifier using default hyperparameters
# Hint: Not counting any import statements, this can be done in a single line of code
# CODE HERE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
    n_estimators=1000,
    max_depth=100,
    min_samples_leaf=2,
    random_state=123
).fit(X_train, y_train)

## Test the classifier on the test set

In [5]:
# Predict the labels on the test set

# CODE HERE
preds = model.predict(X_test)

# Use accuracy and a confusion matrix to measure performance
# Hint: Use sklearn's built-in metrics

# CODE HERE
from sklearn import metrics
labels_uniq = csv_data['label'].unique()
acc = metrics.accuracy_score(y_test, preds)
matrix = metrics.confusion_matrix(y_test, preds, labels=labels_uniq)
mat_df = pd.DataFrame(
    matrix,
    index=['t:'+x for x in labels_uniq],
    columns=['p:'+x for x in labels_uniq]
)
print('Accuracy: {}'.format(acc))
print('Confusion Matrix:\n{}'.format(matrix))
print('Confusion Matrix with Labels:\n{}'.format(mat_df))

Accuracy: 0.8329189457981104
Confusion Matrix:
[[ 263   87    3    0    3    3    5    1]
 [  42  667   16    2    3    1   50    3]
 [   2  105   50    1    0    1    6    4]
 [   5   26    3  396    5    0   13    2]
 [   5   57    3    9   50    2    9    1]
 [   3    7    1    0    0  532    6    1]
 [  12   87    3   18   13    7  275    1]
 [   6   22    1    2    0    0    4 1117]]
Confusion Matrix with Labels:
                 p:AUDIO  p:BROWSING  p:CHAT  p:FILE-TRANSFER  p:MAIL  p:P2P  \
t:AUDIO              263          87       3                0       3      3   
t:BROWSING            42         667      16                2       3      1   
t:CHAT                 2         105      50                1       0      1   
t:FILE-TRANSFER        5          26       3              396       5      0   
t:MAIL                 5          57       3                9      50      2   
t:P2P                  3           7       1                0       0    532   
t:VIDEO           

In [6]:
# Determine important features

# CODE HERE
ft_imps = [ft for ft in zip(data.columns[5:28], model.feature_importances_)]
ft_imps.sort(key=lambda ft: ft[1], reverse=True)
print('Feature\t\tImportance')
for ft in ft_imps: print('{}\t{}'.format(ft[0].strip(), ft[1]))

Feature		Importance
Flow IAT Min	0.10533890391696793
Bwd IAT Std	0.08009979525074445
Active Mean	0.07590030696535331
Bwd IAT Min	0.07552757192054471
Active Std	0.0696217162658118
Idle Mean	0.069297552320734
Fwd IAT Min	0.06891040683994405
Active Min	0.06597273058790702
Fwd IAT Mean	0.06455746253116558
Fwd IAT Std	0.06417815436232036
Fwd IAT Max	0.055787701072625005
Bwd IAT Max	0.055218405109308685
Flow IAT Max	0.05076004169364114
Bwd IAT Mean	0.04353350919648317
Active Max	0.0399038486969354
Idle Std	0.003020806400454983
Idle Min	0.0029491987303549485
Idle Max	0.0


Questions:

1) What is the overall accuracy using the default parameters?  

2) What is the confusion matrix for the tested approach?  What are the classes where the model performs well?  What are the classes where the model performs poorly?

3) What are the top 5 most important features?

4) What hyperparameters could you tune in the random forest to improve performance? What is the best accuracy you can attain?

5) Bonus: How would you improve the pipeline above to automatically tune the hyperparameters?  How would you improve the pipeline to use multiple train/test splits?