# Lab 4, Exercise 2

## Instructions
The goal of this exercise is to build a straightforward machine learning pipeline for a problem with more than two classes.  A lot of the data preprocessing has already been done, so the main focus of this exercise is to become familiar with loading data, training a model, doing inference, and analyzing the results.

In [1]:
import numpy as np
import pandas as pd

## Load the data

For example, here's the first couple rows of the dataset:

| Source IP    |  Source Port |  Destination IP   |  Destination Port |  Protocol |  Flow Duration |  Flow Bytes/s |  Flow Packets/s |  Flow IAT Mean |  Flow IAT Std |  Flow IAT Max |  Flow IAT Min | Fwd IAT Mean |  Fwd IAT Std |  Fwd IAT Max |  Fwd IAT Min | Bwd IAT Mean |  Bwd IAT Std |  Bwd IAT Max |  Bwd IAT Min | Active Mean |  Active Std |  Active Max |  Active Min | Idle Mean |  Idle Std |  Idle Max |  Idle Min | label |
|--------------|--------------|-------------------|-------------------|-----------|----------------|---------------|-----------------|----------------|---------------|---------------|---------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|-------------|-------------|-------------|-------------|-----------|-----------|-----------|-----------|-------|
| 10\.0\.2\.15 | 57188        | 82\.161\.239\.177 | 110               | 6         | 7248168        | 21126\.02798  | 29\.11080428    | 34515\.08571   | 273869\.2625  | 3897923       | 5             | 89483\.55556 | 437167\.5917 | 3898126      | 29           | 56614\.03906 | 349855\.1098 | 3898131      | 7            | 0           | 0           | 0           | 0           | 0         | 0         | 0         | 0         | AUDIO |
| 10\.0\.2\.15 | 57188        | 82\.161\.239\.177 | 110               | 6         | 5157723        | 1052\.790156  | 3\.683796125    | 286540\.1667   | 878838\.5256  | 3743359       | 135           | 644715\.375  | 1272066\.058 | 3743562      | 509          | 568901\.6667 | 1209110\.287 | 3743573      | 451          | 0           | 0           | 0           | 0           | 0         | 0         | 0         | 0         | AUDIO |



In [2]:
# Import CSV data as a Pandas dataframe
# The data is in 'data/exercise2/TOR_TimeBasedFeatures-10s-Layer2.csv'

# CODE HERE
csv_data = pd.read_csv('data/exercise2/TOR_TimeBasedFeatures-10s-Layer2.csv')

# Create data and labels that can be used by sklearn's 'train_test_split'
# Create the labels

# CODE HERE
labels = csv_data['label']

# Create the data
# -Keep just the numeric features (i.e., those features between 'Flow Duration' and 'Idle Min')
# -Make sure not to keep the labels

# CODE HERE
data = csv_data.iloc[:,5:28]

# You should now have data and labels that can be used by sklearn's 'train_test_split'

## Create a single train/test split for experimentation

In [3]:
# Randomly pick 50% of the data for the training set, and keep the remaining 50% for the test set
# Use sklearn's 'train_test_split'
# CODE HERE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data,labels,test_size=0.5,random_state=42)

## Train a classifier

In [4]:
# Train a random forest classifier using default hyperparameters
# Hint: Not counting any import statements, this can be done in a single line of code
# CODE HERE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42).fit(X_train, y_train)



## Test the classifier on the test set

In [5]:
# Predict the labels on the test set

# CODE HERE
preds = model.predict(X_test)

# Use accuracy and a confusion matrix to measure performance
# Hint: Use sklearn's built-in metrics

# CODE HERE
from sklearn import metrics
labels_uniq = csv_data['label'].unique()
acc = metrics.accuracy_score(y_test, preds)
matrix = metrics.confusion_matrix(y_test, preds, labels=labels_uniq)
mat_df = pd.DataFrame(
    matrix,
    index=['t:'+x for x in labels_uniq],
    columns=['p:'+x for x in labels_uniq]
)
print('Accuracy: {}'.format(acc))
print('Confusion Matrix:\n{}'.format(matrix))
print('Confusion Matrix with Labels:\n{}'.format(mat_df))

Accuracy: 0.8145201392342118
Confusion Matrix:
[[ 287   82    0    3    4    0   12    3]
 [  57  635   33    8    6    1   52    1]
 [   6   84   47    1    0    0    5    6]
 [   7   27    3  372    4    0   28    1]
 [   4   53    6   13   42    0   17    2]
 [   4    7    0    2    0  505    8    0]
 [  14   90    5   32    5   11  266    3]
 [   2   22    4    4    0    3    1 1122]]
Confusion Matrix with Labels:
                 p:AUDIO  p:BROWSING  p:CHAT  p:FILE-TRANSFER  p:MAIL  p:P2P  \
t:AUDIO              287          82       0                3       4      0   
t:BROWSING            57         635      33                8       6      1   
t:CHAT                 6          84      47                1       0      0   
t:FILE-TRANSFER        7          27       3              372       4      0   
t:MAIL                 4          53       6               13      42      0   
t:P2P                  4           7       0                2       0    505   
t:VIDEO           

In [6]:
# Determine important features

# CODE HERE
ft_imps = [ft for ft in zip(data.columns[5:28], model.feature_importances_)]
ft_imps.sort(key=lambda ft: ft[1], reverse=True)
print('Feature\t\tImportance')
for ft in ft_imps: print('{}\t{}'.format(ft[0].strip(), ft[1]))

Feature		Importance
Active Mean	0.10515457310120781
Fwd IAT Min	0.09589100740451892
Flow IAT Min	0.08354734280312069
Bwd IAT Std	0.07723398973690583
Active Min	0.07689374865493556
Idle Mean	0.06969341662197867
Fwd IAT Mean	0.0658461807116965
Bwd IAT Min	0.06448185662688093
Fwd IAT Std	0.06325935819009602
Active Std	0.05477230498503107
Bwd IAT Max	0.051339436169573024
Flow IAT Max	0.05133361534142815
Bwd IAT Mean	0.04605061147048475
Fwd IAT Max	0.04498200340507924
Active Max	0.03393568389701756
Idle Std	0.0033827536084262244
Idle Min	0.002773785217306177
Idle Max	0.0


Questions:

1) What is the overall accuracy using the default parameters?  

The overall accuracy using the default parameters is about 0.8145, or 81.45%.

2) What is the confusion matrix for the tested approach?  What are the classes where the model performs well?  What are the classes where the model performs poorly?

The confusion matrix, shown above, is copied here for convenience (labeled confusion matrix is left above):
```
[[ 287   82    0    3    4    0   12    3]
 [  57  635   33    8    6    1   52    1]
 [   6   84   47    1    0    0    5    6]
 [   7   27    3  372    4    0   28    1]
 [   4   53    6   13   42    0   17    2]
 [   4    7    0    2    0  505    8    0]
 [  14   90    5   32    5   11  266    3]
 [   2   22    4    4    0    3    1 1122]]
```
The model seems to perform very well for 'VOIP' and 'P2P.' On the other hand, some classes where the model performs poorly are 'CHAT' and 'MAIL.'

3) What are the top 5 most important features?

The top 5 most important features, as shown above, are: Active Mean, Fwd IAT Min, Flow IAT Min, Bwd IAT Std, and Active Min.

4) What hyperparameters could you tune in the random forest to improve performance? What is the best accuracy you can attain?

There are a lot of hyperparameters that can be changed to tune the model and to improve performance. Some of these include: n_estimators (number of trees in the forest), max_depth (max depth of a tree), and min_samples_leaf (minimum number of samples to be considered a leaf). After a few tries of modifying these parameters, the best accuracy obtained was about 0.8277 or 82.77%, as shown below. However, if the model is extremely overfit to the test data set, then an accuracy of or near 100% could be possible, although extremely unlikely.

5) Bonus: How would you improve the pipeline above to automatically tune the hyperparameters?  How would you improve the pipeline to use multiple train/test splits?

Combining lists of values for several hyperparameters into a dictionary and performing cross validation on the combinations can help automatically tune the hyperparameters. For example, using sklearn.model_selection.GridSearchCV will return a model with the best performing hyperparameter values. Additionally, using multiple train/test splits can also help automatically tune the hyperparameters. In this case, using sklearn.model_selection.StratifiedKFold can be helpful, since it also preserves the percentage of samples for each class. K-fold cross validation can then help determine how to split the data for the model to perform the best.

In [7]:
# tuning
tune_model = RandomForestClassifier(
    n_estimators=1000,
    max_depth=100,
    min_samples_leaf=2,
    min_samples_split=3,
    n_jobs=-1,
    random_state=42
).fit(X_train, y_train)
tune_preds = tune_model.predict(X_test)
print('Accuracy: {}'.format(metrics.accuracy_score(y_test, tune_preds)))

Accuracy: 0.8276976628543014
