<a href="https://colab.research.google.com/github/jyonalee/Insider-Threat-and-Anomaly-Detection-from-User-Activities/blob/master/Anomaly_Detection_LSTM_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install awscli to download the data
!pip3 install awscli --upgrade --user

# download data and save it on `data`
!mkdir data
!~/.local/bin/aws s3 sync --no-sign-request --region us-west-1 "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms/" data/.

# Anomaly Detection with LSTM in Network Traffic Data

This project explores anomaly detection in network traffic with RNN-LSTM to train the model.

The dataset can be obtained [here](https://www.unb.ca/cic/datasets/ids-2018.html)

This is part of the capstone project for the Machine Learning Nano Degree from Udacity

## Data Exploration

---



In [1]:
import pandas as pd
import numpy as np
import os
import glob
from sklearn import svm
from sklearn.ensemble import IsolationForest
import time

from lib.helper_functions import *

In [2]:
# if saved dataframe file exists, load
# if dataframe isn't saved, load raw csv file and save the dataframe
dataframe_file = 'flowmeter_dataframe.pkl'
exists = os.path.isfile(dataframe_file)
if exists:
    df = pd.read_pickle(dataframe_file)
else:
    directory = '/home/jlee/cse-cic-ids2018/Processed Traffic Data for ML Algorithms'
    df = pd.DataFrame()
    df = read_clean_combine_csv(directory, df, 'Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv')
    # save dataframe to file for future use
    pd.to_pickle(df, dataframe_file)

In [3]:
df.memory_usage().sum() / 1024**2 

2338.5255813598633

In [4]:
len(df)

8284195

In [5]:
df = df.sort_values(by=['Timestamp'])

In [6]:
df = df[df['Timestamp'] > pd.to_datetime('2018-01-01')].reset_index(drop=True)

In [7]:
# get count of each label
print(df['Label'].value_counts())

Benign                      6112137
DDOS attack-HOIC             686012
DoS attacks-Hulk             461912
Bot                          286191
FTP-BruteForce               193360
SSH-Bruteforce               187589
Infilteration                161934
DoS attacks-SlowHTTPTest     139890
DoS attacks-GoldenEye         41508
DoS attacks-Slowloris         10990
DDOS attack-LOIC-UDP           1730
Brute Force -Web                611
Brute Force -XSS                230
SQL Injection                    87
Name: Label, dtype: int64


In [8]:
# get distribution in of each label
print(df['Label'].value_counts()/len(df))

Benign                      0.737808
DDOS attack-HOIC            0.082810
DoS attacks-Hulk            0.055758
Bot                         0.034547
FTP-BruteForce              0.023341
SSH-Bruteforce              0.022644
Infilteration               0.019547
DoS attacks-SlowHTTPTest    0.016886
DoS attacks-GoldenEye       0.005011
DoS attacks-Slowloris       0.001327
DDOS attack-LOIC-UDP        0.000209
Brute Force -Web            0.000074
Brute Force -XSS            0.000028
SQL Injection               0.000011
Name: Label, dtype: float64


In essence, 73.8% of data points in this dataset is 'Benign' while the rest are some form of malicious attack

## Training the "Norm"
The goal is to train a model on a "normal" activity and detect anything that is an anomaly based on this model.
This is also called "Novelty Detection" and will be implemented with One-Class SVM as outlined here:
- https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection
- https://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#sphx-glr-auto-examples-svm-plot-oneclass-py
- https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html

In [9]:
# get 66% of 'Benign' labels as training set
# get 33% of 'Benign' labels as test set
# the rest 'non-benign' labels are the outliers
train_set = df.loc[df['Label'] == 'Benign'].sample(frac=.66, random_state=123)
test_set = df.drop(train_set.index)
test_benign = test_set.loc[test_set['Label'] == 'Benign']
test_malic = test_set.loc[test_set['Label'] != 'Benign']

In [10]:
print(len(train_set))
print(len(test_set))
print(len(test_benign))
print(len(test_malic))
print(len(df))

4034010
4250171
2078127
2172044
8284181


In [11]:
train_features_set = train_set[train_set.columns.difference(['Dst Port','Protocol','Timestamp','Label'])].sample(frac=.01, random_state=123)
test_features_benign = test_benign[test_benign.columns.difference(['Dst Port','Protocol','Timestamp','Label'])].sample(frac=.01, random_state=123)
test_features_malic = test_malic[test_malic.columns.difference(['Dst Port','Protocol','Timestamp','Label'])].sample(frac=.01, random_state=123)

### One-Class SVM

In [15]:
start = time.time()
# fit the model
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma='scale')
clf.fit(train_features_set)
print(time.time() - start)

458.05040287971497


In [16]:
y_pred_train = clf.predict(train_features_set)
y_pred_test = clf.predict(test_features_benign)
y_pred_outliers = clf.predict(test_features_malic)
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size

In [17]:
print(str(n_error_train) + '/' + str(len(train_features_set)))
print(str(n_error_test) + '/' + str(len(test_features_benign)))
print(str(n_error_outliers) + '/' + str(len(test_features_malic)))

7467/40340
7943/20781
9828/21720


one-class SVM on this dataset was able to distinguish between 'Benign' and 'Malicious' (everything not benign) with an error rate of ~40% (from a dataset of ~50% normal and 50% anomalous network traffic)

### Isolation Forest

In [18]:
start = time.time()
rng = np.random.RandomState(42)

# fit the model
clf = IsolationForest(behaviour='new', max_samples=1000,
                      random_state=rng, contamination=0.01)
clf.fit(train_features_set)
print(time.time() - start)

2.677292585372925


In [19]:
y_pred_train = clf.predict(train_features_set)
y_pred_test = clf.predict(test_features_benign)
y_pred_outliers = clf.predict(test_features_malic)
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size

In [20]:
print(str(n_error_train) + '/' + str(len(train_features_set)))
print(str(n_error_test) + '/' + str(len(test_features_benign)))
print(str(n_error_outliers) + '/' + str(len(test_features_malic)))

404/40340
204/20781
21607/21720


Isolation Forest was able to determine 'Benign' activity with ~1% error but got the same results with the anomalies (with ~99% error) which means that Isolation Forest doesn't work for anomaly detection in network activity.