## Anomaly detection using [KDDCUP](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) data Part 2

This notebook is based on the previous notebook, which uses the dataset to build a benchmark classification model that can predict the difference between normal and an intrusion. This time we will work on building a multi-class classification model to determine what type of intrusion is happening and also see how the model performs when there is an unseen attack. 

For *seen attacks* we will approach by: 
1. Use a machine learning algorihtm combined with optimization method to understand what type of intrusion it is

For *unseen attacks* we will approach two-fold:
1. First determine if it is normal or not
2. Create a probability distribution of which previously seen intrusion it is similar to

#### Load the data

In [None]:
import gzip
import os
import pandas as pd

df = pd.read_csv(os.getcwd()+'/kddcup.data_10_percent.gz', compression='gzip', header=None)
df.columns = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df.head()

## Data Pre-processing

#### One hot encoding

In [None]:
df = pd.get_dummies(data=df, columns=df.loc[:, df.dtypes == object].columns.drop('label'))
df.label = df.label.replace(to_replace=df.label.unique(), value=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23])

#### Split dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, 
                                                    df.label, test_size=0.30, 
                                                    random_state=101)

In [None]:
#### Fit SVM

In [None]:
from sklearn import svm
poly = svm.SVC(C=1, # Regularization parameter
               kernel='poly',# Specifies type of algo to use
               degree=23,# Degree of the polynomial
               probability=True, # Enables usage of fit, predict_proba
               verbose=True, # To see progress
              ).fit(X_train, y_train)

In [None]:
poly_pred = poly.predict(X_test)
