# <center> Intrusion Detection Using SVM </center>

* [Data Overview](#overview)
    - [Data Reading](#read)
    - [Dataset Info](#info)
* [Data Cleaning](#clean)
    - [Column Renaming](#re)
    - [Binary Target Conversion](#bin)
    - [Categorical Encoding](#enc)
    - [Separating Target](#tar)
    - [Feature Scaling](#scale)
    - [Dimensionality Reduction](#red)
* [Modeling](#model)
    - [Linear Kernel](#lin)
    - [Polynomial Kernel](#poly)
    - [RBF Kernel](#rbf)
    - [Hyperparameter Tuning](#tun)
        - [Grid Search](#grid)
        - [Randomized Search](#random)
        - [Best Model Comparison](#compare)



In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

sns.set_palette('dark')
sns.set_style('darkgrid')

## Data Overview <a id='overview'></a>

### Data Reading <a id='read'></a>

In [2]:
train = pd.read_csv('../input/nslkdd/NSL_KDD_Test.csv')
test = pd.read_csv('../input/nslkdd/NSL_KDD_Train.csv')

train.head()

Unnamed: 0,0,tcp,private,REJ,0.1,0.2,0.3,0.4,0.5,0.6,...,10.1,0.04.1,0.06.1,0.22,0.23,0.24,0.25,1.2,1.3,neptune
0,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.0,0.06,0.0,0.0,0.0,0.0,1.0,1.0,neptune
1,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,86,0.61,0.04,0.61,0.02,0.0,0.0,0.0,0.0,normal
2,0,icmp,eco_i,SF,20,0,0,0,0,0,...,57,1.0,0.0,1.0,0.28,0.0,0.0,0.0,0.0,saint
3,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,86,0.31,0.17,0.03,0.02,0.0,0.0,0.83,0.71,mscan
4,0,tcp,http,SF,267,14515,0,0,0,0,...,255,1.0,0.0,0.01,0.03,0.01,0.0,0.0,0.0,normal


### Dataset Info <a id='info'></a>

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22543 entries, 0 to 22542
Data columns (total 42 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   0        22543 non-null  int64  
 1   tcp      22543 non-null  object 
 2   private  22543 non-null  object 
 3   REJ      22543 non-null  object 
 4   0.1      22543 non-null  int64  
 5   0.2      22543 non-null  int64  
 6   0.3      22543 non-null  int64  
 7   0.4      22543 non-null  int64  
 8   0.5      22543 non-null  int64  
 9   0.6      22543 non-null  int64  
 10  0.7      22543 non-null  int64  
 11  0.8      22543 non-null  int64  
 12  0.9      22543 non-null  int64  
 13  0.10     22543 non-null  int64  
 14  0.11     22543 non-null  int64  
 15  0.12     22543 non-null  int64  
 16  0.13     22543 non-null  int64  
 17  0.14     22543 non-null  int64  
 18  0.15     22543 non-null  int64  
 19  0.16     22543 non-null  int64  
 20  0.17     22543 non-null  int64  
 21  0.18     225

In [4]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125972 entries, 0 to 125971
Data columns (total 42 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   0         125972 non-null  int64  
 1   tcp       125972 non-null  object 
 2   ftp_data  125972 non-null  object 
 3   SF        125972 non-null  object 
 4   491       125972 non-null  int64  
 5   0.1       125972 non-null  int64  
 6   0.2       125972 non-null  int64  
 7   0.3       125972 non-null  int64  
 8   0.4       125972 non-null  int64  
 9   0.5       125972 non-null  int64  
 10  0.6       125972 non-null  int64  
 11  0.7       125972 non-null  int64  
 12  0.8       125972 non-null  int64  
 13  0.9       125972 non-null  int64  
 14  0.10      125972 non-null  int64  
 15  0.11      125972 non-null  int64  
 16  0.12      125972 non-null  int64  
 17  0.13      125972 non-null  int64  
 18  0.14      125972 non-null  int64  
 19  0.15      125972 non-null  int64  
 20  0.16

**Column names and values taken from the dataset description:**

**duration**: real

**protocol_type**: {'tcp','udp', 'icmp'} 

**service**: {'aol', 'auth', 'bgp', 'courier', 'csnet_ns', 'ctf', 'daytime', 'discard', 'domain', 'domain_u', 'echo', 'eco_i', 'ecr_i', 'efs', 'exec', 'finger', 'ftp', 'ftp_data', 'gopher', 'harvest', 'hostnames', 'http', 'http_2784', 'http_443', 'http_8001', 'imap4', 'IRC', 'iso_tsap', 'klogin', 'kshell', 'ldap', 'link', 'login', 'mtp', 'name', 'netbios_dgm', 'netbios_ns', 'netbios_ssn', 'netstat', 'nnsp', 'nntp', 'ntp_u', 'other', 'pm_dump', 'pop_2', 'pop_3', 'printer', 'private', 'red_i', 'remote_job', 'rje', 'shell', 'smtp', 'sql_net', 'ssh', 'sunrpc', 'supdup', 'systat', 'telnet', 'tftp_u', 'tim_i', 'time', 'urh_i', 'urp_i', 'uucp', 'uucp_path', 'vmnet', 'whois', 'X11', 'Z39_50'} 

**flag**: { 'OTH', 'REJ', 'RSTO', 'RSTOS0', 'RSTR', 'S0', 'S1', 'S2', 'S3', 'SF', 'SH' }

**src_bytes**: real

**dst_bytes**: real

**land**: {'0', '1'}

**wrong_fragment**: real

**urgent**: real

**hot**: real

**num_failed_logins**: real

**logged_in**: {'0', '1'}

**num_compromised**: real

**root_shell**: real

**su_attempted**: real

**num_root**: real

**num_file_creations**: real

**num_shells**: real

**num_access_files**: real

**num_outbound_cmds**: real

**is_host_log**in**: {'0', '1'}

**is_guest_login**: {'0', '1'}

**count**: real

**srv_count**: real

**serror_rate**: real

**srv_serror_rate**: real

**rerror_rate**: real

**srv_rerror_rate**: real

**same_srv_rate**: real

**diff_srv_rate**: real

**srv_diff_host_rate**: real

**dst_host_count**: real

**dst_host_srv_count**: real

**dst_host_same_srv_rate**: real

**dst_host_diff_srv_rate**: real

**dst_host_same_src_port_rate**: real

**dst_host_srv_diff_host_rate**: real

**dst_host_serror_rate**: real

**dst_host_srv_serror_rate**: real

**dst_host_rerror_rate**: real

**dst_host_srv_rerror_rate**: real

<font color='green'> **class**: {'normal', 'neptune', 'warezclient', 'ipsweep', 'portsweep',
       'teardrop', 'nmap', 'satan', 'smurf', 'pod', 'back',
       'guess_passwd', 'ftp_write', 'multihop', 'rootkit',
       'buffer_overflow', 'imap', 'warezmaster', 'phf', 'land',
       'loadmodule', 'spy', 'perl'} </font>

Column names need to be replaced, since they are inconsistent and uninformative. Moreover, Our goal is intrusion detection. Therefore the class feature should be binary. We do not care about the type of intrusion so all types will be labeled the same, while "normal" will be labeled differently. 

Since our goal is to fit an SVM to the data, we need to scale it in order to not create bias. We also need to reduce the number of features using PCA to avoid the curse of dimensionality.

## Data Cleaning <a id='cleaning'></a>

### Column Renaming <a id='re'></a>

In [5]:
columns = ['duration', 'protocol_type', 'service', 'flag',
           'src_bytes', 'dst_bytes', 'land', 'wrong_fragment',
           'urgent', 'hot', 'num_failed_logins', 'logged_in',
           'num_compromised', 'root_shell', 'su_attempted',
           'num_root', 'num_file_creations', 'num_shells',
           'num_access_files', 'num_outbound_cmds', 'is_host_login',
           'is_guest_login', 'count', 'srv_count', 'serror_rate',
           'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate',
           'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate',
           'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate',
           'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
           'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
           'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
           'dst_host_srv_rerror_rate', 'class']

In [6]:
train.columns = columns
test.columns = columns

train.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.0,0.06,0.0,0.0,0.0,0.0,1.0,1.0,neptune
1,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,86,0.61,0.04,0.61,0.02,0.0,0.0,0.0,0.0,normal
2,0,icmp,eco_i,SF,20,0,0,0,0,0,...,57,1.0,0.0,1.0,0.28,0.0,0.0,0.0,0.0,saint
3,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,86,0.31,0.17,0.03,0.02,0.0,0.0,0.83,0.71,mscan
4,0,tcp,http,SF,267,14515,0,0,0,0,...,255,1.0,0.0,0.01,0.03,0.01,0.0,0.0,0.0,normal


### Binary Target Conversion <a id='bin'></a>

In [7]:
train['class'].unique()


array(['neptune', 'normal', 'saint', 'mscan', 'guess_passwd', 'smurf',
       'apache2', 'satan', 'buffer_overflow', 'back', 'warezmaster',
       'snmpgetattack', 'processtable', 'pod', 'httptunnel', 'nmap', 'ps',
       'snmpguess', 'ipsweep', 'mailbomb', 'portsweep', 'multihop',
       'named', 'sendmail', 'loadmodule', 'xterm', 'worm', 'teardrop',
       'rootkit', 'xlock', 'perl', 'land', 'xsnoop', 'sqlattack',
       'ftp_write', 'imap', 'udpstorm', 'phf'], dtype=object)

In [8]:
train['class'] = (train['class'] != "normal") * 1.0
test['class'] = (test['class'] != "normal") * 1.0

train.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.0,0.06,0.0,0.0,0.0,0.0,1.0,1.0,1.0
1,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,86,0.61,0.04,0.61,0.02,0.0,0.0,0.0,0.0,0.0
2,0,icmp,eco_i,SF,20,0,0,0,0,0,...,57,1.0,0.0,1.0,0.28,0.0,0.0,0.0,0.0,1.0
3,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,86,0.31,0.17,0.03,0.02,0.0,0.0,0.83,0.71,1.0
4,0,tcp,http,SF,267,14515,0,0,0,0,...,255,1.0,0.0,0.01,0.03,0.01,0.0,0.0,0.0,0.0


### Categorical Encoding <a id='enc'></a>

In [9]:
categorical = ['protocol_type', 'service', 'flag']

for cat in categorical:
    print(cat, "train:", train[cat].unique(),"\n\n", cat, "test:", test[cat].unique(),
          "\n\n============================================\n")

protocol_type train: ['tcp' 'icmp' 'udp'] 

 protocol_type test: ['udp' 'tcp' 'icmp'] 


service train: ['private' 'ftp_data' 'eco_i' 'telnet' 'http' 'smtp' 'ftp' 'ldap' 'pop_3'
 'courier' 'discard' 'ecr_i' 'imap4' 'domain_u' 'mtp' 'systat' 'iso_tsap'
 'other' 'csnet_ns' 'finger' 'uucp' 'whois' 'netbios_ns' 'link' 'Z39_50'
 'sunrpc' 'auth' 'netbios_dgm' 'uucp_path' 'vmnet' 'domain' 'name' 'pop_2'
 'http_443' 'urp_i' 'login' 'gopher' 'exec' 'time' 'remote_job' 'ssh'
 'kshell' 'sql_net' 'shell' 'hostnames' 'echo' 'daytime' 'pm_dump' 'IRC'
 'netstat' 'ctf' 'nntp' 'netbios_ssn' 'tim_i' 'supdup' 'bgp' 'nnsp' 'rje'
 'printer' 'efs' 'X11' 'ntp_u' 'klogin' 'tftp_u'] 

 service test: ['other' 'private' 'http' 'remote_job' 'ftp_data' 'name' 'netbios_ns'
 'eco_i' 'mtp' 'telnet' 'finger' 'domain_u' 'supdup' 'uucp_path' 'Z39_50'
 'smtp' 'csnet_ns' 'uucp' 'netbios_dgm' 'urp_i' 'auth' 'domain' 'ftp'
 'bgp' 'ldap' 'ecr_i' 'gopher' 'vmnet' 'systat' 'http_443' 'efs' 'whois'
 'imap4' 'iso_tsap' 'echo' 'k

In [10]:
whole = pd.concat([test.assign(ind="test"), train.assign(ind="train")])

for col in categorical:
    one_hot = pd.get_dummies(whole[col], drop_first=True)
    whole = whole.drop(columns=[col])
    whole = pd.concat([whole, one_hot],axis=1)
    

test, train = whole[whole["ind"].eq("test")].drop(columns="ind"),\
whole[whole["ind"].eq("train")].drop(columns="ind")

### Separating Target <a id='tar'></a>

In [11]:
y_train, y_test = train['class'], test['class']
X_train, X_test = train.drop(columns=['class']), test.drop(columns=['class'])

### Feature Scaling <a id='scale'></a>

In [12]:
ss = MinMaxScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

### Dimensionality Reduction <a id='red'></a>

In [13]:
pca = PCA()
pca.fit(X_train)

explained_variance = pd.Series(pca.explained_variance_ratio_).apply(lambda x: '%.15f' % x)
explained_variance.values

array(['0.370630907695916', '0.142926409889068', '0.111884157673804',
       '0.055637880481352', '0.036518296906625', '0.036001437293584',
       '0.029027504874700', '0.024292679235707', '0.019873276980867',
       '0.016762396076397', '0.016610165760644', '0.013106164554563',
       '0.012703100596021', '0.011851044858280', '0.010636520070133',
       '0.008402246410946', '0.007881178185683', '0.006911783492382',
       '0.006107279855784', '0.005144040833769', '0.005116248243070',
       '0.004231510059533', '0.003720133070979', '0.003427613907724',
       '0.002904209264885', '0.002621606838305', '0.002389424296031',
       '0.002257159863088', '0.002007326780320', '0.001836228509059',
       '0.001489671156833', '0.001444706235414', '0.000954866897955',
       '0.000854128796458', '0.000774622244797', '0.000720790690272',
       '0.000645615471177', '0.000602819329319', '0.000580068036118',
       '0.000564140805677', '0.000545303707449', '0.000537907026165',
       '0.0005232192

We can see that we do not need more than 3 components to explain most of the variance. So let's use 3 features and reduce the training and testing sets.

In [14]:
pca = PCA(n_components=12)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

## Modeling <a id='model'></a>

### Linear Kernel <a id='lin'></a>

In [15]:
lin = SVC(kernel='linear')
lin.fit(X_train, y_train)

y_pred = lin.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.96      0.75      0.84     67342
         1.0       0.77      0.96      0.86     58630

    accuracy                           0.85    125972
   macro avg       0.86      0.86      0.85    125972
weighted avg       0.87      0.85      0.85    125972



### Polynomial Kernel <a id='poly'></a>

In [16]:
poly = SVC(kernel='poly')
poly.fit(X_train, y_train)

y_pred = poly.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.93      0.86      0.90     67342
         1.0       0.86      0.93      0.89     58630

    accuracy                           0.89    125972
   macro avg       0.89      0.90      0.89    125972
weighted avg       0.90      0.89      0.89    125972



### RBF <a id='rbf'></a>

In [17]:
rbf = SVC(kernel='rbf')
rbf.fit(X_train, y_train)

y_pred = rbf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.93      0.86      0.89     67342
         1.0       0.85      0.93      0.89     58630

    accuracy                           0.89    125972
   macro avg       0.89      0.89      0.89    125972
weighted avg       0.89      0.89      0.89    125972



### Hyperparameter Tuning <a id='tun'></a>

We will only use a fraction of our dataset to speed up the search.

#### Grid Search <a id='grid'></a>

In [18]:
svm = SVC()
param_grid = {'C':[1,10,100,1000],'gamma':[1,0.1,0.001,0.0001], 'kernel':['linear','rbf']}

grid = GridSearchCV(svm,param_grid,refit = True)
grid.fit(X_train[:10000],y_train[:10000])

grid.best_params_

{'C': 100, 'gamma': 1, 'kernel': 'rbf'}

#### Random Search <a id='random'></a>

In [19]:
svm_dist1 = {
    "C": scipy.stats.expon(scale=.1),
    "gamma": scipy.stats.expon(scale=.01),
    "kernel": ["rbf"],
}

svm_dist2 = {
    "C": scipy.stats.expon(scale=1),
    "gamma": scipy.stats.expon(scale=.01),
    "kernel": ["rbf"],
}

svm_dist3 = {
    "C": scipy.stats.expon(scale=10),
    "gamma": scipy.stats.expon(scale=.01),
    "kernel": ["rbf"],
}

svm_dist4 = {
    "C": scipy.stats.expon(scale=100),
    "gamma": scipy.stats.expon(scale=.01),
    "kernel": ["rbf"],
}

svm_dist5 = {
    "C": scipy.stats.expon(scale=1000),
    "gamma": scipy.stats.expon(scale=.01),
    "kernel": ["rbf"],
}

svm_dist6 = {
    "C": scipy.stats.expon(scale=.1),
    "gamma": scipy.stats.expon(scale=1),
    "kernel": ["rbf"],
}

svm_dist7 = {
    "C": scipy.stats.expon(scale=1),
    "gamma": scipy.stats.expon(scale=1),
    "kernel": ["rbf"],
}

svm_dist8 = {
    "C": scipy.stats.expon(scale=10),
    "gamma": scipy.stats.expon(scale=1),
    "kernel": ["rbf"],
}

svm_dist9 = {
    "C": scipy.stats.expon(scale=100),
    "gamma": scipy.stats.expon(scale=1),
    "kernel": ["rbf"],
}

svm_dist10 = {
    "C": scipy.stats.expon(scale=1000),
    "gamma": scipy.stats.expon(scale=1),
    "kernel": ["rbf"],
}

svm_dist11 = {
    "C": scipy.stats.expon(scale=.1),
    "gamma": scipy.stats.expon(scale=10),
    "kernel": ["rbf"],
}

svm_dist12 = {
    "C": scipy.stats.expon(scale=1),
    "gamma": scipy.stats.expon(scale=10),
    "kernel": ["rbf"],
}

svm_dist13 = {
    "C": scipy.stats.expon(scale=10),
    "gamma": scipy.stats.expon(scale=10),
    "kernel": ["rbf"],
}

svm_dist14 = {
    "C": scipy.stats.expon(scale=100),
    "gamma": scipy.stats.expon(scale=10),
    "kernel": ["rbf"],
}

svm_dist15 = {
    "C": scipy.stats.expon(scale=1000),
    "gamma": scipy.stats.expon(scale=10),
    "kernel": ["rbf"],
}

svm_dist16 = {
    "C": scipy.stats.expon(scale=10),
    "degree": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "kernel": ["poly"],
}

distributions = [svm_dist1, svm_dist2, svm_dist3, svm_dist4, svm_dist5,
                svm_dist6, svm_dist7, svm_dist8, svm_dist9, svm_dist10,
                svm_dist11, svm_dist12, svm_dist13, svm_dist14, svm_dist15,
                svm_dist16]
clf = RandomizedSearchCV(svm, distributions, random_state=0, cv=5)

clf.fit(X_train[:10000], y_train[:10000])
clf.best_params_

{'C': 8.985603929935616, 'gamma': 18.59604240911851, 'kernel': 'rbf'}

#### Best Model Comparison <a id='compare'></a>

In [20]:
grid = SVC(C=1000, gamma=1, kernel='rbf')
grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.95      0.87      0.91     67342
         1.0       0.87      0.95      0.90     58630

    accuracy                           0.91    125972
   macro avg       0.91      0.91      0.91    125972
weighted avg       0.91      0.91      0.91    125972



In [21]:
rand = SVC(C=1044.612, gamma=0.459, kernel='rbf')
rand.fit(X_train, y_train)

y_pred = rand.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.94      0.89      0.92     67342
         1.0       0.88      0.94      0.91     58630

    accuracy                           0.91    125972
   macro avg       0.91      0.91      0.91    125972
weighted avg       0.91      0.91      0.91    125972



As seen, the grid search model yields better accuracy but with worse precision and recall. A more balanced model is the random search model with better F1 score.