<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_14_04_ids_kdd99.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training an Intrusion Detection System with KDD99

The [KDD-99 dataset](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) is very famous in the security field and almost a "hello world" of Intrusion Detection Systems (IDS) in machine learning. An intrusion detection system (IDS) is a program that monitors computers and network systems for malicious activity or policy violations. Any intrusion activity or violation is typically reported to an administrator or collected centrally. IDS types range in scope from single computers to large networks. Although the KDD99 dataset is over 20 years old, it is still widely used to demonstrate Intrusion Detection Systems (IDS). KDD99 is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between "bad" connections, called intrusions or attacks, and "good" normal connections. This database contains a standard set of data to be audited, including various intrusions simulated in a military network environment.

## Read in Raw KDD-99 Dataset

The following code reads the KDD99 CSV dataset into a Pandas data frame. The standard format of KDD99 does not include column names. Because of that, the program adds them.

In [None]:
import pandas as pd
from tensorflow.keras.utils import get_file

pd.set_option('display.max_columns', 6)
pd.set_option('display.max_rows', 5)

try:
    path = get_file('kdd-with-columns.csv', origin=\
    'https://github.com/jeffheaton/jheaton-ds2/raw/main/'\
    'kdd-with-columns.csv',archive_format=None)
except:
    print('Error downloading')
    raise

print(path)

# Origional file: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
df = pd.read_csv(path)

print("Read {} rows.".format(len(df)))
df = df.sample(frac=0.1, replace=False) # Uncomment this line to
# sample only 10% of the dataset
df.dropna(inplace=True,axis=1)
# For now, just drop NA's (rows with missing values)

# display 5 rows
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 20)
df

/root/.keras/datasets/kdd-with-columns.csv
Read 494021 rows.


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,outcome
138403,0,tcp,http,SF,272,171,0,0,0,0,...,255,1.00,0.00,0.5,0.02,0.0,0.0,0.0,0.0,normal.
10576,0,icmp,ecr_i,SF,1032,0,0,0,0,0,...,255,1.00,0.00,1.0,0.00,0.0,0.0,0.0,0.0,smurf.
412704,0,icmp,ecr_i,SF,520,0,0,0,0,0,...,255,1.00,0.00,1.0,0.00,0.0,0.0,0.0,0.0,smurf.
180227,0,icmp,ecr_i,SF,1032,0,0,0,0,0,...,255,1.00,0.00,1.0,0.00,0.0,0.0,0.0,0.0,smurf.
108748,0,tcp,private,S0,0,0,0,0,0,0,...,6,0.02,0.07,0.0,0.00,1.0,1.0,0.0,0.0,neptune.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
320466,0,icmp,ecr_i,SF,1032,0,0,0,0,0,...,255,1.00,0.00,1.0,0.00,0.0,0.0,0.0,0.0,smurf.
337133,0,icmp,ecr_i,SF,1032,0,0,0,0,0,...,255,1.00,0.00,1.0,0.00,0.0,0.0,0.0,0.0,smurf.
446201,0,icmp,ecr_i,SF,520,0,0,0,0,0,...,255,1.00,0.00,1.0,0.00,0.0,0.0,0.0,0.0,smurf.
422894,0,icmp,ecr_i,SF,520,0,0,0,0,0,...,255,1.00,0.00,1.0,0.00,0.0,0.0,0.0,0.0,smurf.


## Analyzing a Dataset

Before we preprocess the KDD99 dataset, let's look at the individual columns and distributions.  You can use the following script to give a high-level overview of how a dataset appears.

In [None]:
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore

def expand_categories(values):
    result = []
    s = values.value_counts()
    t = float(len(values))
    for v in s.index:
        result.append("{}:{}%".format(v,round(100*(s[v]/t),2)))
    return "[{}]".format(",".join(result))

def analyze(df):
    print()
    cols = df.columns.values
    total = float(len(df))

    print("{} rows".format(int(total)))
    for col in cols:
        uniques = df[col].unique()
        unique_count = len(uniques)
        if unique_count>100:
            print("** {}:{} ({}%)".format(col,unique_count,\
                int(((unique_count)/total)*100)))
        else:
            print("** {}:{}".format(col,expand_categories(df[col])))
            expand_categories(df[col])

The analysis looks at how many unique values are present.  For example,
duration, a numeric value, has 2495 unique values, and there is a 0% overlap.  A text/categorical value such as protocol_type only has a few unique values, and the program shows the percentages of each.  Columns with many unique values do not have their item counts shown to save display space.

In [None]:
# Analyze KDD-99
analyze(df)


494021 rows
** duration:2495 (0%)
** src_bytes:3300 (0%)
** dst_bytes:10725 (2%)
** wrong_fragment:[-0.047720137096623795:99.75%,22.206606340509904:0.2%,7.370388688772218:0.05%]
** urgent:[-0.002571465497564627:100.0%,181.47713661627023:0.0%,362.956844698038:0.0%,544.4365527798058:0.0%]
** hot:[-0.04413586697647766:99.35%,2.5130734380826283:0.44%,35.756794403851:0.06%,1.2344687855530754:0.05%,5.070282743141734:0.02%,7.627492048200841:0.02%,6.348887395671287:0.01%,3.791678090612181:0.01%,17.856329268437264:0.01%,38.314003708910114:0.01%,28.08516648867369:0.01%,24.24935253108503:0.0%,30.642375793732793:0.0%,22.970747878555475:0.0%,25.527957183614582:0.0%,8.906096700730394:0.0%,21.692143226025923:0.0%,15.299119963378159:0.0%,20.41353857349637:0.0%,12.741910658319053:0.0%,19.134933920966816:0.0%,11.4633060057895:0.0%]
** num_failed_logins:[-0.009782174730841357:99.99%,64.42488106133553:0.01%,128.85954429740192:0.0%,322.16353400560104:0.0%,257.72887076953464:0.0%,193.29420753346827:0.0%]
*

## Encode the feature vector

We use the same two functions provided earlier to preprocess the data. The first encodes Z-Scores, and the second creates dummy variables from categorical columns.

In [None]:
# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name] - mean) / sd

# Encode text values to dummy variables(i.e. [1,0,0],
# [0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = f"{name}-{x}"
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)


Again, just as we did for anomaly detection, we preprocess the data set.  We convert all numeric values to Z-Score and translate all categorical to dummy variables.

> Indented block



In [None]:
# Now encode the feature vector

pd.set_option('display.max_columns', 6)
pd.set_option('display.max_rows', 5)

for name in df.columns:
  if name == 'outcome':
    pass
  elif name in ['protocol_type','service','flag','land','logged_in',
                'is_host_login','is_guest_login']:
    encode_text_dummy(df,name)
  else:
    encode_numeric_zscore(df,name)

# display 5 rows

df.dropna(inplace=True,axis=1)
df[0:5]


# Convert to numpy - Classification
x_columns = df.columns.drop('outcome')
x = df[x_columns].values
dummies = pd.get_dummies(df['outcome']) # Classification
outcomes = dummies.columns
num_classes = len(outcomes)
y = dummies.values

We will attempt to predict what type of attack is underway.  The outcome column specifies the attack type.  A value of normal indicates that there is no attack underway.  We display the outcomes; some attack types are much rarer than others.

In [None]:
df.groupby('outcome')['outcome'].count()

outcome
back.                 2203
buffer_overflow.        30
ftp_write.               8
guess_passwd.           53
imap.                   12
                     ...  
smurf.              280790
spy.                     2
teardrop.              979
warezclient.          1020
warezmaster.            20
Name: outcome, Length: 23, dtype: int64

## Train the Neural Network

We now train the neural network to classify the different KDD99 outcomes.  The code provided here implements a relatively simple neural with two hidden layers.  We train it with the provided KDD99 data.

In [None]:
import pandas as pd
import io
import requests
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn import metrics
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

# Create a test/train split.  25% test
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.25, random_state=42)

# Create neural net
model = Sequential()
model.add(Dense(10, input_dim=x.shape[1], activation='relu'))
model.add(Dense(50, input_dim=x.shape[1], activation='relu'))
model.add(Dense(10, input_dim=x.shape[1], activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
model.add(Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3,
                        patience=5, verbose=1, mode='auto',
                           restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test),
          callbacks=[monitor],verbose=2,epochs=1000)


Epoch 1/1000
11579/11579 - 34s - loss: 0.1183 - val_loss: 0.0479 - 34s/epoch - 3ms/step
Epoch 2/1000
11579/11579 - 26s - loss: 0.0428 - val_loss: 0.0377 - 26s/epoch - 2ms/step
Epoch 3/1000
11579/11579 - 26s - loss: 0.0347 - val_loss: 0.0334 - 26s/epoch - 2ms/step
Epoch 4/1000
11579/11579 - 26s - loss: 0.0301 - val_loss: 0.0334 - 26s/epoch - 2ms/step
Epoch 5/1000
11579/11579 - 25s - loss: 0.0305 - val_loss: 0.0321 - 25s/epoch - 2ms/step
Epoch 6/1000
11579/11579 - 26s - loss: 0.0286 - val_loss: 0.0342 - 26s/epoch - 2ms/step
Epoch 7/1000
11579/11579 - 25s - loss: 0.0263 - val_loss: 0.0338 - 25s/epoch - 2ms/step
Epoch 8/1000
11579/11579 - 25s - loss: 0.0273 - val_loss: 0.0334 - 25s/epoch - 2ms/step
Epoch 9/1000
11579/11579 - 25s - loss: 0.0274 - val_loss: 0.0331 - 25s/epoch - 2ms/step
Epoch 10/1000
Restoring model weights from the end of the best epoch: 5.
11579/11579 - 25s - loss: 0.0254 - val_loss: 0.0361 - 25s/epoch - 2ms/step
Epoch 10: early stopping


<keras.callbacks.History at 0x7f455945cc70>

We can now evaluate the neural network.  As you can see, the neural network achieves a 99% accuracy rate.

In [None]:
# Measure accuracy
pred = model.predict(x_test)
# print(pred)
pred = np.argmax(pred,axis=1)
y_eval = np.argmax(y_test,axis=1)
score = metrics.accuracy_score(y_eval, pred)
print("Validation score: {}".format(score))

Validation score: 0.9918384531925575
