<a href="https://colab.research.google.com/github/tariqsoft/content/blob/main/Anomaly_based_IDS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<span style="font-size:3em; text-align:center">Information System Security -</span>

<span style="font-size:3em; text-align:center">Anomaly-based Intrusion Detection System</span>

Data is contained in 8 different CSV files, each containing different attack data at different times. So first thing we must do is merge all the data from files into one pandas DataFrame.

In [8]:
import pandas as pd
import glob

In [9]:
# Saving all .csv files in folder to list.
path = "MachineLearningCVE/"
files = [file for file in glob.glob(path + "**/*.csv", recursive=True)]

In [10]:
[print(f) for f in files]

[]

In [11]:
# Reading all the csv files into dataframes and putting thoose DFs to one list.

dataset = [pd.read_csv(f) for f in files]

In [12]:
# Here we can see the number of rows and columns for each table.

for d in dataset:
    print(d.shape)

In [13]:
# We already established that all tables have the same number of columns, but are they the same columns?
# This next piece of code loops over all given tables and compares each of them to all others.

for i in range(0,len(dataset)):
    if i != len(dataset)-1:
        same_columns = dataset[i].columns == dataset[i+1].columns

        if False in same_columns:
            print(i)
            break

same_columns

NameError: name 'same_columns' is not defined

In [None]:
# Combining all tables into one dataset. This is possilbe since all tables have the same columns,
# as we checked in the cell above.

dataset = pd.concat([d for d in dataset]).drop_duplicates(keep=False)
dataset.reset_index(drop=True, inplace = True)

In [None]:
# By checking the shape of dataset we can confirm that concatenation has been successfull.

dataset.shape

# Preliminary data analysis

Some general info about the dataset. It contains roughly 2.5 million records across 79 columns. Data consists of mostly int64 and float64 types, except 3 attributes of 'object' type.

Dataset contains of network traffic data during different attacks, represented with values like: port numbers, IP adressses, packet lenghts, SYN/ACK/FIN/.. flag counts, packet size and other...

In [None]:
#dataset = pd.read_csv('Dataset_clean.csv', index_col=[0])
dataset.info()

In [None]:
dataset.describe()

Upon further inspection we can see that dataset contains 15 labels. Labels represent network/web attacks and BENIGN state which is the network traffic during normal business day.

In [None]:
# Dataset conatains 15 labels.
#print(dataset[' Label'].unique())
#len(dataset[' Label'].unique())

print(dataset['Label'].unique())
len(dataset['Label'].unique())

In [None]:
dataset.head()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

Most records in the dataset are of DDos and DOS Hulk attacks. This might pose a problem later in model training, considering that there is a very small amount of data for most attacks. Model selection will be greatly influenced by this information.

In [None]:
data = dataset['Label'].where(dataset['Label'] != "BENIGN")

In [None]:
plt.figure(figsize=(15,6))
chart = sns.countplot(data, palette="Set1")
plt.xticks(rotation=45, horizontalalignment="right")

# Data Cleaning

This chapter contains data cleaning code. We go through the process of renaming columns, removing NaN and non-finite values (-inf, inf) to get the data ready for visualization and model training.

## Renaming columns

In [None]:
# Removing whitespaces in column names.

col_names = [col.replace(' ', '') for col in dataset.columns]
dataset.columns = col_names
dataset.head()

In [None]:
# Here we can see that 'Label' column contains some wierd characters.

dataset["Label"].unique()

In [None]:
# This next snippet uses regular expressions to replace wierd characters with dunders.

label_names = dataset['Label'].unique()


import re

label_names = [re.sub("[^a-zA-Z ]+", "", l) for l in label_names]
label_names = [re.sub("[\s\s]", '_', l) for l in label_names]
label_names = [lab.replace("__", "_") for lab in label_names]

label_names, len(label_names)

In [None]:
# Replacing 'Label' column values with new readable values.

labels = dataset['Label'].unique()

for i in range(0,len(label_names)):
    dataset['Label'] = dataset['Label'].replace({labels[i] : label_names[i]})

dataset['Label'].unique()

In [None]:
len(dataset['Label'].unique())

In [None]:
# Saving cleaned dataset.

#dataset.to_csv("Dataset_clean.csv", index=False)

## Removing NULL values

In [None]:
#dataset = pd.read_csv("Dataset_clean.csv", index_col=0)
dataset.head()

In [None]:
# Checking if there are any NULL values in the dataset.

dataset.isnull().values.any()

In [None]:
# Checking which column/s contain NULL values.

[col for col in dataset if dataset[col].isnull().values.any()]

In [None]:
# Checking how many NULL values it this column contains.

dataset['FlowBytes/s'].isnull().sum()

In [None]:
# Considering that only 334 rows contain NULL vlaues in the entire dataset, which makes about 0.01%, we
# can safely remove all NULL rows without spoiling the data.

334/dataset.shape[0]*100

In [None]:
# Removing rows that contain NULL values and checking if number of removed rows is equal to the number of null values.

before = dataset.shape

dataset.dropna(inplace=True)

after = dataset.shape

before[0] - after[0]

In [None]:
dataset.isnull().any().any()

## Removing *non-finite* values

In [None]:
import numpy as np

In [None]:
labl = dataset['Label']
dataset = dataset.loc[:, dataset.columns != 'Label'].astype('float64')

In [None]:
# Checking if all values are finite.

np.all(np.isfinite(dataset))

In [None]:
# Checking what column/s contain non-finite values.

nonfinite = [col for col in dataset if not np.all(np.isfinite(dataset[col]))]

nonfinite

In [15]:
# Checking how many non-finite values each column contains.

finite = np.isfinite(dataset['FlowBytes/s']).sum()

dataset.shape[0] - finite

NameError: name 'np' is not defined

In [None]:
# Checking how many non-finite values each column contains.

finite = np.isfinite(dataset['FlowPackets/s']).sum()

dataset.shape[0] - finite

In [None]:
# Same as before, since there is a small number of non-finite values we can safely remove them from the dataset
# without spoiling the dataset.

# Replacing infinite values with NaN values.
dataset = dataset.replace([np.inf, -np.inf], np.nan)

In [None]:
# We can see that now we have Nan values again.

np.any(np.isnan(dataset))

In [None]:
# Bringing the Labels back into the dataset before deliting Nan rows.

dataset = dataset.merge(labl, how='outer', left_index=True, right_index=True)

In [None]:
# Removing new NaN values.

dataset.dropna(inplace=True)

In [None]:
dataset.shape

In [None]:
dataset.head()

In [None]:
# Saving cleaned dataset.

#dataset.to_csv("Dataset_clean_dropna.csv", index=False)

# Data visualization

So, by now we know our dataset has 78 features and is split into 15 categories (14 attacks and 1 "normal" state).
Next step is to try and visualize what the dataset looks like in feature space.
For this we will use principal component analysis (PCA) to reduce dimensionality and then pass the reduced dataset to t-SNE (t - Distributed Stohastic Neighbor Entities) for visual representation in 2D space.

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

%matplotlib inline

In [None]:
# We are going to pick 10.000 random rows from the dataset for visualization purposes.
# Setting the random seed for reproducability of results.

np.random.seed(42)

rand_perm = np.random.permutation(dataset.shape[0])

In [None]:
feature_cols = dataset.columns[:-1]

dataset_subset = dataset.loc[rand_perm[:10000],:]

In [None]:
dataset_subset = dataset_subset.replace([np.inf, -np.inf], np.nan)
dataset_subset.dropna(inplace=True)

In [None]:
data_subset = dataset_subset[feature_cols].values

In [None]:
# Performing the principal component analysis. With just 19 components the variance ratio remains 99%, which is great.

pca = PCA(n_components=19)
pca_res = pca.fit_transform(data_subset)

data_subset = None
np.sum(pca.explained_variance_ratio_)

In [None]:
# Computing t-SNE.

tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=1000)
tsne_res = tsne.fit_transform(data_subset)
print("done")

In [None]:
dataset_subset['tsne_firstD'] = tsne_res[:,0]
dataset_subset['tsne_secondD'] = tsne_res[:,1]

In [14]:
plt.figure(figsize=(16,16))

sns.scatterplot(
    x="tsne_firstD", y="tsne_secondD",
    palette=sns.color_palette("hls", colors),
    data=dataset_subset,
    hue="Label",
    legend="full",
    alpha=0.3
)

NameError: name 'plt' is not defined

From the cell above we can see distribution of the data in 2D space. It is obvious that attacks are not spatialy well separated from normal state. Clusters of attacks can hardly be seen, instead they are found in the same place as the "normal state" datatpoints.

This insight leads us to conclude that the ML model will probably have some issues with this kind of data. ML model will have to be chosen with this in mind.

# Data preparation

In this chapter, final data preparation steps are taken before we use the data for model traning and testing.

These steps include:

* Data scaling
* Label encoding
* Data splitting

In [None]:
#dataset = pd.read_csv("Dataset_clean_dropna.csv")

## Scaling the data

The next few cells contain the code for scaling the data into the size adequate for the ML algorithm.

In [None]:
# Splitting dataset into features and labels.

labels = dataset['Label']
features = dataset.loc[:, dataset.columns != 'Label'].astype('float64')

In [None]:
features.head()

In [None]:
# For scaling the data, we use RobustScaler class from sklearn.

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

For scaling the data we used RobustScaler class from sklearn. RobustScaler is used to perserve outliers in the data.

In [None]:
scaler = RobustScaler()
scaler.fit(features)

features = scaler.transform(features)

In [None]:
# Checking if scaling has been succesful.
features[0]

## Label encoding

Label encoding is done when dataset contains categorical values (ex. 0-5, A/B/C, 55+). It is used to turn categorical values into numerical values by replacing data categories with integers starting with 0.

In [None]:
# No need to do previous operations, just load clean saved dataset.

#dataset = pd.read_csv('Dataset_clean.csv', index_col=[0])

In [None]:
from sklearn.preprocessing import LabelEncoder

'Lables' column contains categorical values - 15 of them (14 types of attacks in our dataset +  1 normal state).

To convert this into numerical values we will use 'LabelEncoder' class from sklearn.

In [None]:
LE = LabelEncoder()

LE.fit(labels)
labels = LE.transform(labels)

In [None]:
# Labels have been replaced with integers.

np.unique(labels)

In [None]:
# Checking that encoding reversal works.

d = LE.inverse_transform(labels)
d = pd.Series(d)
d.unique()

## Splitting the data

Final step to data preparation is splitting the data into traning and testing sets. For this there already exists _sklearn_ function that does all the splitting for us. This step is important so we can have representative data for evaluating our model. Both train and test samples should contain similar data variance.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# The next step is to split training and testing data. For this we will use sklearn function train_test_split().

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=.2)

features_train.shape, features_test.shape, labels_train.shape, labels_test.shape

In [None]:
# Clearing variables.

dataset = None
finite = None
labl = None
d = None
features = None
labels = None

# Model training

For completing this task we chose to use a neural network. Specifically, the multi-layer perceptron, more specifically, feedforward neural network multi-class classifier with backpropagating algorithm. NN will be used to classifiy 14 different attacks and 1 normal state, as we saw from the labels in previous chapters.

In this chapter we go by explaning parts of the network and its hyperparameters.

In [None]:
import tensorflow as tf
import datetime

#%load_ext tensorboard

Our tensorflow Sequential model has 3 layers. Input, 1 hidden and an output layer.

* Input layer has 78 neurons, one for each feature.
* Hidden layer has 67 neurons, this number has been calculated by [formula](https://www.heatonresearch.com/2017/06/01/hidden-layers.html) 2/3 the number of input neurons + number of output neurons.
* Output layer has 15 neurons, one for each class we predict.

For activation functions, we used standard functions for multi-class classification tasks - ReLu for hidden layer and _softmax_ function for output layer.

Finally, we use Dropout parameter set to 0.2 for randomly shutting off 20% of neurons in each learning iteration. This technique is used for decreasing overfitting thereby incresing network accuracy.

In [None]:
model = tf.keras.models.Sequential([

    tf.keras.layers.Flatten(input_shape=(78,)),
    tf.keras.layers.Dense(67, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(15, activation='softmax')
])

For learning rate optimization we used Adam optimizer.
Loss function used is sparse categorical crossentropy, which is standard for multiclass classification problems.

In [None]:
model.compile(optimizer='adam',
             loss='sparse_categorical_crossentropy',
             metrics=['accuracy'])

In [None]:
import os

In the next cell we setup training logs for tensorboard as well as some tensorboard callbacks.

* tensorboard - callback that logs training data.
* EarlyStopping - callback that monitors 'loss (function)' metric and if the loss function does not get better in tne hext 10 iterations, callback stops the training and resotres the network with best weights up untill that iteration.

In [None]:
log_dir = os.path.join(
    "train_logs",
    datetime.datetime.now().strftime("%Y%m%d-%H%M%S"),
)

# TF callback that sets up TensorBoard with training logs.
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

# TF callback that stops training when best value of validationi loss function is reached. It also
# restores weights from the best training iteration.
eary_stop_callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=10, restore_best_weights=True)

In [None]:
model.fit(features_train,
          labels_train,
          epochs=100,
          callbacks=[tensorboard_callback, eary_stop_callback])

We can see that training stoped after 18 out of 100 epochs due to 'loss' function metric not changing much in the previous 10 epochs.

After training we evaluate model accuracy (next cell), and find that our model predicts attacks with **91.2% accuracy**.

In [None]:
# Evaluating model accuracy.
model.evaluate(features_test, labels_test, verbose=2)

In [None]:
# Saving the model.

model.save('saved_models/IDS_model_' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + '.h5')

# Conclusion

In this project we made a neural network classifier that can predict 14 network/web attacks and normal traffic with 91% accuracy. This model is proof-of-concept that feedforward neural network with backpropagation algorithm can be used for classifying attacks in anomaly-based intrusion detection systems.


**Propositions**

We propose a couple of solutions for improving model accuracy as well as use of some other neural network architectures.

Accuracy of this model can probably be improved by _feature engineering_ and _feature selection_. Picking the features that have the most influence on the model.

Regarding this model, we propose tuning the model hyperparameters. Changing the hidden layer activation function, early stopping callback, dropout, optimizer and loss function should increase accuracy by some extent. Another way, albeit more complicated and resource intense is to use a genetic algorithm to evolve the best neural network arhitecture for this specific task.

Finally, we propose the usage of some other ML algorithms. Random forest classifiers have been used in intrusion detection system for a while now. Alternatively, we found some sources using autoencoders for anomaly detection.