Auditeur : Didier GORGES

kaggle notebook : https://www.kaggle.com/dgcnam/sec201-lab-session-predicting-attacks/edit/run/61813075

# Loading data

## 1. Open a new Jupyter notebook and name it ‘SEC201 - Lab session - predicting attacks’

Done

## 2. In File > Add or Include Data, search for “UNSW_NB15” dataset and include it

Done

## 3. In ‘Data > input > unsw-nb15‘, get the exact path of CSV file 'UNSW_NB15_training-set.csv' and load it as training_set using Pandas.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

df = pd.read_csv('/kaggle/input/unsw-nb15/UNSW_NB15_training-set.csv')

## 4. List following information for the training set


### 4.1. Column number

In [None]:
print(f'Column number: { df.columns.size }')

### 4.2. Column names

In [None]:
print('Column names')
print(df.columns)

### 4.3. Column types

In [None]:
print('Column types')
print(df.dtypes)

### 4.4. Size of the data set

In [None]:
print(f'Size of the dataset : { len(df) }')

## 5. Look at the file head using: df.head()

In [None]:
df.head()

## 6. Which columns are categories? List them; extract existing values.

In [None]:
print(df.dtypes[df.dtypes == 'object'].index)


In [None]:
for category in df.dtypes[ df.dtypes == 'object' ].index:
    print(category)
    print(list(set(df[category])))

## 7. Which columns are numeric? List them; extract min, max, mean, median and standard deviation values for ‘rate’.

In [None]:
print('Numeric columns :')
newdf = df._get_numeric_data()
print(newdf.columns)

In [None]:
print("Extract min, max, mean, median and standard deviation values for 'rate'")
for f in [ 'min', 'max', 'mean', 'median', 'std']:
    print(f'{ f } = { getattr(df.rate, f)() }')

In [None]:
print('Extract min, max, mean, median and standard deviation values for all numeric colums')
function_list = [ 'min', 'max', 'mean', 'median', 'std']
stats = pd.DataFrame(columns=[ 'name' ] + function_list)
for c in newdf:
    line = { 'name': c }
    for f in function_list:
        line[f] = newdf[c].aggregate(f)
    stats = stats.append(line, ignore_index = True)
stats

## 8. Based on this information
### 8.1. Define the goal of the analysis.

The goal of the analysis is to check if the data are correctly labelled to reveal an attack.


### 8.2. Identify the target properties you will want to analyse
I will want to analyse the 'label' and 'attack_cat' properties versus the others properties.

## 9. Check whether the positive label (1) match attack categories and whether attack categories match labelled data.

The following code shows that the positive label matchs the Normal attack category, and that the negative label matchs all the other attack categories.

In [None]:
label_normal = df.loc[df.attack_cat == 'Normal'].label.unique()
print(f'There is { len(label_normal) } label where attack_cat == Normal, label = { label_normal }')

label_attack = df.loc[df.attack_cat != 'Normal'].label.unique()
print(f'There is { len(label_attack) } label where attack_cat != Normal, label = { label_attack }')


## 10. Which is the number of occurrences for each attack category?

In [None]:
print('Number of occurrences for each attack category :')
df.groupby("attack_cat").count()["id"]

## 11. Which protocols and services appear in the positively labelled entries? In the negatively labelled ones?

In [None]:
print('protocols appearing in negatively labelled entries :')
print(df.loc[df.label == 0].groupby('proto').count()['id'].sort_values(ascending=False).index.tolist())

In [None]:
print('protocols appearing in positively labelled entries:')
print(df.loc[df.label == 1].groupby('proto').count()['id'].sort_values(ascending=False).index.tolist())

In [None]:
print('services appearing in negatively labelled entries :')
print(df.loc[df.label == 0].groupby('service').count()['id'].sort_values(ascending=False).index.tolist())

In [None]:
print('services appearing in positively labelled entries:')
print(df.loc[df.label == 1].groupby('service').count()['id'].sort_values(ascending=False).index.tolist())

## 12. What do you conclude about the traffic being analysed?

In this data set, the 'label' and 'attack_cat' properties are coherents. If 'label' is 1, the 'attack_cat' is not 'Normal'. If 'label' is 0, the attack_cat is 'Normal'.

There are big differences between mean and median on some properties ('rate', 'sload', 'dload', 'sjiy', ...) which means we expect to see some outliers due to attacks.

Attackers' traffic uses more various protocols and services than the legitimate one. Protocols and services that do not appear in legitimate traffic are suspicious.

# Data visualisation

## 13. Visualise the repartition of services, protocols, attack types, as histograms Use pyplot and seaborn libraries.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.gcf()
fig.set_size_inches(8, 5)
cplot = sns.countplot(y='service', data=df)
cplot.set_title('Repartition of services')
plt.show()

In [None]:
plt.figure(figsize=(30, 15))
barplot = sns.countplot(y='proto', data=df)
barplot.set_title('Repartition of protocols')
plt.show()

In [None]:
table = df[['proto', 'id']].pivot_table(index=['proto'], aggfunc='count').sort_values(['id'],ascending=False,inplace=False).head(10)
table.plot(kind='bar', title='Repartition of top 10 protocols', legend=False)

In [None]:
df.groupby('proto').count().describe()

In [None]:
barplot = sns.countplot(y='attack_cat', data=df)
barplot.set_title('Repartition of attack types')

In [None]:
plt.figure(figsize=(30,15))
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation heatmap', fontdict={'fontsize': 12}, pad=10)
plt.show()

## 14. Build the correlation matrix between parameters for labelled and unlabelled entries.

In [None]:
print('Correlation matrix for labelled entries')
plt.figure(figsize=(30,15))
heatmap = sns.heatmap(df[df.label == 1].corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation heatmap', fontdict={'fontsize': 12}, pad=10)
plt.show()

In [None]:
print('Correlation matrix for unlabelled entries')
plt.figure(figsize=(30,15))
heatmap = sns.heatmap(df[df.label == 0].corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation heatmap', fontdict={'fontsize': 12}, pad=10)
plt.show()

In [None]:
print('Correlation matrix for labelled - unlabelled entries')
plt.figure(figsize=(30,15))
heatmap = sns.heatmap(df[df.label == 1].corr() - df[df.label == 0].corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation heatmap', fontdict={'fontsize': 12}, pad=10)
plt.show()

In [None]:
print('Top 20 differences between matrix of labelled correlation - matric of unlabelled correlation')
x = (df[df.label == 1].corr() - df[df.label == 0].corr()).stack().sort_values()
t = pd.DataFrame(columns=['a', 'b', 'diff', 'abs'])
for a, b in x.index:
    if (a != b):
        t = t.append({ 'a': a, 'b': b, 'diff': x[(a, b)], 'abs': abs(x[(a,b)]) }, ignore_index=True)
print(t.sort_values('abs', ascending=False)[['a', 'b', 'diff']].head(20))


In [None]:
print('correlation of sttl and dttl, for labelled entries')
print(df.loc[df.label == 1][['sttl', 'dttl']].corr())
print('correlation of sttl and dttl, for unlabelled entries')
print(df.loc[df.label == 0][['sttl', 'dttl']].corr())

In [None]:
print('min, max, mean, median and standard deviation values for rate, sttl and dttl for unlabelled entries')
function_list = [ 'min', 'max', 'mean', 'median', 'std']
stats = pd.DataFrame(columns=[ 'name' ] + function_list)
for c in 'rate', 'sttl','dttl':
    line = { 'name': c }
    for f in function_list:
        line[f] = df.loc[df.label == 0][c].aggregate(f)
    stats = stats.append(line, ignore_index = True)
stats

In [None]:
print('min, max, mean, median and standard deviation values for sttl and dttl for labelled entries')
stats = pd.DataFrame(columns=[ 'name' ] + function_list)
for c in 'rate', 'sttl','dttl':
    line = { 'name': c }
    for f in function_list:
        line[f] = df.loc[df.label == 1][c].aggregate(f)
    stats = stats.append(line, ignore_index = True)
stats


In [None]:
df.loc[df.label == 0][['sttl','dttl']].plot.hist(bins=256, alpha=0.5, title='unlabelled entries')


In [None]:
df.loc[df.label == 1][['sttl','dttl']].plot.hist(bins=256, alpha=0.5, title='labelled entries')

## 15. Based on the Exploratory Data Analysis
### 15.1. Describe what you learnt from the dataset

I learnt that the label field is a flag that indicates if the entry is considered as an attack.

The kind of attack is written in the attack_cat field. The label values of the entries are coherents with their attack_cat values.

The protocols and services use in an attack are more various than for a legitimate traffic.

Correlation matrix show that, for labelled entries, sttl and dttl fields does not have the same distributions as for unlabelled entries. There is correlation differences for some other fields too.

rate median is 118 for unlabelled entries and 100000 for labelled entries.

dttl median is 29 for unlabelled entries and 0 for labelled entries.

sttl median is 62 for unlabelled entries en 254 for labelled entries.

### 15.2. Draw the first conclusions

The entries contain normal and attack traffic, and seem correctly classified without identified bias. When aggregated according the label field, data have a different metrics profile.

### 15.3. Emit recommendations for enforcing the cybersecurity of the target system

We can use this data set to train machine learning classifiers, like XGBoost, in order to estimate a probability of attack on new entries.