# CICIDS2017 dataset
------------------------------------------------

<div>
    <b>Aim</b>: This notebook provide you with a way to create a clean CICIDS2017 dataset using all the .csv files.</div>

<br>

<div>
    <b>Dataset description</b>: The CICIDS2017 dataset, which has been created by the <a href="https://www.unb.ca/cic/datasets/ids-2017.html">Canadian Institute for Cyber-security (CIC)</a>, consists of labeled network flows. The CICIDS2017 contains benign and the most up-to-date common attacks. It is made up of 2,830,743 records with a total of 78 features.
</div>

In [5]:
import pandas as pd
import numpy as np
import glob
import os

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
sns.set_theme(style="white", color_codes=True)
pd.set_option('display.max_columns', None, 'max_colwidth', None, 'display.expand_frame_repr', False)

## Data Cleaning
------------------------------------------------

In [7]:
DATA_DIR  = os.path.join(os.path.abspath("../.."), "data")
IMAGE_DIR = os.path.join(os.path.abspath("../.."), "images")

if os.path.exists(DATA_DIR):
    print(f"The directory {DATA_DIR} exists.")
else:
    print(f"The directory {DATA_DIR} does not exist.")


The directory C:\Users\tangy\Desktop\Files\信安赛\projects\dbn-based-nids-master\data exists.


### Get data

In [8]:
def clean_column_name(column):
    column = column.strip(' ')
    column = column.replace('/', '_')
    column = column.replace(' ', '_')
    column = column.lower()
    return column

In [None]:
# Read all the .csv files
filenames = glob.glob(os.path.join(DATA_DIR, 'raw', '*.csv'))
datasets = [pd.read_csv(filename) for filename in filenames]

# Remove white spaces and rename the columns
for dataset in datasets:
    dataset.columns = [clean_column_name(column) for column in dataset.columns]

# Concatenate the datasets
dataset = pd.concat(datasets, axis=0, ignore_index=True)
dataset.drop(labels=['fwd_header_length.1'], axis= 1, inplace=True)

Have an initial inspection of the data

In [None]:
dataset.info()

In [None]:
dataset.head(5)

In [None]:
dataset.describe(include=[int, float])

In [None]:
dataset.describe(include=[object]).transpose()

In [None]:
dataset.label.value_counts()

### Dealing with duplicates

We first check if there are duplicates after combining the eight .csv file.

In [None]:
dataset.duplicated().any()

As we can see from above, there are duplicates and we need get rid of them. 

In [None]:
print('Data size BEFORE deleteting instances with duplicate values: ', dataset.shape[0], end='\n\n')

# Remove duplicate rows
dataset.drop_duplicates(inplace=True, keep=False, ignore_index=True)

print('Data size AFTER deleteting instances containing duplicate values: ', dataset.shape[0])

### Dealing with missing values

Then, we check if there are missing values in each feature.

In [None]:
dataset.isnull().sum().sum()

As we can see from above, there are 320 missing values in this **CIC-IDS-2017** dataset.

We have a few options to deal with missing values:
1. Get rid of the corresponding instances.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.).
4. Use imputation methods.

In [None]:
dataset.isnull().sum() / dataset.shape[0]

The number of instances in the dataset is large enough and the fraction of intances with missing values is small, an easy way is simply to remove those instances containing missing values. However, we need to first ensure that the missing values are not related to a specific label.

In [None]:
dataset.columns[dataset.isnull().any()]

All the missing values come from the `flow_bytes_s`.

In [None]:
print('Data size BEFORE deleteting instances with missing values: ', dataset.shape[0], end='\n\n')

# Remove missing values
dataset.dropna(axis=0, inplace=True, how="any")

print('Data size AFTER deleteting instances containing missing values: ', dataset.shape[0])

### Dealing with infinite values

Checking if all values are finite.

In [None]:
np.all(np.isfinite(dataset.drop(['label'], axis=1)))

In [None]:
# Replace infinite values to NaN
dataset.replace([-np.inf, np.inf], np.nan, inplace=True)

# Check which labels are related to infinte values
dataset[(dataset['flow_bytes_s'].isnull()) & (dataset['flow_packets_s'].isnull())].label.unique()

In [None]:
print('Data size BEFORE deleteting instances with infinite values: ', dataset.shape[0], end='\n\n')

# Remove infinte values
dataset.dropna(axis=0, how='any', inplace=True)

print('Data size AFTER deleteting instances containing infinite values: ', dataset.shape[0])

### Dealing with features with quasi null std deviation

Standard deviation denoted by sigma (σ) is the average of the squared root differences from the mean.

In [None]:
dataset_std = dataset.std(numeric_only=True)
dataset_std

In [None]:
# Find Features that meet the threshold
constant_features = [column for column, std in dataset_std.iteritems() if std < 0.01]

# Drop the constant features
dataset.drop(labels=constant_features, axis=1, inplace=True)

Some features such as `bwd_psh_flags`, `fwd_urg_flags`, `bwd_urg_flags`, `cwe_flag_count`, `fwd_avg_bytes_bulk`, `fwd_avg_packets_bulk`, `fwd_avg_bulk_rate`, `bwd_avg_bytes_bulk`, `bwd_avg_packets_bulk`, `bwd_avg_bulk_rate`  don't vary. Hence, the correlation is NaN by definition.

### Observing the outliers

In [None]:
Q1 = dataset.quantile(0.25)
Q3 = dataset.quantile(0.75)
IQR = Q3 - Q1

# Identifying outliers with interquartile range
filt = (dataset < (Q1 - 1.5 * IQR)) | (dataset > (Q3 + 1.5 * IQR))
print(filt.sum())

In [None]:
fig = plt.figure(figsize=(15, 8))
sns.boxplot(data=dataset[["average_packet_size", "avg_bwd_segment_size"]], orient="h")

#plt.title('Summary of some variables containing outliers', fontsize=18)
plt.show()
fig.savefig(os.path.join(IMAGE_DIR, 'outliers.pdf'))

Convert the dtype of some features

In [None]:
dataset[['flow_bytes_s', 'flow_packets_s']] = dataset[['flow_bytes_s', 'flow_packets_s']].apply(pd.to_numeric)

Create a new feature `Port Category`

In [None]:
conds = [
    (dataset['destination_port'] >= 1) & (dataset['destination_port'] < 1024),
    (dataset['destination_port'] >= 1024) & (dataset['destination_port'] < 49152),
    (dataset['destination_port'] >= 49152) & (dataset['destination_port'] <= 65535)
]

choices = [
    "1 - 1023", 
    "1024 - 49151",
    "49152 - 65535"
]

dataset.insert(1, 'destination_port_category', np.select(conds, choices, default="0"))

## Data Exploration
------------------------------------------------

### Correlation Matrix

In [None]:
dataset_corr = dataset.corr()
dataset_corr.head(5)

In [None]:
fig = plt.figure(figsize=(15, 15))
sns.set(font_scale=1.0)
ax = sns.heatmap(dataset_corr, annot=False)
fig.savefig(os.path.join(IMAGE_DIR, 'correlation matrix.pdf'))

We can see that some features seems to be highly correlated. Hence, we might need to remove them since there are bringing redundant information

In [None]:
# Create & Apply mask
mask = np.triu(np.ones_like(dataset_corr, dtype=bool))
tri_df = dataset_corr.mask(mask)

# Find Features that meet the threshold
correlated_features = [c for c in tri_df.columns if any(tri_df[c] > 0.98)]

# Drop the highly correlated features
dataset.drop(labels=correlated_features, axis=1, inplace=True)

### Label 

In [None]:
fig = plt.figure(figsize=(30, 10))

attack = dataset['label'].value_counts()

attack_count = attack.values
attack_type = attack.index

bar = plt.bar(attack_type, attack_count, align='center')

for rect in bar:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2.0, height, format(height, ','), ha='center', va='bottom')

plt.title('Distribution of different type of network activity in the dataset')
plt.xlabel('Network activity')
plt.ylabel('Number of instances')
plt.grid(True)
plt.show()
fig.savefig(os.path.join(IMAGE_DIR, 'network_activity.pdf'))

The dataset is clearly imbalaced and we need to deal with it. We can merge few minority classes having similar characteristics and behavior to form new attack classes.

In [None]:
dataset['label'] = dataset['label'].str.replace('Web Attack �', 'Web Attack', regex=False)

# Proposed Groupings
attack_group = {
    'BENIGN': 'Benign',
    'PortScan': 'PortScan',
    'DDoS': 'DoS/DDoS',
    'DoS Hulk': 'DoS/DDoS',
    'DoS GoldenEye': 'DoS/DDoS',
    'DoS slowloris': 'DoS/DDoS', 
    'DoS Slowhttptest': 'DoS/DDoS',
    'Heartbleed': 'DoS/DDoS',
    'FTP-Patator': 'Brute Force',
    'SSH-Patator': 'Brute Force',
    'Bot': 'Botnet ARES',
    'Web Attack Brute Force': 'Web Attack',
    'Web Attack Sql Injection': 'Web Attack',
    'Web Attack XSS': 'Web Attack',
    'Infiltration': 'Infiltration'
}

# Create grouped label column
dataset['label_category'] = dataset['label'].map(lambda x: attack_group[x])
dataset['label_category'].value_counts()

In [None]:
fig = plt.figure(figsize=(12, 5))

attack = dataset['label_category'].value_counts()

attack_count = attack.values
attack_type = attack.index

bar = plt.bar(attack_type, attack_count, align='center')

for rect in bar:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2.0, height, format(height, ','), ha='center', va='bottom', fontsize=12)

plt.title('Distribution of different type of network activity in the dataset', fontsize=18)
plt.xlabel('Network activity', fontsize=16)
plt.ylabel('Number of instances', fontsize=16)
plt.grid(True)
plt.show()
fig.savefig(os.path.join(IMAGE_DIR, 'network_activity_category.pdf'))

### Port Usage Comparaison

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

benign_ports = dataset.loc[dataset['label'] == 'BENIGN', 'destination_port_category']
malicious_ports = dataset.loc[dataset['label'] != 'BENIGN', 'destination_port_category']

# get rid of rows with specific value
benign_ports = benign_ports[benign_ports != "0"]
malicious_ports = malicious_ports[malicious_ports != "0"]

# sum each port category column
benign_ports = benign_ports.value_counts()
malicious_ports = malicious_ports.value_counts()

indexes = np.arange(3)
width = 0.4
rect1 = plt.bar(indexes, benign_ports.values, width, color="steelblue", label="benign")
rect2 = plt.bar(indexes + width, malicious_ports.values, width, color="indianred", label="malicious")

def add_text(rect):
    # add text to top of each bar
    for r in rect:
        h = r.get_height()
        plt.text(r.get_x() + r.get_width()/2, h*1.01, s=format(h, ",") ,fontsize=12, ha='center', va='bottom')

add_text(rect1)
add_text(rect2)

ax.set_xticks(indexes + width / 2)
ax.set_xticklabels(["1 - 1,023", "1,024 - 49,151", "49,152 - 65,535"])
plt.title('Distribution of Port Usage\nAccording to Network Activity Type')
plt.xlabel('Port Range')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)
plt.show()
fig.savefig(os.path.join(IMAGE_DIR, 'port_usage_comparison.pdf'))

## Data Preparation
------------------------------------------------

### Data Splitting

First step to data preparation is splitting the data into traning and testing sets. For this there already exists sklearn function that does all the splitting for us. This step is important so we can have representative data for evaluating our model. Both train and test samples should contain similar data variance.

In [None]:
labels = dataset['label_category']
features = dataset.drop(labels=['label', 'label_category', 'destination_port_category'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42, stratify=labels)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

In [None]:
X_train.shape, X_test.shape, X_val.shape

### Scaling features to a range

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, QuantileTransformer
from sklearn.compose import ColumnTransformer

In [None]:
categorical_features = features.select_dtypes(exclude=["int64", "float64"]).columns
numeric_features = features.select_dtypes(exclude=[object]).columns

preprocessor = ColumnTransformer(transformers=[
    ('categoricals', OneHotEncoder(drop='first', handle_unknown='error'), categorical_features),
    ('numericals', QuantileTransformer(), numeric_features)
])

Preprocess the features

In [None]:
columns = numeric_features.tolist()

X_train = pd.DataFrame(preprocessor.fit_transform(X_train), columns=columns)
X_test = pd.DataFrame(preprocessor.transform(X_test), columns=columns)
X_val = pd.DataFrame(preprocessor.transform(X_val), columns=columns)

Preprocess the labels

In [None]:
le = LabelEncoder()

y_train = pd.DataFrame(le.fit_transform(y_train), columns=["label"])
y_test = pd.DataFrame(le.transform(y_test), columns=["label"])
y_val = pd.DataFrame(le.transform(y_val), columns=["label"])

Save the results

In [None]:
X_train.to_pickle(os.path.join(DATA_DIR, 'processed', 'train/train_features.pkl'))
X_val.to_pickle(os.path.join(DATA_DIR, 'processed', 'val/val_features.pkl'))
X_test.to_pickle(os.path.join(DATA_DIR, 'processed', 'test/test_features.pkl'))

y_train.to_pickle(os.path.join(DATA_DIR, 'processed', 'train/train_labels.pkl'))
y_val.to_pickle(os.path.join(DATA_DIR, 'processed', 'val/val_labels.pkl'))
y_test.to_pickle(os.path.join(DATA_DIR, 'processed', 'test/test_labels.pkl'))

In [None]:
X_train.shape

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

def balance_dataset(X, y, undersampling_strategy, oversampling_strategy):

    under_sampler = RandomUnderSampler(sampling_strategy=undersampling_strategy, random_state=0)
    X_under, y_under = under_sampler.fit_resample(X, y)

    over_sampler = SMOTE(sampling_strategy=oversampling_strategy)
    X_bal, y_bal = over_sampler.fit_resample(X_under, y_under)
    
    return X_bal, y_bal

## Balance the training set using combination of `SMOTE` & `RandomUnderSampler`
------------------------

***Label Encoder Transformation***
```json
{
    'Benign': 0,
    'DoS/DDoS': 3,
    'PortScan': 4,
    'Brute Force': 2,
    'Web Attack': 5,
    'Botnet ARES': 1
}
```

In [None]:
undersampling_strategy = {
    0: 800000,
    3: 192161,
    4: 34383,
    2: 5131,
    5: 1271,
    1: 1166,
}

oversampling_strategy = {
    0: 800000,
    3: 212102,
    4: 44460,
    2: 50115,
    5: 50284,
    1: 50149,
}

# Balance the training set
X_train_bal, y_train_bal = balance_dataset(X_train, y_train, undersampling_strategy, oversampling_strategy)

# Save the balanced training set
X_train_bal.to_pickle(os.path.join(DATA_DIR, 'processed', 'train/train_features_balanced.pkl'))
y_train_bal.to_pickle(os.path.join(DATA_DIR, 'processed', 'train/train_labels_balanced.pkl'))

In [None]:
fig, ax = plt.subplots(figsize=(15, 10))

# sum each port category column
imbalanced = y_train.value_counts()
balanced = y_train_bal.value_counts()

indexes = np.arange(6)
width = 0.4
rect1 = plt.bar(indexes, imbalanced.values, width, color="steelblue", label="imbalanced")
rect2 = plt.bar(indexes + width, balanced.values, width, color="indianred", label="balanced")

def add_text(rect):
    """Add text to top of each bar."""
    for r in rect:
        h = r.get_height()
        plt.text(r.get_x() + r.get_width()/2, h*1.01, s=format(h, ",") ,fontsize=12, ha='center', va='bottom')

add_text(rect1)
add_text(rect2)

ax.set_xticks(indexes + width / 2)
ax.set_xticklabels(['Benign', 'DoS/DDoS', 'PortScan', 'Brute Force', 'Web Attack', 'Botnet ARES'])
plt.xlabel('Traffic Activity', fontsize=16)
plt.ylabel('# instances', fontsize=16)
plt.legend()
plt.grid()
plt.show()
fig.savefig(os.path.join(IMAGE_DIR, 'balanced_dataset.pdf'))