# CIC-IDS2017: Denial of Service Attacks

## Overview
This project builds a machine learning-based network intrusion detection system, which classifies network flows as benign or malicious, using the CICIDS2017 benchmark dataset. This notebook focuses on denial of service (DoS) attacks.

## Import Packages

In [None]:
import os
import numpy as np
import pandas as pd
from src.utils import load_config, make_path, load_data
from src.processing import clean_data, prepare_labels_binary, split_data

In [None]:
os.getcwd()

## Get Data

We use `load_data` to read the Wednesday (i.e., DoS) data from CIC-IDS2017.

In [23]:
df = load_data('raw','Wednesday-workingHours.pcap_ISCX.csv')

Loaded 692,703 rows from: Wednesday-workingHours.pcap_ISCX.csv


We use `clean_data` to remove rows with missing (`NaN` or `np.inf`) values. We will review and clean specific features below.

In [24]:
df = clean_data(df)

Data Cleaned
Initial rows: 692,703

Removed 1,008 rows with NaN values (0.15%)
Removed 289 rows with np.inf values (0.04%)

Final rows: 691,406
Total removed: 1,297 (0.19%)


Since the current problem is binary classification of benign versus any of several denial-of-service attacks, we use `prepare_label_binary` to exclude flows labeled "Heartbleed", which is a memory disclosure exploit, and create a binary label for the remaining rows. The binary label is stored in the `is_attack` column. The original `Label` columns dropped.

In [25]:
df = prepare_labels_binary(df, exclude_values=['Heartbleed'])

Excluded 11 rows with labels: ['Heartbleed']
Labels Prepared

Binary labels:
{0: 439683, 1: 251712}

Original labels by binary label:
is_attack  label           
0          BENIGN              439683
1          DoS Hulk            230124
           DoS GoldenEye        10293
           DoS slowloris         5796
           DoS Slowhttptest      5499
Name: count, dtype: int64


There is a moderate (~60-40) class imbalance. Using `split_data`, we split the data for training and testing while maintaining the class proportions of the full dataset.

In [26]:
df_train, df_test = split_data(df)

Data Split

Dataset Sizes:
  Full dataset:   691,395 rows
  Training set:   553,116 rows (80.0%)
  Test set:       138,279 rows (20.0%)

Class Balance Comparison:
----------------------------------------------------------------------
Class           Full Dataset         Training Set         Test Set            
----------------------------------------------------------------------
Benign          439,683 (63.59%)     351,746 (63.59%)     87,937 (63.59%)     
Attack          251,712 (36.41%)     201,370 (36.41%)     50,342 (36.41%)     
----------------------------------------------------------------------
âœ“ Stratification successful (class distribution differences <0.5%)


## Train Models

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

X_train = df_train.drop('is_attack', axis=1)
y_train = df_train['is_attack']

In [None]:
log_reg_class = LogisticRegression(random_state=76, max_iter=500)
log_reg_train = train_classifier_cv(
    X_train, y_train, model=lr, 
    scale_features=True)

In [None]:
rand_forest_class = RandomForestClassifier()
rand_forest_train = train_classifier_cv(
    X_train, y_train, model=rf, 
    scale_features=False)

## Explore Data

### Destination Port
All destination port (`destination_port`) values occur in the plausible range (0-65,535). For the vast majority of flows, the destination port is 80 (HTTP, 43%), 53 (DNS, 28%), or 443 (HTTPS, 14%). In the training data, all DoS attacks target port 80. However, DoS attacks could target other ports. For generalizability, we will consider excluding `Destination Port` from the model.

In [None]:
(df_train['destination_port'] < 0).sum()

In [None]:
(df_train['destination_port'] > 65535).sum()

In [None]:
df_train['destination_port'].value_counts(normalize=True)

In [None]:
df_train.groupby('is_attack')['destination_port'].value_counts()

### Flow Duration
Flow duration is provided in microseconds. For interpretability, we convert to seconds. A relatively small number of flows have flow durations less than zero, which is impossible. These flows are dropped. All flows have flow durations <120 seconds, which is consistent with how the CICIDS2017 dataset was generated.

In [None]:
(df_train['Flow Duration'] / 1_000_000).describe()

In [None]:
(df_train['Flow Duration'] <= 0).sum()

In [None]:
df_train = df_train[df_train['Flow Duration'] > 0]

In [None]:
df_train.groupby('is_attack')['Flow Duration'].describe()

### Total Packets

In [None]:
df_train[['Total Fwd Packets','Total Backward Packets']].describe()

In [None]:
df_train.groupby('is_attack')[['Total Fwd Packets','Total Backward Packets']].describe()