# CIC-IDS-2017: Denial of Service Attacks

## Overview
This project builds network intrusion detection models using network flow data from the CIC-IDS-2017 benchmark dataset. This notebook explores **supervised learning for binary classification of flows as benign or denial-of-service (DoS) attacks.** A DoS attack occurs when an attacker attempts to overwhelm a target host or network with requests or connections, reducing service availability for legitimate users.

### Simplicity vs. Performance
DoS detection is a relatively easy classification problem. As shown below, standard algorithms achieve excellent performance (precision and recall >0.95) with models trained on all available features (54 after dropping empty and duplicative variables; see below).

In this notebook, we ask:

> **Can we build simpler models (<10 features) that achieve similar performance through principled feature selection?**

A simpler model will provide faster inference, lower computational costs, easier interpretability for security analysts, and more robust generalization. We will use knowledge of DoS attack signatures to select and engineer features.

### Addressing Dataset Artifacts
We decided *a priori* to drop `destination_port` from the data. In this dataset, all DoS attacks target port 80. While DoS attacks frequently target HTTP servers on port 80, they do not do so exclusively. Real-world DoS attacks commonly target ports 22 (SSH), 53 (DNS), 443 (HTTPS), 3389 (RDP), and more.

A generalizable intrusion detection system should identify attacks based on traffic behavior patterns, not port numbers. Including `destination_port` could result in a model that performs well on this dataset but fails to detect attacks on other ports in production environments.

## Import Packages and Configuration

In [1]:
%load_ext autoreload
%autoreload 2

In [54]:
import numpy as np
import pandas as pd

from src.utilities import load_data
from src.processing import prepare_labels, drop_features, clean_data, split_data
from src.training import train_models, get_feature_importance
from src.testing import eval_model

In [55]:
import yaml
with open('config/dos_train.yml', 'r') as f:
    config = yaml.safe_load(f)

## Process Data

### 1. Load Data
`load_data` loads CIC-IDS-2017 data and, by default, cleans column names (converts to lowercase, strips leading and trailing whitespace, converts inner whitespace and forward slashes to underscores).

In [46]:
df = load_data('data/raw/Wednesday-workingHours.pcap_ISCX.csv')

Load Data
----------------------------------------------------------------------
Loaded:  data/raw/Wednesday-workingHours.pcap_ISCX.csv
Rows:    692,703
Columns: 79



### 2. Prepare Labels
`prepare_labels_binary` creates the `is_attack` column with binary labels and drops the original `label` column with multi-class labels. The function also excludes an unrelated attack type (Heartbleed, which is a memory disclosure vulnerability). The DoS attacks included are:
  - DoS Hulk (volumetric flood)
  - DoS GoldenEye (volumetric flood)
  - DoS Slowloris (slow-rate attack)
  - DoS Slowhttptest (slow-rate attack)

In [47]:
binary_map = {
    0: ['BENIGN'], 
    1: ['DoS Hulk','DoS GoldenEye','DoS slowloris','DoS Slowhttptest']
}
df = prepare_labels(df, exclude_values=['Heartbleed'], replace_values=binary_map)

Prepare Labels
----------------------------------------------------------------------
Initial Rows: 692,703

Initial Distribution:
label
BENIGN              440031
DoS GoldenEye        10293
DoS Hulk            231073
DoS Slowhttptest      5499
DoS slowloris         5796
Heartbleed              11

Removed 11 rows with labels:
- Heartbleed

Mapped old to new values:
- BENIGN              -> 0
- DoS Hulk            -> 1
- DoS GoldenEye       -> 1
- DoS slowloris       -> 1
- DoS Slowhttptest    -> 1

----------------------------------------------------------------------
Final Rows: 692,692
Total Removed: 11 (0.0016%)

Final Distribution:
original          label
BENIGN            0        440,031
DoS GoldenEye     1         10,293
DoS Hulk          1        231,073
DoS Slowhttptest  1          5,499
DoS slowloris     1          5,796



### 3. Drop Features
We drop `destination_port` and other features listed below as well as features with zero variance.

In [49]:
artifact = [
    'destination_port'
]

duplicative = [
    'fwd_header_length.1',
    'average_packet_size',
    'avg_fwd_segment_size',
    'avg_bwd_segment_size',
    'subflow_fwd_packets',
    'subflow_fwd_bytes',
    'subflow_bwd_packets',
    'subflow_bwd_bytes'
]

other = [
    'init_win_bytes_forward',
    'init_win_bytes_backward',
    'act_data_pkt_fwd',
    'min_seg_size_forward',
    'down_up_ratio'
]

drop = artifact + duplicative + other

df = drop_features(df, drop=drop, rm_zv=True)

Drop Features
----------------------------------------------------------------------
Initial Columns: 79

Dropped named columns:
- destination_port
- fwd_header_length.1
- average_packet_size
- avg_fwd_segment_size
- avg_bwd_segment_size
- subflow_fwd_packets
- subflow_fwd_bytes
- subflow_bwd_packets
- subflow_bwd_bytes
- init_win_bytes_forward
- init_win_bytes_backward
- act_data_pkt_fwd
- min_seg_size_forward
- down_up_ratio

Dropped zero-variance columns:
- bwd_psh_flags
- fwd_urg_flags
- bwd_urg_flags
- cwe_flag_count
- fwd_avg_bytes_bulk
- fwd_avg_packets_bulk
- fwd_avg_bulk_rate
- bwd_avg_bytes_bulk
- bwd_avg_packets_bulk
- bwd_avg_bulk_rate

----------------------------------------------------------------------
Final Columns: 55
Dropped 14 columns



The remaining feature and label columns are:

In [50]:
df.columns

Index(['flow_duration', 'total_fwd_packets', 'total_backward_packets',
       'total_length_of_fwd_packets', 'total_length_of_bwd_packets',
       'fwd_packet_length_max', 'fwd_packet_length_min',
       'fwd_packet_length_mean', 'fwd_packet_length_std',
       'bwd_packet_length_max', 'bwd_packet_length_min',
       'bwd_packet_length_mean', 'bwd_packet_length_std', 'flow_bytes_s',
       'flow_packets_s', 'flow_iat_mean', 'flow_iat_std', 'flow_iat_max',
       'flow_iat_min', 'fwd_iat_total', 'fwd_iat_mean', 'fwd_iat_std',
       'fwd_iat_max', 'fwd_iat_min', 'bwd_iat_total', 'bwd_iat_mean',
       'bwd_iat_std', 'bwd_iat_max', 'bwd_iat_min', 'fwd_psh_flags',
       'fwd_header_length', 'bwd_header_length', 'fwd_packets_s',
       'bwd_packets_s', 'min_packet_length', 'max_packet_length',
       'packet_length_mean', 'packet_length_std', 'packet_length_variance',
       'fin_flag_count', 'syn_flag_count', 'rst_flag_count', 'psh_flag_count',
       'ack_flag_count', 'urg_flag_count', 

In [51]:
df.shape

(692692, 55)

### 4. Clean Features
`clean_data` removes rows with missing, infinite, or negative values.

In [52]:
df = clean_data(df)

Clean Data
----------------------------------------------------------------------
Initial Rows: 692,692

Removed 1,008 rows with NaN values (0.15%)
Removed 289 rows with values <0 (0.04%)
Removed 746 rows with negative values (0.11%)

----------------------------------------------------------------------
Final Rows: 690,649
Total Removed: 2,043 (0.29%)



### 5. Split Data
`split_data` takes the processed DataFrame, splits the data for training and testing, and returns a dictionary containing: `X_train`, `X_test`, `y_train`, `y_test`.

In [53]:
dat = split_data(df)

Split Data
----------------------------------------------------------------------
Dataset Sizes:
Full:      690,649 rows
Training:  552,519 rows (80.0%)
Test:      138,130 rows (20.0%)

Class Balance Comparison:
----------------------------------------------------------------------
Class       Full Dataset     Training Set         Test Set
----------------------------------------------------------------------
0        439,101 (63.6%)  351,281 (63.6%)   87,820 (63.6%)
1        251,548 (36.4%)  201,238 (36.4%)   50,310 (36.4%)
----------------------------------------------------------------------
Stratification successful (class distribution differences <0.5%)



## Train Models

### Model Classes
We will compare the following types of models with hyperparameters specified by the loaded configuration file:
- Logistic regression (`sklearn.linear_model.LogisticRegression`)
- Random forest (`sklearn.ensemble.RandomForestClassifier`)
- Gradient boosting (`xgboost.XGBClassifier`)

### All Features

#### Performance
As mentioned, DoS detection is a relatively easy classification problem. Below, a variety of models achieve excellent performance during cross-validation on training data (precision and recall >0.97) with the 54 available features.

In [20]:
models_all_features = train_models(dat, models_config=config['models'])

Train Model: logistic_regression
----------------------------------------------------------------------
Model:
- Module:              sklearn.linear_model
- Class:               LogisticRegression

Hyperparameters (set explicitly):
- max_iter             1000
- class_weight         balanced
- random_state         42

Scale Features: True
----------------------------------------------------------------------
Cross-validation Results (10-Fold):
Precision: 0.8503 (± 0.0211)
Recall:    0.9827 (± 0.0019)
F1-Score:  0.9116 (± 0.0125)

Per-Fold Results:
Fold  1: Precision=0.8494, Recall=0.9819, F1=0.9109
Fold  2: Precision=0.8430, Recall=0.9829, F1=0.9075
Fold  3: Precision=0.8396, Recall=0.9814, F1=0.9049
Fold  4: Precision=0.8452, Recall=0.9832, F1=0.9090
Fold  5: Precision=0.8434, Recall=0.9817, F1=0.9073
Fold  6: Precision=0.8402, Recall=0.9834, F1=0.9062
Fold  7: Precision=0.9131, Recall=0.9873, F1=0.9488
Fold  8: Precision=0.8411, Recall=0.9817, F1=0.9060
Fold  9: Precision=0.8461, Reca

#### Feature Importance
Below, we list the 10 most important features in the random forest and XGBoost models. It could be hypothesized that a DoS attack would be classified by forward features (summarizing packets sent from the attacker to the server) since forward packets directly reflect the attacker's tactics, techniques, and procedures. For example, we might expect a large number or high rate of forward packets to be most important. However, it appears that many of the most important features were backward features (summarizing packets sent from the server to the attacker). This may reflect deterioration of server performance under DoS attacks.

In [22]:
get_feature_importance(models_all_features, 'random_forest')[:10]

Unnamed: 0,feature,importance
33,bwd_packets_s,0.105052
12,bwd_packet_length_std,0.091858
36,packet_length_mean,0.07875
11,bwd_packet_length_mean,0.070484
9,bwd_packet_length_max,0.068961
38,packet_length_variance,0.058687
35,max_packet_length,0.04903
34,min_packet_length,0.042045
4,total_length_of_bwd_packets,0.036899
6,fwd_packet_length_min,0.035907


In [23]:
get_feature_importance(models_all_features, 'xgboost')[:10]

Unnamed: 0,feature,importance
36,packet_length_mean,0.360001
12,bwd_packet_length_std,0.26986
47,active_std,0.0759
33,bwd_packets_s,0.073929
39,fin_flag_count,0.035289
52,idle_max,0.018294
11,bwd_packet_length_mean,0.012829
44,urg_flag_count,0.012729
30,fwd_header_length,0.010325
15,flow_iat_mean,0.008645


#### Conclusions
With 54 features, we see strong performance during cross-validation on training data (precision and recall >0.97) by random forest and gradient boosting models. Notably, some of the most important features are backward features (summarizing packets sent from the server to the client), which may reflect a deterioration of server performance under DoS attacks. Next, we consider how to maintain performance with a simpler model.

### Domain Knowledge

We begin by summarizing the tactics, techniques, and procedures of the DoS attacks in our training data and hypothesize important features to include when training a classifier.

1. **Hulk** aims to flood the server with a high volume of HTTP requests. Flows include a large number of forward packets sent over a short duration, resulting in a high forward packet rate and high active time. The server may respond with errors or resets, resulting in abnormal backward traffic patterns.

2. **Slowloris** is a "low and slow" attack that aims to overload the server by holding open connections with incomplete requests. Slowloris sends HTTP request headers with long delays between them, never sending the final `\r\n\r\n` (blank line) that would complete the headers and allow the server to process the request. Flows have low forward packet rate, long duration, high idle time, and high inter-arrival times.

3. **Slow HTTP Post** (also known as *Slow POST* or *R-U-Dead-Yet/RUDY*) is also a "low and slow" attack. It sends complete HTTP POST request headers (including a `Content-Length` header declaring a large body size) but then sends the body data in small packets at a very slow rate. This causes the server to wait. Like Slowloris, flows have low forward packet rate, long duration, high idle time, and high inter-arrival times.

4. **GoldenEye** is a hybrid, multi-vector attack that can combine flood, Slowloris, and Slow HTTP Post tactics simultaneously. It randomizes request characteristics for evasion. The key detection signature is aggregate behavior showing multiple attack patterns from the same source.

From these descriptions, we select the following candidate features. The descriptions suggest a variety of forward features. Based on the models above, we have included some backward features as well.

In [25]:
candidates = [
    'flow_duration',
    'total_fwd_packets',
    'total_backward_packets',
    'fwd_packet_length_mean',
    'bwd_packet_length_mean',
    'fwd_packets_s',
    'bwd_packets_s',
    'fwd_iat_mean',
    'bwd_iat_mean',
    'active_mean',
    'idle_mean',
    'fwd_header_length'
]

As shown below, some of these features are highly correlated (>0.99) such that only one need be included. For this reason, we omit `total_backward_packets` and `fwd_header_length`, which are highly correlated with `total_fwd_packets`.

In [26]:
dat['X_train'][candidates].corr()

Unnamed: 0,flow_duration,total_fwd_packets,total_backward_packets,fwd_packet_length_mean,bwd_packet_length_mean,fwd_packets_s,bwd_packets_s,fwd_iat_mean,bwd_iat_mean,active_mean,idle_mean,fwd_header_length
flow_duration,1.0,0.028387,0.022119,0.148592,0.69795,-0.195849,-0.085514,0.658458,0.392525,0.122491,0.875292,0.038588
total_fwd_packets,0.028387,1.0,0.998549,0.006578,0.023148,-0.007669,-0.003819,0.00102,0.001466,0.024825,0.008427,0.998236
total_backward_packets,0.022119,0.998549,1.0,0.004696,0.021534,-0.007805,-0.002328,0.000336,0.000657,0.017547,0.005985,0.996419
fwd_packet_length_mean,0.148592,0.006578,0.004696,1.0,0.038813,-0.083898,-0.002317,0.07283,0.117093,0.054342,0.12942,0.008996
bwd_packet_length_mean,0.69795,0.023148,0.021534,0.038813,1.0,-0.206467,-0.084855,0.421819,0.078013,-0.034168,0.797278,0.032891
fwd_packets_s,-0.195849,-0.007669,-0.007805,-0.083898,-0.206467,1.0,0.042341,-0.137452,-0.082481,-0.039467,-0.173559,-0.009944
bwd_packets_s,-0.085514,-0.003819,-0.002328,-0.002317,-0.084855,0.042341,1.0,-0.060018,-0.036015,-0.017231,-0.075784,-0.005254
fwd_iat_mean,0.658458,0.00102,0.000336,0.07283,0.421819,-0.137452,-0.060018,1.0,0.718596,0.018708,0.714656,0.003706
bwd_iat_mean,0.392525,0.001466,0.000657,0.117093,0.078013,-0.082481,-0.036015,0.718596,1.0,0.087205,0.364471,0.002452
active_mean,0.122491,0.024825,0.017547,0.054342,-0.034168,-0.039467,-0.017231,0.018708,0.087205,1.0,0.000197,0.02748


We subset data accordingly.

In [27]:
not_selected = [
    'total_backward_packets',
    'fwd_header_length'
]
selected = [x for x in candidates if x not in not_selected]

dat_sub = dat.copy()
dat_sub['X_train'] = dat_sub['X_train'][selected]
dat_sub['X_test'] = dat_sub['X_test'][selected]

#### Performance
We achieve comparable performance (precision and recall >0.97) with the feature subset.

In [28]:
models_sub_features = train_models(dat_sub, models_config=config['models'])

Train Model: logistic_regression
----------------------------------------------------------------------
Model:
- Module:              sklearn.linear_model
- Class:               LogisticRegression

Hyperparameters (set explicitly):
- max_iter             1000
- class_weight         balanced
- random_state         42

Scale Features: True
----------------------------------------------------------------------
Cross-validation Results (10-Fold):
Precision: 0.8877 (± 0.0013)
Recall:    0.8330 (± 0.0030)
F1-Score:  0.8595 (± 0.0020)

Per-Fold Results:
Fold  1: Precision=0.8898, Recall=0.8329, F1=0.8605
Fold  2: Precision=0.8881, Recall=0.8353, F1=0.8609
Fold  3: Precision=0.8879, Recall=0.8316, F1=0.8588
Fold  4: Precision=0.8862, Recall=0.8285, F1=0.8564
Fold  5: Precision=0.8867, Recall=0.8307, F1=0.8578
Fold  6: Precision=0.8888, Recall=0.8354, F1=0.8612
Fold  7: Precision=0.8871, Recall=0.8347, F1=0.8601
Fold  8: Precision=0.8860, Recall=0.8352, F1=0.8599
Fold  9: Precision=0.8869, Reca

#### Feature Importance
Below, we list the most important features in the random forest and XGBoost models trained on the feature subset. We continue to see that backward features are most important for classification. In both models, the most important features were `bwd_packet_length_mean` and `bwd_packets_s`. Again, this may reflect deterioration of server performance under DoS attacks.

In [29]:
get_feature_importance(models_sub_features, 'random_forest')

Unnamed: 0,feature,importance
3,bwd_packet_length_mean,0.234941
5,bwd_packets_s,0.2145
2,fwd_packet_length_mean,0.163773
6,fwd_iat_mean,0.112129
4,fwd_packets_s,0.072385
9,idle_mean,0.071857
0,flow_duration,0.045654
1,total_fwd_packets,0.036819
8,active_mean,0.025456
7,bwd_iat_mean,0.022487


In [30]:
get_feature_importance(models_sub_features, 'xgboost')

Unnamed: 0,feature,importance
5,bwd_packets_s,0.423891
3,bwd_packet_length_mean,0.346839
2,fwd_packet_length_mean,0.116839
0,flow_duration,0.031683
1,total_fwd_packets,0.031308
6,fwd_iat_mean,0.017183
4,fwd_packets_s,0.011638
7,bwd_iat_mean,0.009481
9,idle_mean,0.008682
8,active_mean,0.002456


#### Conclusions
With 10 features, we continue to see strong performance during cross-validation on training data (precision and recall >0.97) by random forest and gradient boosting models. Some of the most important features are backward features.

## Test Models
Below, we see that performance on test data is excellent and consistent between models with 54 features and models with 10 features.

#### All Features

In [32]:
rnf_eval = eval_model(models_all_features['random_forest'], dat)

Test Model
----------------------------------------------------------------------
Precision: 0.9784
Recall:    0.9962
F1:        0.9872



In [33]:
xgb_eval = eval_model(models_all_features['xgboost'], dat)

Test Model
----------------------------------------------------------------------
Precision: 0.9782
Recall:    0.9953
F1:        0.9867



#### Domain Knowledge

In [34]:
rnf_sub_eval = eval_model(models_sub_features['random_forest'], dat_sub)

Test Model
----------------------------------------------------------------------
Precision: 0.9766
Recall:    0.9945
F1:        0.9854



In [35]:
xbg_sub_eval = eval_model(models_sub_features['xgboost'], dat_sub)

Test Model
----------------------------------------------------------------------
Precision: 0.9764
Recall:    0.9933
F1:        0.9847



## Conclusion
We successfully identified a feature subset that allows comparable performance to models trained on all features. These include several backwards features, which likely reflect deterioration of server availability under DoS attacks.