# CIC-IDS-2017: Denial of Service Attacks

## Overview
This project builds network intrusion detection models using network flow data from the CIC-IDS-2017 benchmark dataset. This notebook explores **supervised learning for binary classification of flows as benign or denial-of-service (DoS) attacks.** A DoS attack occurs when an attacker attempts to overwhelm a target host or network with requests or connections, reducing service availability for legitimate users.

### Simplicity vs. Performance
DoS detection is a relatively easy classification problem. As shown below, standard algorithms achieve excellent performance (precision and recall >0.95) with models trained on all available features (54 after dropping empty and duplicative variables; see below).

In this notebook, we ask:

> **Can we build simpler models (â‰¤10 features) that achieve similar performance through principled feature selection?**

A simpler model will provide faster inference, lower computational costs, easier interpretability for security analysts, and more robust generalization. We will use knowledge of DoS attack signatures to select and engineer features.

### Addressing Dataset Artifacts
We decided *a priori* to drop `destination_port` from the data. In this dataset, all DoS attacks target port 80. While DoS attacks frequently target HTTP servers on port 80, they do not do so exclusively. Real-world DoS attacks commonly target ports 22 (SSH), 53 (DNS), 443 (HTTPS), 3389 (RDP), and more.

A generalizable intrusion detection system should identify attacks based on traffic behavior patterns, not port numbers. The same DoS attack, for example, could target HTTP servers on port 8080. Including `destination_port` could result in a model that performs well on this dataset but fails to detect attacks on other ports in production environments.

## Import Packages and Configuration

In [3]:
import numpy as np
import pandas as pd

from src.io import load_data
from src.processing import clean_data, drop_features, keep_features, prepare_labels
from src.training import split_data, train_models
from src.testing import evaluate_models
from src.finalizing import finalize_model

In [4]:
import yaml
with open('config/train_dos_binary.yml', 'r') as f:
    config = yaml.safe_load(f)

## Process Data

### 1. Load Data
`load_data()` loads CIC-IDS-2017 data and, by default, cleans column names (converts to lowercase, strips leading and trailing whitespace, converts non-alphanumeric characters like inner whitespace and forward slashes to underscores).

In [5]:
df = load_data(directory='data/raw', filenames=['Wednesday-workingHours.pcap_ISCX.csv'])

Load Data
----------------------------------------------------------------------
Directory:
data/raw

Files:
Wednesday-workingHours.pcap_ISCX.csv
(692,703 rows, 79 columns)

Cleaned column names.

----------------------------------------------------------------------
Loaded Rows:    692,703
Loaded Columns: 79
Memory:         470.98 MB



### 2. Clean Data and Drop Features
We want to remove missing, infinite, and negative values as well as features that may be training artifacts or duplicative or that have zero variance. First, we use `clean_data()` to remove rows with missing or infinite values. Second, we use `drop_features()` to drop `destination_port` and other features listed below as well as features with zero variance. The result is 54 features and the label column. Some of these features (e.g., `init_win_bytes_forward`) include legitimate -1 values. So dropping them allows us to run `clean_data()` again to remove rows with erroneous negative values.

In [6]:
df = clean_data(df, missing=True, infinite=True, negative=False)

Clean Data
----------------------------------------------------------------------
Dropped 1,008 rows with missing values.
Dropped 289 rows with infinite values.

----------------------------------------------------------------------
Rows Before: 692,703
Rows After:  691,406



In [7]:
artifact = [
    'destination_port'
]

duplicative = [
    'fwd_header_length.1',
    'average_packet_size',
    'avg_fwd_segment_size',
    'avg_bwd_segment_size',
    'subflow_fwd_packets',
    'subflow_fwd_bytes',
    'subflow_bwd_packets',
    'subflow_bwd_bytes'
]

misc = [
    'init_win_bytes_forward',
    'init_win_bytes_backward',
    'act_data_pkt_fwd',
    'min_seg_size_forward',
    'down_up_ratio'
]

drop = artifact + duplicative + misc

df = drop_features(df, drop=drop, zero_variance=True)

Drop Features
----------------------------------------------------------------------
Dropped 14 named columns:
- destination_port
- fwd_header_length.1
- average_packet_size
- avg_fwd_segment_size
- avg_bwd_segment_size
- subflow_fwd_packets
- subflow_fwd_bytes
- subflow_bwd_packets
- subflow_bwd_bytes
- init_win_bytes_forward
- init_win_bytes_backward
- act_data_pkt_fwd
- min_seg_size_forward
- down_up_ratio

Dropped 10 columns with zero variance:
- bwd_psh_flags
- fwd_urg_flags
- bwd_urg_flags
- cwe_flag_count
- fwd_avg_bytes_bulk
- fwd_avg_packets_bulk
- fwd_avg_bulk_rate
- bwd_avg_bytes_bulk
- bwd_avg_packets_bulk
- bwd_avg_bulk_rate

----------------------------------------------------------------------
Columns Before: 79
Columns After:  55



In [8]:
df = clean_data(df, negative=True)

Clean Data
----------------------------------------------------------------------
Dropped 0 rows with missing values.
Dropped 0 rows with infinite values.
Dropped 750 rows with negative values.

----------------------------------------------------------------------
Rows Before: 691,406
Rows After:  690,656



### 3. Prepare Labels
`prepare_labels()` is used to exclude an unrelated attack type (Heartbleed, which is a memory disclosure vulnerability), collapse the remaining benign and DoS labels to binary, and clean the label values. The DoS attacks included are:
  - DoS Hulk (volumetric flood)
  - DoS GoldenEye (volumetric flood)
  - DoS Slowloris (slow-rate attack)
  - DoS Slowhttptest (slow-rate attack)

In [9]:
dos_binary_labels = {
    'BENIGN': 'benign', 
    'DoS Hulk': 'dos',
    'DoS GoldenEye': 'dos',
    'DoS slowloris': 'dos',
    'DoS Slowhttptest': 'dos'
}
df = prepare_labels(df, drop_labels=['Heartbleed'], replace_labels=dos_binary_labels)

Prepare Labels
----------------------------------------------------------------------
Dropped 7 rows with labels:
- Heartbleed

Replaced label values:
- BENIGN -> benign
- DoS Hulk -> dos
- DoS GoldenEye -> dos
- DoS slowloris -> dos
- DoS Slowhttptest -> dos

Cleaned label values.

----------------------------------------------------------------------
Label Distribution:
_raw_label        label 
BENIGN            benign    439101
DoS GoldenEye     dos        10288
DoS Hulk          dos       229965
DoS Slowhttptest  dos         5499
DoS slowloris     dos         5796

----------------------------------------------------------------------
Rows Before: 690,656
Rows After:  690,649



The remaining feature and label columns are:

In [10]:
df.columns

Index(['flow_duration', 'total_fwd_packets', 'total_backward_packets',
       'total_length_of_fwd_packets', 'total_length_of_bwd_packets',
       'fwd_packet_length_max', 'fwd_packet_length_min',
       'fwd_packet_length_mean', 'fwd_packet_length_std',
       'bwd_packet_length_max', 'bwd_packet_length_min',
       'bwd_packet_length_mean', 'bwd_packet_length_std', 'flow_bytes_s',
       'flow_packets_s', 'flow_iat_mean', 'flow_iat_std', 'flow_iat_max',
       'flow_iat_min', 'fwd_iat_total', 'fwd_iat_mean', 'fwd_iat_std',
       'fwd_iat_max', 'fwd_iat_min', 'bwd_iat_total', 'bwd_iat_mean',
       'bwd_iat_std', 'bwd_iat_max', 'bwd_iat_min', 'fwd_psh_flags',
       'fwd_header_length', 'bwd_header_length', 'fwd_packets_s',
       'bwd_packets_s', 'min_packet_length', 'max_packet_length',
       'packet_length_mean', 'packet_length_std', 'packet_length_variance',
       'fin_flag_count', 'syn_flag_count', 'rst_flag_count', 'psh_flag_count',
       'ack_flag_count', 'urg_flag_count', 

In [11]:
df.shape

(690649, 55)

### 4. Split Data
`split_data()` takes the processed DataFrame, splits the data for training and testing, and returns a dictionary containing: `X_train`, `X_test`, `y_train`, `y_test`. Stratification by label values is the default behavior. Verbose output confirms the same class balance in training and test sets.

In [12]:
data = split_data(df)

Split Data
----------------------------------------------------------------------
Test Size:    0.2
Random State: 76
Stratify:     True

Dataset Sizes:
Full:      690,649 rows
Training:  552,519 rows (80.0%)
Test:      138,130 rows (20.0%)

Class Balance Comparison:
----------------------------------------------------------------------
Class        Full Dataset     Training Set         Test Set
----------------------------------------------------------------------
benign    439,101 (63.6%)  351,281 (63.6%)   87,820 (63.6%)
dos       251,548 (36.4%)  201,238 (36.4%)   50,310 (36.4%)
----------------------------------------------------------------------
Success: Class distribution differences <0.5%



## Train Models

### Model Classes
We will compare the following types of models with hyperparameters specified by the loaded configuration file:
- Logistic regression (`sklearn.linear_model.LogisticRegression`)
- Random forest (`sklearn.ensemble.RandomForestClassifier`)
- Gradient boosting (`xgboost.XGBClassifier`)

### All Features

#### Performance
As mentioned, DoS detection is a relatively easy classification problem. Below, tree-based ensembles (random forest and XGBoost) achieve excellent performance during cross-validation on training data (precision and recall >0.97) with the 54 available features.

The verbose output includes feature importance. It could be hypothesized that a DoS attack would be classified by forward features (summarizing packets sent from the client/attacker to the server) since forward packets directly reflect the attacker's tactics, techniques, and procedures. For example, we might expect a large number or high rate of forward packets to be most important. However, for the tree-based models, which perform best, it appears that many of the most important features were backward features (summarizing packets sent from the server to the client/attacker). This may reflect deterioration of server performance under DoS attacks.

In [13]:
results_all_features = train_models(data, 
                                    models=config['models'], 
                                    filename='dos_binary_results.pkl', 
                                    cv_k=5)

Train Models
----------------------------------------------------------------------
LabelEncoder():
- benign             -> 0
- dos                -> 1

----------------------------------------------------------------------
logistic_regression
----------------------------------------------------------------------
StandardScaler applied

Cross-validation Scores:
Weighted Average:
- precision: 0.9355
- recall:    0.9266
- f1_score:  0.9276

Per Class:
           class  precision  recall  f1_score  support
          benign     0.9888  0.8946    0.9394   351281
             dos     0.8423  0.9824    0.9070   201238
overall_weighted     0.9355  0.9266    0.9276   552519

Feature Importances (coefficient-based):
                    feature  importance
          fwd_header_length     43.2240
     total_backward_packets     24.3780
          total_fwd_packets     24.3640
      bwd_packet_length_min     21.3325
          bwd_header_length     21.0865
     packet_length_variance     17.3226
tota

#### Conclusions
With 54 features, we see strong performance during cross-validation on training data (precision and recall >0.97) by random forest XGBoost models. Notably, some of the most important features are backward features (summarizing packets sent from the server to the client), which may reflect a deterioration of server performance under DoS attacks. Next, we consider how to maintain performance with a simpler model.

### Domain Knowledge

We begin by summarizing the tactics, techniques, and procedures of the DoS attacks in our training data and hypothesize important features to include when training a classifier.

1. **Hulk** aims to flood the server with a high volume of HTTP requests. Flows include a large number of forward packets sent over a short duration, resulting in a high forward packet rate and high active time. The server may respond with errors or resets, resulting in abnormal backward traffic patterns.

2. **Slowloris** is a "low and slow" attack that aims to overload the server by holding open connections with incomplete requests. Slowloris sends HTTP request headers with long delays between them, never sending the final `\r\n\r\n` (blank line) that would complete the headers and allow the server to process the request. Flows have low forward packet rate, long duration, high idle time, and high inter-arrival times.

3. **Slow HTTP Post** (also known as *Slow POST* or *R-U-Dead-Yet/RUDY*) is also a "low and slow" attack. It sends complete HTTP POST request headers (including a `Content-Length` header declaring a large body size) but then sends the body data in small packets at a very slow rate. This causes the server to wait. Like Slowloris, flows have low forward packet rate, long duration, high idle time, and high inter-arrival times.

4. **GoldenEye** is a hybrid, multi-vector attack that can combine flood, Slowloris, and Slow HTTP Post tactics simultaneously. It randomizes request characteristics for evasion. The key detection signature is aggregate behavior showing multiple attack patterns from the same source (although we do not include source IP address).

From these descriptions, we select the following candidate features. The descriptions suggest a variety of forward features. Based on the models above, we have included some backward features as well.

In [14]:
candidates = [
    'flow_duration',
    'total_fwd_packets',
    'total_backward_packets',
    'fwd_packet_length_mean',
    'bwd_packet_length_mean',
    'fwd_packets_s',
    'bwd_packets_s',
    'fwd_iat_mean',
    'bwd_iat_mean',
    'active_mean',
    'idle_mean',
    'fwd_header_length'
]

As shown below, some of these features are highly correlated (>0.99) such that only one need be included. For this reason, we omit `total_backward_packets` and `fwd_header_length`, which are highly correlated with each other and `total_fwd_packets`.

In [15]:
data['X_train'][candidates].corr()

Unnamed: 0,flow_duration,total_fwd_packets,total_backward_packets,fwd_packet_length_mean,bwd_packet_length_mean,fwd_packets_s,bwd_packets_s,fwd_iat_mean,bwd_iat_mean,active_mean,idle_mean,fwd_header_length
flow_duration,1.0,0.028387,0.022119,0.148592,0.69795,-0.195849,-0.085514,0.658458,0.392525,0.122491,0.875292,0.038588
total_fwd_packets,0.028387,1.0,0.998549,0.006578,0.023148,-0.007669,-0.003819,0.00102,0.001466,0.024825,0.008427,0.998236
total_backward_packets,0.022119,0.998549,1.0,0.004696,0.021534,-0.007805,-0.002328,0.000336,0.000657,0.017547,0.005985,0.996419
fwd_packet_length_mean,0.148592,0.006578,0.004696,1.0,0.038813,-0.083898,-0.002317,0.07283,0.117093,0.054342,0.12942,0.008996
bwd_packet_length_mean,0.69795,0.023148,0.021534,0.038813,1.0,-0.206467,-0.084855,0.421819,0.078013,-0.034168,0.797278,0.032891
fwd_packets_s,-0.195849,-0.007669,-0.007805,-0.083898,-0.206467,1.0,0.042341,-0.137452,-0.082481,-0.039467,-0.173559,-0.009944
bwd_packets_s,-0.085514,-0.003819,-0.002328,-0.002317,-0.084855,0.042341,1.0,-0.060018,-0.036015,-0.017231,-0.075784,-0.005254
fwd_iat_mean,0.658458,0.00102,0.000336,0.07283,0.421819,-0.137452,-0.060018,1.0,0.718596,0.018708,0.714656,0.003706
bwd_iat_mean,0.392525,0.001466,0.000657,0.117093,0.078013,-0.082481,-0.036015,0.718596,1.0,0.087205,0.364471,0.002452
active_mean,0.122491,0.024825,0.017547,0.054342,-0.034168,-0.039467,-0.017231,0.018708,0.087205,1.0,0.000197,0.02748


We use `keep_features()` to subset data accordingly.

In [16]:
not_selected = [
    'total_backward_packets',
    'fwd_header_length'
]
selected = [x for x in candidates if x not in not_selected]

data_sub = keep_features(data, keep=selected)

----------------------------------------------------------------------
Keep Features
----------------------------------------------------------------------
Kept 10 columns in data['X_train'].
- flow_duration
- total_fwd_packets
- fwd_packet_length_mean
- bwd_packet_length_mean
- fwd_iat_mean
- bwd_iat_mean
- fwd_packets_s
- bwd_packets_s
- active_mean
- idle_mean

Kept 10 columns in data['X_test'].
- flow_duration
- total_fwd_packets
- fwd_packet_length_mean
- bwd_packet_length_mean
- fwd_iat_mean
- bwd_iat_mean
- fwd_packets_s
- bwd_packets_s
- active_mean
- idle_mean



#### Performance
We achieve comparable performance (precision and recall >0.97) with the feature subset using random forest and XGBoost models. We continue to see that backward features are important for classification. In both tree-based models, the most important features were `bwd_packet_length_mean` and `bwd_packets_s`. Again, this may reflect deterioration of server performance under DoS attacks.

In [17]:
results_sub_features = train_models(data_sub, 
                                    models=config['models'], 
                                    filename='dos_binary_results_sub.pkl', 
                                    cv_k=5)

Train Models
----------------------------------------------------------------------
LabelEncoder():
- benign             -> 0
- dos                -> 1

----------------------------------------------------------------------
logistic_regression
----------------------------------------------------------------------
StandardScaler applied

Cross-validation Scores:
Weighted Average:
- precision: 0.9004
- recall:    0.9008
- f1_score:  0.9001

Per Class:
           class  precision  recall  f1_score  support
          benign     0.9076  0.9397    0.9233   351281
             dos     0.8877  0.8330    0.8595   201238
overall_weighted     0.9004  0.9008    0.9001   552519

Feature Importances (coefficient-based):
               feature  importance
     total_fwd_packets     29.6749
bwd_packet_length_mean      2.1352
fwd_packet_length_mean      2.0064
             idle_mean      1.0300
         fwd_packets_s      0.9361
           active_mean      0.7808
         flow_duration      0.4904
    

#### Conclusions
With 10 features, we continue to see strong performance during cross-validation on training data (precision and recall >0.97) by random forest and XGBoost models. Some of the most important features were backward features.

## Test Models
Below, for random forest and XGBoost models, we see that performance on test data is excellent (>0.97) and consistent between models with 54 features and models with 10 features. For logistic regression, there is a notable decline in recall when comparing all features (0.76) and the feature subset (0.37).

#### All Features

In [None]:
evaluate_all = evaluate_models(results_all_features, data)

#### Domain Knowledge

In [None]:
evaluate_sub = evaluate_models(results_sub_features, data_sub)

## Finalize Model
Since the goal was the simpler model, we select XGBoost as the best performing model on the feature subset. We use `finalize_model()` to retrain the model on the full dataset and save the final model with the encoder and feature names as a `skops` file for deployment.

In [None]:
final_model = finalize_model(results=results_sub_features, 
                             model_name='xgboost', 
                             data=data_sub, 
                             filename='dos_binary_final_sub.skops')

## Conclusion
We successfully identified a feature subset that allows comparable performance to models trained on all features. These include several backwards features, which likely reflect deterioration of server availability under DoS attacks. The final model has been saved.