# Network Forensic Analysis Tool on CIC-IDS-2017  
### Random Forest + RFE (Top 30 Features, No Redundancy / Multicollinearity)

This notebook builds a **forensic-style network log analysis tool** using the **CIC-IDS-2017** dataset:

- Loads and merges multiple CIC-IDS-2017 CSV files (Monday, Wednesday, Friday PortScan, Friday DDoS)
- Cleans the data (NaNs, infinities, constant columns)
- Creates a binary label: **0 = BENIGN, 1 = ATTACK**
- Removes **highly correlated (multicollinear) features**
- Uses **Recursive Feature Elimination (RFE)** with **RandomForest** to select the **top 30 features**
- Trains a final **RandomForest** model on these 30 features
- Evaluates the model (accuracy, classification report, confusion matrix)
- Picks **one random flow** and predicts whether it is **BENIGN** or **ATTACK** (for forensic triage).


## 1. Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## 2. Load and Merge CIC-IDS-2017 CSV Files

We use four commonly used CIC-IDS-2017 files:

- Monday-WorkingHours.pcap_ISCX.csv (mostly BENIGN)
- Wednesday-workingHours.pcap_ISCX.csv (mixed traffic)
- Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
- Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv

Update the paths below if your files are in a different folder.


In [2]:
# List of CIC-IDS-2017 CSV files (update paths if needed)
files = [
    "../data/Monday-WorkingHours.pcap_ISCX.csv",
    "../data/Wednesday-workingHours.pcap_ISCX.csv",
    "../data/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv",
    "../data/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",
]
files

['../data/Monday-WorkingHours.pcap_ISCX.csv',
 '../data/Wednesday-workingHours.pcap_ISCX.csv',
 '../data/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv',
 '../data/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv']

In [3]:
def load_cic_file(path: str) -> pd.DataFrame:
    """Load a CIC-IDS-2017 CSV and normalize column names."""
    print(f"Loading: {path}")
    df_tmp = pd.read_csv(path)
    df_tmp.columns = df_tmp.columns.str.strip()
    print("  Shape:", df_tmp.shape)
    return df_tmp

dfs = [load_cic_file(f) for f in files]

len(dfs)

Loading: ../data/Monday-WorkingHours.pcap_ISCX.csv
  Shape: (529918, 79)
Loading: ../data/Wednesday-workingHours.pcap_ISCX.csv
  Shape: (692703, 79)
Loading: ../data/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
  Shape: (286467, 79)
Loading: ../data/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
  Shape: (225745, 79)


4

## 3. Align Columns and Concatenate

We keep only the columns that are common across all files, including the **Label** column, and then merge them into a single dataframe.


In [11]:
# Find common columns across all dataframes
common_cols = set(dfs[0].columns)
for d in dfs[1:]:
    common_cols = common_cols.intersection(set(d.columns))

common_cols = sorted(list(common_cols))
print("Number of common columns:", len(common_cols))
print("Sample common columns:", common_cols[:20])

# Keep only common columns and concatenate
dfs_common = [d[common_cols].copy() for d in dfs]
df_full = pd.concat(dfs_common, axis=0, ignore_index=True)
print("Merged shape:", df_full.shape)

# Get 10% of the original dataset
df_full=df_full.sample(frac=0.10, random_state=52)
df_full.head()

Number of common columns: 79
Sample common columns: ['ACK Flag Count', 'Active Max', 'Active Mean', 'Active Min', 'Active Std', 'Average Packet Size', 'Avg Bwd Segment Size', 'Avg Fwd Segment Size', 'Bwd Avg Bulk Rate', 'Bwd Avg Bytes/Bulk', 'Bwd Avg Packets/Bulk', 'Bwd Header Length', 'Bwd IAT Max', 'Bwd IAT Mean', 'Bwd IAT Min', 'Bwd IAT Std', 'Bwd IAT Total', 'Bwd PSH Flags', 'Bwd Packet Length Max', 'Bwd Packet Length Mean']
Merged shape: (1734833, 79)


Unnamed: 0,ACK Flag Count,Active Max,Active Mean,Active Min,Active Std,Average Packet Size,Avg Bwd Segment Size,Avg Fwd Segment Size,Bwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,...,Subflow Bwd Packets,Subflow Fwd Bytes,Subflow Fwd Packets,Total Backward Packets,Total Fwd Packets,Total Length of Bwd Packets,Total Length of Fwd Packets,URG Flag Count,act_data_pkt_fwd,min_seg_size_forward
1159211,0,0,0.0,0,0.0,60.0,72.0,32.0,0,0,...,2,64,2,2,2,144,64,0,1,20
601501,0,7005579,7005579.0,7005579,0.0,0.0,0.0,0.0,0,0,...,0,0,7,0,7,0,0,0,0,40
1383497,0,0,0.0,0,0.0,3.0,6.0,0.0,0,0,...,1,0,1,1,1,6,0,0,0,40
1281978,1,0,0.0,0,0.0,37.833333,46.0,27.0,0,0,...,1,135,5,1,5,46,135,0,4,20
1396436,0,0,0.0,0,0.0,3.0,6.0,0.0,0,0,...,1,0,1,1,1,6,0,0,0,40


## 4. Data Cleaning and Binary Label Creation

Steps:

1. Drop fully empty columns.  
2. Replace infinities with NaN and drop rows with NaN.  
3. Normalize the `Label` text.  
4. Create a binary label `Attack_Binary`:  
   - `0` → BENIGN  
   - `1` → any attack label (DDoS, PortScan, etc.).


In [12]:
df = df_full.copy()

# Drop fully empty columns
df = df.dropna(axis=1, how="all")

# Replace infinities and drop NaNs
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()

# Normalize Label text
df["Label"] = df["Label"].astype(str).str.strip()

# Binary label mapping
def map_attack_binary(label: str) -> int:
    label = label.upper()
    if "BENIGN" in label:
        return 0
    else:
        return 1

df["Attack_Binary"] = df["Label"].apply(map_attack_binary)

print("After cleaning shape:", df.shape)
print("Binary label counts:")
df["Attack_Binary"].value_counts()

After cleaning shape: (173255, 80)
Binary label counts:


Attack_Binary
0    119627
1     53628
Name: count, dtype: int64

## 5. Keep Only Numeric, Non-Constant Features

We keep only numeric columns and remove any feature that has the same value for all rows (constant).

In [13]:
# Keep only numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Remove constant columns
numeric_nonconst = [c for c in numeric_cols if df[c].nunique() > 1]

# Ensure Attack_Binary is included
if "Attack_Binary" not in numeric_nonconst:
    numeric_nonconst.append("Attack_Binary")

df_num = df[numeric_nonconst].copy()
print("Numeric non-constant columns (including label):", len(numeric_nonconst))
df_num.head()

Numeric non-constant columns (including label): 69


Unnamed: 0,ACK Flag Count,Active Max,Active Mean,Active Min,Active Std,Average Packet Size,Avg Bwd Segment Size,Avg Fwd Segment Size,Bwd Header Length,Bwd IAT Max,...,Subflow Fwd Bytes,Subflow Fwd Packets,Total Backward Packets,Total Fwd Packets,Total Length of Bwd Packets,Total Length of Fwd Packets,URG Flag Count,act_data_pkt_fwd,min_seg_size_forward,Attack_Binary
1159211,0,0,0.0,0,0.0,60.0,72.0,32.0,40,4,...,64,2,2,2,144,64,0,1,20,0
601501,0,7005579,7005579.0,7005579,0.0,0.0,0.0,0.0,0,0,...,0,7,0,7,0,0,0,0,40,1
1383497,0,0,0.0,0,0.0,3.0,6.0,0.0,20,0,...,0,1,1,1,6,0,0,0,40,1
1281978,1,0,0.0,0,0.0,37.833333,46.0,27.0,20,0,...,135,5,1,5,46,135,0,4,20,0
1396436,0,0,0.0,0,0.0,3.0,6.0,0.0,20,0,...,0,1,1,1,6,0,0,0,40,1


## 6. Remove Multicollinearity (Highly Correlated Features)

To avoid redundancy and multicollinearity, we:

- Compute a correlation matrix between features.  
- Drop one of each pair of features whose absolute correlation is **> 0.90**.

This keeps the feature set more compact and reduces overfitting.


In [14]:
feature_cols_base = [c for c in df_num.columns if c != "Attack_Binary"]

corr_matrix = df_num[feature_cols_base].corr().abs()

# Upper triangle of the correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Features to drop
to_drop = [col for col in upper.columns if any(upper[col] > 0.90)]

print("Dropping due to multicollinearity:", len(to_drop), "features")

df_base = df_num.drop(columns=to_drop)
base_features = [c for c in df_base.columns if c != "Attack_Binary"]
print("Features remaining after multicollinearity removal:", len(base_features))

Dropping due to multicollinearity: 33 features
Features remaining after multicollinearity removal: 35


## 7. Train/Test Split

We now split the cleaned, deduplicated features into training and test sets.

- 70% for training  
- 30% for testing  
- Stratified by `Attack_Binary` to preserve class balance.


In [15]:
X_base = df_base[base_features].values
y = df_base["Attack_Binary"].values

X_train_base, X_test_base, y_train, y_test = train_test_split(
    X_base, y, test_size=0.30, random_state=42, stratify=y
)

print("Train shape:", X_train_base.shape)
print("Test shape:", X_test_base.shape)

Train shape: (121278, 35)
Test shape: (51977, 35)


## 8. Feature Selection with RFE (Top 30 Features)

We use **Recursive Feature Elimination (RFE)** with a **RandomForest** estimator to select the **Top 30 most important features**:

- RFE repeatedly fits the model and removes the least important features.  
- We keep at most 30 features (or fewer if there are less than 30 available after cleaning).

In [16]:
rf_estimator = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

n_features_to_select = min(20, X_train_base.shape[1])
rfe = RFE(
    estimator=rf_estimator,
    n_features_to_select=n_features_to_select,
    step=1
)

rfe.fit(X_train_base, y_train)

selected_mask = rfe.support_
selected_features = [f for f, m in zip(base_features, selected_mask) if m]

print(f"Selected TOP {n_features_to_select} features via RFE:")
for f in selected_features:
    print("-", f)

# Reduce train/test to selected features only
X_train_sel = X_train_base[:, selected_mask]
X_test_sel = X_test_base[:, selected_mask]

print("Train/Test shapes (selected features):", X_train_sel.shape, X_test_sel.shape)

Selected TOP 20 features via RFE:
- Average Packet Size
- Avg Fwd Segment Size
- Bwd Header Length
- Bwd Packet Length Min
- Bwd Packets/s
- Destination Port
- Flow Bytes/s
- Flow Duration
- Flow IAT Max
- Flow IAT Mean
- Flow Packets/s
- Fwd IAT Mean
- Fwd IAT Min
- Fwd Packet Length Min
- Init_Win_bytes_backward
- Init_Win_bytes_forward
- Min Packet Length
- PSH Flag Count
- Subflow Bwd Bytes
- Subflow Fwd Bytes
Train/Test shapes (selected features): (121278, 20) (51977, 20)


## 9. Train Final RandomForest on Selected Features

We now train a **RandomForestClassifier** using only the 30 selected features and evaluate it on the test set:

- Accuracy  
- Precision, Recall, F1-score (classification report)  
- Confusion matrix

In [17]:
rf_final = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

rf_final.fit(X_train_sel, y_train)

y_pred = rf_final.predict(X_test_sel)
acc = accuracy_score(y_test, y_pred)

print("=== RandomForest (Top 30 Features via RFE) Evaluation ===")
print(f"Accuracy: {acc:.4f}\n")
print(classification_report(y_test, y_pred, target_names=["BENIGN", "ATTACK"]))

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

=== RandomForest (Top 30 Features via RFE) Evaluation ===
Accuracy: 0.9993

              precision    recall  f1-score   support

      BENIGN       1.00      1.00      1.00     35888
      ATTACK       1.00      1.00      1.00     16089

    accuracy                           1.00     51977
   macro avg       1.00      1.00      1.00     51977
weighted avg       1.00      1.00      1.00     51977

Confusion Matrix:
 [[35871    17]
 [   17 16072]]


## 10. Forensic-Style Single Flow Prediction

To simulate a **forensic analyst** inspecting a single network flow:

1. We pick **one random row** from the cleaned dataset.  
2. Extract the same base features and apply the RFE mask.  
3. Use the trained RandomForest to predict whether this flow is **BENIGN** or **ATTACK**.  
4. Show predicted probabilities as a form of confidence score.


In [19]:
# Sample one random flow from the base dataframe
sample = df_base.sample(1, random_state=None)

# Extract base features and apply RFE mask
sample_X_base = sample[base_features].values
sample_X_sel = sample_X_base[:, selected_mask]

# Predict
sample_pred = rf_final.predict(sample_X_sel)[0]
sample_proba = rf_final.predict_proba(sample_X_sel)[0]
sample_actual = sample["Attack_Binary"].iloc[0]

print("=== Random Flow Prediction (RF + RFE Top 30) ===")
print("Actual Label   :", "ATTACK" if sample_actual == 1 else "BENIGN")
print("Predicted Label:", "ATTACK" if sample_pred == 1 else "BENIGN")
print("Probabilities  -> BENIGN: {:.4f}, ATTACK: {:.4f}".format(sample_proba[0], sample_proba[1]))

print("\nSample row (all base features + label):")
sample

=== Random Flow Prediction (RF + RFE Top 30) ===
Actual Label   : BENIGN
Predicted Label: BENIGN
Probabilities  -> BENIGN: 1.0000, ATTACK: 0.0000

Sample row (all base features + label):


Unnamed: 0,ACK Flag Count,Active Max,Active Mean,Active Std,Average Packet Size,Avg Fwd Segment Size,Bwd Header Length,Bwd IAT Max,Bwd IAT Mean,Bwd IAT Std,...,Fwd Packet Length Min,Idle Std,Init_Win_bytes_backward,Init_Win_bytes_forward,Min Packet Length,PSH Flag Count,Subflow Bwd Bytes,Subflow Fwd Bytes,URG Flag Count,Attack_Binary
1146427,0,0,0.0,0.0,61.0,36.0,40,48,48.0,0.0,...,36,0.0,-1,-1,36,0,136,72,0,0
