# Network Traffic Anomaly Detection – Supervised Model Training

In this notebook, we train and evaluate machine learning models to detect anomalous or malicious network traffic using the cleaned CICIDS2017 dataset.

Specifically, we will:

- Reuse the data preparation pipeline from `preprocess.py` to ensure consistency.
- Train supervised machine learning models such as:
  - Random Forest
  - Decision Tree
  - XGBoost 

This step focuses on building a robust classifier to distinguish between normal and attack traffic effectively.


In [2]:
import sys
import os

sys.path.append(os.path.abspath("../src"))

from utils import load_data_files, save_object
from preprocess import (
    clean_data,
    handle_infinite_values,
    clean_data_2,
    apply_and_save_scaler,
    separate_features_and_target,
    split_data
)
from train_model import train_model

In [3]:
# Load and prepare dataset

# Load dataset
file_paths = [
    "../data/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",     # DDoS
    "../data/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv", # Port Scan
    "../data/Tuesday-WorkingHours.pcap_ISCX.csv",                   # Brute Force (FTP & SSH)
    "../data/Wednesday-workingHours.pcap_ISCX.csv"                  # DoS (Slowloris, Hulk, etc.)
]
df = load_data_files(file_paths)
print(df["Attack"].value_counts())

# Clean the dataset
df = clean_data(df)

# Separete features and target variable
X, y = separate_features_and_target(df)

# Handle infinite values first
X_clean = handle_infinite_values(X)

# Remove features with low variance
X_clean = clean_data_2(X_clean, 0.01)

# Scale the features
X_scaled = apply_and_save_scaler(X_clean, '../models/scalers/2.1_scaler.pkl')

# Divide the dataset into training and testing sets
x_temp, x_test, y_temp, y_test = split_data(X_scaled, y)
x_train, x_val, y_train, y_val = split_data(x_temp, y_temp)

# Save the testing and validation sets for later use
save_object(x_test, "../data/dataset/1_x_test.pkl")
save_object(y_test, "../data/dataset/1_y_test.pkl")
save_object(x_val, "../data/dataset/1_x_val.pkl")
save_object(y_val, "../data/dataset/1_y_val.pkl")

Attack
0    124023
1     75977
Name: count, dtype: int64
Removed rows with missing values. Remaining rows: 199906
Checking for infinite values in the dataset:
202
VarianceThreshold removed 6 low-variance features
Remaining features: 64
Using StandardScaler for scaling features.
Scaler saved as ../models/scalers/2.1_scaler.pkl.
Data split into training and testing sets.
Data split into training and testing sets.
Object saved to ../data/dataset/1_x_test.pkl.
Object saved to ../data/dataset/1_y_test.pkl.
Object saved to ../data/dataset/1_x_val.pkl.
Object saved to ../data/dataset/1_y_val.pkl.


In [4]:
# Train the model - Random Forest
model_rf = train_model(x_train, y_train, model_type='random_forest')

In [5]:
# Save the model - Random Forest
save_object(model_rf, "../models/random_forest_model.pkl")

Object saved to ../models/random_forest_model.pkl.


In [6]:
# Train the model - Decision Tree
model_dt = train_model(x_train, y_train, model_type='decision_tree')

In [7]:
# Save the model - Decision Tree
save_object(model_dt, "../models/decision_tree_model.pkl")

Object saved to ../models/decision_tree_model.pkl.


In [8]:
# Train the model - XGBoost
model_xgb = train_model(x_train, y_train, model_type='xgboost')

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [9]:
# Save the model - XGBoost
save_object(model_xgb, "../models/xgboost_model.pkl")

Object saved to ../models/xgboost_model.pkl.
