# Network Traffic Anomaly Detection – Unsupervised Model Training

In this notebook, we train and evaluate an unsupervised anomaly detection model using the CICIDS2017 dataset. Unlike supervised learning, which requires labeled attack data, unsupervised techniques aim to learn the structure of normal behavior and identify deviations without relying on labels during training.

## Objectives:
- Preprocess and normalize network traffic data using selected features.
- Apply **Isolation Forest**, an unsupervised model effective for high-dimensional anomaly detection.
- Reduce dimensionality using **PCA (Principal Component Analysis)** to improve performance and mitigate memory issues.
- Evaluate the model using labeled data (for validation only) to compute precision, recall, F1-score, and ROC-AUC.

In [1]:
import sys
import os
sys.path.append(os.path.abspath("../src"))

from utils import (
    load_data_files, 
    save_object
)
from preprocess import (
    clean_data,
    separate_normal_and_attack,
    separate_features_and_target,
    handle_infinite_values,
    clean_data_2,
    apply_and_save_scaler,
    load_scaler_and_transform,
    apply_and_save_pca,
    load_pca_and_transform

)
from train_model import train_isolation_forest, autoencoder

In [6]:
# Load and prepare dataset

# Load dataset
file_paths = [
    "../data/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",     # DDoS
    "../data/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv", # Port Scan
    "../data/Tuesday-WorkingHours.pcap_ISCX.csv",                   # Brute Force (FTP & SSH)
    "../data/Wednesday-workingHours.pcap_ISCX.csv"                  # DoS (Slowloris, Hulk, etc.)
]
df = load_data_files(file_paths)
print(df["Attack"].value_counts())

# Clean the dataset
df = clean_data(df)

# Separate normal and attack data
normal_df = separate_normal_and_attack(df)

# Separete features and target variable
X, y = separate_features_and_target(normal_df)
X_test, y_test = separate_features_and_target(df)

# Handle infinite values first
X = handle_infinite_values(X)
X_test = handle_infinite_values(X_test)

# Remove features with low variance
X = clean_data_2(X, 0.01)
X_test = clean_data_2(X_test, 0.01)

save_object(y_test, "../data/dataset/2_y_test.pkl")


Attack
0    123986
1     76014
Name: count, dtype: int64
Removed rows with missing values. Remaining rows: 199887
Normal traffic rows: 123956
Checking for infinite values in the dataset:
130
Checking for infinite values in the dataset:
174
VarianceThreshold removed 6 low-variance features
Remaining features: 64
VarianceThreshold removed 6 low-variance features
Remaining features: 64
Object saved to ../data/dataset/2_y_test.pkl.


In [7]:
# Scale the features for Isolation Forest
X_iso = apply_and_save_scaler(X, "../models/scalers/2.2_iso_scaler.pkl")
X_test_iso = load_scaler_and_transform(X_test, "../models/scalers/2.2_iso_scaler.pkl")

# Apply PCA for dimensionality reduction
X_pca = apply_and_save_pca(X_iso, 2, "../models/pca/pca.pkl")
X_test_pca = load_pca_and_transform(X_test_iso, "../models/pca/pca.pkl")

save_object(X_test_pca, "../data/dataset/2pca_x_test.pkl")


Using StandardScaler for scaling features.
Scaler saved as ../models/scalers/2.2_iso_scaler.pkl.
PCA applied and saved as ../models/pca/pca.pkl. Reduced from 64 to 2 components.
Object saved to ../data/dataset/2pca_x_test.pkl.


In [8]:
# Train Isolation Forest model
n_estimators = 100
max_samples = 0.2
contamination = 0.38
seed = 42
model_iso = train_isolation_forest(X_pca, n_estimators, max_samples, contamination, seed)

Isolation Forest model trained.


In [9]:
# Save the Isolation Forest model
save_object(model_iso, "../models/isolation_forest_model2.pkl")

Object saved to ../models/isolation_forest_model2.pkl.


In [10]:
# Scale the features for Autoencoder
X_ae = apply_and_save_scaler(X, '../models/scalers/2.2_ae_scaler.pkl', 'minmax')
X_test_ae = load_scaler_and_transform(X_test, '../models/scalers/2.2_ae_scaler.pkl')

save_object(X_test_ae, '../data/dataset/2ae_x_test.pkl')

Using MinMaxScaler for scaling features.
Scaler saved as ../models/scalers/2.2_ae_scaler.pkl.
Object saved to ../data/dataset/2ae_x_test.pkl.


In [11]:
# Train autoencoder 
n_epochs = 50
batch_size = 128
encoding_dim = 5
model_ae = autoencoder(X_ae, n_epochs, batch_size, encoding_dim)


Epoch 1/50
[1m775/775[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - loss: 0.0174 - val_loss: 0.0017 - learning_rate: 0.0010
Epoch 2/50
[1m775/775[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - loss: 0.0014 - val_loss: 8.3904e-04 - learning_rate: 0.0010
Epoch 3/50
[1m775/775[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - loss: 8.2703e-04 - val_loss: 6.8347e-04 - learning_rate: 0.0010
Epoch 4/50
[1m775/775[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - loss: 7.0675e-04 - val_loss: 5.8917e-04 - learning_rate: 0.0010
Epoch 5/50
[1m775/775[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - loss: 6.1977e-04 - val_loss: 5.2259e-04 - learning_rate: 0.0010
Epoch 6/50
[1m775/775[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - loss: 5.1555e-04 - val_loss: 3.9143e-04 - learning_rate: 0.0010
Epoch 7/50
[1m775/775[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - loss: 4.1273

In [12]:
save_object(model_ae, "../models/autoencoder_model2.pkl")

Object saved to ../models/autoencoder_model2.pkl.
