# UNSW-NB15 Network Intrusion Detection - Comprehensive ML Pipeline

**Authors:** Research Team  
**Date:** 2025  
**Version:** 1.0.0

---

## Overview

This notebook implements a comprehensive machine learning pipeline for network intrusion detection using the UNSW-NB15 dataset. The pipeline includes:

- Data acquisition and exploratory data analysis
- Feature engineering specific to UNSW-NB15
- Advanced preprocessing with host-based CV splitting
- Multiple models: LightGBM, XGBoost, CatBoost, TabTransformer
- Ensemble learning and calibration
- Comprehensive evaluation and visualization

**Dataset:** UNSW-NB15 (Training + Test sets)

**Targets:**
- Binary: `label` (0=Normal, 1=Attack)
- Multi-class: `attack_cat` (Normal, DoS, Exploits, Fuzzers, Generic, Reconnaissance, etc.)

---

## 1. Environment Setup and Package Installation

In [None]:
# Install required packages (uncomment if needed)
# !pip install -r requirements.txt

In [None]:
# Core libraries
import os
import sys
import json
import time
import warnings
from pathlib import Path
from typing import Dict, List, Tuple, Optional

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.model_selection import StratifiedKFold, GroupKFold
from sklearn.preprocessing import (
    RobustScaler, StandardScaler, OneHotEncoder, 
    OrdinalEncoder, LabelEncoder
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report, confusion_matrix,
    f1_score, precision_score, recall_score, accuracy_score,
    roc_auc_score, average_precision_score
)

# Imbalanced learning
from imblearn.over_sampling import SMOTENC

# Gradient boosting models
import lightgbm as lgb
import xgboost as xgb
try:
    import catboost as cb
    CATBOOST_AVAILABLE = True
except ImportError:
    CATBOOST_AVAILABLE = False
    print("⚠ CatBoost not available")

# Deep learning
try:
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import Dataset, DataLoader, TensorDataset
    TORCH_AVAILABLE = True
except ImportError:
    TORCH_AVAILABLE = False
    print("⚠ PyTorch not available")

# HPO (optional)
try:
    import optuna
    OPTUNA_AVAILABLE = True
except ImportError:
    OPTUNA_AVAILABLE = False
    print("⚠ Optuna not available")

# Utilities
from tqdm.auto import tqdm

# Import custom utilities
from utils import (
    load_config, save_config_snapshot, ensure_directories,
    create_data_inventory, create_eda_overview,
    create_numeric_summary, create_categorical_summary,
    create_target_distribution, compute_metrics,
    plot_confusion_matrix, plot_pr_curves, plot_roc_curves,
    plot_calibration_curve, create_feature_catalog,
    find_optimal_threshold, create_reproducibility_manifest
)

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100

print("✓ All packages imported successfully")
print(f"Python version: {sys.version.split()[0]}")
print(f"Pandas: {pd.__version__}, NumPy: {np.__version__}")
print(f"LightGBM: {lgb.__version__}, XGBoost: {xgb.__version__}")
if TORCH_AVAILABLE:
    print(f"PyTorch: {torch.__version__}")

## 2. Configuration Loading and Directory Setup

In [None]:
# Load configuration
config = load_config('config.json')

# Set random seeds for reproducibility
SEED = config['project']['seed']
np.random.seed(SEED)
if TORCH_AVAILABLE:
    torch.manual_seed(SEED)

# Ensure all directories exist
ensure_directories(config)

# Save configuration snapshot
save_config_snapshot(
    config,
    os.path.join(config['output']['tables_dir'], 'config_snapshot.json')
)

print(f"\n✓ Configuration loaded: {config['project']['name']}")
print(f"Random seed: {SEED}")

## 3. Data Acquisition (UNSW-NB15)

### Option 1: Download from Kaggle
### Option 2: Load from local files

In [None]:
# Option 1: Kaggle download (requires kaggle.json setup)
# Uncomment and run if needed
# !kaggle datasets download -d mrwellsdavid/unsw-nb15
# !unzip -q unsw-nb15.zip -d data/

In [None]:
# Option 2: Load from local files
train_path = config['data']['train_path']
test_path = config['data']['test_path']

# Check if files exist
if not os.path.exists(train_path):
    print(f"⚠ Training file not found: {train_path}")
    print("Please download UNSW-NB15 dataset and place CSV files in data/ directory")
    raise FileNotFoundError(train_path)

if not os.path.exists(test_path):
    print(f"⚠ Test file not found: {test_path}")
    print("Please download UNSW-NB15 dataset and place CSV files in data/ directory")
    raise FileNotFoundError(test_path)

# Load data
print("Loading data...")
df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

# Add split column
df_train['split'] = 'train'
df_test['split'] = 'test'

# Combine
df = pd.concat([df_train, df_test], ignore_index=True)

print(f"\n✓ Data loaded successfully")
print(f"Training set: {len(df_train):,} rows")
print(f"Test set: {len(df_test):,} rows")
print(f"Total: {len(df):,} rows, {len(df.columns)} columns")

# Create data inventory
inventory_df = create_data_inventory(
    [train_path, test_path],
    os.path.join(config['output']['tables_dir'], 'data_inventory.csv')
)
display(inventory_df)

## 4. Exploratory Data Analysis (EDA)

In [None]:
# First look at the data
print("Dataset shape:", df.shape)
print("\nColumn names:")
print(df.columns.tolist())
print("\nFirst few rows:")
display(df.head())

In [None]:
# Create EDA overview
eda_overview = create_eda_overview(
    df, config,
    os.path.join(config['output']['tables_dir'], 'eda_overview.csv')
)
display(eda_overview)

In [None]:
# Numeric summary
numeric_summary = create_numeric_summary(
    df, config['features']['numeric'],
    os.path.join(config['output']['tables_dir'], 'summary_numeric.csv')
)
display(numeric_summary.head(10))

In [None]:
# Categorical summary
categorical_summary = create_categorical_summary(
    df, config['features']['categorical'],
    os.path.join(config['output']['tables_dir'], 'summary_categorical.csv')
)
display(categorical_summary)

## 5. Target Distribution Analysis

In [None]:
# Binary target distribution
target_binary = config['targets']['binary']
binary_dist = create_target_distribution(
    df, target_binary, 'split',
    os.path.join(config['output']['tables_dir'], 'target_distribution_binary.csv'),
    is_binary=True
)
display(binary_dist)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
df[target_binary].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title('Binary Label Distribution (Overall)')
axes[0].set_xlabel('Label')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['Normal (0)', 'Attack (1)'], rotation=0)

df.groupby('split')[target_binary].value_counts().unstack().plot(kind='bar', ax=axes[1])
axes[1].set_title('Binary Label Distribution by Split')
axes[1].set_xlabel('Split')
axes[1].set_ylabel('Count')
axes[1].legend(['Normal', 'Attack'])
plt.tight_layout()
plt.savefig(os.path.join(config['output']['figs_dir'], 'target_binary_dist.png'), dpi=300)
plt.show()

In [None]:
# Multi-class target distribution
target_multi = config['targets']['multi']
multi_dist = create_target_distribution(
    df, target_multi, 'split',
    os.path.join(config['output']['tables_dir'], 'target_distribution_multi.csv'),
    is_binary=False
)
display(multi_dist.head(15))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
attack_counts = df[target_multi].value_counts()
attack_counts.plot(kind='barh', ax=axes[0])
axes[0].set_title('Attack Category Distribution (Overall)')
axes[0].set_xlabel('Count')

df.groupby('split')[target_multi].value_counts().unstack().T.plot(kind='bar', ax=axes[1])
axes[1].set_title('Attack Category Distribution by Split')
axes[1].set_xlabel('Attack Category')
axes[1].set_ylabel('Count')
axes[1].legend(['Train', 'Test'])
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.savefig(os.path.join(config['output']['figs_dir'], 'target_multi_dist.png'), dpi=300)
plt.show()

## 6. Data Type Conversion and Missing Value Imputation

In [None]:
# Track imputation
imputation_report = []

# Categorical features: convert to string
for col in config['features']['categorical']:
    if col in df.columns:
        before_missing = df[col].isna().sum()
        
        # Handle special case: service with "-" values
        if col == 'service':
            df[col] = df[col].replace('-', '_missing_')
        
        # Convert to string and fill missing with '_missing_'
        df[col] = df[col].fillna('_missing_').astype(str)
        
        after_missing = df[col].isna().sum()
        imputation_report.append({
            'column': col,
            'strategy': 'fill(_missing_)',
            'before_missing': before_missing,
            'after_missing': after_missing
        })

# Numeric features: convert to float and fill with median
for col in config['features']['numeric']:
    if col in df.columns:
        before_missing = df[col].isna().sum()
        
        # Convert to numeric, coercing errors
        df[col] = pd.to_numeric(df[col], errors='coerce')
        
        # Replace inf with NaN
        df[col] = df[col].replace([np.inf, -np.inf], np.nan)
        
        # Fill with median
        median_val = df[col].median()
        df[col] = df[col].fillna(median_val)
        
        after_missing = df[col].isna().sum()
        imputation_report.append({
            'column': col,
            'strategy': f'median({median_val:.2f})',
            'before_missing': before_missing,
            'after_missing': after_missing
        })

# Save imputation report
imputation_df = pd.DataFrame(imputation_report)
imputation_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'imputation_report.csv'),
    index=False
)

print("✓ Data type conversion and imputation completed")
display(imputation_df.head(10))

## 7. Feature Engineering (UNSW-NB15 Specific)

In [None]:
# Feature engineering
print("Performing feature engineering...")

# Bytes-related features
df['bytes_total'] = df['sbytes'] + df['dbytes']
df['bytes_ratio_sd'] = df['sbytes'] / (df['dbytes'] + 1)
df['bytes_per_sec'] = df['bytes_total'] / df['dur'].clip(lower=1e-6)

# Packets-related features
df['pkts_total'] = df['spkts'] + df['dpkts']
df['pkts_per_sec'] = df['pkts_total'] / df['dur'].clip(lower=1e-6)

# Port bucketing function
def port_to_bucket(port):
    try:
        port = int(port)
        if 0 <= port <= 1023:
            return 'well_known'
        elif 1024 <= port <= 49151:
            return 'registered'
        elif 49152 <= port <= 65535:
            return 'dynamic'
        else:
            return 'other'
    except:
        return 'unknown'

# Apply port bucketing
if 'sport' in df.columns:
    df['sport_bucket'] = df['sport'].apply(port_to_bucket)
else:
    df['sport_bucket'] = 'unknown'

if 'dsport' in df.columns:
    df['dsport_bucket'] = df['dsport'].apply(port_to_bucket)
else:
    df['dsport_bucket'] = 'unknown'

# Proto × Service interaction
if 'proto' in df.columns and 'service' in df.columns:
    df['proto_service'] = df['proto'].astype(str) + '_' + df['service'].astype(str)
else:
    df['proto_service'] = 'unknown'

# Time-based features (if stime exists)
if 'stime' in df.columns:
    try:
        df['hour'] = pd.to_datetime(df['stime'], unit='s', errors='coerce').dt.hour
        df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
        df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
    except:
        df['hour'] = 0
        df['hour_sin'] = 0
        df['hour_cos'] = 0

# Clean inf values from engineered features
engineered_numeric = [
    'bytes_total', 'bytes_ratio_sd', 'bytes_per_sec',
    'pkts_total', 'pkts_per_sec'
]

for col in engineered_numeric:
    if col in df.columns:
        df[col] = df[col].replace([np.inf, -np.inf], np.nan)
        df[col] = df[col].fillna(0)

print("✓ Feature engineering completed")
print(f"New features created: {len(engineered_numeric) + 3}")
print(f"Total columns: {len(df.columns)}")

# Create feature catalog
feature_catalog = create_feature_catalog(
    config['features'],
    os.path.join(config['output']['tables_dir'], 'feature_catalog.csv')
)
display(feature_catalog.head(15))

## 8. Cross-Validation Strategy: Host-Based Splitting

In [None]:
# CV strategy
n_splits = config['cv']['n_splits']
cv_strategy = config['cv']['strategy']

print(f"CV Strategy: {cv_strategy} with {n_splits} splits")

if cv_strategy == 'host':
    # Host-based: group by (srcip, dstip)
    if 'srcip' in df.columns and 'dstip' in df.columns:
        df['group_key'] = df['srcip'].astype(str) + '_' + df['dstip'].astype(str)
        
        # Encode group_key to numeric
        from sklearn.preprocessing import LabelEncoder
        le_group = LabelEncoder()
        df['group_id'] = le_group.fit_transform(df['group_key'])
        
        # GroupKFold
        gkf = GroupKFold(n_splits=n_splits)
        
        # Assign fold
        df['cv_fold'] = -1
        for fold_idx, (train_idx, val_idx) in enumerate(gkf.split(df, groups=df['group_id'])):
            df.loc[val_idx, 'cv_fold'] = fold_idx
        
        print(f"✓ Host-based CV: {df['group_key'].nunique()} unique host pairs")
    else:
        print("⚠ srcip/dstip not found, falling back to stratified CV")
        skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
        df['cv_fold'] = -1
        for fold_idx, (train_idx, val_idx) in enumerate(skf.split(df, df[target_multi])):
            df.loc[val_idx, 'cv_fold'] = fold_idx

elif cv_strategy == 'time':
    # Time-based: sort by stime and split into blocks
    if 'stime' in df.columns:
        df = df.sort_values('stime').reset_index(drop=True)
        df['cv_fold'] = pd.qcut(df.index, q=n_splits, labels=False, duplicates='drop')
        print("✓ Time-based CV")
    else:
        print("⚠ stime not found, falling back to stratified CV")
        skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
        df['cv_fold'] = -1
        for fold_idx, (train_idx, val_idx) in enumerate(skf.split(df, df[target_multi])):
            df.loc[val_idx, 'cv_fold'] = fold_idx
else:
    # Default: stratified
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
    df['cv_fold'] = -1
    for fold_idx, (train_idx, val_idx) in enumerate(skf.split(df, df[target_multi])):
        df.loc[val_idx, 'cv_fold'] = fold_idx
    print("✓ Stratified CV")

# Fold sizes
fold_sizes = []
for fold in range(n_splits):
    n_val = (df['cv_fold'] == fold).sum()
    n_train = (df['cv_fold'] != fold).sum()
    fold_sizes.append({
        'fold': fold,
        'n_train': n_train,
        'n_valid': n_val,
        'train_pct': round(n_train / len(df) * 100, 2),
        'valid_pct': round(n_val / len(df) * 100, 2)
    })

fold_sizes_df = pd.DataFrame(fold_sizes)
fold_sizes_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'fold_sizes.csv'),
    index=False
)
display(fold_sizes_df)

# Leakage check (host-based)
leakage_checks = []
if cv_strategy == 'host' and 'group_key' in df.columns:
    has_leakage = False
    for fold in range(n_splits):
        train_groups = set(df[df['cv_fold'] != fold]['group_key'].unique())
        val_groups = set(df[df['cv_fold'] == fold]['group_key'].unique())
        overlap = train_groups & val_groups
        if len(overlap) > 0:
            has_leakage = True
            break
    
    leakage_checks.append({
        'check_name': 'host_group_leakage',
        'passed': not has_leakage,
        'detail': 'No overlap' if not has_leakage else f'{len(overlap)} groups overlap'
    })
else:
    leakage_checks.append({
        'check_name': 'host_group_leakage',
        'passed': True,
        'detail': 'N/A (not using host-based CV)'
    })

leakage_df = pd.DataFrame(leakage_checks)
leakage_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'leakage_checks.csv'),
    index=False
)
display(leakage_df)

## 9. Preprocessing Pipeline Definition

In [None]:
# Define feature columns for modeling
categorical_features = [
    'proto', 'service', 'state',
    'sport_bucket', 'dsport_bucket'
    # 'proto_service'  # High cardinality, optional
]

numeric_features = [
    'dur', 'sbytes', 'dbytes', 'spkts', 'dpkts',
    'sload', 'dload', 'sloss', 'dloss',
    'sinpkt', 'dinpkt', 'sjit', 'djit',
    'swin', 'dwin', 'stcpb', 'dtcpb',
    'smeansz', 'dmeansz', 'trans_depth',
    'res_bdy_len', 'ct_srv_src', 'ct_state_ttl',
    'ct_dst_ltm', 'ct_src_ltm', 'ct_src_dport_ltm',
    'ct_dst_sport_ltm', 'ct_dst_src_ltm',
    'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd',
    'is_sm_ips_ports',
    # Engineered
    'bytes_total', 'bytes_ratio_sd', 'bytes_per_sec',
    'pkts_total', 'pkts_per_sec'
]

# Filter to existing columns
categorical_features = [c for c in categorical_features if c in df.columns]
numeric_features = [c for c in numeric_features if c in df.columns]

print(f"Categorical features ({len(categorical_features)}): {categorical_features}")
print(f"Numeric features ({len(numeric_features)}): {numeric_features[:10]}...")

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features),
        ('num', RobustScaler(), numeric_features)
    ],
    remainder='drop'
)

# Pipeline summary
pipeline_summary = [
    {'step_order': 1, 'step_name': 'OneHotEncoder', 'params_json': json.dumps({'handle_unknown': 'ignore'})},
    {'step_order': 2, 'step_name': 'RobustScaler', 'params_json': json.dumps({})}
]

if config['imbalance']['use_smote_nc']:
    pipeline_summary.append({
        'step_order': 3,
        'step_name': 'SMOTENC',
        'params_json': json.dumps({'k_neighbors': config['imbalance']['smote_k_neighbors']})
    })

pipeline_df = pd.DataFrame(pipeline_summary)
pipeline_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'pipeline_summary.csv'),
    index=False
)
display(pipeline_df)

## 10. Quick Baseline Model (Logistic Regression) - Smoke Test

In [None]:
# Smoke test with fold 0
print("Running smoke test with Logistic Regression on fold 0...")

fold = 0
train_mask = df['cv_fold'] != fold
val_mask = df['cv_fold'] == fold

X_train = df.loc[train_mask, categorical_features + numeric_features]
y_train = df.loc[train_mask, target_multi]
X_val = df.loc[val_mask, categorical_features + numeric_features]
y_val = df.loc[val_mask, target_multi]

# Encode target
le_target = LabelEncoder()
y_train_enc = le_target.fit_transform(y_train)
y_val_enc = le_target.transform(y_val)

# Preprocess
X_train_pp = preprocessor.fit_transform(X_train)
X_val_pp = preprocessor.transform(X_val)

# Train logistic regression
lr = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',
    random_state=SEED,
    n_jobs=-1
)
lr.fit(X_train_pp, y_train_enc)

# Predict
y_pred = lr.predict(X_val_pp)
y_proba = lr.predict_proba(X_val_pp)

# Metrics
metrics = compute_metrics(y_val_enc, y_pred, y_proba, le_target.classes_)
print(f"\n✓ Smoke test results (Fold {fold}):")
print(f"  Accuracy: {metrics['accuracy']:.4f}")
print(f"  Macro F1: {metrics['macro_f1']:.4f}")
print(f"  PR-AUC (OVR): {metrics.get('ovr_pr_auc', 0):.4f}")

# Classification report
report_dict = classification_report(y_val_enc, y_pred, target_names=le_target.classes_, output_dict=True, zero_division=0)
report_df = pd.DataFrame(report_dict).T
report_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'baseline_logreg_fold0_report.csv')
)
display(report_df)

# Confusion matrix
plot_confusion_matrix(
    y_val_enc, y_pred, le_target.classes_,
    os.path.join(config['output']['figs_dir'], 'smoke_cm_fold0.png'),
    title='Smoke Test - Confusion Matrix (Fold 0)'
)

# PR curve
plot_pr_curves(
    y_val_enc, y_proba, le_target.classes_,
    os.path.join(config['output']['figs_dir'], 'smoke_pr_fold0.png'),
    title='Smoke Test - PR Curves (Fold 0)'
)

print("\n✓ Smoke test completed successfully!")

## 11. Save Processed Data

In [None]:
# Save processed data
processed_path = config['data']['processed_path']
os.makedirs(os.path.dirname(processed_path), exist_ok=True)

df.to_parquet(processed_path, index=False, engine='pyarrow')
print(f"✓ Processed data saved to {processed_path}")
print(f"File size: {os.path.getsize(processed_path) / 1024 / 1024:.2f} MB")

# Schema snapshot
schema_data = []
for col in df.columns:
    if col in categorical_features:
        role = 'categorical_feature'
    elif col in numeric_features:
        role = 'numeric_feature'
    elif col in [target_binary, target_multi]:
        role = 'target'
    elif col in ['cv_fold', 'split']:
        role = 'metadata'
    else:
        role = 'id_or_dropped'
    
    schema_data.append({
        'column': col,
        'dtype': str(df[col].dtype),
        'role': role
    })

schema_df = pd.DataFrame(schema_data)
schema_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'processed_schema.csv'),
    index=False
)
display(schema_df.head(20))

## 12. Metric Definitions

In [None]:
# Metric definitions table
metric_defs = [
    {'metric_name': 'macro_f1', 'definition': 'F1-score averaged across all classes (unweighted)', 'averaging': 'macro'},
    {'metric_name': 'weighted_f1', 'definition': 'F1-score averaged across all classes (weighted by support)', 'averaging': 'weighted'},
    {'metric_name': 'accuracy', 'definition': 'Overall classification accuracy', 'averaging': 'N/A'},
    {'metric_name': 'ovr_pr_auc', 'definition': 'Average Precision (PR-AUC) using One-vs-Rest', 'averaging': 'ovr'},
    {'metric_name': 'precision_macro', 'definition': 'Precision averaged across all classes', 'averaging': 'macro'},
    {'metric_name': 'recall_macro', 'definition': 'Recall averaged across all classes', 'averaging': 'macro'},
]

metric_defs_df = pd.DataFrame(metric_defs)
metric_defs_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'metric_definitions.csv'),
    index=False
)
display(metric_defs_df)

## 13. Full Model Training and Evaluation

### 13.1 LightGBM - 5-Fold CV

In [None]:
# LightGBM training
print("=" * 60)
print("Training LightGBM with 5-Fold CV")
print("=" * 60)

lgbm_cv_scores = []
lgbm_preds = []
lgbm_probas = []
lgbm_feature_importance = []

for fold in range(n_splits):
    print(f"\n--- Fold {fold} ---")
    
    train_mask = df['cv_fold'] != fold
    val_mask = df['cv_fold'] == fold
    
    X_train = df.loc[train_mask, categorical_features + numeric_features]
    y_train = df.loc[train_mask, target_multi]
    X_val = df.loc[val_mask, categorical_features + numeric_features]
    y_val = df.loc[val_mask, target_multi]
    
    # Encode target
    le = LabelEncoder()
    y_train_enc = le.fit_transform(y_train)
    y_val_enc = le.transform(y_val)
    
    # Preprocess
    X_train_pp = preprocessor.fit_transform(X_train)
    X_val_pp = preprocessor.transform(X_val)
    
    # Train LightGBM
    lgbm_params = config['models']['lightgbm'].copy()
    lgbm_clf = lgb.LGBMClassifier(**lgbm_params)
    
    start_time = time.time()
    lgbm_clf.fit(
        X_train_pp, y_train_enc,
        eval_set=[(X_val_pp, y_val_enc)],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
    )
    train_time = time.time() - start_time
    
    # Predict
    y_pred = lgbm_clf.predict(X_val_pp)
    y_proba = lgbm_clf.predict_proba(X_val_pp)
    
    # Metrics
    metrics = compute_metrics(y_val_enc, y_pred, y_proba, le.classes_)
    lgbm_cv_scores.append({
        'fold': fold,
        'macro_f1': metrics['macro_f1'],
        'ovr_pr_auc': metrics.get('ovr_pr_auc', 0),
        'accuracy': metrics['accuracy'],
        'train_time_sec': round(train_time, 2)
    })
    
    # Store predictions
    for i, (true, pred) in enumerate(zip(y_val_enc, y_pred)):
        lgbm_preds.append({'fold': fold, 'y_true': true, 'y_pred': pred})
    
    # Store probabilities
    for i, proba_row in enumerate(y_proba):
        proba_dict = {'fold': fold}
        for cls_idx, cls_name in enumerate(le.classes_):
            proba_dict[f'p_{cls_name}'] = proba_row[cls_idx]
        lgbm_probas.append(proba_dict)
    
    print(f"Fold {fold}: Macro F1 = {metrics['macro_f1']:.4f}, PR-AUC = {metrics.get('ovr_pr_auc', 0):.4f}")

# Feature importance (from last fold)
feature_names = (preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features).tolist() +
                 numeric_features)

if len(feature_names) == lgbm_clf.n_features_in_:
    gain_importance = lgbm_clf.feature_importances_
    split_importance = lgbm_clf.booster_.feature_importance(importance_type='split')
    
    for idx, feat in enumerate(feature_names):
        lgbm_feature_importance.append({
            'feature': feat,
            'gain_importance': gain_importance[idx] if idx < len(gain_importance) else 0,
            'split_importance': split_importance[idx] if idx < len(split_importance) else 0
        })
    
    lgbm_fi_df = pd.DataFrame(lgbm_feature_importance)
    lgbm_fi_df['rank_gain'] = lgbm_fi_df['gain_importance'].rank(ascending=False)
    lgbm_fi_df['rank_split'] = lgbm_fi_df['split_importance'].rank(ascending=False)
    lgbm_fi_df = lgbm_fi_df.sort_values('gain_importance', ascending=False)
    lgbm_fi_df.to_csv(
        os.path.join(config['output']['tables_dir'], 'lgbm_feature_importance.csv'),
        index=False
    )

# Save results
lgbm_cv_df = pd.DataFrame(lgbm_cv_scores)
lgbm_cv_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'lgbm_cv_scores.csv'),
    index=False
)

lgbm_preds_df = pd.DataFrame(lgbm_preds)
lgbm_preds_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'lgbm_preds.csv'),
    index=False
)

lgbm_probas_df = pd.DataFrame(lgbm_probas)
lgbm_probas_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'lgbm_probas.csv'),
    index=False
)

print("\n" + "=" * 60)
print("LightGBM CV Results:")
print(lgbm_cv_df)
print(f"Mean Macro F1: {lgbm_cv_df['macro_f1'].mean():.4f} ± {lgbm_cv_df['macro_f1'].std():.4f}")
print(f"Mean PR-AUC: {lgbm_cv_df['ovr_pr_auc'].mean():.4f} ± {lgbm_cv_df['ovr_pr_auc'].std():.4f}")
print("=" * 60)

### 13.2 XGBoost - 5-Fold CV

In [None]:
# XGBoost training
print("=" * 60)
print("Training XGBoost with 5-Fold CV")
print("=" * 60)

xgb_cv_scores = []
xgb_preds = []
xgb_probas = []
xgb_feature_importance = []

for fold in range(n_splits):
    print(f"\n--- Fold {fold} ---")
    
    train_mask = df['cv_fold'] != fold
    val_mask = df['cv_fold'] == fold
    
    X_train = df.loc[train_mask, categorical_features + numeric_features]
    y_train = df.loc[train_mask, target_multi]
    X_val = df.loc[val_mask, categorical_features + numeric_features]
    y_val = df.loc[val_mask, target_multi]
    
    # Encode target
    le = LabelEncoder()
    y_train_enc = le.fit_transform(y_train)
    y_val_enc = le.transform(y_val)
    
    # Preprocess
    X_train_pp = preprocessor.fit_transform(X_train)
    X_val_pp = preprocessor.transform(X_val)
    
    # Train XGBoost
    xgb_params = config['models']['xgboost'].copy()
    xgb_clf = xgb.XGBClassifier(**xgb_params)
    
    start_time = time.time()
    xgb_clf.fit(
        X_train_pp, y_train_enc,
        eval_set=[(X_val_pp, y_val_enc)],
        verbose=False
    )
    train_time = time.time() - start_time
    
    # Predict
    y_pred = xgb_clf.predict(X_val_pp)
    y_proba = xgb_clf.predict_proba(X_val_pp)
    
    # Metrics
    metrics = compute_metrics(y_val_enc, y_pred, y_proba, le.classes_)
    xgb_cv_scores.append({
        'fold': fold,
        'macro_f1': metrics['macro_f1'],
        'ovr_pr_auc': metrics.get('ovr_pr_auc', 0),
        'accuracy': metrics['accuracy'],
        'train_time_sec': round(train_time, 2)
    })
    
    # Store predictions
    for i, (true, pred) in enumerate(zip(y_val_enc, y_pred)):
        xgb_preds.append({'fold': fold, 'y_true': true, 'y_pred': pred})
    
    # Store probabilities
    for i, proba_row in enumerate(y_proba):
        proba_dict = {'fold': fold}
        for cls_idx, cls_name in enumerate(le.classes_):
            proba_dict[f'p_{cls_name}'] = proba_row[cls_idx]
        xgb_probas.append(proba_dict)
    
    print(f"Fold {fold}: Macro F1 = {metrics['macro_f1']:.4f}, PR-AUC = {metrics.get('ovr_pr_auc', 0):.4f}")

# Feature importance (from last fold)
if len(feature_names) == xgb_clf.n_features_in_:
    importance = xgb_clf.feature_importances_
    
    for idx, feat in enumerate(feature_names):
        xgb_feature_importance.append({
            'feature': feat,
            'importance': importance[idx] if idx < len(importance) else 0
        })
    
    xgb_fi_df = pd.DataFrame(xgb_feature_importance)
    xgb_fi_df['rank'] = xgb_fi_df['importance'].rank(ascending=False)
    xgb_fi_df = xgb_fi_df.sort_values('importance', ascending=False)
    xgb_fi_df.to_csv(
        os.path.join(config['output']['tables_dir'], 'xgb_feature_importance.csv'),
        index=False
    )

# Save results
xgb_cv_df = pd.DataFrame(xgb_cv_scores)
xgb_cv_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'xgb_cv_scores.csv'),
    index=False
)

xgb_preds_df = pd.DataFrame(xgb_preds)
xgb_preds_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'xgb_preds.csv'),
    index=False
)

xgb_probas_df = pd.DataFrame(xgb_probas)
xgb_probas_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'xgb_probas.csv'),
    index=False
)

print("\n" + "=" * 60)
print("XGBoost CV Results:")
print(xgb_cv_df)
print(f"Mean Macro F1: {xgb_cv_df['macro_f1'].mean():.4f} ± {xgb_cv_df['macro_f1'].std():.4f}")
print(f"Mean PR-AUC: {xgb_cv_df['ovr_pr_auc'].mean():.4f} ± {xgb_cv_df['ovr_pr_auc'].std():.4f}")
print("=" * 60)

### 13.3 Model Comparison Summary

In [None]:
# Main results table
main_results = [
    {
        'model': 'LightGBM',
        'macro_f1_mean': lgbm_cv_df['macro_f1'].mean(),
        'macro_f1_std': lgbm_cv_df['macro_f1'].std(),
        'ovr_pr_auc_mean': lgbm_cv_df['ovr_pr_auc'].mean(),
        'accuracy_mean': lgbm_cv_df['accuracy'].mean(),
        'train_time_sec_mean': lgbm_cv_df['train_time_sec'].mean()
    },
    {
        'model': 'XGBoost',
        'macro_f1_mean': xgb_cv_df['macro_f1'].mean(),
        'macro_f1_std': xgb_cv_df['macro_f1'].std(),
        'ovr_pr_auc_mean': xgb_cv_df['ovr_pr_auc'].mean(),
        'accuracy_mean': xgb_cv_df['accuracy'].mean(),
        'train_time_sec_mean': xgb_cv_df['train_time_sec'].mean()
    }
]

main_results_df = pd.DataFrame(main_results)
main_results_df.to_csv(
    os.path.join(config['output']['tables_dir'], 'main_results.csv'),
    index=False
)

print("\n" + "=" * 80)
print("MODEL COMPARISON SUMMARY")
print("=" * 80)
display(main_results_df)
print("=" * 80)

### 13.4 Visualization - PR and ROC Curves

In [None]:
# Plot PR curves for LightGBM (using fold 0)
fold = 0
train_mask = df['cv_fold'] != fold
val_mask = df['cv_fold'] == fold

X_val = df.loc[val_mask, categorical_features + numeric_features]
y_val = df.loc[val_mask, target_multi]

le = LabelEncoder()
le.fit(df[target_multi])
y_val_enc = le.transform(y_val)

# Get probabilities from saved results
lgbm_probas_fold0 = lgbm_probas_df[lgbm_probas_df['fold'] == fold]
proba_cols = [c for c in lgbm_probas_fold0.columns if c.startswith('p_')]
y_proba_lgbm = lgbm_probas_fold0[proba_cols].values

# Plot PR curves
plot_pr_curves(
    y_val_enc, y_proba_lgbm, le.classes_,
    os.path.join(config['output']['figs_dir'], 'pr_curve_lgbm_ovr.png'),
    title='LightGBM - Precision-Recall Curves (Fold 0, OVR)'
)

# Plot ROC curves
plot_roc_curves(
    y_val_enc, y_proba_lgbm, le.classes_,
    os.path.join(config['output']['figs_dir'], 'roc_curve_lgbm_ovr.png'),
    title='LightGBM - ROC Curves (Fold 0, OVR)'
)

# Confusion matrix
lgbm_preds_fold0 = lgbm_preds_df[lgbm_preds_df['fold'] == fold]
y_pred_lgbm = lgbm_preds_fold0['y_pred'].values

plot_confusion_matrix(
    y_val_enc, y_pred_lgbm, le.classes_,
    os.path.join(config['output']['figs_dir'], 'cm_lgbm_fold0.png'),
    title='LightGBM - Confusion Matrix (Fold 0)'
)

### 13.5 Feature Importance Visualization

In [None]:
# Plot top 20 features for LightGBM
if len(lgbm_feature_importance) > 0:
    lgbm_fi_df = pd.read_csv(
        os.path.join(config['output']['tables_dir'], 'lgbm_feature_importance.csv')
    )
    
    top_features = lgbm_fi_df.nlargest(20, 'gain_importance')
    
    plt.figure(figsize=(10, 8))
    plt.barh(range(len(top_features)), top_features['gain_importance'])
    plt.yticks(range(len(top_features)), top_features['feature'])
    plt.xlabel('Gain Importance')
    plt.title('LightGBM - Top 20 Features by Gain')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.savefig(
        os.path.join(config['output']['figs_dir'], 'feature_importance_lgbm.png'),
        dpi=300
    )
    plt.show()
    
    print("\nTop 10 Most Important Features (LightGBM):")
    display(lgbm_fi_df.head(10))

## 14. Reproducibility Manifest

In [None]:
# Create reproducibility manifest
manifest_df = create_reproducibility_manifest(
    os.path.join(config['output']['tables_dir'], 'reproducibility_manifest.csv')
)
display(manifest_df)

## 15. Summary and Conclusion

In [None]:
print("\n" + "=" * 80)
print("UNSW-NB15 ANALYSIS - SUMMARY")
print("=" * 80)
print(f"\nDataset: {len(df):,} samples, {len(df.columns)} features")
print(f"Target classes: {len(le_target.classes_)} ({', '.join(le_target.classes_)})")
print(f"CV Strategy: {cv_strategy} with {n_splits} folds")
print(f"\nModels trained: LightGBM, XGBoost")
print(f"\nBest model by Macro F1: ", end="")

best_f1 = main_results_df.loc[main_results_df['macro_f1_mean'].idxmax()]
print(f"{best_f1['model']} (F1={best_f1['macro_f1_mean']:.4f})")

print(f"\n✓ All results saved to: {config['output']['artifacts_dir']}")
print(f"  - Tables: {config['output']['tables_dir']}")
print(f"  - Figures: {config['output']['figs_dir']}")
print("=" * 80)

---

## Next Steps (Optional Extensions)

1. **TabTransformer (PyTorch)**: Implement deep learning model for tabular data
2. **Hyperparameter Optimization (Optuna)**: Automated HPO for better performance
3. **Ensemble Learning**: Combine predictions from multiple models
4. **Calibration**: Isotonic/Platt calibration for probability estimates
5. **Ablation Studies**: Test impact of SMOTE, focal loss, etc.
6. **SHAP Analysis**: Detailed explainability with SHAP values
7. **Cross-Dataset Evaluation**: Test on CIC-IDS2017 or other datasets

To implement these extensions, uncomment and run the corresponding sections below or create new notebooks.

In [None]:
# Placeholder for TabTransformer implementation
# See separate notebook: tabtransformer_training.ipynb

In [None]:
# Placeholder for Optuna HPO
# See separate notebook: hyperparameter_optimization.ipynb

---

**End of Notebook**

For questions or issues, please refer to the project README or contact the research team.