# ⚠️ IMPORTANT: Feature Compatibility Note

**The Challenge:**
- **CIC-IDS2018** uses flow-based features (aggregated statistics over multiple packets)
- **Live Capture** extracts packet-based features (attributes from individual packets)
- These are fundamentally different feature spaces!

**Our Approach:**
This notebook selects 17 features from CIC-IDS2018 that have the **closest semantic meaning** to what can be extracted from live packets:

**Training Features (CIC-IDS2018):**
- Port numbers, protocol, packet lengths
- TCP flags (SYN, ACK, FIN, RST, PSH, URG)
- Packet rates, header lengths, traffic ratios

**Live Capture Features (preprocessor.py):**
- src_ip, dst_ip, src_port, dst_port
- total_length, payload_size, ttl
- tcp_syn_flag, tcp_ack_flag, tcp_fin_flag, tcp_rst_flag
- window_size, sequence_number, packet_rate

**Important Notes:**
1. Features won't match exactly - this is an **approximation**
2. Model learns attack patterns from statistical features
3. After training, you'll need to create a **feature mapping layer** in your deployment
4. Alternative: Implement flow tracking to compute CIC-IDS2018 features in real-time

**For Production Use:**
Consider training on synthetic data that matches your exact live capture features, or implement a full flow tracker.

# IDS/IPS Training on Kaggle with CSE-CIC-IDS2018 Dataset

This notebook trains an Intrusion Detection System using the CSE-CIC-IDS2018 dataset on Kaggle infrastructure.

**Features:**
- Load CIC-IDS2018 dataset from Kaggle Datasets
- Efficient memory management for large datasets
- Advanced preprocessing with feature selection and balancing
- Train multiple architectures (MLP, Random Forest, XGBoost)
- Save models for deployment

**Dataset:** [CSE-CIC-IDS2018](https://www.kaggle.com/datasets/solarmainframe/ids-intrusion-csv)

## 1. Install Required Dependencies

Install packages not pre-installed on Kaggle.

In [None]:
%%time
# Install dependencies (if not already installed)
!pip install -q imbalanced-learn
!pip install -q xgboost
!pip install -q lightgbm

print("✓ Dependencies installed successfully")

## 2. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import logging
import gc
import warnings
import joblib
from pathlib import Path

# Sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Imbalanced-learn
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Models
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# TensorFlow (if available)
try:
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers, models, callbacks
    TF_AVAILABLE = True
    print(f"✓ TensorFlow {tf.__version__} available")
except:
    TF_AVAILABLE = False
    print("⚠ TensorFlow not available, will use traditional ML models")

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Check GPU availability
if TF_AVAILABLE:
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        print(f"✓ GPU detected: {len(gpus)} device(s)")
        for gpu in gpus:
            print(f"  - {gpu}")
            # Enable memory growth to avoid OOM
            tf.config.experimental.set_memory_growth(gpu, True)
    else:
        print("⚠ No GPU detected, using CPU")
else:
    print("⚠ TensorFlow not available")

print("✓ All libraries imported successfully")

## 3. Configure Kaggle Environment

Set up paths and check available datasets.

In [None]:
# Kaggle paths
KAGGLE_INPUT = Path('/kaggle/input')
KAGGLE_WORKING = Path('/kaggle/working')

# Check if running on Kaggle
IS_KAGGLE = KAGGLE_INPUT.exists()

if IS_KAGGLE:
    print("✓ Running on Kaggle")
    print(f"Input directory: {KAGGLE_INPUT}")
    print(f"Working directory: {KAGGLE_WORKING}")
    
    # List available datasets
    print("\nAvailable datasets:")
    for item in KAGGLE_INPUT.iterdir():
        print(f"  - {item.name}")
        # List CSV files in dataset
        if item.is_dir():
            csv_files = list(item.glob('*.csv'))
            if csv_files:
                print(f"    CSV files found: {len(csv_files)}")
                for csv in csv_files[:3]:  # Show first 3
                    print(f"      • {csv.name}")
else:
    print("⚠ Not running on Kaggle - using local paths")
    KAGGLE_INPUT = Path('../data/cic-ids2018')
    KAGGLE_WORKING = Path('.')

## 4. Load Dataset

**Dataset:** CSE-CIC-IDS2018 from Kaggle Datasets

Add the dataset to your Kaggle notebook:
1. Go to "Add Data" → Search "ids-intrusion-csv"
2. Add "solarmainframe/ids-intrusion-csv"

In [None]:
# Find dataset directory
dataset_dirs = list(KAGGLE_INPUT.glob('*ids*')) + list(KAGGLE_INPUT.glob('*intrusion*'))

if dataset_dirs:
    DATA_DIR = dataset_dirs[0]
    print(f"✓ Dataset found: {DATA_DIR}")
else:
    # Fallback: list all and let user choose
    print("Available directories:")
    for d in KAGGLE_INPUT.iterdir():
        if d.is_dir():
            print(f"  {d.name}")
    DATA_DIR = KAGGLE_INPUT / 'ids-intrusion-csv'  # Default

print(f"\nUsing data directory: {DATA_DIR}")

# List CSV files
csv_files = sorted(list(DATA_DIR.glob('*.csv')))
print(f"Found {len(csv_files)} CSV files:")
for f in csv_files:
    size_mb = f.stat().st_size / (1024 * 1024)
    print(f"  • {f.name:20s} ({size_mb:>6.1f} MB)")

## 5. Define Configuration & Helper Functions

In [None]:
# Configuration
CONFIG = {
    'use_simplified': True,  # Use simplified attack categories
    'balance_method': 'hybrid',  # 'hybrid', 'undersample', 'oversample', 'smote', or None
    'min_samples_per_class': 10000,  # Minimum samples to keep per class (for hybrid)
    'max_samples_per_class': 100000,  # Maximum samples to keep per class (for hybrid)
    'test_size': 0.2,
    'scaler_type': 'robust',  # 'standard', 'minmax', 'robust'
    'correlation_threshold': 0.9,
    'random_state': 42,
    
    # Which days to load (None = all, or list specific files)
    'days_to_load': [
        # '02-15-2018.csv',
        # '02-16-2018.csv',
        # '02-20-2018.csv',
        # '02-21-2018.csv',
        # '02-22-2018.csv',
        # '02-23-2018.csv',
    ],
    
    # Model training
    'model_type': 'random_forest',  # 'random_forest', 'xgboost', 'lightgbm', 'mlp'
    'epochs': 50,  # For neural networks
    'batch_size': 32,
}

# Attack class mappings
SIMPLIFIED_MAPPING = {
    'Benign': 'Normal',
    'FTP-BruteForce': 'BruteForce',
    'SSH-Bruteforce': 'BruteForce',
    'DoS-GoldenEye': 'DoS',
    'DoS-Slowloris': 'DoS',
    'DoS-SlowHTTPTest': 'DoS',
    'DoS-Hulk': 'DoS',
    'Heartbleed': 'Exploit',
    'Web-BruteForce': 'Web',
    'Web-XSS': 'Web',
    'Infiltration': 'Infiltration',
    'Botnet': 'Botnet',
    'DDoS-LOIC-HTTP': 'DDoS',
    'DDoS-HOIC': 'DDoS',
}

print("✓ Configuration loaded")
print(f"  Days to load: {len(CONFIG['days_to_load'])}")
print(f"  Simplified labels: {CONFIG['use_simplified']}")
print(f"  Balance method: {CONFIG['balance_method']}")
print(f"  Model: {CONFIG['model_type']}")
if CONFIG['balance_method'] == 'hybrid':
    print(f"  Min samples/class: {CONFIG['min_samples_per_class']:,}")
    print(f"  Max samples/class: {CONFIG['max_samples_per_class']:,}")

In [None]:
def load_and_preprocess_day(csv_path, chunksize=50000):
    """Load and preprocess a single day's data with memory optimization"""
    print(f"\nLoading {csv_path.name}...")
    
    # Drop unnecessary columns during read
    drop_cols = ['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Timestamp']
    
    # Load in chunks to reduce memory
    chunks = []
    for chunk in pd.read_csv(csv_path, chunksize=chunksize, low_memory=False):
        # Drop unnecessary columns
        chunk = chunk.drop(columns=[col for col in drop_cols if col in chunk.columns], errors='ignore')
        
        # Remove duplicate header rows
        if 'Dst Port' in chunk.columns:
            chunk = chunk[chunk['Dst Port'] != 'Dst Port']
        
        # Handle infinity and null
        chunk = chunk.replace(["Infinity", "infinity"], np.inf)
        chunk = chunk.replace([np.inf, -np.inf], np.nan)
        chunk.dropna(inplace=True)
        
        # Convert numeric columns to float32 to save memory
        for col in chunk.columns:
            if col != 'Label':
                chunk[col] = pd.to_numeric(chunk[col], errors='coerce', downcast='float')
        
        chunks.append(chunk)
    
    # Combine chunks
    df = pd.concat(chunks, ignore_index=True)
    del chunks
    gc.collect()
    
    print(f"  Loaded {len(df):,} samples")
    print(f"  Memory: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    print(f"  Attack distribution: {df['Label'].value_counts().to_dict()}")
    
    return df

print("✓ Helper functions defined")

In [None]:
# Load data from selected days using generator
def load_days_generator():
    """Generator to load days one at a time"""
    for day in CONFIG['days_to_load']:
        day_path = DATA_DIR / day
        if day_path.exists():
            yield load_and_preprocess_day(day_path)
        else:
            print(f"⚠ {day} not found, skipping...")

# Combine all days
print(f"\n{'='*50}")
print("Combining data from all days...")
df = pd.concat(load_days_generator(), ignore_index=True)
gc.collect()  # Force garbage collection

# Optimize data types after combining
for col in df.columns:
    if col != 'Label':
        df[col] = pd.to_numeric(df[col], errors='coerce', downcast='float')

print(f"✓ Total samples: {len(df):,}")
print(f"✓ Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"\n{'='*50}")
print("Label distribution:")
print(df['Label'].value_counts())
print(f"{'='*50}")

## 6. Data Preprocessing

In [None]:
# Select 17 packet-level features compatible with live capture
print("\n" + "="*50)
print("FEATURE SELECTION: Mapping to 17 packet-level features")
print("="*50)
print("\n⚠️  IMPORTANT: Selecting features that match preprocessor.py")
print("Live capture extracts: IPs, ports, length, TTL, flags, payload size, etc.\n")

# Map CIC-IDS2018 features to closest preprocessor equivalents
# Preprocessor extracts from single packets:
# src_ip_numeric, dst_ip_numeric, total_length, fragment_offset, is_fragment,
# payload_size, ttl, src_port, dst_port, tcp_syn_flag, tcp_ack_flag,
# tcp_fin_flag, tcp_rst_flag, window_size, sequence_number, udp_length, packet_rate

FEATURE_MAPPING = {
    # Direct matches or close approximations
    'Dst Port': 'dst_port',              # ✓ Direct match
    'Protocol': 'protocol',              # ✓ Direct match (6=TCP, 17=UDP, 1=ICMP)
    'TotLen Fwd Pkts': 'total_length',   # ~ Total length of packets
    'Fwd Pkt Len Max': 'payload_size',   # ~ Max payload approximates payload size
    'Fwd Pkt Len Mean': 'mean_length',   # ~ Mean packet length
    'Flow Byts/s': 'packet_rate',        # ~ Bytes per second approximates rate
    
    # TCP Flags (can be extracted from single packets)
    'PSH Flag Cnt': 'tcp_psh_flag',      # TCP PSH flag
    'URG Flag Cnt': 'tcp_urg_flag',      # TCP URG flag  
    'FIN Flag Cnt': 'tcp_fin_flag',      # ✓ Direct match
    'SYN Flag Cnt': 'tcp_syn_flag',      # ✓ Direct match
    'RST Flag Cnt': 'tcp_rst_flag',      # ✓ Direct match
    'ACK Flag Cnt': 'tcp_ack_flag',      # ✓ Direct match
    
    # Header and size features
    'Fwd Header Len': 'header_length',   # Header length
    'Fwd Pkt Len Min': 'min_length',     # Min packet length
    'Fwd Pkt Len Std': 'length_std',     # Packet length variation
    
    # Additional useful features
    'Down/Up Ratio': 'down_up_ratio',    # Traffic ratio
    'Fwd Pkts/s': 'packet_rate_fwd',     # Packet rate
}

# Try to select features in order of preference
selected_features = []
selected_cic_names = []

print("Searching for compatible features...\n")

# First, try to get all mapped features
for cic_feat, preprocessor_equiv in FEATURE_MAPPING.items():
    if cic_feat in df.columns:
        selected_cic_names.append(cic_feat)
        print(f"  ✓ {cic_feat:25s} → {preprocessor_equiv}")

# If we have less than 17, add more features
if len(selected_cic_names) < 17:
    print(f"\nNeed {17 - len(selected_cic_names)} more features...")
    
    # Additional candidates that might be useful
    backup_features = [
        'Tot Fwd Pkts', 'Tot Bwd Pkts', 'TotLen Bwd Pkts',
        'Bwd Pkt Len Mean', 'Flow Duration', 'Flow IAT Mean',
        'Fwd IAT Mean', 'Pkt Len Mean', 'Pkt Len Std',
        'Pkt Len Var', 'Fwd Seg Size Avg', 'Init Fwd Win Byts'
    ]
    
    for feat in backup_features:
        if feat in df.columns and feat not in selected_cic_names:
            selected_cic_names.append(feat)
            print(f"  + {feat:25s} (backup feature)")
            if len(selected_cic_names) >= 17:
                break

# Limit to exactly 17
selected_cic_names = selected_cic_names[:17]

print(f"\n{'='*50}")
print(f"✓ Final selection: {len(selected_cic_names)} features")
print(f"{'='*50}")
for i, feat in enumerate(selected_cic_names, 1):
    print(f"  {i:2d}. {feat}")

# Filter dataframe to only include these features + Label
df_filtered = df[selected_cic_names + ['Label']].copy()

# Save feature names globally for metadata
SELECTED_FEATURE_NAMES = selected_cic_names

# Replace original dataframe
del df
df = df_filtered
del df_filtered
gc.collect()

print(f"\n✓ Filtered dataset shape: {df.shape}")
print(f"✓ Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print("="*50)

In [None]:
# Simplify attack classes if configured
if CONFIG['use_simplified']:
    print("Applying simplified class mapping...")
    df['Label'] = df['Label'].replace(SIMPLIFIED_MAPPING)
    print(f"✓ Classes reduced to: {df['Label'].nunique()} unique values")

# Separate features and labels
X = df.drop('Label', axis=1)
y = df['Label']
del df  # Free memory immediately
gc.collect()

print(f"\nFeature matrix shape: {X.shape}")
print(f"✓ Number of features: {X.shape[1]} (target: 17 for live capture compatibility)")
print(f"Memory usage: {X.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"\nLabel distribution:")
print(y.value_counts())

In [None]:
# Convert all columns to numeric with memory-efficient types
print("Converting features to numeric...")
for col in X.columns:
    X[col] = pd.to_numeric(X[col], errors='coerce', downcast='float')

# Handle any remaining NaNs
X.fillna(0, inplace=True)

# Remove constant and near-constant features
print("\nRemoving constant features...")
variances = X.var()
constant_features = variances[variances < 1e-8].index.tolist()
if constant_features:
    X = X.drop(columns=constant_features)
    print(f"  Dropped {len(constant_features)} constant features")

del variances, constant_features
gc.collect()

print(f"\n✓ Final feature matrix: {X.shape}")
print(f"✓ Memory usage: {X.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

## 7. Handle Class Imbalance

In [None]:
# Balance classes before splitting
balance_method = CONFIG.get('balance_method', 'hybrid')
print(f"Applying {balance_method} balancing...")

if balance_method == 'hybrid':
    # Smart balancing: oversample minorities, cap majorities
    min_samples = CONFIG.get('min_samples_per_class', 10000)
    max_samples = CONFIG.get('max_samples_per_class', 100000)
    
    print(f"  Target range: {min_samples:,} - {max_samples:,} samples per class")
    
    # Calculate current distribution
    class_counts = y.value_counts()
    print(f"\nOriginal distribution:")
    for cls, count in class_counts.items():
        print(f"  {cls}: {count:,}")
    
    # Build sampling strategy
    sampling_strategy = {}
    for cls, count in class_counts.items():
        if count < min_samples:
            # Oversample minority classes
            sampling_strategy[cls] = min_samples
        elif count > max_samples:
            # Undersample majority classes
            sampling_strategy[cls] = max_samples
        else:
            # Keep as is
            sampling_strategy[cls] = count
    
    print(f"\nTarget distribution:")
    for cls, target in sampling_strategy.items():
        original = class_counts[cls]
        change = "↑ oversample" if target > original else "↓ undersample" if target < original else "→ unchanged"
        print(f"  {cls}: {original:,} → {target:,} {change}")
    
    # First oversample minorities
    minority_classes = [cls for cls, count in class_counts.items() if count < min_samples]
    if minority_classes:
        print(f"\nOversampling {len(minority_classes)} minority classes...")
        oversample_strategy = {cls: sampling_strategy[cls] for cls in minority_classes}
        oversampler = RandomOverSampler(sampling_strategy=oversample_strategy, random_state=42)
        X_resampled, y_resampled = oversampler.fit_resample(X, y)
    else:
        X_resampled, y_resampled = X, y
    
    # Then undersample majorities
    majority_classes = [cls for cls, count in class_counts.items() if count > max_samples]
    if majority_classes:
        print(f"Undersampling {len(majority_classes)} majority classes...")
        undersample_strategy = {cls: sampling_strategy[cls] for cls in majority_classes}
        undersampler = RandomUnderSampler(sampling_strategy=undersample_strategy, random_state=42)
        X_balanced, y_balanced = undersampler.fit_resample(X_resampled, y_resampled)
    else:
        X_balanced, y_balanced = X_resampled, y_resampled
    
    print(f"\n✓ Before: {X.shape[0]:,} samples")
    print(f"✓ After: {X_balanced.shape[0]:,} samples")
    print(f"\nFinal balanced distribution:")
    print(pd.Series(y_balanced).value_counts())
    
elif balance_method == 'undersample':
    # Custom undersampling with minimum threshold
    min_samples = CONFIG.get('min_samples_per_class', 10000)
    class_counts = y.value_counts()
    
    # Set all classes to max(minority_class_size, min_samples)
    target_size = max(class_counts.min(), min_samples)
    print(f"  Target samples per class: {target_size:,}")
    
    sampler = RandomUnderSampler(sampling_strategy={cls: min(count, target_size) for cls, count in class_counts.items()}, random_state=42)
    X_balanced, y_balanced = sampler.fit_resample(X, y)
    print(f"✓ Before: {X.shape[0]:,} samples")
    print(f"✓ After: {X_balanced.shape[0]:,} samples")
    print(f"\nBalanced distribution:")
    print(pd.Series(y_balanced).value_counts())
    
elif balance_method == 'oversample':
    sampler = RandomOverSampler(random_state=42)
    X_balanced, y_balanced = sampler.fit_resample(X, y)
    print(f"✓ Before: {X.shape[0]:,} samples")
    print(f"✓ After: {X_balanced.shape[0]:,} samples")
    print(f"\nBalanced distribution:")
    print(pd.Series(y_balanced).value_counts())
    
elif balance_method == 'smote':
    sampler = SMOTE(random_state=42, k_neighbors=5)
    X_balanced, y_balanced = sampler.fit_resample(X, y)
    print(f"✓ Before: {X.shape[0]:,} samples")
    print(f"✓ After: {X_balanced.shape[0]:,} samples")
    print(f"\nBalanced distribution:")
    print(pd.Series(y_balanced).value_counts())
    
else:
    X_balanced, y_balanced = X, y
    print("✓ No balancing applied")

# Convert back to float32 if needed
if isinstance(X_balanced, pd.DataFrame):
    for col in X_balanced.columns:
        X_balanced[col] = X_balanced[col].astype('float32')

# Free memory
del X, y
gc.collect()
print(f"\n✓ Memory after balancing: {X_balanced.memory_usage(deep=True).sum() / 1024**2:.1f} MB" if isinstance(X_balanced, pd.DataFrame) else f"✓ Memory after balancing: {X_balanced.nbytes / 1024**2:.1f} MB")

## 8. Train-Test Split & Scaling

In [None]:
# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_balanced)

print(f"Class mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {i}: {label}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_balanced, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

print(f"\n✓ Train: {X_train.shape[0]:,} samples")
print(f"✓ Test: {X_test.shape[0]:,} samples")
print(f"✓ Features: {X_train.shape[1]} (compatible with 17-feature live capture)")

# Save feature names before deleting
feature_names = SELECTED_FEATURE_NAMES if 'SELECTED_FEATURE_NAMES' in globals() else (X_train.columns.tolist() if hasattr(X_train, 'columns') else [])

del X_balanced, y_balanced, y_encoded
gc.collect()

In [None]:
# Scale features (convert to float32 to save memory)
print("Scaling features with StandardScaler...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train).astype('float32')
X_test_scaled = scaler.transform(X_test).astype('float32')

# Delete unscaled versions
del X_train, X_test
gc.collect()

print(f"✓ Scaled training data: {X_train_scaled.shape}")
print(f"✓ Scaled test data: {X_test_scaled.shape}")
print(f"✓ Train memory: {X_train_scaled.nbytes / 1024**2:.1f} MB")
print(f"✓ Test memory: {X_test_scaled.nbytes / 1024**2:.1f} MB")

## 9. Model Training

Train models based on CONFIG['model_type']

In [None]:
model_type = CONFIG['model_type']
n_classes = len(label_encoder.classes_)

print(f"Training {model_type.upper()} model for {n_classes} classes...")
print("="*60)

if model_type == 'random_forest':
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=20,
        n_jobs=-1,
        random_state=42,
        verbose=1
    )
    model.fit(X_train_scaled, y_train)
    
elif model_type == 'xgboost':
    import xgboost as xgb
    
    # Check for GPU
    try:
        import subprocess
        gpu_available = subprocess.run(['nvidia-smi'], capture_output=True).returncode == 0
    except:
        gpu_available = False
    
    xgb_params = {
        'n_estimators': 100,
        'max_depth': 10,
        'learning_rate': 0.1,
        'n_jobs': -1,
        'random_state': 42,
        'verbosity': 1
    }
    
    if gpu_available:
        # XGBoost 3.1+ uses 'device' instead of 'gpu_id'
        xgb_params['device'] = 'cuda'
        print("  Using GPU acceleration for XGBoost")
    else:
        xgb_params['device'] = 'cpu'
    
    model = xgb.XGBClassifier(**xgb_params)
    model.fit(X_train_scaled, y_train, eval_set=[(X_test_scaled, y_test)])

elif model_type == 'lightgbm':
    import lightgbm as lgb
    
    # Check for GPU
    try:
        import subprocess
        gpu_available = subprocess.run(['nvidia-smi'], capture_output=True).returncode == 0
    except:
        gpu_available = False
    
    lgb_params = {
        'n_estimators': 100,
        'max_depth': 10,
        'learning_rate': 0.1,
        'n_jobs': -1,
        'random_state': 42,
        'verbose': 1
    }
    
    if gpu_available:
        lgb_params['device'] = 'gpu'
        print("  Using GPU acceleration for LightGBM")
    
    model = lgb.LGBMClassifier(**lgb_params)
    model.fit(X_train_scaled, y_train, eval_set=[(X_test_scaled, y_test)])
    
elif model_type == 'mlp':
    if not TF_AVAILABLE:
        raise RuntimeError("TensorFlow not available for MLP training")
    
    # TensorFlow MLP
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
    from tensorflow.keras.callbacks import EarlyStopping
    
    model = Sequential([
        Dense(256, activation='relu', input_shape=(X_train_scaled.shape[1],)),
        BatchNormalization(),
        Dropout(0.3),
        Dense(128, activation='relu'),
        BatchNormalization(),
        Dropout(0.3),
        Dense(64, activation='relu'),
        Dropout(0.2),
        Dense(n_classes, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    
    history = model.fit(
        X_train_scaled, y_train,
        validation_data=(X_test_scaled, y_test),
        epochs=CONFIG['epochs'],
        batch_size=CONFIG['batch_size'],
        callbacks=[early_stop],
        verbose=1
    )
    
else:
    raise ValueError(f"Unknown model type: {model_type}")

print("\n✓ Training complete!")

## 10. Evaluation

In [None]:
# Make predictions
y_pred = model.predict(X_test_scaled)

# For neural networks, get class predictions
if model_type == 'mlp':
    y_pred = np.argmax(y_pred, axis=1)

# Classification report
print("Classification Report:")
print("="*60)
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\n✓ Overall Accuracy: {accuracy:.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\n✓ Confusion Matrix shape: {cm.shape}")

In [None]:
# Visualize confusion matrix
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=label_encoder.classes_,
            yticklabels=label_encoder.classes_)
plt.title(f'Confusion Matrix - {model_type.upper()}')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.savefig(KAGGLE_WORKING / 'confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

print("✓ Confusion matrix saved")

## 11. Save Model & Artifacts

Save trained model, scaler, and metadata for deployment

In [None]:
# Save model
if model_type == 'mlp':
    model_path = KAGGLE_WORKING / 'ids_model_mlp.keras'
    model.save(model_path)
    print(f"✓ Keras model saved to {model_path}")
else:
    model_path = KAGGLE_WORKING / f'ids_model_{model_type}.joblib'
    joblib.dump(model, model_path)
    print(f"✓ {model_type.upper()} model saved to {model_path}")

# Save scaler
scaler_path = KAGGLE_WORKING / 'scaler.joblib'
joblib.dump(scaler, scaler_path)
print(f"✓ Scaler saved to {scaler_path}")

# Save label encoder
encoder_path = KAGGLE_WORKING / 'label_encoder.joblib'
joblib.dump(label_encoder, encoder_path)
print(f"✓ Label encoder saved to {encoder_path}")

In [None]:
# Save metadata
import json

metadata = {
    'model_type': model_type,
    'n_classes': n_classes,
    'class_names': label_encoder.classes_.tolist(),
    'feature_names': feature_names,
    'n_features': X_train_scaled.shape[1],
    'accuracy': float(accuracy),
    'training_samples': X_train_scaled.shape[0],
    'test_samples': X_test_scaled.shape[0],
    'config': CONFIG
}

metadata_path = KAGGLE_WORKING / 'model_metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"✓ Metadata saved to {metadata_path}")
print("\n" + "="*60)
print("ALL ARTIFACTS SAVED TO /kaggle/working/")
print("Download these files to use in your local IDS system:")
print("  - Model file (.keras or .joblib)")
print("  - scaler.joblib")
print("  - label_encoder.joblib")
print("  - model_metadata.json")
print("  - confusion_matrix.png")
print("="*60)