
# 🚨 Intrusion Detection System (IDS) with XGBoost — NSL-KDD

**Team 11: AI for Cybersecurity** — *Model Engineer Notebook*

This notebook builds an Intrusion Detection System (IDS) using the **NSL-KDD** dataset and the **XGBoost** algorithm, following an industry-standard ML workflow.

## Objectives
- Load and explore the NSL-KDD dataset
- Preprocess data (encoding, scaling, train/test split, SMOTE)
- Train baseline and XGBoost models
- Evaluate with classification metrics & confusion matrix
- Explain predictions with **SHAP**
- Discuss ethics and limitations
- Export the trained model for reuse



## 1. Environment Setup

> Run this cell to install required libraries if missing (e.g., in Colab). If you already have them, you can skip.


In [None]:

# If running on a fresh environment (e.g., Colab), uncomment to install:
# %pip install -q xgboost scikit-learn pandas numpy shap imbalanced-learn kagglehub matplotlib



## 2. Imports & Configuration


In [None]:

import os
import sys
import json
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (classification_report, confusion_matrix, 
                             f1_score, precision_score, recall_score, accuracy_score)

from sklearn.ensemble import RandomForestClassifier

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

from xgboost import XGBClassifier

import shap
import joblib

# Matplotlib settings
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['axes.grid'] = True



## 3. Data Loading & Exploration

This section attempts to locate NSL-KDD files automatically. **You have two options**:

- **Option A (Recommended):** Use `kagglehub` to download the dataset programmatically.
- **Option B:** Manually place the files locally (e.g., `./data/`) and set `DATA_DIR` accordingly.

**Common NSL-KDD file names:**
- `KDDTrain+.csv` or `KDDTrain+.txt`
- `KDDTest+.csv` or `KDDTest+.txt`

> The notebook is robust to either `.csv` or `.txt` delimiters.


In [None]:

# Locate or download dataset
DATA_DIR = os.environ.get("DATA_DIR", "./data")
DATA_DIR = Path(DATA_DIR)
DATA_DIR.mkdir(parents=True, exist_ok=True)

train_candidates = [
    DATA_DIR / "KDDTrain+.csv", DATA_DIR / "KDDTrain+.txt",
    DATA_DIR / "KDDTrain.csv",  DATA_DIR / "KDDTrain.txt"
]
test_candidates = [
    DATA_DIR / "KDDTest+.csv", DATA_DIR / "KDDTest+.txt",
    DATA_DIR / "KDDTest.csv",  DATA_DIR / "KDDTest.txt"
]

# Try kagglehub if files missing
if not any(p.exists() for p in train_candidates+test_candidates):
    try:
        import kagglehub
        path = Path(kagglehub.dataset_download("hassan06/nslkdd"))
        # Move files into DATA_DIR if needed
        for p in path.iterdir():
            if p.is_file():
                dest = DATA_DIR / p.name
                if not dest.exists():
                    dest.write_bytes(p.read_bytes())
        print("Downloaded with kagglehub to:", DATA_DIR.resolve())
    except Exception as e:
        print("kagglehub download failed or not available:", e)
        print("Please place NSL-KDD files into", DATA_DIR.resolve())

def load_nsl_kdd(train_path, test_path):
    # Try to infer delimiter
    def read_any(p):
        p = Path(p)
        if p.suffix.lower() == ".csv":
            return pd.read_csv(p)
        elif p.suffix.lower() == ".txt":
            # NSL-KDD txts are comma-separated
            return pd.read_csv(p, header=None)
        else:
            # Fallback
            return pd.read_csv(p)
    df_train = read_any(train_path)
    df_test  = read_any(test_path)
    return df_train, df_test

# Column names (NSL-KDD has 41 feature columns + 'label' + 'difficulty')
NSL_KDD_COLUMNS = [
    "duration","protocol_type","service","flag","src_bytes","dst_bytes","land","wrong_fragment","urgent",
    "hot","num_failed_logins","logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_host_login","is_guest_login",
    "count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate",
    "dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate",
    "dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate",
    "label","difficulty"
]

# Find actual files
def find_first_existing(cands):
    for p in cands:
        if Path(p).exists():
            return p
    return None

train_file = find_first_existing(train_candidates)
test_file  = find_first_existing(test_candidates)

if train_file is None or test_file is None:
    raise FileNotFoundError(
        f"NSL-KDD files not found in {DATA_DIR.resolve()}. "
        "Expected something like KDDTrain+.csv/.txt and KDDTest+.csv/.txt."
    )

df_train, df_test = load_nsl_kdd(train_file, test_file)

# If files had no headers, apply column names
if df_train.shape[1] == len(NSL_KDD_COLUMNS):
    df_train.columns = NSL_KDD_COLUMNS
if df_test.shape[1] == len(NSL_KDD_COLUMNS):
    df_test.columns = NSL_KDD_COLUMNS

print("Train shape:", df_train.shape, "| Test shape:", df_test.shape)
display(df_train.head())



### 3.1 Basic EDA
- Shapes, dtypes, missing values
- Class distribution (multi-class & binary)
- Quick feature correlation (numeric only)


In [None]:

# Dtypes & missing
display(df_train.dtypes.head(20))
print("\nMissing values (train):\n", df_train.isna().sum().sum())
print("Missing values (test):", df_test.isna().sum().sum())

# Class distribution (multi-class)
print("\nLabel distribution (train):\n", df_train['label'].value_counts())

# Binary mapping for overview
def to_binary(y):
    return (y != 'normal').astype(int)

y_train_bin = to_binary(df_train['label'])
print("\nBinary distribution (train):\n", pd.Series(y_train_bin).value_counts())

# Quick numeric correlation (sample to speed up)
num_cols = df_train.select_dtypes(include=[np.number]).columns.tolist()
if len(num_cols) > 0:
    corr = df_train[num_cols].sample(min(len(df_train), 5000), random_state=42).corr()
    plt.figure()
    plt.imshow(corr, aspect='auto')
    plt.title("Numeric Feature Correlation (sample)")
    plt.colorbar()
    plt.tight_layout()
    plt.show()



## 4. Data Preprocessing

- Handle missing values (imputation)
- Encode categoricals (`protocol_type`, `service`, `flag`)
- Scale numericals
- Convert labels to binary (normal → 0, attacks → 1)
- **Important:** Apply **SMOTE on the training set only** to avoid leakage


In [None]:

# Separate features / labels
X_train_raw = df_train.drop(columns=['label', 'difficulty'], errors='ignore')
X_test_raw  = df_test.drop(columns=['label', 'difficulty'], errors='ignore')
y_train_mc  = df_train['label'].copy()
y_test_mc   = df_test['label'].copy()

# Binary labels
y_train = (y_train_mc != 'normal').astype(int)
y_test  = (y_test_mc  != 'normal').astype(int)

# Identify categorical / numerical columns
categorical_features = ['protocol_type', 'service', 'flag']
numerical_features = [c for c in X_train_raw.columns if c not in categorical_features]

# Preprocessors
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Train/test split is already provided by NSL-KDD (KDDTrain+ vs KDDTest+)
# We will still reserve a validation split from the train set for tuning if needed.
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train_raw, y_train, test_size=0.2, random_state=42, stratify=y_train
)

print("Train split:", X_tr.shape, "Val split:", X_val.shape)

# SMOTE only on the training portion
smote = SMOTE(random_state=42)



## 5. Baseline Model (Random Forest)


In [None]:

baseline_clf = RandomForestClassifier(
    n_estimators=200, random_state=42, n_jobs=-1
)

baseline_pipe = ImbPipeline(steps=[
    ("preprocess", preprocessor),
    ("smote", smote),
    ("clf", baseline_clf)
])

baseline_pipe.fit(X_tr, y_tr)

def evaluate_model(estimator, X, y, label="Eval"):
    y_pred = estimator.predict(X)
    y_proba = None
    try:
        y_proba = estimator.predict_proba(X)[:,1]
    except Exception:
        pass
    f1 = f1_score(y, y_pred)
    prec = precision_score(y, y_pred)
    rec = recall_score(y, y_pred)
    acc = accuracy_score(y, y_pred)
    tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
    fpr = fp / (fp + tn)
    print(f"=== {label} ===")
    print(f"F1: {f1:.4f} | Precision: {prec:.4f} | Recall: {rec:.4f} | Accuracy: {acc:.4f} | FPR: {fpr:.4f}")
    print("\nClassification Report:\n", classification_report(y, y_pred))
    # Confusion matrix plot
    cm = np.array([[tn, fp],[fn, tp]])
    plt.figure()
    plt.imshow(cm, cmap=None)
    plt.title(f"Confusion Matrix — {label}")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    for (i, j), v in np.ndenumerate(cm):
        plt.text(j, i, int(v), ha='center', va='center')
    plt.tight_layout()
    plt.show()
    return {"f1": f1, "precision": prec, "recall": rec, "accuracy": acc, "fpr": fpr}

print("\nBaseline on Validation Set:")
baseline_val_metrics = evaluate_model(baseline_pipe, X_val, y_val, label="Baseline (Val)")



## 6. XGBoost Model — Training & Hyperparameter Tuning
We use the **scikit-learn wrapper** for smooth integration with pipelines and GridSearchCV.


In [None]:

xgb = XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    n_estimators=300,
    random_state=42,
    n_jobs=-1,
    tree_method="hist"  # fast & memory efficient
)

xgb_pipe = ImbPipeline(steps=[
    ("preprocess", preprocessor),
    ("smote", smote),
    ("clf", xgb)
])

param_grid = {
    "clf__max_depth": [4, 6, 8],
    "clf__learning_rate": [0.05, 0.1],
    "clf__subsample": [0.8, 1.0],
    "clf__colsample_bytree": [0.8, 1.0],
}

grid = GridSearchCV(
    xgb_pipe,
    param_grid=param_grid,
    scoring="f1",
    n_jobs=-1,
    cv=3,
    verbose=1
)

grid.fit(X_tr, y_tr)
print("Best Params:", grid.best_params_)
best_xgb = grid.best_estimator_

print("\nValidation Performance (Best XGB):")
xgb_val_metrics = evaluate_model(best_xgb, X_val, y_val, label="XGB (Val)")



## 7. Final Evaluation on Test Set


In [None]:

print("Baseline on Test Set:")
_ = evaluate_model(baseline_pipe, X_test_raw, y_test, label="Baseline (Test)")

print("\nXGB on Test Set:")
_ = evaluate_model(best_xgb, X_test_raw, y_test, label="XGB (Test)")



## 8. Model Interpretability with SHAP
We compute **SHAP values** for the best XGBoost model to explain which features drive predictions.
> Note: We compute SHAP on a sample for speed.


In [None]:

# Extract trained XGB model and the fitted preprocessor from the pipeline
fitted_preprocessor = best_xgb.named_steps["preprocess"]
fitted_clf = best_xgb.named_steps["clf"]

# Transform a sample of validation data to model-ready numeric features
sample_idx = np.random.RandomState(42).choice(len(X_val), size=min(2000, len(X_val)), replace=False)
X_val_sample = X_val.iloc[sample_idx]
y_val_sample = y_val.iloc[sample_idx]

X_val_transformed = fitted_preprocessor.transform(X_val_sample)

# Get feature names from the preprocessor
def get_feature_names(preprocessor, numeric_features, categorical_features):
    num_features_out = numeric_features
    cat_encoder = preprocessor.named_transformers_["cat"].named_steps["onehot"]
    cat_features_out = cat_encoder.get_feature_names_out(categorical_features).tolist()
    return num_features_out + cat_features_out

feature_names = get_feature_names(fitted_preprocessor, 
                                  fitted_preprocessor.transformers_[0][2], 
                                  fitted_preprocessor.transformers_[1][2])

# SHAP analysis
explainer = shap.TreeExplainer(fitted_clf)
shap_values = explainer.shap_values(X_val_transformed)

# Summary plot
shap.summary_plot(shap_values, X_val_transformed, feature_names=feature_names, show=True)
plt.show()

# Dependence plot for a top feature
top_feature_idx = np.argsort(np.abs(shap_values).mean(axis=0))[-1]
top_feature_name = feature_names[top_feature_idx]
shap.dependence_plot(top_feature_name, shap_values, X_val_transformed, feature_names=feature_names, show=True)
plt.show()



## 9. Ethical Analysis

- **Privacy:** Even anonymized network telemetry can reveal sensitive behavior. Use data minimization and access controls; avoid storing PII; follow GDPR-like principles where applicable.
- **False Positives:** High FPR can overwhelm analysts and cause alert fatigue. Balance precision/recall carefully; route low-confidence alerts to sandbox rather than block.
- **Bias & Coverage:** Rare attack types may be under-detected. Use techniques like **SMOTE**, targeted data augmentation, and periodic re-training on fresh telemetry.
- **Human-in-the-Loop:** Keep humans in decision loops for critical actions (blocking/quarantine). Provide explanations (e.g., **SHAP**) to support trust and oversight.



## 10. Conclusion & Future Work

**Summary:** We built an IDS on NSL-KDD with an XGBoost core, addressed class imbalance via SMOTE, and explained predictions using SHAP. We benchmarked against a Random Forest baseline and evaluated with F1, precision, recall, accuracy, and FPR.

**Future Enhancements:**
- Add MLflow for experiment tracking and reproducibility
- Evaluate deep learning or autoencoder-based anomaly detection
- Explore streaming inference and real-time features
- Perform feature drift and data quality monitoring in production



## 11. Export Trained Model
Save the trained XGBoost pipeline to disk for reuse in downstream applications.


In [None]:

OUTPUT_DIR = Path("./models")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

model_path = OUTPUT_DIR / "xgb_nslkdd_pipeline.joblib"
joblib.dump(best_xgb, model_path)
print("Saved model to:", model_path.resolve())



## 12. Appendix: Model Spec & Config

```yaml
model:
  name: xgboost_classifier
  objective: binary:logistic
  metrics: [f1, precision, recall, accuracy, fpr]
  imbalance: SMOTE
  interpretability: SHAP
  selection: grid_search (cv=3)
  features:
    categorical: [protocol_type, service, flag]
    numerical: all_others
artifacts:
  pipeline: models/xgb_nslkdd_pipeline.joblib
  notebooks: this_notebook.ipynb
```
