# Lesson 11: Assignment Guide - NSL-KDD Network Intrusion Detection

**This lesson walks you through YOUR ASSIGNMENT step by step!**

**What you'll do:**
1. Load and explore the NSL-KDD dataset
2. Build a BASELINE model
3. Apply ONE optimization technique
4. Compare baseline vs optimized
5. Prepare your report

---

## PART 1: Load and Understand the Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Load training data
train_df = pd.read_csv('../datasets/NSL_KDD/NSL_ppTrain.csv')
test_df = pd.read_csv('../datasets/NSL_KDD/NSL_ppTest.csv')

print(f"Training data: {train_df.shape}")
print(f"Test data: {test_df.shape}")
print(f"\nColumns: {train_df.columns.tolist()[:10]}...")

In [None]:
# Check the target variable
print("Attack Categories (this is what you predict):")
print(train_df['atakcat'].value_counts())

# Visualize
train_df['atakcat'].value_counts().plot(kind='bar', color='steelblue')
plt.title('NSL-KDD: Class Distribution (HIGHLY IMBALANCED!)')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Key Observations:
- **benign**: Normal traffic (~67,000)
- **dos**: Denial of Service attacks (~46,000)
- **probe**: Scanning attacks (~11,000)
- **r2l**: Remote to Local attacks (~995)
- **u2r**: User to Root attacks (only ~52!)

**This is HIGHLY IMBALANCED!**

---

## PART 2: Data Preprocessing

In [None]:
# Separate features and target
target_col = 'atakcat'  # What we predict

# Drop 'label' (specific attack name) and 'atakcat' (our target)
X_train = train_df.drop(['label', 'atakcat'], axis=1)
y_train = train_df[target_col]

X_test = test_df.drop(['label', 'atakcat'], axis=1)
y_test = test_df[target_col]

print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")

In [None]:
# Check for categorical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns
print(f"Categorical columns: {categorical_cols.tolist()}")

# One-hot encode
X_train_encoded = pd.get_dummies(X_train, columns=categorical_cols)
X_test_encoded = pd.get_dummies(X_test, columns=categorical_cols)

# Align columns (handle categories only in one set)
X_train_encoded, X_test_encoded = X_train_encoded.align(
    X_test_encoded, join='left', axis=1, fill_value=0
)

print(f"\nAfter encoding: {X_train_encoded.shape}")

In [None]:
# Scale numeric features
numeric_cols = X_train.select_dtypes(include=['float64', 'int64']).columns

scaler = MinMaxScaler()
X_train_encoded[numeric_cols] = scaler.fit_transform(X_train_encoded[numeric_cols])
X_test_encoded[numeric_cols] = scaler.transform(X_test_encoded[numeric_cols])

print("Scaling complete!")

---

## PART 3: BASELINE Model

In [None]:
# Train baseline Random Forest
print("Training Baseline Model...")
baseline_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

baseline_model.fit(X_train_encoded, y_train)
baseline_pred = baseline_model.predict(X_test_encoded)

print("\n" + "="*60)
print("BASELINE MODEL RESULTS")
print("="*60)
print(classification_report(y_test, baseline_pred))

In [None]:
# Save baseline metrics
baseline_f1_weighted = f1_score(y_test, baseline_pred, average='weighted')
baseline_f1_macro = f1_score(y_test, baseline_pred, average='macro')
baseline_accuracy = accuracy_score(y_test, baseline_pred)

print(f"Baseline Accuracy:   {baseline_accuracy:.4f}")
print(f"Baseline F1-weighted: {baseline_f1_weighted:.4f}")
print(f"Baseline F1-macro:    {baseline_f1_macro:.4f}")

---

## PART 4: OPTIMIZATION

**Choose ONE of these techniques:**
- Option A: Hyperparameter Tuning (Lesson 07)
- Option B: Feature Selection (Lesson 08)
- Option C: Handling Class Imbalance (Lesson 09)

---

### OPTION C: Class Weights (Recommended for this dataset!)

In [None]:
# Optimized model with class_weight='balanced'
print("Training Optimized Model (with class weights)...")

optimized_model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # THE KEY OPTIMIZATION!
    random_state=42,
    n_jobs=-1
)

optimized_model.fit(X_train_encoded, y_train)
optimized_pred = optimized_model.predict(X_test_encoded)

print("\n" + "="*60)
print("OPTIMIZED MODEL RESULTS (Class Weights)")
print("="*60)
print(classification_report(y_test, optimized_pred))

In [None]:
# Save optimized metrics
optimized_f1_weighted = f1_score(y_test, optimized_pred, average='weighted')
optimized_f1_macro = f1_score(y_test, optimized_pred, average='macro')
optimized_accuracy = accuracy_score(y_test, optimized_pred)

print(f"Optimized Accuracy:   {optimized_accuracy:.4f}")
print(f"Optimized F1-weighted: {optimized_f1_weighted:.4f}")
print(f"Optimized F1-macro:    {optimized_f1_macro:.4f}")

---

## PART 5: COMPARISON

In [None]:
print("="*60)
print("BASELINE vs OPTIMIZED COMPARISON")
print("="*60)

comparison = pd.DataFrame({
    'Metric': ['Accuracy', 'F1-weighted', 'F1-macro'],
    'Baseline': [baseline_accuracy, baseline_f1_weighted, baseline_f1_macro],
    'Optimized': [optimized_accuracy, optimized_f1_weighted, optimized_f1_macro]
})
comparison['Improvement'] = comparison['Optimized'] - comparison['Baseline']
comparison['% Change'] = (comparison['Improvement'] / comparison['Baseline'] * 100).round(2)

print(comparison.to_string(index=False))

In [None]:
# Visualize comparison
metrics = ['Accuracy', 'F1-weighted', 'F1-macro']
baseline_vals = [baseline_accuracy, baseline_f1_weighted, baseline_f1_macro]
optimized_vals = [optimized_accuracy, optimized_f1_weighted, optimized_f1_macro]

x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 6))
bars1 = ax.bar(x - width/2, baseline_vals, width, label='Baseline', color='lightcoral')
bars2 = ax.bar(x + width/2, optimized_vals, width, label='Optimized', color='lightgreen')

ax.set_ylabel('Score')
ax.set_title('Baseline vs Optimized Model Comparison')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()
ax.set_ylim(0, 1)

plt.tight_layout()
plt.show()

---

## PART 6: What to Include in Your Report

### 1. Introduction
- Describe NSL-KDD dataset
- Explain the classification task
- Mention the class imbalance problem

### 2. Methodology
- Data preprocessing steps (encoding, scaling)
- Baseline model (Random Forest with default settings)
- Optimization technique (e.g., class weights) and why you chose it

### 3. Results
- Baseline metrics (table + confusion matrix)
- Optimized metrics (table + confusion matrix)
- Comparison chart

### 4. Discussion
- Did optimization improve performance?
- Which classes improved/worsened?
- What would you try next?

### 5. Conclusion
- Summary of findings
- Best model recommendation

---

## Good luck with your assignment!