# Lesson 09: Handling Imbalanced Data

**What you'll learn:**
- What is class imbalance and why it's a problem
- Oversampling (duplicate minority class)
- Undersampling (reduce majority class)
- Class weights (tell model to pay more attention)

**This is ONE of the optimization techniques for your assignment!**

**Your NSL-KDD dataset is HIGHLY imbalanced!**

---

## Section 1: Understanding Imbalance

### READ

**Class imbalance**: Some classes have many more samples than others.

**Your assignment (NSL-KDD):**
- benign: ~67,000 samples
- u2r: only 52 samples!

**The problem:** Model might "cheat" by always predicting the majority class.
- Gets high accuracy but never catches rare attacks!
- Useless for security.

### TRY IT

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score

# Load data
df = pd.read_csv('../datasets/tomatjus.csv')
X = df.drop('quality', axis=1)
y = df['quality']

print("Class Distribution:")
print(y.value_counts())
print(f"\nImbalance ratio: {y.value_counts().max() / y.value_counts().min():.1f}x")

In [None]:
# Visualize
y.value_counts().plot(kind='bar', color=['skyblue', 'lightgreen', 'salmon'])
plt.title('Class Distribution (Imbalanced)')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set class distribution:")
print(y_train.value_counts())

---

## Section 2: Baseline (No Handling)

In [None]:
# Train without handling imbalance
rf_baseline = RandomForestClassifier(random_state=42)
rf_baseline.fit(X_train, y_train)
pred_baseline = rf_baseline.predict(X_test)

print("BASELINE (no imbalance handling):")
print("="*50)
print(classification_report(y_test, pred_baseline))

---

## Section 3: Class Weights (Easiest Method!)

### READ

**Class weights** tell the model to pay MORE attention to minority classes.

Instead of changing the data, we change how the model learns.
The model gets penalized MORE for mistakes on minority class.

### TRY IT

In [None]:
# Train with class_weight='balanced'
rf_balanced = RandomForestClassifier(
    class_weight='balanced',  # This is the key!
    random_state=42
)
rf_balanced.fit(X_train, y_train)
pred_balanced = rf_balanced.predict(X_test)

print("WITH CLASS WEIGHTS:")
print("="*50)
print(classification_report(y_test, pred_balanced))

### EXPLAIN

`class_weight='balanced'` automatically calculates weights:
- Minority classes get HIGHER weights
- Model pays more attention to them
- Look at recall for minority class - should improve!

---

## Section 4: Oversampling (Duplicate Minority)

In [None]:
# Combine features and target for oversampling
train_data = X_train.copy()
train_data['quality'] = y_train.values

# Get the target count (size of largest class)
target_count = y_train.value_counts().max()

# Oversample each class to match majority
oversampled_list = []
for class_name in y_train.unique():
    class_data = train_data[train_data['quality'] == class_name]
    oversampled = class_data.sample(target_count, replace=True, random_state=42)
    oversampled_list.append(oversampled)

oversampled_data = pd.concat(oversampled_list)

print("After Oversampling:")
print(oversampled_data['quality'].value_counts())

In [None]:
# Train on oversampled data
X_over = oversampled_data.drop('quality', axis=1)
y_over = oversampled_data['quality']

rf_over = RandomForestClassifier(random_state=42)
rf_over.fit(X_over, y_over)
pred_over = rf_over.predict(X_test)

print("WITH OVERSAMPLING:")
print("="*50)
print(classification_report(y_test, pred_over))

---

## Section 5: Comparing All Methods

In [None]:
print("="*50)
print("COMPARISON (F1-score weighted)")
print("="*50)

results = {
    'Baseline': f1_score(y_test, pred_baseline, average='weighted'),
    'Class Weights': f1_score(y_test, pred_balanced, average='weighted'),
    'Oversampling': f1_score(y_test, pred_over, average='weighted')
}

for name, score in results.items():
    print(f"{name:15s}: {score:.3f}")

best = max(results, key=results.get)
print(f"\nBest method: {best}")

In [None]:
# Also compare F1-macro (treats all classes equally)
print("\nF1-score (macro) - treats all classes equally:")
print(f"Baseline:      {f1_score(y_test, pred_baseline, average='macro'):.3f}")
print(f"Class Weights: {f1_score(y_test, pred_balanced, average='macro'):.3f}")
print(f"Oversampling:  {f1_score(y_test, pred_over, average='macro'):.3f}")

---

## Quick Reference

```python
# Method 1: Class Weights (EASIEST)
model = RandomForestClassifier(class_weight='balanced')

# Method 2: Oversampling
# Duplicate minority class samples to match majority

# Method 3: Undersampling
# Reduce majority class to match minority (loses data)
```

**For your assignment:** Start with `class_weight='balanced'` - it's the easiest!

---

## Next Lesson

In **Lesson 10: Model Comparison**, you'll learn:
- How to compare multiple models fairly
- Cross-validation for reliable comparison
- Visualizing model comparison