# Imbalanced Classification: SMOTE and Class Weights

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain what class imbalance is and why it causes problems
2. Demonstrate why accuracy fails for imbalanced datasets
3. Apply `class_weight='balanced'` to handle imbalance in logistic regression
4. Implement random undersampling and oversampling
5. Understand and apply SMOTE (Synthetic Minority Oversampling Technique)
6. Use stratified cross-validation and appropriate metrics for evaluation
7. Avoid data leakage when resampling

## Prerequisites

- Logistic regression (Notebook 01)
- Classification metrics: precision, recall, F1, PR-AUC (Notebook 02)
- Threshold tuning concepts (Notebook 03)
- Python, NumPy, sklearn fundamentals

## Table of Contents

1. [What is Class Imbalance?](#1)
2. [Why Accuracy Fails](#2)
3. [Strategy 1: Class Weights](#3)
4. [Strategy 2: Random Undersampling](#4)
5. [Strategy 3: Random Oversampling](#5)
6. [Strategy 4: SMOTE](#6)
7. [Comparing All Strategies](#7)
8. [Evaluation Best Practices](#8)
9. [Common Mistakes](#9)
10. [Exercise](#10)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import (
    train_test_split, StratifiedKFold, cross_val_score
)
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report, f1_score, precision_recall_curve,
    average_precision_score, confusion_matrix, roc_auc_score
)

np.random.seed(42)
sns.set_style("whitegrid")
%matplotlib inline

<a id='1'></a>
## 1. What is Class Imbalance?

**Class imbalance** occurs when one class significantly outnumbers the other(s). This is common in real-world problems:

| Domain | Positive Class | Typical Prevalence |
|--------|---------------|--------------------|
| Fraud detection | Fraudulent transaction | 0.1% - 1% |
| Medical diagnosis | Disease positive | 1% - 5% |
| Spam detection | Spam email | 10% - 30% |
| Churn prediction | Customer churns | 5% - 15% |
| Click-through rate | User clicks ad | 0.5% - 3% |

**Why is it a problem?** Standard classifiers optimize overall accuracy, so they learn to predict the majority class most of the time.

In [None]:
# Create an imbalanced dataset with 10:1 ratio
X, y = make_classification(
    n_samples=2200, n_features=10, n_informative=5, n_redundant=2,
    weights=[0.909, 0.091],  # ~10:1 ratio
    flip_y=0.05, random_state=42
)

print(f"Total samples: {len(y)}")
print(f"Class 0 (majority): {np.sum(y == 0)} ({np.mean(y == 0):.1%})")
print(f"Class 1 (minority): {np.sum(y == 1)} ({np.mean(y == 1):.1%})")
print(f"Imbalance ratio: {np.sum(y == 0) / np.sum(y == 1):.1f}:1")

# Split data BEFORE any resampling
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

print(f"\nTrain: {len(y_train)} samples (class 1: {np.sum(y_train == 1)})")
print(f"Test:  {len(y_test)} samples (class 1: {np.sum(y_test == 1)})")

In [None]:
# Visualize the class distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

counts = np.bincount(y)
axes[0].bar(["Class 0 (Majority)", "Class 1 (Minority)"], counts,
            color=["steelblue", "coral"])
axes[0].set_ylabel("Count")
axes[0].set_title("Class Distribution")
for i, v in enumerate(counts):
    axes[0].text(i, v + 10, str(v), ha="center", fontweight="bold")

# Show first 2 features
axes[1].scatter(X[y == 0, 0], X[y == 0, 1], alpha=0.3, label="Class 0", s=10)
axes[1].scatter(X[y == 1, 0], X[y == 1, 1], alpha=0.7, label="Class 1", s=30,
                edgecolors="k", linewidths=0.5)
axes[1].set_xlabel("Feature 0")
axes[1].set_ylabel("Feature 1")
axes[1].set_title("Feature Space (first 2 features)")
axes[1].legend()

plt.tight_layout()
plt.show()

<a id='2'></a>
## 2. Why Accuracy Fails

With a 10:1 imbalance ratio, a model that always predicts class 0 achieves ~91% accuracy.

In [None]:
# Baseline: always predict majority class
y_pred_naive = np.zeros_like(y_test)
naive_accuracy = np.mean(y_pred_naive == y_test)

print("=== Naive Model (always predict class 0) ===")
print(f"Accuracy: {naive_accuracy:.4f}  (looks decent!)")
print(f"F1 (minority): {f1_score(y_test, y_pred_naive):.4f}  (zero -- useless!)")
print()
print(classification_report(y_test, y_pred_naive, zero_division=0))

# Unweighted logistic regression
model_default = LogisticRegression(max_iter=300, random_state=42)
model_default.fit(X_train_s, y_train)
y_pred_default = model_default.predict(X_test_s)

print("\n=== Default Logistic Regression (no class weights) ===")
print(classification_report(y_test, y_pred_default))
print(f"PR-AUC: {average_precision_score(y_test, model_default.predict_proba(X_test_s)[:, 1]):.4f}")

<a id='3'></a>
## 3. Strategy 1: Class Weights

Setting `class_weight='balanced'` adjusts the loss function to penalize misclassifications of the minority class more heavily.

The weight for class $c$ is computed as:

$$w_c = \frac{n}{k \cdot n_c}$$

where $n$ = total samples, $k$ = number of classes, $n_c$ = samples in class $c$.

This is the **simplest and most commonly used** approach.

In [None]:
# Logistic regression with balanced class weights
model_balanced = LogisticRegression(
    class_weight="balanced", max_iter=300, random_state=42
)
model_balanced.fit(X_train_s, y_train)
y_pred_balanced = model_balanced.predict(X_test_s)

print("=== Logistic Regression with class_weight='balanced' ===")
print(classification_report(y_test, y_pred_balanced))
print(f"PR-AUC: {average_precision_score(y_test, model_balanced.predict_proba(X_test_s)[:, 1]):.4f}")

# Show the effective class weights
n = len(y_train)
k = 2
for c in [0, 1]:
    nc = np.sum(y_train == c)
    w = n / (k * nc)
    print(f"\nClass {c}: n_c={nc}, weight={w:.4f}")

<a id='4'></a>
## 4. Strategy 2: Random Undersampling

Randomly remove samples from the majority class until both classes are balanced.

**Pros**: Fast, reduces training time.

**Cons**: Discards potentially useful majority-class data.

In [None]:
def random_undersample(X, y, random_state=42):
    """Undersample majority class to match minority class size."""
    rng = np.random.RandomState(random_state)
    classes, counts = np.unique(y, return_counts=True)
    min_count = counts.min()

    indices = []
    for c in classes:
        c_indices = np.where(y == c)[0]
        sampled = rng.choice(c_indices, size=min_count, replace=False)
        indices.extend(sampled)

    indices = np.array(indices)
    rng.shuffle(indices)
    return X[indices], y[indices]

X_under, y_under = random_undersample(X_train_s, y_train)
print(f"After undersampling: {len(y_under)} samples")
print(f"Class distribution: {np.bincount(y_under)}")

model_under = LogisticRegression(max_iter=300, random_state=42)
model_under.fit(X_under, y_under)
y_pred_under = model_under.predict(X_test_s)

print("\n=== Logistic Regression with Undersampling ===")
print(classification_report(y_test, y_pred_under))
print(f"PR-AUC: {average_precision_score(y_test, model_under.predict_proba(X_test_s)[:, 1]):.4f}")

<a id='5'></a>
## 5. Strategy 3: Random Oversampling

Randomly duplicate samples from the minority class until both classes are balanced.

**Pros**: No information loss from majority class.

**Cons**: Exact duplicates can cause overfitting. Increases training time.

In [None]:
def random_oversample(X, y, random_state=42):
    """Oversample minority class to match majority class size."""
    rng = np.random.RandomState(random_state)
    classes, counts = np.unique(y, return_counts=True)
    max_count = counts.max()

    X_list, y_list = [], []
    for c in classes:
        c_indices = np.where(y == c)[0]
        if len(c_indices) < max_count:
            # Oversample: sample with replacement
            extra = rng.choice(c_indices, size=max_count - len(c_indices), replace=True)
            all_indices = np.concatenate([c_indices, extra])
        else:
            all_indices = c_indices
        X_list.append(X[all_indices])
        y_list.append(y[all_indices])

    X_res = np.vstack(X_list)
    y_res = np.concatenate(y_list)

    # Shuffle
    shuffle_idx = rng.permutation(len(y_res))
    return X_res[shuffle_idx], y_res[shuffle_idx]

X_over, y_over = random_oversample(X_train_s, y_train)
print(f"After oversampling: {len(y_over)} samples")
print(f"Class distribution: {np.bincount(y_over)}")

model_over = LogisticRegression(max_iter=300, random_state=42)
model_over.fit(X_over, y_over)
y_pred_over = model_over.predict(X_test_s)

print("\n=== Logistic Regression with Oversampling ===")
print(classification_report(y_test, y_pred_over))
print(f"PR-AUC: {average_precision_score(y_test, model_over.predict_proba(X_test_s)[:, 1]):.4f}")

<a id='6'></a>
## 6. Strategy 4: SMOTE

**SMOTE (Synthetic Minority Oversampling Technique)** creates new synthetic samples instead of duplicating existing ones:

1. For each minority sample, find its k nearest neighbors (in the minority class).
2. Randomly pick one neighbor.
3. Create a new sample along the line segment between the original and the neighbor.

This produces more diverse synthetic samples than random duplication.

SMOTE requires the `imbalanced-learn` (imblearn) library. We try to import it; if unavailable, we demonstrate the concept with manual oversampling.

In [None]:
# Try to use SMOTE from imblearn
try:
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline as ImbPipeline
    IMBLEARN_AVAILABLE = True
    print("imblearn is available. Using SMOTE.")
except ImportError:
    IMBLEARN_AVAILABLE = False
    print("imblearn not installed. Will use manual oversampling as fallback.")
    print("To install: pip install imbalanced-learn")

In [None]:
if IMBLEARN_AVAILABLE:
    # SMOTE with imblearn
    smote = SMOTE(random_state=42)
    X_smote, y_smote = smote.fit_resample(X_train_s, y_train)

    print(f"After SMOTE: {len(y_smote)} samples")
    print(f"Class distribution: {np.bincount(y_smote)}")

    model_smote = LogisticRegression(max_iter=300, random_state=42)
    model_smote.fit(X_smote, y_smote)
    y_pred_smote = model_smote.predict(X_test_s)

    print("\n=== Logistic Regression with SMOTE ===")
    print(classification_report(y_test, y_pred_smote))
    print(f"PR-AUC: {average_precision_score(y_test, model_smote.predict_proba(X_test_s)[:, 1]):.4f}")
else:
    # Fallback: manual oversampling with noise (poor man's SMOTE)
    print("Using manual oversampling with added noise as SMOTE substitute.")
    rng = np.random.RandomState(42)
    minority_idx = np.where(y_train == 1)[0]
    majority_count = np.sum(y_train == 0)
    minority_count = len(minority_idx)
    n_synthetic = majority_count - minority_count

    # Duplicate with small Gaussian noise
    base_indices = rng.choice(minority_idx, size=n_synthetic, replace=True)
    X_synthetic = X_train_s[base_indices] + rng.normal(0, 0.1, size=(n_synthetic, X_train_s.shape[1]))
    y_synthetic = np.ones(n_synthetic, dtype=int)

    X_smote = np.vstack([X_train_s, X_synthetic])
    y_smote = np.concatenate([y_train, y_synthetic])

    print(f"After manual oversampling with noise: {len(y_smote)} samples")
    print(f"Class distribution: {np.bincount(y_smote)}")

    model_smote = LogisticRegression(max_iter=300, random_state=42)
    model_smote.fit(X_smote, y_smote)
    y_pred_smote = model_smote.predict(X_test_s)

    print("\n=== Logistic Regression with Manual Oversampling + Noise ===")
    print(classification_report(y_test, y_pred_smote))
    print(f"PR-AUC: {average_precision_score(y_test, model_smote.predict_proba(X_test_s)[:, 1]):.4f}")

In [None]:
# If imblearn available, show the Pipeline approach (best practice)
if IMBLEARN_AVAILABLE:
    pipeline = ImbPipeline([
        ("smote", SMOTE(random_state=42)),
        ("classifier", LogisticRegression(max_iter=300, random_state=42)),
    ])

    # Use stratified cross-validation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X_train_s, y_train, cv=cv, scoring="f1")

    print("SMOTE + LogisticRegression Pipeline (5-fold stratified CV):")
    print(f"  F1 scores: {scores.round(4)}")
    print(f"  Mean F1:   {scores.mean():.4f} +/- {scores.std():.4f}")
    print("\nNote: SMOTE is applied inside each CV fold, avoiding data leakage.")
else:
    print("Skipping Pipeline demo (imblearn not available).")
    print("The key idea: always resample INSIDE the cross-validation loop.")

<a id='7'></a>
## 7. Comparing All Strategies

In [None]:
# Collect results for comparison
from sklearn.metrics import precision_score, recall_score

models = {
    "Default (no weights)": model_default,
    "Class weight=balanced": model_balanced,
    "Undersampling": model_under,
    "Oversampling": model_over,
    "SMOTE / Manual+Noise": model_smote,
}

results = []
for name, m in models.items():
    yp = m.predict(X_test_s)
    yprob = m.predict_proba(X_test_s)[:, 1]
    results.append({
        "Method": name,
        "Accuracy": np.mean(yp == y_test),
        "F1 (minority)": f1_score(y_test, yp),
        "Precision (minority)": precision_score(y_test, yp),
        "Recall (minority)": recall_score(y_test, yp),
        "PR-AUC": average_precision_score(y_test, yprob),
    })

df_results = pd.DataFrame(results)
print(df_results.to_string(index=False, float_format="{:.4f}".format))

In [None]:
# Visual comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Bar chart of key metrics
metrics_to_plot = ["F1 (minority)", "PR-AUC"]
x = np.arange(len(df_results))
width = 0.35

axes[0].bar(x - width/2, df_results["F1 (minority)"], width, label="F1 (minority)", color="steelblue")
axes[0].bar(x + width/2, df_results["PR-AUC"], width, label="PR-AUC", color="coral")
axes[0].set_xticks(x)
axes[0].set_xticklabels(df_results["Method"], rotation=30, ha="right", fontsize=9)
axes[0].set_ylabel("Score")
axes[0].set_title("F1 and PR-AUC Comparison")
axes[0].legend()
axes[0].set_ylim(0, 1.1)

# PR curves for all models
for name, m in models.items():
    yprob = m.predict_proba(X_test_s)[:, 1]
    prec, rec, _ = precision_recall_curve(y_test, yprob)
    ap = average_precision_score(y_test, yprob)
    axes[1].plot(rec, prec, linewidth=2, label=f"{name} (AP={ap:.3f})")

axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].set_title("Precision-Recall Curves")
axes[1].legend(fontsize=8, loc="lower left")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<a id='8'></a>
## 8. Evaluation Best Practices

When working with imbalanced data:

1. **Always use stratified cross-validation** (`StratifiedKFold`) to ensure each fold has the same class ratio.
2. **Focus on F1 or PR-AUC**, not accuracy.
3. **Report the full classification report** so both classes are visible.
4. **Use stratified train/test splits** (`stratify=y` in `train_test_split`).

In [None]:
# Demonstrate stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, m_class in [("Default", LogisticRegression(max_iter=300, random_state=42)),
                       ("Balanced", LogisticRegression(class_weight="balanced", max_iter=300, random_state=42))]:
    f1_scores_cv = cross_val_score(m_class, X_train_s, y_train, cv=cv, scoring="f1")
    print(f"{name:12s} - 5-fold F1: {f1_scores_cv.mean():.4f} +/- {f1_scores_cv.std():.4f}  {f1_scores_cv.round(3)}")

print("\nStratified CV ensures each fold preserves the class ratio.")

<a id='9'></a>
## 9. Common Mistakes

1. **Oversampling before splitting (DATA LEAKAGE!)**: If you oversample or SMOTE the entire dataset before the train/test split, synthetic copies of test-set minority samples will appear in the training set. This leads to over-optimistic results. **Always split first, then resample only the training set.**

2. **Using accuracy for imbalanced data**: A 99% accuracy on a 99:1 dataset means nothing. Always use F1, PR-AUC, or business-specific metrics.

3. **Not using stratified splits**: Without stratification, a random split may put very few minority samples in the test set, giving unreliable metric estimates.

4. **Applying SMOTE outside the CV loop**: When doing cross-validation, SMOTE must be applied inside each fold (use `imblearn.pipeline.Pipeline`). Otherwise you leak information across folds.

5. **Ignoring the simplest solution**: `class_weight='balanced'` often works just as well as resampling with much less complexity. Try it first.

In [None]:
# Demonstrate the data leakage problem
print("=== Data Leakage Demo ===")
print("\nWRONG approach: oversample entire dataset, then split")
X_over_all, y_over_all = random_oversample(X, y, random_state=42)
X_tr_bad, X_te_bad, y_tr_bad, y_te_bad = train_test_split(
    X_over_all, y_over_all, test_size=0.3, random_state=42
)
sc_bad = StandardScaler()
X_tr_bad_s = sc_bad.fit_transform(X_tr_bad)
X_te_bad_s = sc_bad.transform(X_te_bad)
m_bad = LogisticRegression(max_iter=300, random_state=42).fit(X_tr_bad_s, y_tr_bad)
print(f"  F1 (looks great but LEAKED): {f1_score(y_te_bad, m_bad.predict(X_te_bad_s)):.4f}")

print("\nCORRECT approach: split first, then oversample only training data")
print(f"  F1 (honest): {f1_score(y_test, model_over.predict(X_test_s)):.4f}")
print("\nThe leaked F1 is inflated because test data was used to create training samples!")

<a id='10'></a>
## 10. Exercise

**Task**: Handle a severely imbalanced dataset (20:1 ratio).

1. Generate data: `make_classification(n_samples=2100, weights=[0.952, 0.048], n_features=8, n_informative=4, random_state=42)`.
2. Split 70/30 with stratification.
3. Scale features.
4. Train three models: default, `class_weight='balanced'`, and manual oversampling.
5. Compare F1 (minority class) and PR-AUC for all three.
6. Which method works best?

In [None]:
# Your solution here
# ------------------

# Step 1: Generate data
X_ex, y_ex = make_classification(
    n_samples=2100, weights=[0.952, 0.048], n_features=8,
    n_informative=4, n_redundant=2, random_state=42
)
print(f"Class distribution: {np.bincount(y_ex)}")
print(f"Ratio: {np.sum(y_ex == 0) / np.sum(y_ex == 1):.1f}:1")

# Step 2: Split
X_tr, X_te, y_tr, y_te = train_test_split(
    X_ex, y_ex, test_size=0.3, random_state=42, stratify=y_ex
)

# Step 3: Scale
sc = StandardScaler()
X_tr_s = sc.fit_transform(X_tr)
X_te_s = sc.transform(X_te)

# Step 4: Three models
# (a) Default
m1 = LogisticRegression(max_iter=300, random_state=42).fit(X_tr_s, y_tr)

# (b) Balanced weights
m2 = LogisticRegression(class_weight="balanced", max_iter=300, random_state=42).fit(X_tr_s, y_tr)

# (c) Manual oversampling
X_over_ex, y_over_ex = random_oversample(X_tr_s, y_tr, random_state=42)
m3 = LogisticRegression(max_iter=300, random_state=42).fit(X_over_ex, y_over_ex)

# Step 5: Compare
print(f"\n{'Method':<25} {'F1 (minority)':>15} {'PR-AUC':>10}")
print("-" * 52)
for name, m in [("Default", m1), ("Balanced weights", m2), ("Oversampling", m3)]:
    yp = m.predict(X_te_s)
    yprob = m.predict_proba(X_te_s)[:, 1]
    f1 = f1_score(y_te, yp)
    ap = average_precision_score(y_te, yprob)
    print(f"{name:<25} {f1:>15.4f} {ap:>10.4f}")

# Step 6: Conclusion
print("\nConclusion: class_weight='balanced' is often the simplest and most")
print("effective first approach for imbalanced classification.")