# Handling Imbalanced Datasets in Machine Learning

## Introduction

An **imbalanced dataset** is one where the distribution of classes is not uniform. One class (majority class) has significantly more samples than the other class(es) (minority class). This is one of the most common challenges in real-world machine learning applications.

### What is Class Imbalance?

Class imbalance occurs when the number of observations belonging to different classes is not equal. For example:
- **Fraud Detection**: 99% legitimate transactions, 1% fraudulent
- **Disease Diagnosis**: 95% healthy patients, 5% diseased
- **Email Spam**: 80% legitimate emails, 20% spam
- **Manufacturing Defects**: 99.5% good products, 0.5% defective

### Why is Imbalance a Problem?

**1. Biased Model Training**
   - ML models optimize for overall accuracy
   - Model learns to predict majority class always
   - Minority class patterns are ignored

**2. Poor Minority Class Performance**
   - High accuracy but poor recall for minority class
   - Example: 99% accuracy by always predicting "No Fraud" in 99:1 dataset!

**3. Misleading Evaluation Metrics**
   - Accuracy becomes meaningless
   - Need to use precision, recall, F1-score instead

### Example of the Problem

```
Dataset: 1000 samples
- Class 0 (Healthy): 950 samples (95%)
- Class 1 (Disease): 50 samples (5%)

Model predicts everything as Class 0:
- Accuracy: 95% ‚úì (looks great!)
- Disease Detection: 0% ‚úó (completely fails at the important task!)
```

### Real-World Impact

- **Healthcare**: Missing rare diseases can be fatal
- **Finance**: Undetected fraud causes millions in losses
- **Security**: Failed intrusion detection compromises systems
- **Manufacturing**: Defective products reach customers

### Learning Objectives

In this notebook, you will learn:
1. How to detect and measure class imbalance
2. Resampling techniques (Oversampling and Undersampling)
3. When to use each technique
4. Best practices for handling imbalanced data
5. Proper evaluation metrics

Let's dive in and master these techniques!

## Step 1: Import Libraries and Create Imbalanced Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (confusion_matrix, classification_report, 
                              accuracy_score, precision_score, recall_score, 
                              f1_score, roc_auc_score, roc_curve)
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Create an imbalanced dataset
# Simulating credit card fraud detection: 97% legitimate, 3% fraud
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.97, 0.03],  # 97% class 0, 3% class 1
    random_state=42,
    flip_y=0.05  # Add some noise
)

# Create DataFrame for better visualization
feature_names = [f'Feature_{i+1}' for i in range(20)]
df = pd.DataFrame(X, columns=feature_names)
df['Target'] = y

print("=" * 100)
print("IMBALANCED DATASET CREATED")
print("=" * 100)
print(f"\nDataset Shape: {df.shape}")
print(f"Number of Features: {X.shape[1]}")
print(f"Number of Samples: {X.shape[0]}")

# Analyze class distribution
class_dist = Counter(y)
print("\n" + "-" * 100)
print("CLASS DISTRIBUTION:")
print("-" * 100)
for cls, count in class_dist.items():
    percentage = (count / len(y)) * 100
    print(f"Class {cls}: {count:,} samples ({percentage:.2f}%)")

imbalance_ratio = class_dist[0] / class_dist[1]
print(f"\nImbalance Ratio (Majority:Minority): {imbalance_ratio:.2f}:1")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Bar plot of class distribution
classes = ['Legitimate (0)', 'Fraud (1)']
counts = [class_dist[0], class_dist[1]]
colors = ['lightgreen', 'lightcoral']
bars = axes[0].bar(classes, counts, color=colors, alpha=0.7, edgecolor='black', linewidth=2)
axes[0].set_ylabel('Number of Samples', fontsize=12, fontweight='bold')
axes[0].set_title('Class Distribution (Imbalanced)', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars:
    height = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height):,}\n({height/len(y)*100:.1f}%)',
                ha='center', va='bottom', fontsize=11, fontweight='bold')

# Pie chart
axes[1].pie(counts, labels=classes, colors=colors, autopct='%1.1f%%',
            startangle=90, explode=(0, 0.1), shadow=True)
axes[1].set_title('Class Proportion', fontsize=14, fontweight='bold')

# Sample distribution across first 100 samples
sample_size = 100
sample_y = y[:sample_size]
axes[2].scatter(range(sample_size), sample_y, c=sample_y, cmap='RdYlGn_r', 
                alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
axes[2].set_xlabel('Sample Index', fontsize=12, fontweight='bold')
axes[2].set_ylabel('Class Label', fontsize=12, fontweight='bold')
axes[2].set_title('First 100 Samples (Visual Imbalance)', fontsize=14, fontweight='bold')
axes[2].set_yticks([0, 1])
axes[2].set_yticklabels(['Legitimate', 'Fraud'])
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("‚ö†Ô∏è WARNING: This is a highly imbalanced dataset!")
print("   Training a model on this data without handling imbalance will lead to poor results.")
print("=" * 100)

## Step 2: Baseline Model (Without Handling Imbalance)

Let's train a model on the imbalanced data to see why it's problematic.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("BASELINE MODEL (No Balancing)")
print("=" * 100)

# Train logistic regression on imbalanced data
baseline_model = LogisticRegression(random_state=42, max_iter=1000)
baseline_model.fit(X_train, y_train)

# Predictions
y_pred_baseline = baseline_model.predict(X_test)
y_pred_proba_baseline = baseline_model.predict_proba(X_test)[:, 1]

# Evaluation
accuracy_baseline = accuracy_score(y_test, y_pred_baseline)
precision_baseline = precision_score(y_test, y_pred_baseline)
recall_baseline = recall_score(y_test, y_pred_baseline)
f1_baseline = f1_score(y_test, y_pred_baseline)
roc_auc_baseline = roc_auc_score(y_test, y_pred_proba_baseline)

print("\nTraining Set Class Distribution:")
print(f"  Class 0: {Counter(y_train)[0]:,} samples")
print(f"  Class 1: {Counter(y_train)[1]:,} samples")

print("\n" + "-" * 100)
print("PERFORMANCE METRICS:")
print("-" * 100)
print(f"Accuracy:  {accuracy_baseline:.4f} (Looks good but misleading!)")
print(f"Precision: {precision_baseline:.4f} (Of predicted frauds, how many are correct?)")
print(f"Recall:    {recall_baseline:.4f} ‚ö†Ô∏è (Of actual frauds, how many did we catch?)")
print(f"F1-Score:  {f1_baseline:.4f}")
print(f"ROC-AUC:   {roc_auc_baseline:.4f}")

print("\n" + "-" * 100)
print("CLASSIFICATION REPORT:")
print("-" * 100)
print(classification_report(y_test, y_pred_baseline, 
                          target_names=['Legitimate (0)', 'Fraud (1)']))

# Confusion Matrix
cm_baseline = confusion_matrix(y_test, y_pred_baseline)

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Baseline Model Performance (Imbalanced Data)', fontsize=16, fontweight='bold')

# Confusion Matrix
sns.heatmap(cm_baseline, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['Pred Legitimate', 'Pred Fraud'],
            yticklabels=['True Legitimate', 'True Fraud'],
            cbar_kws={'label': 'Count'})
axes[0].set_title('Confusion Matrix', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Actual', fontsize=11, fontweight='bold')
axes[0].set_xlabel('Predicted', fontsize=11, fontweight='bold')

# Metrics comparison
metrics_names = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
metrics_values = [accuracy_baseline, precision_baseline, recall_baseline, f1_baseline, roc_auc_baseline]
colors_metric = ['green' if m > 0.7 else 'orange' if m > 0.5 else 'red' for m in metrics_values]

axes[1].bar(metrics_names, metrics_values, color=colors_metric, alpha=0.7, edgecolor='black', linewidth=2)
axes[1].axhline(y=0.5, color='red', linestyle='--', linewidth=2, label='Poor Performance')
axes[1].axhline(y=0.7, color='orange', linestyle='--', linewidth=2, label='Moderate')
axes[1].axhline(y=0.9, color='green', linestyle='--', linewidth=2, label='Good')
axes[1].set_ylabel('Score', fontsize=12, fontweight='bold')
axes[1].set_title('Performance Metrics', fontsize=12, fontweight='bold')
axes[1].set_ylim([0, 1])
axes[1].tick_params(axis='x', rotation=45)
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

# Add value labels
for i, (name, value) in enumerate(zip(metrics_names, metrics_values)):
    axes[1].text(i, value + 0.02, f'{value:.3f}', ha='center', fontsize=10, fontweight='bold')

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_baseline)
axes[2].plot(fpr, tpr, color='darkorange', linewidth=2, label=f'ROC curve (AUC = {roc_auc_baseline:.3f})')
axes[2].plot([0, 1], [0, 1], color='navy', linewidth=2, linestyle='--', label='Random Classifier')
axes[2].set_xlabel('False Positive Rate', fontsize=11, fontweight='bold')
axes[2].set_ylabel('True Positive Rate', fontsize=11, fontweight='bold')
axes[2].set_title('ROC Curve', fontsize=12, fontweight='bold')
axes[2].legend()
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üö® PROBLEM IDENTIFIED:")
print(f"   ‚Ä¢ High Accuracy ({accuracy_baseline:.1%}) gives FALSE confidence")
print(f"   ‚Ä¢ Low Recall ({recall_baseline:.1%}) means we're MISSING most frauds!")
print(f"   ‚Ä¢ Out of {Counter(y_test)[1]} frauds, we only detected ~{int(recall_baseline * Counter(y_test)[1])}")
print("   ‚Ä¢ This model is USELESS for fraud detection despite 'high accuracy'")
print("=" * 100)

## Method 1: Random Oversampling

**Random Oversampling** increases the number of minority class samples by randomly duplicating existing minority samples until classes are balanced.

### How It Works:
1. Identify minority class samples
2. Randomly select samples from minority class
3. Duplicate them until reaching desired balance
4. Combine with majority class

### When to Use:
- Small datasets where you can't afford to lose data
- When minority class patterns need to be emphasized
- As a baseline before trying more sophisticated methods

### Advantages:
‚úÖ Simple and easy to implement
‚úÖ No information loss
‚úÖ Works well with enough diverse minority samples

### Disadvantages:
‚ùå Creates exact copies (no new information)
‚ùå Can lead to overfitting
‚ùå May not generalize well to new data
‚ùå Increases dataset size and training time

In [None]:
# Random Oversampling
print("RANDOM OVERSAMPLING")
print("=" * 100)

# Apply Random Oversampling
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

print("\nBefore Oversampling:")
print(f"  Class 0: {Counter(y_train)[0]:,} samples")
print(f"  Class 1: {Counter(y_train)[1]:,} samples")
print(f"  Total: {len(y_train):,} samples")

print("\nAfter Oversampling:")
print(f"  Class 0: {Counter(y_train_ros)[0]:,} samples")
print(f"  Class 1: {Counter(y_train_ros)[1]:,} samples")
print(f"  Total: {len(y_train_ros):,} samples")

print(f"\nNew minority class samples added: {Counter(y_train_ros)[1] - Counter(y_train)[1]:,}")
print(f"Dataset size increase: {((len(y_train_ros) - len(y_train)) / len(y_train) * 100):.1f}%")

# Train model on oversampled data
model_ros = LogisticRegression(random_state=42, max_iter=1000)
model_ros.fit(X_train_ros, y_train_ros)

# Predictions
y_pred_ros = model_ros.predict(X_test)
y_pred_proba_ros = model_ros.predict_proba(X_test)[:, 1]

# Evaluation
accuracy_ros = accuracy_score(y_test, y_pred_ros)
precision_ros = precision_score(y_test, y_pred_ros)
recall_ros = recall_score(y_test, y_pred_ros)
f1_ros = f1_score(y_test, y_pred_ros)
roc_auc_ros = roc_auc_score(y_test, y_pred_proba_ros)

print("\n" + "-" * 100)
print("PERFORMANCE AFTER OVERSAMPLING:")
print("-" * 100)
print(f"Accuracy:  {accuracy_ros:.4f}")
print(f"Precision: {precision_ros:.4f}")
print(f"Recall:    {recall_ros:.4f} ‚úÖ (Much better!)")
print(f"F1-Score:  {f1_ros:.4f}")
print(f"ROC-AUC:   {roc_auc_ros:.4f}")

print("\n" + "-" * 100)
print("CLASSIFICATION REPORT:")
print("-" * 100)
print(classification_report(y_test, y_pred_ros, 
                          target_names=['Legitimate (0)', 'Fraud (1)']))

# Confusion Matrix
cm_ros = confusion_matrix(y_test, y_pred_ros)

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Random Oversampling Results', fontsize=16, fontweight='bold')

# Class distribution before and after
dist_before = [Counter(y_train)[0], Counter(y_train)[1]]
dist_after = [Counter(y_train_ros)[0], Counter(y_train_ros)[1]]
x_pos = np.arange(2)
width = 0.35

axes[0, 0].bar(x_pos - width/2, dist_before, width, label='Before', alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].bar(x_pos + width/2, dist_after, width, label='After', alpha=0.7, color='coral', edgecolor='black')
axes[0, 0].set_xticks(x_pos)
axes[0, 0].set_xticklabels(['Class 0', 'Class 1'])
axes[0, 0].set_ylabel('Number of Samples', fontsize=11, fontweight='bold')
axes[0, 0].set_title('Training Set: Before vs After Oversampling', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(axis='y', alpha=0.3)

# Confusion Matrix
sns.heatmap(cm_ros, annot=True, fmt='d', cmap='Greens', ax=axes[0, 1],
            xticklabels=['Pred Legitimate', 'Pred Fraud'],
            yticklabels=['True Legitimate', 'True Fraud'])
axes[0, 1].set_title('Confusion Matrix (Oversampled Model)', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('Actual', fontsize=11, fontweight='bold')
axes[0, 1].set_xlabel('Predicted', fontsize=11, fontweight='bold')

# Metrics comparison: Baseline vs Oversampled
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
baseline_scores = [accuracy_baseline, precision_baseline, recall_baseline, f1_baseline, roc_auc_baseline]
ros_scores = [accuracy_ros, precision_ros, recall_ros, f1_ros, roc_auc_ros]

x_pos = np.arange(len(metrics))
width = 0.35

axes[0, 2].bar(x_pos - width/2, baseline_scores, width, label='Baseline', alpha=0.7, color='lightcoral')
axes[0, 2].bar(x_pos + width/2, ros_scores, width, label='Oversampled', alpha=0.7, color='lightgreen')
axes[0, 2].set_xticks(x_pos)
axes[0, 2].set_xticklabels(metrics, rotation=45, ha='right')
axes[0, 2].set_ylabel('Score', fontsize=11, fontweight='bold')
axes[0, 2].set_title('Performance Comparison', fontsize=12, fontweight='bold')
axes[0, 2].set_ylim([0, 1])
axes[0, 2].legend()
axes[0, 2].grid(axis='y', alpha=0.3)

# ROC Curves comparison
fpr_ros, tpr_ros, _ = roc_curve(y_test, y_pred_proba_ros)
axes[1, 0].plot(fpr, tpr, color='red', linewidth=2, label=f'Baseline (AUC={roc_auc_baseline:.3f})')
axes[1, 0].plot(fpr_ros, tpr_ros, color='green', linewidth=2, label=f'Oversampled (AUC={roc_auc_ros:.3f})')
axes[1, 0].plot([0, 1], [0, 1], color='navy', linewidth=2, linestyle='--', label='Random')
axes[1, 0].set_xlabel('False Positive Rate', fontsize=11, fontweight='bold')
axes[1, 0].set_ylabel('True Positive Rate', fontsize=11, fontweight='bold')
axes[1, 0].set_title('ROC Curve Comparison', fontsize=12, fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)

# Recall improvement
recall_improvement = ((recall_ros - recall_baseline) / recall_baseline) * 100
f1_improvement = ((f1_ros - f1_baseline) / f1_baseline) * 100

improvements = {
    'Recall': recall_improvement,
    'F1-Score': f1_improvement,
    'Precision': ((precision_ros - precision_baseline) / precision_baseline) * 100
}

axes[1, 1].barh(list(improvements.keys()), list(improvements.values()), 
                color=['green' if v > 0 else 'red' for v in improvements.values()],
                alpha=0.7, edgecolor='black')
axes[1, 1].axvline(x=0, color='black', linewidth=2)
axes[1, 1].set_xlabel('Improvement (%)', fontsize=11, fontweight='bold')
axes[1, 1].set_title('Metric Improvements vs Baseline', fontsize=12, fontweight='bold')
axes[1, 1].grid(axis='x', alpha=0.3)

# Detected frauds comparison
frauds_detected_baseline = int(recall_baseline * Counter(y_test)[1])
frauds_detected_ros = int(recall_ros * Counter(y_test)[1])
total_frauds = Counter(y_test)[1]

detection_data = [frauds_detected_baseline, frauds_detected_ros, total_frauds]
labels = ['Baseline\nDetected', 'Oversampled\nDetected', 'Total\nFrauds']
colors_det = ['red', 'green', 'blue']

axes[1, 2].bar(labels, detection_data, color=colors_det, alpha=0.7, edgecolor='black', linewidth=2)
axes[1, 2].set_ylabel('Number of Frauds', fontsize=11, fontweight='bold')
axes[1, 2].set_title('Fraud Detection Count', fontsize=12, fontweight='bold')
axes[1, 2].grid(axis='y', alpha=0.3)

# Add value labels
for i, (label, value) in enumerate(zip(labels, detection_data)):
    axes[1, 2].text(i, value + 2, f'{int(value)}', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üìä KEY IMPROVEMENTS:")
print(f"   ‚Ä¢ Recall improved by {recall_improvement:.1f}%")
print(f"   ‚Ä¢ Now detecting {frauds_detected_ros}/{total_frauds} frauds vs {frauds_detected_baseline}/{total_frauds} before")
print(f"   ‚Ä¢ F1-Score improved by {f1_improvement:.1f}%")
print("   ‚Ä¢ Much better balance between precision and recall")
print("=" * 100)

## Method 2: Random Undersampling

**Random Undersampling** reduces the number of majority class samples by randomly removing them until classes are balanced.

### How It Works:
1. Identify majority class samples
2. Randomly select a subset of majority class
3. Keep only selected samples
4. Combine with all minority class samples

### When to Use:
- Large datasets where losing majority samples is acceptable
- When computational resources are limited
- When majority class has redundant information

### Advantages:
‚úÖ Reduces dataset size (faster training)
‚úÖ Reduces memory requirements
‚úÖ Balances classes effectively

### Disadvantages:
‚ùå Loses potentially useful information
‚ùå May discard important majority class patterns
‚ùå Can lead to underfitting
‚ùå Not suitable for small datasets

In [None]:
# Random Undersampling
print("RANDOM UNDERSAMPLING")
print("=" * 100)

# Apply Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print("\nBefore Undersampling:")
print(f"  Class 0: {Counter(y_train)[0]:,} samples")
print(f"  Class 1: {Counter(y_train)[1]:,} samples")
print(f"  Total: {len(y_train):,} samples")

print("\nAfter Undersampling:")
print(f"  Class 0: {Counter(y_train_rus)[0]:,} samples")
print(f"  Class 1: {Counter(y_train_rus)[1]:,} samples")
print(f"  Total: {len(y_train_rus):,} samples")

print(f"\nMajority class samples removed: {Counter(y_train)[0] - Counter(y_train_rus)[0]:,}")
print(f"Dataset size reduction: {((len(y_train) - len(y_train_rus)) / len(y_train) * 100):.1f}%")

# Train model on undersampled data
model_rus = LogisticRegression(random_state=42, max_iter=1000)
model_rus.fit(X_train_rus, y_train_rus)

# Predictions
y_pred_rus = model_rus.predict(X_test)
y_pred_proba_rus = model_rus.predict_proba(X_test)[:, 1]

# Evaluation
accuracy_rus = accuracy_score(y_test, y_pred_rus)
precision_rus = precision_score(y_test, y_pred_rus)
recall_rus = recall_score(y_test, y_pred_rus)
f1_rus = f1_score(y_test, y_pred_rus)
roc_auc_rus = roc_auc_score(y_test, y_pred_proba_rus)

print("\n" + "-" * 100)
print("PERFORMANCE AFTER UNDERSAMPLING:")
print("-" * 100)
print(f"Accuracy:  {accuracy_rus:.4f}")
print(f"Precision: {precision_rus:.4f}")
print(f"Recall:    {recall_rus:.4f} ‚úÖ")
print(f"F1-Score:  {f1_rus:.4f}")
print(f"ROC-AUC:   {roc_auc_rus:.4f}")

# Confusion Matrix
cm_rus = confusion_matrix(y_test, y_pred_rus)

# Comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Random Undersampling Results', fontsize=16, fontweight='bold')

# Dataset size comparison
methods = ['Original', 'Oversampling', 'Undersampling']
sizes = [len(y_train), len(y_train_ros), len(y_train_rus)]
colors_size = ['blue', 'coral', 'lightgreen']

axes[0, 0].bar(methods, sizes, color=colors_size, alpha=0.7, edgecolor='black', linewidth=2)
axes[0, 0].set_ylabel('Dataset Size', fontsize=11, fontweight='bold')
axes[0, 0].set_title('Training Set Size Comparison', fontsize=12, fontweight='bold')
axes[0, 0].grid(axis='y', alpha=0.3)

for i, (method, size) in enumerate(zip(methods, sizes)):
    axes[0, 0].text(i, size + 100, f'{size:,}', ha='center', fontsize=10, fontweight='bold')

# Performance comparison: All three methods
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
baseline_scores = [accuracy_baseline, precision_baseline, recall_baseline, f1_baseline, roc_auc_baseline]
ros_scores = [accuracy_ros, precision_ros, recall_ros, f1_ros, roc_auc_ros]
rus_scores = [accuracy_rus, precision_rus, recall_rus, f1_rus, roc_auc_rus]

x_pos = np.arange(len(metrics))
width = 0.25

axes[0, 1].bar(x_pos - width, baseline_scores, width, label='Baseline', alpha=0.7, color='lightcoral')
axes[0, 1].bar(x_pos, ros_scores, width, label='Oversampling', alpha=0.7, color='lightblue')
axes[0, 1].bar(x_pos + width, rus_scores, width, label='Undersampling', alpha=0.7, color='lightgreen')
axes[0, 1].set_xticks(x_pos)
axes[0, 1].set_xticklabels(metrics, rotation=45, ha='right')
axes[0, 1].set_ylabel('Score', fontsize=11, fontweight='bold')
axes[0, 1].set_title('Performance Metrics: All Methods', fontsize=12, fontweight='bold')
axes[0, 1].set_ylim([0, 1.1])
axes[0, 1].legend()
axes[0, 1].grid(axis='y', alpha=0.3)

# Confusion matrices comparison
axes[1, 0].axis('off')
fig_text = f"""
CONFUSION MATRIX COMPARISON

Baseline Model:
  True Negatives:  {cm_baseline[0,0]:4d}  |  False Positives: {cm_baseline[0,1]:4d}
  False Negatives: {cm_baseline[1,0]:4d}  |  True Positives:  {cm_baseline[1,1]:4d}

Oversampling:
  True Negatives:  {cm_ros[0,0]:4d}  |  False Positives: {cm_ros[0,1]:4d}
  False Negatives: {cm_ros[1,0]:4d}  |  True Positives:  {cm_ros[1,1]:4d}

Undersampling:
  True Negatives:  {cm_rus[0,0]:4d}  |  False Positives: {cm_rus[0,1]:4d}
  False Negatives: {cm_rus[1,0]:4d}  |  True Positives:  {cm_rus[1,1]:4d}

Key: Lower False Negatives = Better fraud detection
"""
axes[1, 0].text(0.1, 0.5, fig_text, fontsize=10, family='monospace',
                verticalalignment='center', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# ROC Curves - All methods
fpr_rus, tpr_rus, _ = roc_curve(y_test, y_pred_proba_rus)
axes[1, 1].plot(fpr, tpr, color='red', linewidth=2, label=f'Baseline (AUC={roc_auc_baseline:.3f})')
axes[1, 1].plot(fpr_ros, tpr_ros, color='blue', linewidth=2, label=f'Oversampling (AUC={roc_auc_ros:.3f})')
axes[1, 1].plot(fpr_rus, tpr_rus, color='green', linewidth=2, label=f'Undersampling (AUC={roc_auc_rus:.3f})')
axes[1, 1].plot([0, 1], [0, 1], color='navy', linewidth=2, linestyle='--', label='Random')
axes[1, 1].set_xlabel('False Positive Rate', fontsize=11, fontweight='bold')
axes[1, 1].set_ylabel('True Positive Rate', fontsize=11, fontweight='bold')
axes[1, 1].set_title('ROC Curves: All Methods', fontsize=12, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üìä COMPARISON SUMMARY:")
print("=" * 100)
print(f"{'Metric':<15} {'Baseline':<12} {'Oversampling':<15} {'Undersampling':<15}")
print("-" * 100)
print(f"{'Accuracy':<15} {accuracy_baseline:<12.4f} {accuracy_ros:<15.4f} {accuracy_rus:<15.4f}")
print(f"{'Precision':<15} {precision_baseline:<12.4f} {precision_ros:<15.4f} {precision_rus:<15.4f}")
print(f"{'Recall':<15} {recall_baseline:<12.4f} {recall_ros:<15.4f} {recall_rus:<15.4f}")
print(f"{'F1-Score':<15} {f1_baseline:<12.4f} {f1_ros:<15.4f} {f1_rus:<15.4f}")
print(f"{'ROC-AUC':<15} {roc_auc_baseline:<12.4f} {roc_auc_ros:<15.4f} {roc_auc_rus:<15.4f}")
print(f"{'Training Size':<15} {len(y_train):<12,} {len(y_train_ros):<15,} {len(y_train_rus):<15,}")
print("=" * 100)

## Summary and Best Practices

### Key Takeaways

1. **Imbalanced data is common** in real-world ML problems
2. **Accuracy is misleading** - use precision, recall, F1-score, ROC-AUC instead
3. **Both methods improve minority class detection** significantly
4. **Choose based on your constraints**:
   - **Oversampling**: When you have small dataset or can afford larger training time
   - **Undersampling**: When you have large dataset and need faster training

### Decision Framework

| Factor | Favor Oversampling | Favor Undersampling |
|--------|-------------------|---------------------|
| Dataset Size | Small (<10K) | Large (>100K) |
| Minority Class Diversity | High | Low |
| Computational Resources | Available | Limited |
| Training Time | Can afford longer | Need fast training |
| Information Loss Tolerance | Low | High |
| Overfitting Risk | Acceptable | Want to minimize |

### Oversampling vs Undersampling: Quick Comparison

**Oversampling Wins When:**
- Dataset is small
- Can't afford to lose any information
- Minority class has good diversity
- Computational resources are available

**Undersampling Wins When:**
- Dataset is very large
- Training time is critical
- Memory is limited
- Majority class has redundant samples

### Best Practices

‚úÖ **DO:**
- Always split data BEFORE resampling (avoid data leakage!)
- Use stratified split to maintain class proportions
- Evaluate with multiple metrics (precision, recall, F1, ROC-AUC)
- Try both oversampling and undersampling
- Consider advanced techniques (SMOTE, Tomek Links, etc.)
- Use cross-validation for robust evaluation
- Monitor for overfitting

‚ùå **DON'T:**
- Resample before train-test split (major data leakage!)
- Rely only on accuracy
- Ignore the confusion matrix
- Apply resampling to test set
- Forget about class weights as an alternative
- Ignore domain context and costs

### Alternative Approaches

Beyond random resampling:
1. **SMOTE** (Synthetic Minority Over-sampling Technique) - Creates synthetic examples
2. **Tomek Links** - Removes borderline majority samples
3. **Class Weights** - Penalize model more for misclassifying minority class
4. **Ensemble Methods** - Use algorithms like balanced Random Forest
5. **Anomaly Detection** - Treat minority class as anomalies
6. **Cost-Sensitive Learning** - Assign different costs to different errors

### When to Use What

```
Small Dataset + High Imbalance ‚Üí Oversampling (or SMOTE)
Large Dataset + Moderate Imbalance ‚Üí Undersampling
Very High Imbalance (>99:1) ‚Üí Anomaly Detection
Real-time Prediction Needed ‚Üí Undersampling
High Cost of False Negatives ‚Üí Oversampling + Tuned Threshold
```

### Real-World Tips

1. **Start Simple**: Try random oversampling/undersampling first
2. **Understand Costs**: What's more costly - false positive or false negative?
3. **Domain Knowledge**: Use it to guide your choice
4. **Iterate**: Try multiple approaches and compare
5. **Production Monitoring**: Track performance on new data
6. **Threshold Tuning**: Adjust prediction threshold based on business needs

---

**Next Step**: Learn about SMOTE, a more sophisticated oversampling technique that creates synthetic samples instead of duplicating!