# Handling Imbalanced Datasets Using SMOTE

## Introduction

**SMOTE** (Synthetic Minority Over-sampling Technique) is an advanced oversampling method that creates synthetic (artificial) samples for the minority class instead of simply duplicating existing samples.

### The Problem with Random Oversampling

Random oversampling creates **exact duplicates** of minority class samples, which can lead to:
- **Overfitting**: Model memorizes specific instances
- **No new information**: Just repeating the same data
- **Poor generalization**: Model doesn't learn the underlying pattern

### What is SMOTE?

SMOTE generates **new synthetic samples** by:
1. Taking a minority class sample
2. Finding its K nearest neighbors (from minority class)
3. Creating new samples along the line segments joining the sample and its neighbors
4. Adding these synthetic samples to the dataset

### Key Advantages of SMOTE

‚úÖ **Creates new information** - synthetic samples, not duplicates
‚úÖ **Better generalization** - model learns patterns, not specific instances
‚úÖ **Reduces overfitting** - more diverse training data
‚úÖ **Interpolates** between existing samples rather than extrapolating

### How SMOTE Works (Step-by-Step)

```
For each minority class sample x_i:
1. Find K nearest minority neighbors (typically K=5)
2. Randomly select one neighbor x_nn
3. Generate synthetic sample:
   x_new = x_i + Œª √ó (x_nn - x_i)
   where Œª is random number between [0, 1]
4. Repeat until desired balance is achieved
```

### Visual Example

```
Original minority samples: ‚óè ‚óè ‚óè
Nearest neighbors connected: ‚óè‚Äî‚óè  ‚óè‚Äî‚óè
Synthetic samples created:   ‚óè‚Äî‚óâ‚Äî‚óè
New balanced dataset: ‚óè ‚óè ‚óè ‚óâ ‚óâ ‚óâ (originals + synthetics)
```

### When to Use SMOTE

- Imbalanced classification problems
- When random oversampling causes overfitting
- Small to medium-sized datasets
- When minority class has some structure/pattern
- Classification tasks (not regression)

### Variants of SMOTE

1. **SMOTE** (original) - Standard synthetic oversampling
2. **Borderline-SMOTE** - Focuses on borderline samples
3. **ADASYN** - Adaptive synthetic sampling
4. **SMOTE-ENN** - SMOTE + Edited Nearest Neighbors
5. **SMOTE-Tomek** - SMOTE + Tomek links removal

Let's implement SMOTE and see the difference!

## Step 1: Import Libraries and Create Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (confusion_matrix, classification_report, 
                              accuracy_score, precision_score, recall_score, 
                              f1_score, roc_auc_score, roc_curve)
from imblearn.over_sampling import SMOTE, RandomOverSampler, BorderlineSMOTE, ADASYN
from imblearn.combine import SMOTETomek, SMOTEENN
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

# Create imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=2,  # Using 2 features for easy visualization
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    n_classes=2,
    weights=[0.95, 0.05],  # 95% vs 5% imbalance
    random_state=42,
    flip_y=0.01
)

print("=" * 100)
print("IMBALANCED DATASET FOR SMOTE DEMONSTRATION")
print("=" * 100)
print(f"\nDataset Shape: {X.shape}")
print(f"\nClass Distribution:")
for cls, count in Counter(y).items():
    print(f"  Class {cls}: {count:,} samples ({count/len(y)*100:.2f}%)")

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Visualization of original data
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], 
            c='blue', label='Class 0 (Majority)', alpha=0.5, s=30)
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], 
            c='red', label='Class 1 (Minority)', alpha=0.8, s=50, edgecolors='black')
plt.xlabel('Feature 1', fontweight='bold')
plt.ylabel('Feature 2', fontweight='bold')
plt.title('Original Imbalanced Training Data', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
counts = [Counter(y_train)[0], Counter(y_train)[1]]
plt.bar(['Majority (0)', 'Minority (1)'], counts, color=['skyblue', 'salmon'], 
        alpha=0.7, edgecolor='black', linewidth=2)
plt.ylabel('Number of Samples', fontweight='bold')
plt.title('Training Data Distribution', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
for i, count in enumerate(counts):
    plt.text(i, count + 20, f'{count}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()
print("=" * 100)

## Step 2: Comparing Random Oversampling vs SMOTE

In [None]:
# Apply Random Oversampling
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

# Apply SMOTE
smote = SMOTE(random_state=42, k_neighbors=5)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("RESAMPLING COMPARISON")
print("=" * 100)
print(f"\nOriginal Training Set: {Counter(y_train)}")
print(f"After Random Oversampling: {Counter(y_train_ros)}")
print(f"After SMOTE: {Counter(y_train_smote)}")

# Visualize the difference
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Original data
axes[0].scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], 
                c='blue', label='Class 0', alpha=0.4, s=20)
axes[0].scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], 
                c='red', label='Class 1', alpha=0.8, s=40, edgecolors='black')
axes[0].set_title('Original Imbalanced Data', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Random Oversampling
axes[1].scatter(X_train_ros[y_train_ros==0, 0], X_train_ros[y_train_ros==0, 1], 
                c='blue', label='Class 0', alpha=0.4, s=20)
axes[1].scatter(X_train_ros[y_train_ros==1, 0], X_train_ros[y_train_ros==1, 1], 
                c='red', label='Class 1 (Duplicates)', alpha=0.6, s=40, marker='s')
axes[1].set_title('Random Oversampling (Duplicates)', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

# SMOTE
axes[2].scatter(X_train_smote[y_train_smote==0, 0], X_train_smote[y_train_smote==0, 1], 
                c='blue', label='Class 0', alpha=0.4, s=20)
minority_original = X_train[y_train==1]
axes[2].scatter(minority_original[:, 0], minority_original[:, 1], 
                c='darkred', label='Original Minority', alpha=0.9, s=60, 
                edgecolors='black', linewidth=2, marker='o')
synthetic_mask = ~np.isin(X_train_smote[y_train_smote==1], minority_original).all(axis=1)
X_synthetic = X_train_smote[y_train_smote==1][synthetic_mask]
axes[2].scatter(X_synthetic[:, 0], X_synthetic[:, 1], 
                c='orangered', label='SMOTE Synthetic', alpha=0.7, s=40, 
                marker='^', edgecolors='black', linewidth=0.5)
axes[2].set_title('SMOTE (Original + Synthetic)', fontsize=12, fontweight='bold')
axes[2].legend()
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("KEY OBSERVATION:")
print("  ‚Ä¢ Random Oversampling: Creates exact duplicates (overlapping points)")
print("  ‚Ä¢ SMOTE: Creates new synthetic samples in the feature space")
print("  ‚Ä¢ SMOTE samples are interpolated between existing minority samples")
print("=" * 100)

## Step 3: Model Performance Comparison

Let's train models and compare performance across different resampling techniques.

In [None]:
# Train models on different datasets
results = {}

# Baseline (No resampling)
model_base = LogisticRegression(random_state=42)
model_base.fit(X_train, y_train)
y_pred_base = model_base.predict(X_test)
results['Baseline'] = {
    'accuracy': accuracy_score(y_test, y_pred_base),
    'precision': precision_score(y_test, y_pred_base),
    'recall': recall_score(y_test, y_pred_base),
    'f1': f1_score(y_test, y_pred_base),
    'roc_auc': roc_auc_score(y_test, model_base.predict_proba(X_test)[:, 1])
}

# Random Oversampling
model_ros = LogisticRegression(random_state=42)
model_ros.fit(X_train_ros, y_train_ros)
y_pred_ros = model_ros.predict(X_test)
results['Random Oversampling'] = {
    'accuracy': accuracy_score(y_test, y_pred_ros),
    'precision': precision_score(y_test, y_pred_ros),
    'recall': recall_score(y_test, y_pred_ros),
    'f1': f1_score(y_test, y_pred_ros),
    'roc_auc': roc_auc_score(y_test, model_ros.predict_proba(X_test)[:, 1])
}

# SMOTE
model_smote = LogisticRegression(random_state=42)
model_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = model_smote.predict(X_test)
results['SMOTE'] = {
    'accuracy': accuracy_score(y_test, y_pred_smote),
    'precision': precision_score(y_test, y_pred_smote),
    'recall': recall_score(y_test, y_pred_smote),
    'f1': f1_score(y_test, y_pred_smote),
    'roc_auc': roc_auc_score(y_test, model_smote.predict_proba(X_test)[:, 1])
}

# Display results
print("MODEL PERFORMANCE COMPARISON")
print("=" * 100)
results_df = pd.DataFrame(results).T
print(results_df.round(4))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Performance metrics
metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
x_pos = np.arange(len(metrics))
width = 0.25

for i, (method, scores) in enumerate(results.items()):
    values = [scores[m] for m in metrics]
    axes[0].bar(x_pos + i*width, values, width, label=method, alpha=0.7)

axes[0].set_xticks(x_pos + width)
axes[0].set_xticklabels(['Accuracy', 'Precision', 'Recall', 'F1', 'ROC-AUC'], rotation=45)
axes[0].set_ylabel('Score', fontweight='bold')
axes[0].set_title('Performance Metrics Comparison', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].set_ylim([0, 1.1])
axes[0].grid(axis='y', alpha=0.3)

# Focus on key metrics for imbalanced data
key_metrics = ['Recall', 'F1', 'ROC-AUC']
key_values = {method: [scores['recall'], scores['f1'], scores['roc_auc']] 
              for method, scores in results.items()}

x_pos = np.arange(len(key_metrics))
for i, (method, values) in enumerate(key_values.items()):
    axes[1].bar(x_pos + i*width, values, width, label=method, alpha=0.7)

axes[1].set_xticks(x_pos + width)
axes[1].set_xticklabels(key_metrics)
axes[1].set_ylabel('Score', fontweight='bold')
axes[1].set_title('Key Metrics for Imbalanced Data', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].set_ylim([0, 1.1])
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üìä KEY FINDINGS:")
print(f"  ‚Ä¢ SMOTE Recall: {results['SMOTE']['recall']:.4f} vs Random OS: {results['Random Oversampling']['recall']:.4f}")
print(f"  ‚Ä¢ SMOTE F1: {results['SMOTE']['f1']:.4f} vs Random OS: {results['Random Oversampling']['f1']:.4f}")
print(f"  ‚Ä¢ SMOTE typically provides better generalization than random oversampling")
print("=" * 100)

## Step 4: SMOTE Variants

SMOTE has several variants that address specific scenarios. Let's explore some of them.

In [None]:
# Apply different SMOTE variants
smote_variants = {
    'SMOTE': SMOTE(random_state=42),
    'Borderline-SMOTE': BorderlineSMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42),
    'SMOTE-Tomek': SMOTETomek(random_state=42),
    'SMOTE-ENN': SMOTEENN(random_state=42)
}

variant_results = {}

print("SMOTE VARIANTS COMPARISON")
print("=" * 100)

for name, sampler in smote_variants.items():
    try:
        X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)
        print(f"\n{name}:")
        print(f"  Class Distribution: {Counter(y_resampled)}")
        print(f"  Total Samples: {len(y_resampled)}")
        
        # Train and evaluate
        model = LogisticRegression(random_state=42)
        model.fit(X_resampled, y_resampled)
        y_pred = model.predict(X_test)
        
        variant_results[name] = {
            'samples': len(y_resampled),
            'recall': recall_score(y_test, y_pred),
            'f1': f1_score(y_test, y_pred),
            'roc_auc': roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
        }
    except Exception as e:
        print(f"  Error: {e}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Sample counts
methods = list(variant_results.keys())
sample_counts = [variant_results[m]['samples'] for m in methods]
axes[0].barh(methods, sample_counts, color='skyblue', alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Total Samples After Resampling', fontweight='bold')
axes[0].set_title('Dataset Size: SMOTE Variants', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Performance comparison
metrics_v = ['recall', 'f1', 'roc_auc']
x_pos = np.arange(len(methods))
width = 0.25

for i, metric in enumerate(metrics_v):
    values = [variant_results[m][metric] for m in methods]
    axes[1].bar(x_pos + i*width, values, width, label=metric.upper(), alpha=0.7)

axes[1].set_xticks(x_pos + width)
axes[1].set_xticklabels(methods, rotation=45, ha='right')
axes[1].set_ylabel('Score', fontweight='bold')
axes[1].set_title('Performance: SMOTE Variants', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].set_ylim([0, 1.1])
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("VARIANT EXPLANATIONS:")
print("-" * 100)
print("‚Ä¢ SMOTE: Standard synthetic oversampling")
print("‚Ä¢ Borderline-SMOTE: Focuses on samples near decision boundary")
print("‚Ä¢ ADASYN: Adaptive - generates more synthetics for harder-to-learn samples")
print("‚Ä¢ SMOTE-Tomek: SMOTE + removes Tomek links (overlapping samples)")
print("‚Ä¢ SMOTE-ENN: SMOTE + Edited Nearest Neighbors (removes noisy samples)")
print("=" * 100)

## Summary: SMOTE Best Practices

### Key Takeaways

1. **SMOTE creates synthetic samples** by interpolating between existing minority samples
2. **Better than random oversampling** - avoids exact duplication and overfitting
3. **Multiple variants available** for different scenarios
4. **Works best with moderate imbalance** (not extreme cases)

### When to Use SMOTE

‚úÖ **Use SMOTE When:**
- Imbalance ratio is moderate (1:10 to 1:100)
- Dataset is small to medium-sized
- Random oversampling causes overfitting
- Minority class has clear patterns
- Features are continuous/numerical

‚ùå **Avoid SMOTE When:**
- Extreme imbalance (>1:1000) - consider anomaly detection
- Very high dimensional data - curse of dimensionality
- Minority class is too sparse - not enough neighbors
- Features are categorical - use other techniques
- Dataset is very large - undersampling might be better

### SMOTE Parameters

**Important Parameters:**
- `k_neighbors`: Number of nearest neighbors (default=5)
  - Higher ‚Üí More diverse synthetics
  - Lower ‚Üí Closer to original samples
- `sampling_strategy`: Desired ratio after resampling
  - 'auto' ‚Üí Balance to 1:1
  - float ‚Üí Custom ratio (e.g., 0.5 ‚Üí 1:2)

### Best Practices

‚úÖ **DO:**
- Apply SMOTE only to training data (after train-test split)
- Experiment with different k_neighbors values
- Try SMOTE variants for better results
- Combine with undersampling for extreme imbalance
- Use cross-validation for robust evaluation
- Check for noise/outliers before applying SMOTE

‚ùå **DON'T:**
- Apply SMOTE before splitting data (data leakage!)
- Use with categorical features directly (encode first)
- Expect miracles with extremely imbalanced data
- Forget to scale features before SMOTE
- Use on test/validation data

### Comparison Summary

| Aspect | Random Oversampling | SMOTE |
|--------|---------------------|-------|
| **Method** | Duplicate samples | Create synthetic samples |
| **Overfitting Risk** | High | Lower |
| **Generalization** | Poor | Better |
| **Training Time** | Fast | Slower |
| **Diversity** | No new info | New synthetic points |
| **Best For** | Very small datasets | Medium datasets |

### Real-World Tips

1. **Start with standard SMOTE**, then try variants
2. **Borderline-SMOTE** often performs best (focuses on hard samples)
3. **SMOTE-Tomek/ENN** clean up noise after generation
4. **ADASYN** adapts to local difficulty
5. **Monitor both precision and recall** - balance matters
6. **Use ensemble methods** (Random Forest, XGBoost) with SMOTE for best results

### Code Template

```python
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# 1. Split data FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

# 2. Apply SMOTE only to training data
smote = SMOTE(random_state=42, k_neighbors=5)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# 3. Train model
model.fit(X_train_resampled, y_train_resampled)

# 4. Evaluate on original test data
y_pred = model.predict(X_test)
```

---

**Congratulations!** You now understand SMOTE and how to handle imbalanced datasets effectively!