# Handling Imbalanced Data in Machine Learning

This notebook provides a comprehensive guide to understanding and handling imbalanced datasets in machine learning. We will explore various techniques, including oversampling, undersampling, and advanced methods like SMOTE and ensemble learning.

## 1. Introduction to Imbalanced Data

An imbalanced dataset is a dataset where the number of observations per class is not equally distributed. This is a common problem in many real-world scenarios such as fraud detection, medical diagnosis, and spam filtering. When one class (the majority class) significantly outnumbers another class (the minority class), machine learning models can become biased towards the majority class, leading to poor performance on the minority class, which is often the class of interest.

### 1.1 Why is it a problem?

Most standard machine learning algorithms assume that the class distribution is balanced. When this assumption is violated, the model may achieve high accuracy by simply predicting the majority class for all instances, while completely ignoring the minority class. This is problematic because in many applications, correctly identifying the minority class is crucial (e.g., detecting a rare disease or a fraudulent transaction).

### 1.2 Overview of Techniques

There are several techniques to handle imbalanced data, which can be broadly categorized into:

*   **Data-level methods:** These techniques modify the training data to create a balanced class distribution. This includes oversampling the minority class, undersampling the majority class, or a combination of both.
*   **Algorithm-level methods:** These techniques modify the learning algorithm to be more sensitive to the minority class. This often involves assigning different weights to the classes or using cost-sensitive learning.
*   **Ensemble methods:** These techniques combine multiple models to improve performance. Specialized ensemble methods have been developed for imbalanced learning.

## 2. Setup and Dataset Creation

In [None]:
!uv pip install imbalanced-learn

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
from imblearn.over_sampling import RandomOverSampler, SMOTE, BorderlineSMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, TomekLinks, NearMiss
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.ensemble import BalancedRandomForestClassifier, EasyEnsembleClassifier, BalancedBaggingClassifier
from collections import Counter

# Set plotting style
sns.set_style('whitegrid')

### 2.1 Create a Synthetic Imbalanced Dataset

In [None]:
# Create a synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0,
                           n_classes=2, n_clusters_per_class=1, weights=[0.95, 0.05],
                           flip_y=0, random_state=42)

# Visualize the class distribution
counter = Counter(y)
print(f'Original dataset shape {X.shape}')
print(f'Original dataset samples per class {counter}')

plt.figure(figsize=(8, 6))
sns.countplot(x=y)
plt.title('Original Class Distribution')
plt.show()

## 3. Evaluation Metrics for Imbalanced Data

As discussed earlier, accuracy is not a suitable metric for evaluating models trained on imbalanced data. Instead, we should use metrics that provide a better picture of the model's performance on the minority class. Some of these metrics are:

*   **Confusion Matrix:** A table that summarizes the performance of a classification model.
*   **Precision:** The ratio of correctly predicted positive observations to the total predicted positive observations. `Precision = TP / (TP + FP)`
*   **Recall (Sensitivity):** The ratio of correctly predicted positive observations to all observations in the actual class. `Recall = TP / (TP + FN)`
*   **F1-Score:** The harmonic mean of precision and recall. `F1 Score = 2 * (Recall * Precision) / (Recall + Precision)`
*   **ROC-AUC:** The Area Under the Receiver Operating Characteristic Curve. It measures the ability of a classifier to distinguish between classes.

### 3.1 Helper Function for Evaluation

In [None]:
def evaluate_model(X_test, y_test, model):
    # Predict probabilities
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Predict classes
    y_pred = model.predict(X_test)
    
    # Print classification report
    print('Classification Report:')
    print(classification_report(y_test, y_pred))
    
    # Print ROC-AUC score
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    print(f'ROC-AUC Score: {roc_auc:.4f}')
    
    # Plotting Precision-Recall Curve
    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
    pr_auc = auc(recall, precision)
    
    plt.figure(figsize=(8, 6))
    plt.plot(recall, precision, label=f'PR Curve (AUC = {pr_auc:.4f})')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Curve')
    plt.legend(loc='best')
    plt.show()

## 4. Baseline Model

Let's first train a simple Logistic Regression model on the original, imbalanced dataset. This will serve as our baseline to see how the different data handling techniques improve the model's performance.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train a baseline logistic regression model
baseline_model = LogisticRegression(random_state=42)
baseline_model.fit(X_train, y_train)

# Evaluate the baseline model
print('--- Baseline Model Evaluation ---')
evaluate_model(X_test, y_test, baseline_model)

## 5. Oversampling Techniques

Oversampling techniques increase the number of instances in the minority class to balance the dataset. Let's explore some popular oversampling methods.

### 5.1 Random Oversampling

Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset.

In [None]:
# Apply Random Oversampling
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)

# Visualize the class distribution
print(f'Resampled dataset shape {X_ros.shape}')
print(f'Resampled dataset samples per class {Counter(y_ros)}')

# Train a logistic regression model on the resampled data
ros_model = LogisticRegression(random_state=42)
ros_model.fit(X_ros, y_ros)

# Evaluate the model
print('--- Random Oversampling Model Evaluation ---')
evaluate_model(X_test, y_test, ros_model)

### 5.2 SMOTE (Synthetic Minority Oversampling Technique)

SMOTE is a more sophisticated oversampling technique. Instead of duplicating minority class instances, it creates synthetic instances by interpolating between existing minority class instances.

In [None]:
# Apply SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

# Visualize the class distribution
print(f'Resampled dataset shape {X_smote.shape}')
print(f'Resampled dataset samples per class {Counter(y_smote)}')

# Train a logistic regression model on the resampled data
smote_model = LogisticRegression(random_state=42)
smote_model.fit(X_smote, y_smote)

# Evaluate the model
print('--- SMOTE Model Evaluation ---')
evaluate_model(X_test, y_test, smote_model)

### 5.3 Borderline SMOTE

Borderline SMOTE is a variant of SMOTE that focuses on generating synthetic samples along the decision boundary between the minority and majority classes. It identifies minority class samples that are difficult to classify (i.e., those with many majority class neighbors) and generates synthetic samples from them.

In [None]:
# Apply Borderline SMOTE
bsmote = BorderlineSMOTE(random_state=42)
X_bsmote, y_bsmote = bsmote.fit_resample(X_train, y_train)

# Visualize the class distribution
print(f'Resampled dataset shape {X_bsmote.shape}')
print(f'Resampled dataset samples per class {Counter(y_bsmote)}')

# Train a logistic regression model on the resampled data
bsmote_model = LogisticRegression(random_state=42)
bsmote_model.fit(X_bsmote, y_bsmote)

# Evaluate the model
print('--- Borderline SMOTE Model Evaluation ---')
evaluate_model(X_test, y_test, bsmote_model)

### 5.4 ADASYN (Adaptive Synthetic Sampling)

ADASYN is another adaptive oversampling technique. It generates more synthetic data for minority class samples that are harder to learn, based on the density of majority class samples in their neighborhood.

In [None]:
# Apply ADASYN
adasyn = ADASYN(random_state=42)
X_adasyn, y_adasyn = adasyn.fit_resample(X_train, y_train)

# Visualize the class distribution
print(f'Resampled dataset shape {X_adasyn.shape}')
print(f'Resampled dataset samples per class {Counter(y_adasyn)}')

# Train a logistic regression model on the resampled data
adasyn_model = LogisticRegression(random_state=42)
adasyn_model.fit(X_adasyn, y_adasyn)

# Evaluate the model
print('--- ADASYN Model Evaluation ---')
evaluate_model(X_test, y_test, adasyn_model)

## 6. Undersampling Techniques

Undersampling techniques reduce the number of instances in the majority class to balance the dataset. This can be useful when the dataset is very large and training time is a concern. However, it can also lead to loss of important information.

### 6.1 Random Undersampling

Random undersampling involves randomly removing examples from the majority class.

In [None]:
# Apply Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)

# Visualize the class distribution
print(f'Resampled dataset shape {X_rus.shape}')
print(f'Resampled dataset samples per class {Counter(y_rus)}')

# Train a logistic regression model on the resampled data
rus_model = LogisticRegression(random_state=42)
rus_model.fit(X_rus, y_rus)

# Evaluate the model
print('--- Random Undersampling Model Evaluation ---')
evaluate_model(X_test, y_test, rus_model)

### 6.2 Tomek Links

Tomek links are pairs of instances of opposite classes that are their own nearest neighbors. In the context of undersampling, Tomek links can be used to remove majority class instances that are close to minority class instances, which helps to clean the class boundary.

In [None]:
# Apply Tomek Links
tl = TomekLinks()
X_tl, y_tl = tl.fit_resample(X_train, y_train)

# Visualize the class distribution
print(f'Resampled dataset shape {X_tl.shape}')
print(f'Resampled dataset samples per class {Counter(y_tl)}')

# Train a logistic regression model on the resampled data
tl_model = LogisticRegression(random_state=42)
tl_model.fit(X_tl, y_tl)

# Evaluate the model
print('--- Tomek Links Model Evaluation ---')
evaluate_model(X_test, y_test, tl_model)

### 6.3 NearMiss

NearMiss is an undersampling technique that selects majority class samples based on their distance to minority class samples. There are three versions of NearMiss:

*   **NearMiss-1:** Selects majority class samples with the smallest average distance to the *three* closest minority class samples.
*   **NearMiss-2:** Selects majority class samples with the smallest average distance to the *three* farthest minority class samples.
*   **NearMiss-3:** Selects a given number of the closest majority class samples for each minority class sample.

In [None]:
# Apply NearMiss
nm = NearMiss()
X_nm, y_nm = nm.fit_resample(X_train, y_train)

# Visualize the class distribution
print(f'Resampled dataset shape {X_nm.shape}')
print(f'Resampled dataset samples per class {Counter(y_nm)}')

# Train a logistic regression model on the resampled data
nm_model = LogisticRegression(random_state=42)
nm_model.fit(X_nm, y_nm)

# Evaluate the model
print('--- NearMiss Model Evaluation ---')
evaluate_model(X_test, y_test, nm_model)

## 7. Combination Techniques

Combination techniques, also known as hybrid methods, combine oversampling and undersampling techniques to achieve a better-balanced dataset. These methods can often provide better results than using either oversampling or undersampling alone.

### 7.1 SMOTE + Tomek Links

This method first uses SMOTE to oversample the minority class, and then uses Tomek Links to remove instances from the majority class that are close to the minority class instances. This helps to clean the class boundary and remove noise.

In [None]:
# Apply SMOTE + Tomek Links
smt = SMOTETomek(random_state=42)
X_smt, y_smt = smt.fit_resample(X_train, y_train)

# Visualize the class distribution
print(f'Resampled dataset shape {X_smt.shape}')
print(f'Resampled dataset samples per class {Counter(y_smt)}')

# Train a logistic regression model on the resampled data
smt_model = LogisticRegression(random_state=42)
smt_model.fit(X_smt, y_smt)

# Evaluate the model
print('--- SMOTE + Tomek Links Model Evaluation ---')
evaluate_model(X_test, y_test, smt_model)

### 7.2 SMOTE + ENN (Edited Nearest Neighbors)

This method is similar to SMOTE + Tomek Links, but it uses Edited Nearest Neighbors (ENN) for cleaning. ENN removes any instance whose class label differs from the class label of at least two of its three nearest neighbors.

In [None]:
# Apply SMOTE + ENN
sme = SMOTEENN(random_state=42)
X_sme, y_sme = sme.fit_resample(X_train, y_train)

# Visualize the class distribution
print(f'Resampled dataset shape {X_sme.shape}')
print(f'Resampled dataset samples per class {Counter(y_sme)}')

# Train a logistic regression model on the resampled data
sme_model = LogisticRegression(random_state=42)
sme_model.fit(X_sme, y_sme)

# Evaluate the model
print('--- SMOTE + ENN Model Evaluation ---')
evaluate_model(X_test, y_test, sme_model)

## 8. Ensemble Methods

Ensemble methods combine the predictions of several base estimators to improve the overall performance. Some ensemble methods are specifically designed to handle imbalanced datasets.

### 8.1 BalancedRandomForestClassifier

This is a variant of the Random Forest algorithm where each tree is trained on a balanced bootstrap sample of the data. It randomly undersamples the majority class for each tree.

In [None]:
# Apply BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(random_state=42)
brf.fit(X_train, y_train)

# Evaluate the model
print('--- BalancedRandomForestClassifier Model Evaluation ---')
evaluate_model(X_test, y_test, brf)

### 8.2 BalancedBaggingClassifier

This is a bagging classifier that uses a balanced bootstrap sample for each base estimator.

In [None]:
# Apply BalancedBaggingClassifier
bbc = BalancedBaggingClassifier(random_state=42)
bbc.fit(X_train, y_train)

# Evaluate the model
print('--- BalancedBaggingClassifier Model Evaluation ---')
evaluate_model(X_test, y_test, bbc)

### 8.3 EasyEnsembleClassifier

The EasyEnsemble classifier trains an ensemble of classifiers on different balanced bootstrap samples of the data. It is a form of undersampling-based ensemble.

In [None]:
# Apply EasyEnsembleClassifier
eec = EasyEnsembleClassifier(random_state=42)
eec.fit(X_train, y_train)

# Evaluate the model
print('--- EasyEnsembleClassifier Model Evaluation ---')
evaluate_model(X_test, y_test, eec)

## 9. Final Comparison

After applying various techniques, we can compare their performance to see which one worked best for our synthetic dataset. The choice of the best technique will depend on the specific dataset and the problem at hand.

## 10. Practical Tips and Best Practices

Here are some practical tips and best practices for handling imbalanced data:

*   **Choose the right evaluation metric:** As we've seen, accuracy is not a good metric for imbalanced datasets. Use metrics like Precision, Recall, F1-score, and ROC-AUC.
*   **Don't test on the resampled data:** Always resample only the training data and test your model on the original, untouched test set. This prevents data leakage and gives a more realistic estimate of the model's performance on unseen data.
*   **Cross-validation:** Use stratified cross-validation to ensure that the class distribution in each fold is representative of the original dataset.
*   **Consider class weights:** Many machine learning algorithms have a `class_weight` parameter that can be set to `balanced` to automatically adjust for class imbalance. This is a simple and often effective technique.
*   **Experiment with different techniques:** There is no one-size-fits-all solution for imbalanced data. It's important to experiment with different techniques to find the one that works best for your specific problem.