# Data Resampling Algorithms for Imbalanced Data

## Introduction
Imbalanced datasets can significantly affect the performance of machine learning models. This notebook explores various resampling techniques, including oversampling and undersampling, to balance the classes in a dataset. We will demonstrate the use of popular libraries such as `imbalanced-learn` along with `pandas` and `scikit-learn`.

## Setup
First, we need to install the required libraries and import them.


In [None]:
# Install imbalanced-learn if not already installed
!pip install -U imbalanced-learn

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN

# Set style for plots
sns.set(style='whitegrid')


## Create a Synthetic Imbalanced Dataset
We'll create a synthetic dataset with an imbalanced class distribution using `make_classification`.


In [None]:
# Create a synthetic imbalanced dataset
X, y = make_classification(n_classes=2, n_samples=1000, weights=[0.9, 0.1],
                           random_state=42)

# Convert to DataFrame for easier handling
data = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
data['target'] = y

# Display the class distribution
class_counts = data['target'].value_counts()
sns.barplot(x=class_counts.index, y=class_counts.values)
plt.title('Class Distribution Before Resampling')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.show()


## Resampling Techniques
We will explore three resampling techniques: SMOTE (Synthetic Minority Over-sampling Technique), Random Under-sampling, and SMOTE followed by ENN (Edited Nearest Neighbors).


### 1. SMOTE
SMOTE generates synthetic samples for the minority class to balance the class distribution.


In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Display the class distribution after SMOTE
resampled_class_counts = pd.Series(y_resampled).value_counts()
sns.barplot(x=resampled_class_counts.index, y=resampled_class_counts.values)
plt.title('Class Distribution After SMOTE')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.show()

# Train a Random Forest Classifier
rf_smote = RandomForestClassifier(random_state=42)
rf_smote.fit(X_resampled, y_resampled)

# Evaluate the model
y_pred_smote = rf_smote.predict(X_test)
print('Classification Report for SMOTE:
', classification_report(y_test, y_pred_smote))
print('Confusion Matrix for SMOTE:
', confusion_matrix(y_test, y_pred_smote))


### 2. Random Under-sampling
Random under-sampling reduces the number of samples from the majority class to balance the dataset.


In [None]:
# Apply Random Under-sampling
under_sampler = RandomUnderSampler(random_state=42)
X_resampled_under, y_resampled_under = under_sampler.fit_resample(X_train, y_train)

# Display the class distribution after under-sampling
resampled_under_class_counts = pd.Series(y_resampled_under).value_counts()
sns.barplot(x=resampled_under_class_counts.index, y=resampled_under_class_counts.values)
plt.title('Class Distribution After Random Under-sampling')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.show()

# Train a Random Forest Classifier
rf_under = RandomForestClassifier(random_state=42)
rf_under.fit(X_resampled_under, y_resampled_under)

# Evaluate the model
y_pred_under = rf_under.predict(X_test)
print('Classification Report for Random Under-sampling:
', classification_report(y_test, y_pred_under))
print('Confusion Matrix for Random Under-sampling:
', confusion_matrix(y_test, y_pred_under))


### 3. SMOTE + Edited Nearest Neighbors (ENN)
SMOTE followed by ENN generates synthetic samples and then removes samples from the majority class that are incorrectly classified.


In [None]:
# Apply SMOTE + ENN
smote_enn = SMOTEENN(random_state=42)
X_resampled_enn, y_resampled_enn = smote_enn.fit_resample(X_train, y_train)

# Display the class distribution after SMOTE + ENN
resampled_enn_class_counts = pd.Series(y_resampled_enn).value_counts()
sns.barplot(x=resampled_enn_class_counts.index, y=resampled_enn_class_counts.values)
plt.title('Class Distribution After SMOTE + ENN')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.show()

# Train a Random Forest Classifier
rf_enn = RandomForestClassifier(random_state=42)
rf_enn.fit(X_resampled_enn, y_resampled_enn)

# Evaluate the model
y_pred_enn = rf_enn.predict(X_test)
print('Classification Report for SMOTE + ENN:
', classification_report(y_test, y_pred_enn))
print('Confusion Matrix for SMOTE + ENN:
', confusion_matrix(y_test, y_pred_enn))


## Conclusion
In this notebook, we explored different data resampling techniques for handling imbalanced datasets, including SMOTE, Random Under-sampling, and SMOTE followed by ENN. Each method has its strengths and can improve the performance of machine learning models in different scenarios. 

Make sure to choose the resampling technique based on the specific context and requirements of your dataset.