## Imbalanced Learning

Imbalanced learning refers to the situation in a classification problem where the distribution of classes in the training data is not balanced, meaning one class has significantly more samples than the other(s). In such cases, the model can become biased towards the majority class, leading to poorer performance in correctly predicting the minority class.

Imbalanced learning can cause several issues:

- Biased Predictions: Models tend to predict the majority class more frequently, leading to imbalanced and biased predictions.

- Low Recall for Minority Class: The model might have a high accuracy due to correctly classifying the majority class but might fail to identify most instances of the minority class, resulting in low recall for the minority class.

- Model Evaluation Misleading: Accuracy alone can be misleading in imbalanced datasets, as it can be high even if the model only predicts the majority class.

In imbalanced learning, several techniques can be used to address the class imbalance issue. These techniques can be broadly categorized into two main types:

### Data Level Techniques

#### Under-Sampling:

- Definition: Removing samples from the majority class to balance the class distribution. This can be done randomly or with more sophisticated methods like Cluster Centroids and Tomek links.

- Purpose: Helps reduce the dominance of the majority class and can prevent the model from being biased towards it.

- Pros: Simple and computationally efficient, can lead to faster training times.

- Cons: May discard valuable information, potential loss of important instances.

In [39]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from imblearn.under_sampling import ClusterCentroids
from imblearn.under_sampling import TomekLinks
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.combine import SMOTEENN
from sklearn.svm import OneClassSVM
from sklearn.ensemble import IsolationForest

In [30]:
# Generate a synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, weights=[0.99, 0.01], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Cluster Centroids can be an effective under-sampling technique, especially when the majority class has clusters of data points close to the minority class.

In [31]:
# Create the Cluster Centroids under-sampling object
cluster_centroids = ClusterCentroids(random_state=42)

# Apply under-sampling on the training set
X_train_resampled, y_train_resampled = cluster_centroids.fit_resample(X_train, y_train)

# Check class distribution before and after under-sampling
# print("Class distribution before under-sampling:", pd.Series(y_train).value_counts())
# print("Class distribution after under-sampling:", pd.Series(y_train_resampled).value_counts())

# Create a RandomForestClassifier and train on the resampled data
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy after under-sampling:", accuracy)



Accuracy after under-sampling: 0.705


Tomek Links are effective in dealing with class imbalance when there are overlapping instances of the majority and minority classes, particularly at their borders.

In [40]:
# Create the Tomek Links under-sampling object
tomek_links = TomekLinks()

# Apply under-sampling on the training set
X_train_resampled, y_train_resampled = tomek_links.fit_resample(X_train, y_train)

# Create a RandomForestClassifier and train on the resampled data
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy after Tomek Links under-sampling:", accuracy)

Accuracy after Tomek Links under-sampling: 0.985


#### Over-Sampling:

- Definition: Creating additional samples for the minority class to balance the class distribution.

- Purpose: Increase the representation of the minority class and improve the model's performance.

- Pros: Helps the model learn from the minority class, can be combined with techniques like SMOTE and ADASYN.

- Cons: May lead to overfitting, synthetic samples might not fully represent the true distribution.

In [32]:
# Create the SMOTE over-sampling object
smote = SMOTE(random_state=42)

# Apply over-sampling on the training set
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Check class distribution before and after over-sampling
# print("Class distribution before over-sampling:", pd.Series(y_train).value_counts())
# print("Class distribution after over-sampling:", pd.Series(y_train_resampled).value_counts())

# Create a RandomForestClassifier and train on the resampled data
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy after over-sampling:", accuracy)

Accuracy after over-sampling: 0.97


#### Combined Over-Under Sampling:

- Definition: Combining over-sampling and under-sampling to balance the class distribution.

- Purpose: Take advantage of both methods to improve performance.

- Pros: Overcomes limitations of individual techniques, can lead to better results.

- Cons: Requires careful tuning and experimentation.

SMOTEENN is effective when the dataset suffers from both class imbalance and contains noisy samples that can potentially interfere with the learning process.

In [33]:
# Create the SMOTEENN object, which combines SMOTE and Edited Nearest Neighbors
smote_enn = SMOTEENN(random_state=42)

# Apply combined under-sampling and over-sampling on the training set
X_train_resampled, y_train_resampled = smote_enn.fit_resample(X_train, y_train)

# Check class distribution before and after combined resampling
# print("Class distribution before combined resampling:", pd.Series(y_train).value_counts())
# print("Class distribution after combined resampling:", pd.Series(y_train_resampled).value_counts())

# Create a RandomForestClassifier and train on the resampled data
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy after combined resampling:", accuracy)

Accuracy after combined resampling: 0.97


#### Synthetic Data Generation:

- Definition: Creating artificial data to address limited data availability and improve model performance.

- Purpose: Mitigate class imbalance by providing more data for the minority class.

- Pros: Potential for better generalization and improved model performance.

- Cons: Risk of overfitting if not handled carefully.

SMOTE works well with moderate class imbalance and linear decision boundaries, while ADASYN is suitable for severe class imbalance and complex decision boundaries.

In [34]:
# Create the SMOTE over-sampling object
smote = SMOTE(random_state=42)

# Create the ADASYN over-sampling object
adasyn = ADASYN(random_state=42)

# Apply SMOTE to generate synthetic data for the minority class
X_train_resampled_smote, y_train_resampled_smote = smote.fit_resample(X_train, y_train)

# Apply ADASYN to generate synthetic data for the minority class
X_train_resampled_adasyn, y_train_resampled_adasyn = adasyn.fit_resample(X_train, y_train)

# Check class distribution before and after over-sampling with SMOTE
# print("Class distribution before SMOTE over-sampling:", pd.Series(y_train).value_counts())
# print("Class distribution after SMOTE over-sampling:", pd.Series(y_train_resampled_smote).value_counts())

# Check class distribution before and after over-sampling with ADASYN
# print("Class distribution before ADASYN over-sampling:", pd.Series(y_train).value_counts())
# print("Class distribution after ADASYN over-sampling:", pd.Series(y_train_resampled_adasyn).value_counts())

# Create a RandomForestClassifier and train on the resampled data using SMOTE
rf_model_smote = RandomForestClassifier(random_state=42)
rf_model_smote.fit(X_train_resampled_smote, y_train_resampled_smote)

# Create a RandomForestClassifier and train on the resampled data using ADASYN
rf_model_adasyn = RandomForestClassifier(random_state=42)
rf_model_adasyn.fit(X_train_resampled_adasyn, y_train_resampled_adasyn)

# Make predictions on the test set using SMOTE model
y_pred_smote = rf_model_smote.predict(X_test)

# Make predictions on the test set using ADASYN model
y_pred_adasyn = rf_model_adasyn.predict(X_test)

# Calculate accuracy for SMOTE model
accuracy_smote = accuracy_score(y_test, y_pred_smote)
print("Accuracy after SMOTE over-sampling:", accuracy_smote)

# Calculate accuracy for ADASYN model
accuracy_adasyn = accuracy_score(y_test, y_pred_adasyn)
print("Accuracy after ADASYN over-sampling:", accuracy_adasyn)

Accuracy after SMOTE over-sampling: 0.97
Accuracy after ADASYN over-sampling: 0.975


### Algorithm Level Techniques

#### Ensemble Methods:

- Definition: Combine models to address class imbalance and improve performance.

- Purpose: Boost accuracy for the minority class using collective model knowledge.

- Pros: Effective in handling class imbalance and improving overall performance.

- Cons: May increase computational complexity and resource requirements.

#### Class Weighting:

- Definition: Assigning higher weights to the minority class and lower weights to the majority class during training.

- Purpose: Adjust impact of different classes on the model's loss function.

- Pros: Simple to implement, no changes to the dataset.

- Cons: Not as effective in severe class imbalance, performance improvement may be limited.

In [35]:
# Calculate class weights
class_weight = dict(zip(np.unique(y_train), np.bincount(y_train)))

# Create a DecisionTreeClassifier with class weights
dt_model = DecisionTreeClassifier(class_weight=class_weight, random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with class weighting:", accuracy)

Accuracy with class weighting: 0.965


#### Cost-Sensitive Learning:

- Definition: Considering varying misclassification costs during training.

- Purpose: Optimize model performance in scenarios with imbalanced misclassification costs.

- Pros: Better decision-making in applications with imbalanced costs.

- Cons: Requires accurate estimation of misclassification costs.

In [36]:
# Define misclassification costs for each class
class_costs = {0: 1.0, 1: 10.0}

# Create a DecisionTreeClassifier with class weights based on misclassification costs
dt_model = DecisionTreeClassifier(class_weight=class_costs, random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with cost-sensitive learning:", accuracy)

Accuracy with cost-sensitive learning: 0.975


#### Anomaly Detection:

- Definition: Identify rare or unusual data points in imbalanced datasets. This can be done randomly or with more sophisticated methods like OneClassSVM and IsolationForest.

- Purpose: Detect anomalies or outliers despite class imbalance.

- Pros: Useful for identifying rare events or anomalies with low representation.

- Cons: May struggle to distinguish genuine anomalies from novel patterns, especially in unsupervised anomaly detection.

One-Class SVM is particularly useful when dealing with unsupervised learning scenarios, where there are no labeled anomalies for training, it can can struggle with high-dimensional spaces due to the curse of dimensionality.

In [37]:
# Treat the minority class as an anomaly (class 1 is the minority class)
anomaly_mask = y_train == 1

# Create a One-Class SVM classifier
svm_model = OneClassSVM()

# Train the model on the majority class (normal instances)
svm_model.fit(X_train[~anomaly_mask])

# Make predictions on the test set
y_pred = svm_model.predict(X_test)

# Convert predictions to binary (1 for anomalies, -1 for normal instances)
y_pred[y_pred == 1] = 0  # Normal instances
y_pred[y_pred == -1] = 1  # Anomalies

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with anomaly detection:", accuracy)

Accuracy with anomaly detection: 0.455


Isolation Forest is generally faster and more scalable, especially for high-dimensional data, provides more interpretability, requires minimal hyperparameter tuning.

In [38]:
# Treat the minority class as an anomaly (class 1 is the minority class)
anomaly_mask = y_train == 1

# Create an Isolation Forest classifier
if_model = IsolationForest()

# Train the model on the majority class (normal instances)
if_model.fit(X_train[~anomaly_mask])

# Make predictions on the test set
y_pred = if_model.predict(X_test)

# Convert predictions to binary (1 for anomalies, -1 for normal instances)
y_pred[y_pred == 1] = 0  # Normal instances
y_pred[y_pred == -1] = 1  # Anomalies

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with Isolation Forest:", accuracy)

Accuracy with Isolation Forest: 0.96
