# 1. Data Balancing

Data balancing addresses the problem of imbalanced datasets where some classes have many more samples than others. Imbalanced data can cause machine learning models to be biased toward majority classes, resulting in poor predictive performance on minority classes. Balancing techniques aim to create a more even class distribution to improve model fairness and accuracy.

# 2. Why Data Balancing is Important

When classes are imbalanced, models tend to favour the majority class, leading to misleadingly high overall accuracy but poor recall or precision for minority classes. This is critical in fields like medical diagnosis or fraud detection, where minority class detection is vital.

# 3. Common Techniques for Data Balancing

## 3.1 Random Undersampling

Randomly removes samples from the majority class to balance with the minority class.  
Simple but may discard useful information.

In [None]:
%pip install imbalanced-learn

In [8]:
import pandas as pd
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler

# Create a simple imbalanced dataset
data = {
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'feature2': [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
    'target':   [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # Class 0 is majority
}

df = pd.DataFrame(data)

X = df.drop('target', axis=1)
y = df['target']

print("Original class distribution:", Counter(y))

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

print("Resampled class distribution:", Counter(y_resampled))

# Show resampled data as DataFrame
resampled_df = pd.DataFrame(X_resampled, columns=X.columns)
resampled_df['target'] = y_resampled.values

print("\nResampled dataset:")
print(resampled_df)


Original class distribution: Counter({0: 5, 1: 5})
Resampled class distribution: Counter({0: 5, 1: 5})

Resampled dataset:
   feature1  feature2  target
0         1        10       0
1         2         9       0
2         3         8       0
3         4         7       0
4         5         6       0
6         7         4       1
9        10         1       1
7         8         3       1
5         6         5       1
8         9         2       1


## 3.2 Random Oversampling

Duplicates samples from the minority class to match the majority class count.  
Easy to implement but can cause overfitting.

In [9]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

## 3.3 SMOTE (Synthetic Minority Over-sampling Technique)

Generates synthetic minority class examples by interpolating between existing samples.  
Reduces overfitting compared to simple oversampling.

In [10]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

## 3.4 Tomek Links and Edited Nearest Neighbours

Undersampling methods that clean overlapping or noisy samples near class boundaries.  
Improve class separability and data quality.

# 4. When to Use Each Technique

| Technique              | Use Case                            | Pros                               | Cons                         |
| ---------------------- | ---------------------------------- | ---------------------------------- | ---------------------------- |
| Random Undersampling    | When dataset is large               | Simple, fast                      | May lose important data       |
| Random Oversampling     | Small datasets                     | Easy to implement                 | Risk of overfitting           |
| SMOTE                  | When synthetic data can help       | Creates new samples, reduces overfitting | More complex, may create noise |
| Tomek Links / ENN      | When cleaning noisy data            | Improves data quality             | May remove borderline cases   |

# 5. Summary

Data balancing is essential for training fair and accurate models on imbalanced datasets. Selecting the right method depends on the dataset size, domain, and tolerance for overfitting or data loss. Techniques like SMOTE are widely used for their balance between effectiveness and robustness.