#### Problems of Imbalanced dataset

* An imbalanced dataset is a dataset where the distribution of target values or labels is not equal. In other words, one class or category in the dataset is significantly more common than the others. The more common class is called the majority class, and the less common class is called the minority class. 
* Imbalanced datasets are common in real-world scenarios, such as credit card fraud detection, rare disease diagnosis, and anomaly detection. They can cause problems with accuracy, overfitting, and bias. This is because machine learning algorithms tend to focus on the majority class and ignore the minority class. 

#### Two techniques to overcome this:

* Upsampling
* downsampling

In [2]:
import numpy as np
import pandas as pd

#Set the random seed for reproducibility
np.random.seed(123)

## creating a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [3]:
n_class_0,n_class_1

(900, 100)

In [4]:
## Creating dataframe with imbalanced dataset
class_0 = pd.DataFrame({
    'feature_1' : np.random.normal(loc=0,scale=1, size = n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1' : np.random.normal(loc=2,scale=1, size = n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [6]:
df=pd.concat([class_0, class_1]).reset_index(drop=True)

In [7]:
df['target'].value_counts()

0    900
1    100
Name: target, dtype: int64

### Up- sampling

In [8]:
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [9]:
from sklearn.utils import resample
df_minority_upsampled = resample(df_minority, replace=True,
                            n_samples= len(df_majority),
                            random_state = 42
                            )

In [10]:
df_minority_upsampled.shape

(900, 3)

In [11]:
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [14]:
df_upsampled['target'].value_counts()

0    900
1    900
Name: target, dtype: int64

### Down- sampling

In [15]:
import numpy as np
import pandas as pd

#Set the random seed for reproducibility
np.random.seed(123)

## creating a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

df=pd.concat([class_0, class_1]).reset_index(drop=True)
df['target'].value_counts()

0    900
1    100
Name: target, dtype: int64

In [16]:
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [17]:
from sklearn.utils import resample
df_majority_downsampled = resample(df_majority, replace=False,
                            n_samples= len(df_minority),
                            random_state = 42
                            )

In [19]:
df_majority_downsampled.shape

(100, 3)

In [20]:
df_downsampled = pd.concat([df_minority, df_majority_downsampled])

In [21]:
df_downsampled.target.value_counts()

1    100
0    100
Name: target, dtype: int64

# SUMMARY

<b> 1. Upsampling (Over-sampling the minority class) </b>

* When to use:

Small Dataset: When your dataset is small, and removing majority class samples (downsampling) might cause loss of valuable information.

Preserve Data: When you want to avoid losing any data points from the majority class.

High-Performance Models: When the computational resources can handle the increased dataset size (because upsampling adds synthetic data points).

Imbalanced Ratios: When the minority class ratio is very small (e.g., 1:100 or less), and you want to make the class distribution more balanced.

* Methods:

Random Oversampling: Duplicate random samples from the minority class.

SMOTE (Synthetic Minority Oversampling Technique): Generate synthetic samples by interpolating between nearest neighbors of the minority class.

ADASYN: Similar to SMOTE but focuses on generating more synthetic data for harder-to-classify samples.

* Examples:

Fraud detection datasets with very few fraudulent transactions.

Medical datasets with rare disease cases.

<b> 2. Downsampling (Under-sampling the majority class) </b>

* When to use:

Large Dataset: When the dataset is large enough that removing samples from the majority class won't result in significant data loss.

Memory Constraints: When computational resources are limited, and you want to reduce the dataset size.

Class Ratio Not Extreme: When the imbalance is moderate (e.g., 1:10), and you can afford to remove some majority class samples without losing valuable information.

* Methods:

Random Under-sampling: Remove random samples from the majority class.

Cluster-based Sampling: Use clustering algorithms to reduce the majority class while retaining its diversity.

Tomek Links: Remove majority class samples that are close to the decision boundary.

* Examples:

Spam classification datasets with many "non-spam" emails.

Social media datasets where positive interactions vastly outnumber negative interactions.

<b> Key Considerations: </b>

* Risk of Overfitting (Upsampling): Upsampling (especially random oversampling) might cause the model to overfit because duplicate or synthetic data points may not generalize well. SMOTE and its variants can mitigate this.


* Loss of Information (Downsampling): Downsampling reduces the dataset size, which might lead to loss of potentially useful data from the majority class.


* Combine Techniques: Sometimes, combining upsampling and downsampling can be effective. For instance, upsample the minority class to some extent and downsample the majority class to create a balanced dataset without excessive data duplication or loss.