## Handling Imbalanced Dataset

In data science, an imbalanced dataset refers to a situation where the classes in a classification problem are not represented equally. For example, if you're working on a binary classification problem to identify whether an email is spam or not, and 95% of the emails in your dataset are non-spam (negative class) while only 5% are spam (positive class), the dataset is considered imbalanced.

Imbalanced datasets can pose challenges for machine learning models, leading to several issues:

1. **Bias Toward Majority Class**: Models may become biased toward predicting the majority class, as they tend to minimize overall error, which may result in poor performance for the minority class.

2. **Evaluation Metrics**: Standard metrics like accuracy can be misleading because a model that always predicts the majority class will still have high accuracy despite being ineffective for the minority class. Metrics such as precision, recall, F1-score, and ROC-AUC are often more informative in imbalanced scenarios.

3. **Learning Challenges**: Models may struggle to learn the characteristics of the minority class due to its underrepresentation, which can lead to poor generalization.

To address the challenges of imbalanced datasets, several techniques can be used:

- **Resampling Methods**: Techniques such as oversampling the minority class (e.g., using Synthetic Minority Over-sampling Technique (SMOTE)) or undersampling the majority class to balance the class distribution.

- **Algorithmic Approaches**: Using algorithms that are inherently better at handling imbalanced data or adjusting model parameters to make the algorithm more sensitive to the minority class.

- **Cost-sensitive Learning**: Incorporating different costs for misclassifying different classes, which can help the model to pay more attention to the minority class.

- **Anomaly Detection Methods**: When the minority class is extremely rare, treating it as an anomaly detection problem rather than a standard classification problem.

Understanding and addressing the imbalance in your dataset is crucial for building effective and reliable machine learning models.



## Methods to handle this
1. Up Sampling
2. Down Sampling

In [1]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [2]:
n_class_0,n_class_1

(900, 100)

### CREATE MY DATAFRAME WITH IMBALANCED DATASET

In [3]:

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [4]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [5]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [6]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

## Upsampling

In [7]:

df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [8]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [9]:
from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True, #Sample With replacement
         n_samples=len(df_majority),
         random_state=42
        )

In [10]:
df_minority_upsampled.shape

(900, 3)

In [11]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [12]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

In [13]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

## Down Sampling

In [14]:
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


In [15]:
## downsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [16]:
from sklearn.utils import resample
df_majority_downsampled=resample(df_majority,replace=False, #Sample With replacement
         n_samples=len(df_minority),
         random_state=42
        )

In [17]:
df_majority_downsampled.shape

(100, 3)

In [18]:
df_downsampled=pd.concat([df_minority,df_majority_downsampled])

In [19]:
df_downsampled['target'].value_counts()

target
1    100
0    100
Name: count, dtype: int64