# Handling Imbalanced Dataset

Let say I have 1000 data points in dataset, the ouptput may be YES/No (binary calssification problem) and consider 900 - Yes and 100 - No.

Here maximum data points are 'Yes', probability ratio 9:1 - it consider as imbalanced dataset.




If you have imbalanced dataset, the model that you creating to predict will get biased towards the maximum number of datapoints ('Yes')

So, its necessary to fix this dataset, to make the data points ratio equal by using best techniques:
1. Up Sampling :- increase the data points
2. Down Sampling:- decrease the data points

## Create Sample Dataset with 1000 datapoints 

In [1]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [3]:
n_class_0,n_class_1

(900, 100)

## CREATE MY DATAFRAME WITH IMBALANCED DATASET

In [4]:

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [7]:
class_0.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [9]:
class_1.head()

Unnamed: 0,feature_1,feature_2,target
0,1.699768,2.139033,1
1,1.367739,2.025577,1
2,1.795683,1.803557,1
3,2.213696,3.312255,1
4,3.033878,3.187417,1


In [10]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [11]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [6]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [12]:
df['target'].value_counts()

0    900
1    100
Name: target, dtype: int64

### Upsampling
- Increase the number of '1' to 900

In [13]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [14]:
from sklearn.utils import resample
df_minority_upsampled=resample(
    df_minority,
    replace=True,  #Sample With replacement
    n_samples=len(df_majority),
    random_state=42
)

In [15]:
df_minority_upsampled.shape

(900, 3)

In [16]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [17]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

In [18]:
df_upsampled['target'].value_counts()

0    900
1    900
Name: target, dtype: int64

### Down Sampling

In [19]:
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

0    900
1    100
Name: target, dtype: int64


In [20]:
## downsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [21]:
from sklearn.utils import resample
df_majority_downsampled=resample(
    df_majority,
    replace=False, #Sample With replacement
    n_samples=len(df_minority), 
    random_state=42
)

In [22]:
df_majority_downsampled.shape

(100, 3)

In [23]:
df_downsampled=pd.concat([df_minority,df_majority_downsampled])

In [24]:
df_downsampled['target'].value_counts()

1    100
0    100
Name: target, dtype: int64