## Handling Imbalanced Dataset
Imbalance in dataset occurs when the data is divided categorically or there is a column which has categorical values like yes/no, male/female.

The problem with imbalnced datset is, the model will be biased with the datapoints that are having majority in number.

To solve imbalnced datasets there are two ways 1.Upsmapling(increase the datapoints) 2. Downsampling(decrease the datapoints)

In [2]:
import numpy as np
import pandas as pd

In [5]:
# Setting the random ssed for reproducibility
np.random.seed(123)

# Creating a dataframe with two classes
n_samples = 1000
class_0_ratio=0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples-n_class_0

In [6]:
n_class_0,n_class_1


(900, 100)

In [8]:
#Creating dataframe with imbalanced dataset

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0]*n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_1),
    'target': [1]*n_class_1
})

In [11]:
df=pd.concat([class_0,class_1]).reset_index(drop = True)

In [14]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

## Upsampling

In [None]:
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [20]:
from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True,
        n_samples=len(df_majority),
        random_state=42)

In [21]:
df_minority_upsampled.shape

(900, 3)

In [22]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

In [23]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

In [24]:
# Setting the random ssed for reproducibility
np.random.seed(123)

# Creating a dataframe with two classes
n_samples = 1000
class_0_ratio=0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples-n_class_0

#Creating dataframe with imbalanced dataset

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0]*n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_1),
    'target': [1]*n_class_1
})

df=pd.concat([class_0,class_1]).reset_index(drop = True)
print(df['target'].value_counts())


target
0    900
1    100
Name: count, dtype: int64


In [25]:
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [27]:
from sklearn.utils import resample
df_majority_downsampled=resample(df_majority,replace=False,
        n_samples=len(df_minority),
        random_state=42)

In [28]:
df_majority_downsampled.shape

(100, 3)

In [29]:
df_downsampled=pd.concat([df_majority_downsampled,df_minority])

In [30]:
df_downsampled['target'].value_counts()

target
0    100
1    100
Name: count, dtype: int64