# Here we will learn how to handle imbalanced data sets

## There are two methods to handle that. They are :-

        1. Upsampling
        2. Downsampling

Now we can have a practical look on they :-

        Step 1 :- 
                creating a imbalance dataset which follows random distribution using random methods
        step 2:- merging them together
        step 3:- Having a look on downsampling
        step 4:- Having a look on upsampling


In [11]:
import numpy as np
import pandas as pd


total_samples = 1000
imbalance_ratio = 0.9

# Creating shapes of dataset which having imbalance data
dset1 = int(total_samples * imbalance_ratio)
dset2 = total_samples - dset1
dset1,dset2

(900, 100)

In [68]:
np.random.seed(50) # this will generate same values every time based on given integers
df1 = pd.DataFrame({
    "Column0" : np.random.normal(loc=0, scale=1, size=dset1),
    "Column1" : np.random.normal(loc=0, scale=1, size=dset1),
    "Tgt" : [0] * dset1
})
df2 = pd.DataFrame({
    "Column0" : np.random.normal(loc=0, scale=1, size=dset2),
    "Column1" : np.random.normal(loc=0, scale=1, size=dset2),
    "Tgt" : [1] * dset2
})
df1, df2

(      Column0   Column1  Tgt
 0   -1.560352 -0.098932    0
 1   -0.030978  0.587077    0
 2   -0.620928 -1.054589    0
 3   -1.464580 -1.028763    0
 4    1.411946  0.324330    0
 ..        ...       ...  ...
 895 -0.436827  0.597436    0
 896 -0.328299 -1.395772    0
 897  1.131414  0.664117    0
 898 -0.250921 -0.931639    0
 899 -0.543287  0.407108    0
 
 [900 rows x 3 columns],
      Column0   Column1  Tgt
 0   0.153333  1.432912    1
 1   1.851210  1.850155    1
 2  -0.063522  0.436402    1
 3   0.182513 -0.616574    1
 4   0.754748 -0.152089    1
 ..       ...       ...  ...
 95 -0.392789 -0.623098    1
 96  1.942472  0.068365    1
 97  0.885720  0.189992    1
 98  0.442306 -1.118375    1
 99  1.042787 -0.796007    1
 
 [100 rows x 3 columns])

In [69]:
df = pd.concat([df1, df2]) # merging both datasets to get main imbalance dataset
del df1, df2 # this dataframes are not required anymore.
df['Tgt'].value_counts() # Looking in to Tgt column

0    900
1    100
Name: Tgt, dtype: int64

In [45]:
## reseting index
df.reset_index(drop=True)

Unnamed: 0,Column0,Column1,Tgt
0,-1.560352,-0.098932,0
1,-0.030978,0.587077,0
2,-0.620928,-1.054589,0
3,-1.464580,-1.028763,0
4,1.411946,0.324330,0
...,...,...,...
995,-0.392789,-0.623098,1
996,1.942472,0.068365,1
997,0.885720,0.189992,1
998,0.442306,-1.118375,1


In [49]:
from sklearn.utils import resample
# this method is used for resampling, uses random sampling and default sampling by default

# we can also recreate the df1, df2 again with pandas with following method
df1 = df[df["Tgt"]==0] # this is majority dataset where rows equal to 900
df2 = df[df["Tgt"]==1] # this is minority dataset where rows equal to 100

### Here df1 having more 0 in target so we need to reduce those values and then mix with the df2 this method is called downsampling

In [52]:
reduced_df1 = resample(df1, replace=True,
                       n_samples=len(df2),
                       random_state=50) # taken 100 samples from 900 samples. here data is lossed 
reduced_df1

Unnamed: 0,Column0,Column1,Tgt
688,0.110897,0.142652,0
480,0.485773,-0.733560,0
109,-2.444716,1.819445,0
289,0.920007,1.743950,0
132,-0.640622,-0.417895,0
...,...,...,...
342,-0.356296,1.559659,0
811,0.083715,-1.222527,0
275,-0.224136,-1.253117,0
224,1.089509,0.420127,0


In [59]:
reduced_df1 = resample(df1, replace=False, # replace = False give us unique values only
                       n_samples=len(df2),
                       random_state=50) # taken 100 samples from 900 samples. here data is lossed 
reduced_df1

Unnamed: 0,Column0,Column1,Tgt
806,0.416837,-1.854719,0
188,2.712144,-0.107462,0
222,1.577870,2.094607,0
407,-1.188551,-0.919913,0
7,1.070268,1.352667,0
...,...,...,...
690,0.500014,-1.318416,0
104,-0.168575,-2.233004,0
253,1.124327,-1.260511,0
46,-0.358858,1.161654,0


### Here df2 having less count compared to df1 so we need to increase those values and then mix with the df1 this method is called upsampling

In [58]:
elevated_df2 = resample(df2, replace=True,  # Not taking unique values because sample size is 100 but we need to make 900
                        n_samples=len(df1),
                        random_state=50)
elevated_df2

Unnamed: 0,Column0,Column1,Tgt
48,-2.151085,-0.592497,1
96,1.942472,0.068365,1
11,0.577360,0.211948,1
33,1.547619,0.821354,1
94,0.835437,0.716605,1
...,...,...,...
75,0.770849,0.625531,1
80,0.799130,0.998024,1
34,-0.098642,0.249638,1
88,-0.667568,-0.597570,1


In [62]:
len(df1), len(reduced_df1), len(df2), len(elevated_df2)

(900, 100, 100, 900)

Now we can concat dfl and elevated_df2 or reduced_df1 and df2

In [66]:
pd.concat([df1, elevated_df2])["Tgt"].value_counts()

0    900
1    900
Name: Tgt, dtype: int64

In [67]:
pd.concat([df2, reduced_df1])["Tgt"].value_counts()

1    100
0    100
Name: Tgt, dtype: int64

Finally we handled inbalance datasets and made them equal.