## Correcting Class Imbalance Dataset for Binary Classification Problems

**Creator - Gourab Nath**

In [20]:
#Package for class imbalance correction
import imblearn

In [4]:
#Numpy and pandas
import pandas as pd
import numpy as np

### A. Pima Indian Diabetes Dataset

You may download the dataset from this link: http://bit.ly/2UfPQNI

In [5]:
#Read the file - you need to change the destination folder
pima = pd.read_csv("C:\\Users\\Gourab\\Downloads\\diabetes.csv")

In [9]:
#Class distribution of the target
pd.crosstab(pima.Outcome, columns='count')

col_0,count
Outcome,Unnamed: 1_level_1
0,500
1,268


In [12]:
#Split the predictors and target
X = pima.drop('Outcome', axis=1)
y = pima.Outcome

In [13]:
#Train-Test split using stratified random sampling
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y, random_state=4)

In [52]:
#Checking the class (percentage) distribution of the target in the training set
pd.crosstab(y_train, columns='count')/len(y_train)*100

col_0,count
Outcome,Unnamed: 1_level_1
0,65.176909
1,34.823091


In [None]:
#Check the percentage class distribution of y and y_test
#(Do yourself - these will match closely with y_train)



### B. Under-Sampling Techniques

#### B.1. Random Under-sampling

In [31]:
#Performing random undersampling on the training dataset
rus = imblearn.under_sampling.RandomUnderSampler(sampling_strategy=1.0)
X_rus, y_rus = rus.fit_resample(X_train, y_train)

*sampling_strategy = 1.0, ensures that the number minority observations is equal to the number of majoroty observations, after under-sampling. If sampling_strategy = 1.2, then that would mean that the number of minority observations is 1.2 times more than the number of majority obsevation, after under-sampling. Find the details of this function in the link given below.*

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html

In [53]:
#Checking the class distribution of the target
pd.crosstab(y_rus, columns='count')/len(y_rus)*100

col_0,count
Outcome,Unnamed: 1_level_1
0,50.0
1,50.0


### B.2. Tomek Links

In [39]:
#Undersampling using Tomek links on training data
tomek = imblearn.under_sampling.TomekLinks()
X_tomek, y_tomek = tomek.fit_resample(X_train, y_train)

In [41]:
#Checking the class(frequency) distribution of the target after the undersampling
pd.crosstab(y_tomek, columns='count')

col_0,count
Outcome,Unnamed: 1_level_1
0,313
1,187


In [55]:
#The class (frequency) distribution of the training data
pd.crosstab(y_train, columns='count')

col_0,count
Outcome,Unnamed: 1_level_1
0,350
1,187


*This shows that (350-313) = 37 Tomek links are identified and the majority observations of all the links are deleted*

In [56]:
#Checking the class (percentage) distribution of the target after the undersampling
pd.crosstab(y_tomek, columns='count')/len(y_tomek)*100

col_0,count
Outcome,Unnamed: 1_level_1
0,62.6
1,37.4


The description of the imblearn.under_sampling.TomekLinks() function can be found in the link given below:
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.TomekLinks.html

### B.3. Wilson's ENN

In [42]:
#Undersampling using Wilson's ENN on training data
enn = imblearn.under_sampling.EditedNearestNeighbours()
X_enn, y_enn = enn.fit_resample(X_train, y_train)

In [44]:
#Checking the class(frequency) distribution of the target after the undersampling
pd.crosstab(y_enn, columns='count')

col_0,count
Outcome,Unnamed: 1_level_1
0,175
1,187


*The output indicate that (350-175) = 175 observations from the majority observations got deleted. This means that out of 350 majority observations in the training data, 150 of then got misclassified by their three nearest neighbours and therefore they are removed.*

The description of the imblearn.under_sampling.EditedNearestNeighbours() function can be found in the link given below:
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.EditedNearestNeighbours.html

### C.2. Random Over-Sampling

In [36]:
#Performing random undersampling on the training dataset
ros = imblearn.over_sampling.RandomOverSampler(sampling_strategy=1.0)
X_ros, y_ros = ros.fit_resample(X_train, y_train)

*sampling_strategy = 1.0, ensures that the number minority observations is equal to the number of majoroty observations, after over-sampling. If sampling_strategy = 0.8, then that would mean that the number of minority observations is 0.8 times more than the number of majority obsevation, after over-sampling. Find the details of this function in the link given below.*

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html#imblearn.over_sampling.RandomOverSampler

In [54]:
#Checking the class (percentage) distribution of the target after over-sampling
pd.crosstab(y_ros, columns='count')/len(y_ros)*100

col_0,count
Outcome,Unnamed: 1_level_1
0,50.0
1,50.0


### C.2. SMOTE

In [57]:
#Applying SMOTE on the training data
smote = imblearn.over_sampling.SMOTE(sampling_strategy=1.0, k_neighbors=2)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

*sampling_strategy has the same function as mentioned before. k_neighbour=2 sets the value of k in SMOTE.*

The description of the imblearn.over_sampling.SMOTE() function can be found in the link given below:
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

In [58]:
#Checking the class (percentage) distribution of the target after SMOTE
pd.crosstab(y_smote, columns='count')

col_0,count
Outcome,Unnamed: 1_level_1
0,350
1,350


## D. Exercise (For You)

### D.1. Fit Classification Models

Fit different classification models on the training dataset (original), training datasets after under-sampling and training dataset after over-sampling and calculate the accuracy of each models on the test dataset using AUROC and compare.

### D.2. Try Hybrid Techniques

Create new samples from the training dataset (original) by using hybrid techniques like - 

* SMOTE + Tomek
* SMOTE + ENN
* SMOTE + ENN + Tomek


Fit different classification models on these dataset and calculate the accuracy of each models on the test dataset using AUROC and compare.