**Handling Imbalanced classes with Downsampling**

https://chrisalbon.com/machine_learning/preprocessing_structured_data/handling_imbalanced_classes_with_downsampling/

**Downsampling - A strategy to handle imbalanced classes by creating a new random subset of observations from the majority class equal to the size of the minority class.**

**Preliminaries**

In [1]:
#Load Libraries

import numpy as np
from sklearn.datasets import load_iris

**Load Iris dataset**

In [2]:
#Load Iris dataset

iris = load_iris()

#Create a feature matrix

X = iris.data

#Create a target vector

y = iris.target


**Make Iris data imbalanced**

In [3]:
#Remove the first 40 observations

X = X[40:,:]
y = y[40:]


In [9]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [10]:
#Create binary target vector indicating if class 0

y = np.where((y == 0),0,1)

#Look at the imbalanced target vector
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

**Downsample the majority class to match minority class**

In [17]:
#Indices of each class' observations

i_class0 = np.where(y==0)[0]
i_class1 = np.where(y==1)[0]

In [20]:
#Number of observations in each class

n_class0 = len(i_class0)
n_class1 = len(i_class1)

In [23]:
#For every observation of class 0, randomly sample from class 1 without replacement
i_class1_downsampled = np.random.choice(i_class1,size = n_class0,replace = False)
i_class1_downsampled

array([ 88,  60,  17,  34,  82,  77, 105,  83,  84,  27])

In [24]:
#Look at the data for the downsampled class
X[i_class1_downsampled]

array([[ 6.4,  2.8,  5.6,  2.1],
       [ 6.3,  3.3,  6. ,  2.5],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 6.4,  2.9,  4.3,  1.3],
       [ 7.7,  2.8,  6.7,  2. ],
       [ 7.7,  3.8,  6.7,  2.2],
       [ 6.7,  3. ,  5.2,  2.3],
       [ 6.3,  2.7,  4.9,  1.8],
       [ 6.7,  3.3,  5.7,  2.1],
       [ 5.8,  2.7,  4.1,  1. ]])

In [25]:
#Join together class 0's target vector with downsampled class 1's target vector
np.hstack((y[i_class0],y[i_class1_downsampled]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])