# Class Imbalance
Class imbalance is a common situation in real-world application of classification algorithms. This example shows an example of classification class imbalance and methods to deal with it.

## Dataset
The data used in the example is from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/balance+scale). This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced.

Attribute Information:

1. Class Name: 3 (L, B, R)
2. Left-Weight: 5 (1, 2, 3, 4, 5)
3. Left-Distance: 5 (1, 2, 3, 4, 5)
4. Right-Weight: 5 (1, 2, 3, 4, 5)
5. Right-Distance: 5 (1, 2, 3, 4, 5)


## Load dataset and package

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [4]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data', names=['balance','Left_Weight','Left_Distance','Right_Weight','Right_Distance'])
df.head()

Unnamed: 0,balance,Left_Weight,Left_Distance,Right_Weight,Right_Distance
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


**Review the class weight**

In [5]:
df['balance'].value_counts(normalize=True)

R    0.4608
L    0.4608
B    0.0784
Name: balance, dtype: float64

**Update the three classes to balance or imblance two classes**

In [6]:
df['balance'] = [1 if b =='B' else 0 for b in df['balance']]
df['balance'].value_counts(normalize=True)

0    0.9216
1    0.0784
Name: balance, dtype: float64

## Example

of applying classification algorithms directly to imbalanced dataset

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc, f1_score, classification_report, precision_score, recall_score

In [10]:
# Get training set with input and output
y = df.balance
X = df.drop('balance', axis=1)

# train model
clf = LogisticRegression().fit(X,y)

# predict 
y_pred = clf.predict(X)

# print result
print("accuracy score: ", accuracy_score(y, y_pred))
print("\n confusion matrix: ")
confusion_matrix(y,y_pred)

accuracy score:  0.9216

 confusion matrix: 


array([[576,   0],
       [ 49,   0]], dtype=int64)

In [12]:
np.unique(y_pred)

array([0], dtype=int64)

In [14]:
roc_curve(y, y_pred)

(array([0., 1.]), array([0., 1.]), array([1, 0], dtype=int64))

In [17]:
print(classification_report(y,y_pred))

precision    recall  f1-score   support

           0       0.92      1.00      0.96       576
           1       0.00      0.00      0.00        49

    accuracy                           0.92       625
   macro avg       0.46      0.50      0.48       625
weighted avg       0.85      0.92      0.88       625



In [18]:
f1_score(y,y_pred)

0.0

In [20]:
precision_score(y,y_pred)

0.0

In [21]:
recall_score(y,y_pred)

0.0

The is the model predictions on the training set. The result shows that all the predictions is predicted to be class 0, which is the majority class. The model is trying to gain the highest accuracy but is completely ignoring the minority class.


## Upsampling
Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.

In [22]:
# Import resample module
from sklearn.utils import resample


In [23]:
# Separate majority and minority samples
df_majority = df[df['balance']==0]
df_minority = df[df['balance']==1]
print(df_majority.shape, df_minority.shape)


(576, 5) (49, 5)


In [24]:
df_minority_upsampled = resample(
    df_minority,
    replace=True,
    n_samples = 576,
    random_state = 123
)
print(df_minority_upsampled.shape)

(576, 5)


In [25]:
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [26]:
df_upsampled['balance'].value_counts()

1    576
0    576
Name: balance, dtype: int64

In [27]:
# Separate input features (X) and target variable (y)
y = df_upsampled.balance
X = df_upsampled.drop('balance', axis=1)
 
# Train model
clf_1 = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_y_1 = clf_1.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_1 ) )
# [0 1]
 
# How's our accuracy?
print( classification_report(y, pred_y_1) )

[0 1]
              precision    recall  f1-score   support

           0       0.51      0.51      0.51       576
           1       0.51      0.52      0.52       576

    accuracy                           0.51      1152
   macro avg       0.51      0.51      0.51      1152
weighted avg       0.51      0.51      0.51      1152



In [28]:
confusion_matrix(y,pred_y_1)

array([[296, 280],
       [279, 297]], dtype=int64)

The model is no longer predicting just one class. While the accuracy also took a nosedive, it's now more meaningful as a performance metric.

## Down-sampling

In [29]:
# Separate majority and minority samples
df_majority = df[df['balance']==0]
df_minority = df[df['balance']==1]

df_majority_downsampled = resample(
    df_majority,
    replace=False,
    n_samples=49,
    random_state = 123
)

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

df_downsampled.balance.value_counts()

1    49
0    49
Name: balance, dtype: int64

In [30]:
# Separate input features (X) and target variable (y)
y = df_downsampled.balance
X = df_downsampled.drop('balance', axis=1)
 
# Train model
clf_2 = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_y_2 = clf_2.predict(X)
print( classification_report(y, pred_y_2) )

precision    recall  f1-score   support

           0       0.56      0.59      0.57        49
           1       0.57      0.53      0.55        49

    accuracy                           0.56        98
   macro avg       0.56      0.56      0.56        98
weighted avg       0.56      0.56      0.56        98



In [31]:
confusion_matrix(y, pred_y_2)

array([[29, 20],
       [23, 26]], dtype=int64)

The model isn't predicting just one class, and the accuracy seems higher.

## imblearn package for Over-Sampling

In [32]:
from imblearn.over_sampling import RandomOverSampler

In [33]:
y = df.balance
X = df.drop('balance', axis=1)

In [35]:
ros = RandomOverSampler(random_state=123)

In [36]:
X_resampled, y_resampled = ros.fit_resample(X, y)

In [40]:
y_resampled.value_counts()

1    576
0    576
Name: balance, dtype: int64

In [42]:
# train model
clf = LogisticRegression().fit(X_resampled, y_resampled)

# predict 
y_pred = clf.predict(X_resampled)

# print result
print("accuracy score: ", accuracy_score(y_resampled, y_pred))
print("\n confusion matrix: ")
confusion_matrix(y_resampled,y_pred)

accuracy score:  0.5121527777777778

 confusion matrix: 


array([[292, 284],
       [278, 298]], dtype=int64)

In [43]:
from sklearn.svm import LinearSVC
clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("accuracy score: ", accuracy_score(y_resampled, y_pred))
print("\n confusion matrix: ")
confusion_matrix(y_resampled,y_pred)

accuracy score:  0.5121527777777778

 confusion matrix: 


array([[292, 284],
       [278, 298]], dtype=int64)

In [44]:
from imblearn.over_sampling import SMOTE, ADASYN

In [51]:
X_resampled, y_resampled = SMOTE().fit_resample(X, y)

print(y_resampled.value_counts())

clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)

1    576
0    576
Name: balance, dtype: int64

accuracy score:  0.6041666666666666

confusion matrix: 


array([[332, 244],
       [212, 364]], dtype=int64)

In [53]:
X_resampled, y_resampled = ADASYN().fit_resample(X, y)

print(y_resampled.value_counts())

clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)

1    578
0    576
Name: balance, dtype: int64

accuracy score:  0.5719237435008665

confusion matrix: 


array([[328, 248],
       [246, 332]], dtype=int64)

In [54]:
from imblearn.over_sampling import BorderlineSMOTE, SVMSMOTE, KMeansSMOTE
X_resampled, y_resampled = BorderlineSMOTE().fit_resample(X, y)

print(y_resampled.value_counts())

clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)

1    576
0    576
Name: balance, dtype: int64

accuracy score:  0.6015625

confusion matrix: 


array([[350, 226],
       [233, 343]], dtype=int64)

In [55]:
from imblearn.over_sampling import BorderlineSMOTE, SVMSMOTE, KMeansSMOTE
X_resampled, y_resampled = SVMSMOTE().fit_resample(X, y)

print(y_resampled.value_counts())

clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)

0    576
1    343
Name: balance, dtype: int64

accuracy score:  0.6953210010881393

confusion matrix: 


array([[531,  45],
       [235, 108]], dtype=int64)

In [65]:
from imblearn.over_sampling import BorderlineSMOTE, SVMSMOTE, KMeansSMOTE
X_resampled, y_resampled = KMeansSMOTE(cluster_balance_threshold =0.13).fit_resample(X, y)

print(y_resampled.value_counts())



1    576
0    576
Name: balance, dtype: int64


In [66]:
clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)


accuracy score:  0.8784722222222222

confusion matrix: 


array([[471, 105],
       [ 35, 541]], dtype=int64)

## imblearn package for Under-Sampling

In [67]:
from imblearn.under_sampling import RandomUnderSampler, AllKNN,CondensedNearestNeighbour

In [70]:
X_resampled, y_resampled = RandomUnderSampler().fit_resample(X, y)

print(y_resampled.value_counts())
clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)

1    49
0    49
Name: balance, dtype: int64

accuracy score:  0.5510204081632653

confusion matrix: 


array([[27, 22],
       [22, 27]], dtype=int64)

In [71]:
X_resampled, y_resampled = AllKNN().fit_resample(X, y)

print(y_resampled.value_counts())
clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)

0    435
1     49
Name: balance, dtype: int64

accuracy score:  0.8987603305785123

confusion matrix: 


array([[435,   0],
       [ 49,   0]], dtype=int64)

In [72]:
X_resampled, y_resampled = CondensedNearestNeighbour().fit_resample(X, y)

print(y_resampled.value_counts())
clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)

0    115
1     49
Name: balance, dtype: int64

accuracy score:  0.7012195121951219

confusion matrix: 


array([[115,   0],
       [ 49,   0]], dtype=int64)

## imblearn package for combine

In [73]:
 from imblearn.combine import SMOTEENN,SMOTETomek

In [74]:
X_resampled, y_resampled = SMOTEENN().fit_resample(X, y)

print(y_resampled.value_counts())
clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)

1    498
0    397
Name: balance, dtype: int64

accuracy score:  0.6

confusion matrix: 


array([[166, 231],
       [127, 371]], dtype=int64)

In [75]:
X_resampled, y_resampled = SMOTETomek().fit_resample(X, y)

print(y_resampled.value_counts())
clf = LinearSVC().fit(X_resampled, y_resampled)
# predict 
y_pred = clf.predict(X_resampled)

# print result
print("\naccuracy score: ", accuracy_score(y_resampled, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y_resampled,y_pred)

1    576
0    576
Name: balance, dtype: int64

accuracy score:  0.5842013888888888

confusion matrix: 


array([[349, 227],
       [252, 324]], dtype=int64)

## imbalance.ensemble

In [84]:
from imblearn.ensemble import BalancedRandomForestClassifier
clf = BalancedRandomForestClassifier(max_depth=10, random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)

print("\naccuracy score: ", accuracy_score(y, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y,y_pred)


accuracy score:  0.6528

confusion matrix: 


array([[359, 217],
       [  0,  49]], dtype=int64)

In [85]:
clf.n_classes_

2

In [87]:
from imblearn.ensemble import BalancedBaggingClassifier 
clf = BalancedBaggingClassifier(random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)

print("\naccuracy score: ", accuracy_score(y, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y,y_pred)


accuracy score:  0.8288

confusion matrix: 


array([[471, 105],
       [  2,  47]], dtype=int64)

In [88]:
from imblearn.ensemble import EasyEnsembleClassifier  
clf = EasyEnsembleClassifier(random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)

print("\naccuracy score: ", accuracy_score(y, y_pred))
print("\nconfusion matrix: ")
confusion_matrix(y,y_pred)


accuracy score:  0.368

confusion matrix: 


array([[196, 380],
       [ 15,  34]], dtype=int64)