# Class Imbalance
Class imbalance is a common situation in real-world application of classification algorithms. This example shows an example of classification class imbalance and methods to deal with it.

## Dataset
The data used in the example is from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/balance+scale). This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced.

Attribute Information:

1. Class Name: 3 (L, B, R)
2. Left-Weight: 5 (1, 2, 3, 4, 5)
3. Left-Distance: 5 (1, 2, 3, 4, 5)
4. Right-Weight: 5 (1, 2, 3, 4, 5)
5. Right-Distance: 5 (1, 2, 3, 4, 5)


## Load dataset and package

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [4]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data', names=['balance','Left_Weight','Left_Distance','Right_Weight','Right_Distance'])
df.head()

Unnamed: 0,balance,Left_Weight,Left_Distance,Right_Weight,Right_Distance
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


**Review the class weight**

In [5]:
df['balance'].value_counts(normalize=True)

R    0.4608
L    0.4608
B    0.0784
Name: balance, dtype: float64

**Update the three classes to balance or imblance two classes**

In [6]:
df['balance'] = [1 if b =='B' else 0 for b in df['balance']]
df['balance'].value_counts(normalize=True)

0    0.9216
1    0.0784
Name: balance, dtype: float64

## Example

of applying classification algorithms directly to imbalanced dataset

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc, f1_score, classification_report, precision_score, recall_score

In [10]:
# Get training set with input and output
y = df.balance
X = df.drop('balance', axis=1)

# train model
clf = LogisticRegression().fit(X,y)

# predict 
y_pred = clf.predict(X)

# print result
print("accuracy score: ", accuracy_score(y, y_pred))
print("\n confusion matrix: ")
confusion_matrix(y,y_pred)

accuracy score:  0.9216

 confusion matrix: 


array([[576,   0],
       [ 49,   0]], dtype=int64)

In [12]:
np.unique(y_pred)

array([0], dtype=int64)

In [14]:
roc_curve(y, y_pred)

(array([0., 1.]), array([0., 1.]), array([1, 0], dtype=int64))

In [17]:
print(classification_report(y,y_pred))

precision    recall  f1-score   support

           0       0.92      1.00      0.96       576
           1       0.00      0.00      0.00        49

    accuracy                           0.92       625
   macro avg       0.46      0.50      0.48       625
weighted avg       0.85      0.92      0.88       625



In [18]:
f1_score(y,y_pred)

0.0

In [20]:
precision_score(y,y_pred)

0.0

In [21]:
recall_score(y,y_pred)

0.0

The is the model predictions on the training set. The result shows that all the predictions is predicted to be class 0, which is the majority class. The model is trying to gain the highest accuracy but is completely ignoring the minority class.
