# Imbalanced classes
Imbalanced classes appear in many domains, including:
* Fraud detection
* Spam filtering
* Disease screening
* SaaS subscription churn

## Technique for handling imbalanced classes
The  followings are used to tackle the imbalanced classes
1. Up-sample Minority Class
2. Down-sample Majority Class

## Algorithms
The following algorithms are used to build a classification model:
1. Logistic Regression Algorithm
2. SVM Algorithm
3. Random Forests Tree Algorithm

## Validation
The accuracy_score and Area Under ROC Curve (AUROC) are used for the validation.

## Notes
* Focus purely on addressing imbalanced classes
* Not split the dataset into trained/test set
* Not tune hyperparameters, or implement cross-validation.  


## Input Dataset
Input: Download from balance-scale.data from http://archive.ics.uci.edu/ml/datasets/balance+scale and rename it as balance_scale_data.txt

The dataset contains information about whether a scale is balanced or not, based on weights and distances of the two arms.
Target variable - balance 
Input features  - var1 ~ var4 

The target variable has 3 classes:
* R for right-heavy, i.e. when var3 * var4 > var1 * var2
* L for left-heavy, i.e. when var3 * var4 < var1 * var2
* B for balanced, i.e. when var3 * var4 = var1 * var2

The target variable 'balance' will be converted to indicator variable
0 : target variable with balance  = R or L
1 : target variable with balance  = B

## Import packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier	
from  sklearn.metrics import roc_auc_score

## Read input and create an indicator variable

In [2]:
df = pd.read_csv('balance_scale_data.txt', 
                 names=['balance', 'var1', 'var2', 'var3', 'var4'])
df.head()

Unnamed: 0,balance,var1,var2,var3,var4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


In [3]:
df['balance'].value_counts() 

L    288
R    288
B     49
Name: balance, dtype: int64

In [4]:
# Transform into binary classification
df['balance'] = [1 if b=='B' else 0 for b in df.balance]
df['balance'].value_counts() 

0    576
1     49
Name: balance, dtype: int64

## Separate input features (X) and target variable (y)

In [5]:
y = df.balance
X = df.drop('balance', axis=1)

## Train model, # Predict on training set and get the accuracy 
Notes:
* The result shows high accuracy. 
* This model is only predicting 0 and completely ignoring the minority class 

In [11]:
clf_0 = LogisticRegression().fit(X, y)
pred_y_0 = clf_0.predict(X)
print( "Score Accuracy:{} ".format(accuracy_score(pred_y_0, y)) )  
print( "unique value from predicted result: {}".format(np.unique( pred_y_0 )) ) 

Score Accuracy:0.9216 
unique value from predicted result: [0]


## Up-sample Minority Class
Up-sampling is the process of randomly duplicating observations from the minority class

In [12]:

# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]
 
# Upsample minority class 
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=576,    # to match majority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
print(df_upsampled.balance.value_counts())

1    576
0    576
Name: balance, dtype: int64


## Ues Up-sampled dataset to train model and get the accuracy

Note: The model is no longer predicting just one class

In [16]:
y = df_upsampled.balance
X = df_upsampled.drop('balance', axis=1)
clf_1 = LogisticRegression().fit(X, y)
pred_y_1 = clf_1.predict(X) 
print( np.unique( pred_y_1 ) )
print( "Score Accuracy:{} ".format(accuracy_score(pred_y_1, y)) ) 



[0 1]
Score Accuracy:0.5138888888888888 


## Down-sample Minority Class
Down-sampling is the process of randomly removing observations from the majority class

In [20]:
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=49,     # to match minority class
                                 random_state=123) # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
print(df_downsampled.balance.value_counts())

1    49
0    49
Name: balance, dtype: int64


## Ues down-sampled dataset to train model and get the accuracy
Note: The model is no longer predicting just one class

In [21]:
# Separate input features (X) and target variable (y)
y = df_downsampled.balance
X = df_downsampled.drop('balance', axis=1)
 
# Train model
clf_2 = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_y_2 = clf_2.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_2 ) )  # [0 1]
print( "Score Accuracy:{} ".format(accuracy_score(pred_y_2, y)) )

[0 1]
Score Accuracy:0.5816326530612245 


## Change Your Performance Metric
Area Under ROC Curve (AUROC) is a general-purpose metric for classification

In [22]:
# Predict class probabilities
prob_y_2 = clf_2.predict_proba(X)
 
# Keep only the positive class
prob_y_2 = [p[1] for p in prob_y_2]
print( roc_auc_score(y, prob_y_2) ) 

0.5680966264056644


## Use original dataset to apply SVC algorithm

In [24]:
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)
 
# Train model
clf_3 = SVC(kernel='linear', 
            class_weight='balanced', # penalize
            probability=True)
 
clf_3.fit(X, y)
 
# Predict on training set
pred_y_3 = clf_3.predict(X)
 
# Is the model still predicting just one class?
print( np.unique( pred_y_3 ) ) # [0 1]
 
# How's the accuracy?
print( "Score Accuracy:{} ".format(accuracy_score(pred_y_3, y)) )

[0 1]
Score Accuracy:0.688 


In [25]:
# What about AUROC?
prob_y_3 = clf_3.predict_proba(X)
prob_y_3 = [p[1] for p in prob_y_3]
print( roc_auc_score(y, prob_y_3) )

0.46947633219954643


## Use original dataset to apply Random Forests algorithm
Note:
The Score Accuracy and roc_auc_score are excellent. However the model might overfit. 

In [27]:
clf_4 = RandomForestClassifier()
clf_4.fit(X, y)
 
# Predict on training set
pred_y_4 = clf_4.predict(X)
 
# Is the model still predicting just one class?
print( np.unique( pred_y_4 ) ) 
 
# How's is accuracy?
print( "Score Accuracy:{} ".format(accuracy_score(pred_y_4, y)) )
 
# What about AUROC?
prob_y_4 = clf_4.predict_proba(X)
prob_y_4 = [p[1] for p in prob_y_4]
print(  "roc_auc_score: {}".format(roc_auc_score(y, prob_y_4)) ) 

[0 1]
Score Accuracy:0.9776 
roc_auc_score: 0.9967403628117915
