# EXCERCISE: IMBALANCE CLASSIFICATION 

Masih menggunakan Loan dataset, kombinasikan hyperparameter tuning cv dengan metode: 
* Resampling: Undersampling
* Penalized 

email ke Brigita.gems@gmail.com  
subject: imbalance classification

pada data bankloan ini, kita mencari nilai f1 score, dengan tujuan:
-  mengurangi yg diprediksi gagal bayar, padahal bisa bayar. ini adalah calon customer
- mengurangi yg diprediksi bisa bayar, padahal ga mampu bayar. ini bikin rugi perusahaan

In [124]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler, NearMiss
from imblearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import plot_roc_curve
from sklearn.metrics import classification_report, f1_score

import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In [125]:
bankloan = pd.read_csv('bankloan.csv')
bankloan.head(3)

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41,3,17,12,176,9.3,11.359392,5.008608,1
1,27,1,10,6,31,17.3,1.362202,4.000798,0
2,40,1,15,14,55,5.5,0.856075,2.168925,0


In [126]:
fitur = ['employ', 'debtinc', 'creddebt', 'othdebt']
target = 'default'

X = bankloan[fitur]
y = bankloan[target]

## EDA

In [127]:
bankloan['default'].value_counts()/bankloan.shape[0]*100

0    73.857143
1    26.142857
Name: default, dtype: float64

## Data Splitting

In [128]:
# full data dibagi 2: train_val dan test
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, 
    y, 
    stratify = y,
    test_size = 0.2, 
    random_state = 1899)

# # ini untuk threshold
# # train_val dibagi 2: train dan val 
# X_train, X_val, y_train, y_val = train_test_split(
#     X_train_val,
#     y_train_val, 
#     stratify = y_train_val,
#     test_size = 0.25, 
#     random_state = 1899)

## Modeling

## 1. BENCHMARK Default

- tanpa resampling method
- tanpa menggunakan hyperparameter tuning (before)


In [129]:
# fitting dengan train_val set
logreg = LogisticRegression()
logreg.fit(X_train_val, y_train_val)

# predict dengan test set
y_pred_1 = logreg.predict(X_test)

# classification report
print('Benchmark report:')
print(classification_report(y_test, y_pred_1))

Benchmark report:
              precision    recall  f1-score   support

           0       0.81      0.93      0.87       103
           1       0.68      0.41      0.51        37

    accuracy                           0.79       140
   macro avg       0.75      0.67      0.69       140
weighted avg       0.78      0.79      0.77       140



## 2. Resampling: Undersampling

In [130]:
# ALGORITHM CHAIN

# balancing
rus = RandomUnderSampler()

# model
model = LogisticRegression()

# pipeline
estimator = Pipeline([('balancing', rus), ('clf', model)]) 


In [131]:
# fitting model dengan set yg sudah diundersampling (dgn estimator) 
estimator.fit(X_train_val, y_train_val)

# predict model
y_pred_2 = estimator.predict(X_test)

# classification report
print('Undersampling report:')
print(classification_report(y_test, y_pred_2))

Undersampling report:
              precision    recall  f1-score   support

           0       0.93      0.73      0.82       103
           1       0.53      0.84      0.65        37

    accuracy                           0.76       140
   macro avg       0.73      0.78      0.73       140
weighted avg       0.82      0.76      0.77       140



## 3. Penalized

In [132]:
# fitting model dengan penalized
model_balanced = LogisticRegression(class_weight='balanced')
model_balanced.fit(X_train_val, y_train_val)

# predict model
y_pred_3 = model_balanced.predict(X_test)

# classification report
print('Penalized report:')
print(classification_report(y_test, y_pred_3))

Penalized report:
              precision    recall  f1-score   support

           0       0.95      0.72      0.82       103
           1       0.53      0.89      0.67        37

    accuracy                           0.76       140
   macro avg       0.74      0.81      0.74       140
weighted avg       0.84      0.76      0.78       140



## 4. Resampling (Undersampling) dengan Hyperparameter Tuning 

In [133]:
# Algorithm Chain

rus = RandomUnderSampler() # resampling
model = LogisticRegression() # model
estimator = Pipeline([('balancing',rus),('clf',model)]) 

# Hyperparameter space
hyperparam_space = {
    'clf__C':[100, 10, 1, 0.1, 0.01, 0.001],
    'clf__solver':['liblinear', 'newton-cg']
}

# Stratified cross validation (berapa kali cross val)
skf = StratifiedKFold(n_splits=5)

# Hyperparameter Tuning
grid_search = GridSearchCV(
    estimator, # menggunakan model, ambilnya dari estimator
    param_grid = hyperparam_space, # untuk isi parameter, ambil dari Hyperparameter space
    cv = skf, # mau berapa kali cross validation
    scoring = 'f1', # kita mencari f1 score
    n_jobs = -1 # menggunakan all cores
)


In [134]:
# fitting model yg sudah di-hyperparameter tuning
grid_search.fit(X_train_val, y_train_val)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=Pipeline(steps=[('balancing', RandomUnderSampler()),
                                       ('clf', LogisticRegression())]),
             n_jobs=-1,
             param_grid={'clf__C': [100, 10, 1, 0.1, 0.01, 0.001],
                         'clf__solver': ['liblinear', 'newton-cg']},
             scoring='f1')

In [135]:
# melihat best score dan best parameter
print('best score:', grid_search.best_score_)
print('best param:', grid_search.best_params_)

best score: 0.6242486297043421
best param: {'clf__C': 0.01, 'clf__solver': 'liblinear'}


In [136]:
# melihat best estimator (sama kaya best_params_ di atas)
print(grid_search.best_estimator_)

Pipeline(steps=[('balancing', RandomUnderSampler()),
                ('clf', LogisticRegression(C=0.01, solver='liblinear'))])


In [137]:
# predict 
y_pred_4 = grid_search.best_estimator_.predict(X_test)

In [138]:
# classification report
print('Undersampling after Hyperparameter Tuning')
print(classification_report(y_test, y_pred_4))

Undersampling after Hyperparameter Tuning
              precision    recall  f1-score   support

           0       0.91      0.62      0.74       103
           1       0.44      0.84      0.58        37

    accuracy                           0.68       140
   macro avg       0.68      0.73      0.66       140
weighted avg       0.79      0.68      0.70       140



## 5. Penalized dengan Hyperparameter Tuning

In [139]:
# model
model_balanced = LogisticRegression(class_weight='balanced')

# hyperparameter space
hyperparam_space = {
    'C':[100, 10, 1, 0.1, 0.01, 0.001], 
    'solver':['liblinear','newton-cg']  
}

# stratified cross validation
skf = StratifiedKFold(n_splits = 5)

# hyperparameter tuning
grid_search = GridSearchCV(
    model_balanced, # model to tune
    param_grid = hyperparam_space, # hyperparameter space
    cv = skf, # hyperparameter space
    scoring ='f1', # metrics
    n_jobs = -1 # use all cores
)

In [140]:
# fitting model yg sudah di-hyperparameter tuning
grid_search.fit(X_train_val, y_train_val)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=LogisticRegression(class_weight='balanced'), n_jobs=-1,
             param_grid={'C': [100, 10, 1, 0.1, 0.01, 0.001],
                         'solver': ['liblinear', 'newton-cg']},
             scoring='f1')

In [141]:
# melihat best score dan best parameter
print('best score:', grid_search.best_score_)
print('best param:', grid_search.best_params_)

best score: 0.6220290204014038
best param: {'C': 0.01, 'solver': 'liblinear'}


In [142]:
grid_search.best_estimator_

LogisticRegression(C=0.01, class_weight='balanced', solver='liblinear')

In [143]:
# predict
y_pred_5 = grid_search.best_estimator_.predict(X_test)

In [144]:
# classification report
print('Undersampling after Hyperparameter Tuning')
print(classification_report(y_test, y_pred_5))

Undersampling after Hyperparameter Tuning
              precision    recall  f1-score   support

           0       0.93      0.66      0.77       103
           1       0.48      0.86      0.62        37

    accuracy                           0.71       140
   macro avg       0.70      0.76      0.69       140
weighted avg       0.81      0.71      0.73       140



# SUMMARY

In [145]:
f1_1 = f1_score(y_test, y_pred_1)
f1_2 = f1_score(y_test, y_pred_2)
f1_3 = f1_score(y_test, y_pred_3)
f1_4 = f1_score(y_test, y_pred_4)
f1_5 = f1_score(y_test, y_pred_5)

In [146]:
score_list = [f1_1, f1_2, f1_3, f1_4, f1_5] 
model_names = ['Default','Undersampling','Penalized','Undersampling dgn Tuning', 'Penalized dgn Tuning']
df_summary = pd.DataFrame({
    'method':model_names,
    'score':score_list
})
df_summary

Unnamed: 0,method,score
0,Default,0.508475
1,Undersampling,0.645833
2,Penalized,0.666667
3,Undersampling dgn Tuning,0.579439
4,Penalized dgn Tuning,0.615385


# Kesimpulan

Model paling baik berdasarkan f1 score nya adalah dengan Penalized yg class_weight (algo based) tanpa Hyperparameter tuning

In [None]:
# TAHAP MELAKUKAN MACHINE LEARNING:

1. Explore Data : datanya mau diapain, tentukan target data, dan tipe machine learning yg cocok

2. Preprocessing : handling missing data dan outlier (ini bisa juga dilakukan saat explore data). scaling, encoding, feature engineering (ex:polynomial),  feature selaction jika diperlukan.

3. Pemilihan model: dengan cross validation, untuk memmilh model yg stabil dan performa maksimum.

4. Optimasi model: imbalance classification, hyperparameter tuning.

5. Evaluasi : mencari model dengan performa optimum. kalau hasil evaluasi kurang memuaskan, kembali ke step2 sebelumnya.

pro tip: proses preprocessing-optimasi lebih baik dilakukan dengan pipeline untuk mencegah data leakage (algoritm chain, model evaluation method)