#### Exercise: Handling imbalanced data in machine learning

1. Use [this notebook](https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/main/13_imbalanced/handling_imbalanced_data.ipynb) but handle imbalanced data using simple logistic regression from skelarn library. The original notebook using neural network but you need to use sklearn logistic regression or any other classification model and improve the f1-score of minority class using,
    1. Undersampling
    1. Oversampling: duplicate copy
    1. OVersampling: SMOT
    1. Ensemble

    [Solution](https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/main/14_imbalanced/handling_imbalanced_data_exercise_solution_telecom_churn.ipynb)    
   
2. Take this dataset for bank customer churn prediction : https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling
    1. Build a deep learning model to predict churn rate at bank
    1. Once model is built, print classification report and analyze precision, recall and f1-score
    1. Improve f1 score in minority class using various techniques such as undersampling, oversampling, ensemble etc
    
    [Solution](https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced/Handling%20Imbalanced%20Data%20In%20Customer%20Churn%20Using%20ANN/Bank%20Turnover%20Customer%20Churn%20Using%20ANN.ipynb)
    
    Thanks https://github.com/src-sohail for providing this solution.
     


In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [39]:
df = pd.read_csv('D:/NLP_Codebasics/deep-learning-keras-tf-tutorial-codebasics/Chrun_Prediction_Exercise_Dataset/Churn_Modelling.csv')
df.head(2)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0


In [41]:
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
df.head(2)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0


In [43]:
df = pd.get_dummies(df, columns=['Geography', 'Gender'], dtype=int)
df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,1,0,0,1,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,0,1,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,1,0,0,1,0
3,699,39,1,0.0,2,0,0,93826.63,0,1,0,0,1,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,0,1,1,0


In [45]:
# Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']] = scaler.fit_transform(df[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']])
df.head(2)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,0.538,0.324324,0.2,0.0,0.0,1,1,0.506735,1,1,0,0,1,0
1,0.516,0.310811,0.1,0.334031,0.0,0,1,0.562709,0,0,0,1,1,0


In [47]:
X = df.drop(columns='Exited')
y = df['Exited']

In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [51]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

In [53]:
y_pred = classifier.predict(X_test)

In [55]:
from sklearn.metrics import accuracy_score, classification_report
print(accuracy_score(y_test, y_pred))
print()
print(classification_report(y_test, y_pred))

0.816

              precision    recall  f1-score   support

           0       0.83      0.97      0.89      1604
           1       0.60      0.21      0.31       396

    accuracy                           0.82      2000
   macro avg       0.72      0.59      0.60      2000
weighted avg       0.79      0.82      0.78      2000



In [75]:
class0_count, class1_count = df['Exited'].value_counts()

In [77]:
class0_count

7963

In [79]:
class1_count

2037

In [81]:
df_class0 = df[df['Exited'] == 0]
df_class1 = df[df['Exited'] == 1]

In [83]:
df_class1.shape

(2037, 14)

In [86]:
def feature_extraction(data):
    X = data.drop(columns='Exited')
    y = data['Exited']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

    return X_train, X_test, y_train, y_test

In [125]:
def classification(X_train, X_test, y_train, y_test):
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(accuracy_score(y_test, y_pred))
    print()
    print(classification_report(y_test, y_pred))
    return y_pred
        

## Under Sampling

In [91]:
df_class0_under = df_class0.sample(class1_count)

df_under = pd.concat([df_class0_under, df_class1])
df_under['Exited'].value_counts()

Exited
0    2037
1    2037
Name: count, dtype: int64

In [93]:
X_train_under, X_test_under, y_train_under, y_test_under = feature_extraction(df_under)

In [95]:
classification(X_train_under, X_test_under, y_train_under, y_test_under)

0.698159509202454

              precision    recall  f1-score   support

           0       0.69      0.71      0.70       407
           1       0.71      0.68      0.69       408

    accuracy                           0.70       815
   macro avg       0.70      0.70      0.70       815
weighted avg       0.70      0.70      0.70       815



## Over Sampling by Duplication

In [99]:
df_class1_over = df_class1.sample(class0_count, replace=True)

df_over = pd.concat([df_class0, df_class1_over])

df_over['Exited'].value_counts()

Exited
0    7963
1    7963
Name: count, dtype: int64

In [101]:
X_train_over, X_test_over, y_train_over, y_test_over = feature_extraction(df_over)
classification(X_train_over, X_test_over, y_train_over, y_test_over)

0.7065285624607659

              precision    recall  f1-score   support

           0       0.70      0.72      0.71      1593
           1       0.71      0.69      0.70      1593

    accuracy                           0.71      3186
   macro avg       0.71      0.71      0.71      3186
weighted avg       0.71      0.71      0.71      3186



## Over Sampling by SMOTE

In [106]:
from imblearn.over_sampling import SMOTE

In [110]:
sampler = SMOTE()
X_sm, y_sm = sampler.fit_resample(X, y)

In [114]:
y_sm.value_counts()

Exited
1    7963
0    7963
Name: count, dtype: int64

In [120]:
df_sm = pd.concat([X_sm, y_sm], axis=1)

In [131]:
X_train_sm, X_test_sm, y_train_sm, y_test_sm = feature_extraction(df_sm)
classification(X_train_sm, X_test_sm, y_train_sm, y_test_sm)

0.7002510985561833

              precision    recall  f1-score   support

           0       0.70      0.70      0.70      1593
           1       0.70      0.70      0.70      1593

    accuracy                           0.70      3186
   macro avg       0.70      0.70      0.70      3186
weighted avg       0.70      0.70      0.70      3186



array([0, 1, 1, ..., 0, 0, 0], dtype=int64)

In [129]:
df[0:2]

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,0.538,0.324324,0.2,0.0,0.0,1,1,0.506735,1,1,0,0,1,0
1,0.516,0.310811,0.1,0.334031,0.0,0,1,0.562709,0,0,0,1,1,0


## Ensemble Method

In [157]:
df_ensemble = X_train.copy()
df_ensemble['Churn'] = y_train

In [159]:
df_en_class0 = df_ensemble[df_ensemble['Churn'] == 0]
df_en_class1 = df_ensemble[df_ensemble['Churn'] == 1]

In [161]:
df_en_class0.shape

(6359, 14)

In [163]:
df_en_class1.shape

(1641, 14)

In [175]:
def ensemble(df_class0, start, end, df_class1):
    df_new = pd.concat([df_class0[start:end], df_class1])
    X_train = df_new.drop(columns='Churn')
    y_train = df_new['Churn']
    return X_train, y_train

In [177]:
df_en_class0.shape[0]/df_en_class1.shape[0]

3.8750761730652044

In [179]:
df_en_class0.shape[0]/3

2119.6666666666665

In [183]:
X_train_1, y_train_1 = ensemble(df_en_class0, 0, 2120, df_en_class1)
y_pred_1 = classification(X_train_1, X_test, y_train_1, y_test)

0.764

              precision    recall  f1-score   support

           0       0.89      0.80      0.85      1604
           1       0.43      0.61      0.51       396

    accuracy                           0.76      2000
   macro avg       0.66      0.71      0.68      2000
weighted avg       0.80      0.76      0.78      2000



In [189]:
X_train_2, y_train_2 = ensemble(df_en_class0,2120, 2120+2120, df_en_class1)
y_pred_2 = classification(X_train_2, X_test, y_train_2, y_test)

0.769

              precision    recall  f1-score   support

           0       0.89      0.81      0.85      1604
           1       0.44      0.60      0.51       396

    accuracy                           0.77      2000
   macro avg       0.66      0.70      0.68      2000
weighted avg       0.80      0.77      0.78      2000



In [191]:
X_train_3, y_train_3 = ensemble(df_en_class0, 2120+2120, 2120+2120+2120, df_en_class1)
y_pred_3 = classification(X_train_3, X_test, y_train_3, y_test)

0.7695

              precision    recall  f1-score   support

           0       0.89      0.81      0.85      1604
           1       0.44      0.59      0.50       396

    accuracy                           0.77      2000
   macro avg       0.66      0.70      0.68      2000
weighted avg       0.80      0.77      0.78      2000



In [193]:
y_pred_1.shape

(2000,)

In [195]:
y_pred_2.shape

(2000,)

In [197]:
y_pred_3.shape

(2000,)

In [201]:
len(y_pred_1)

2000

In [203]:
y_pred_final = y_pred_1.copy()

In [207]:
y_pred_final[0]

0

In [209]:
y_pred_final = y_pred_1.copy()
for i in range(len(y_pred_1)):
    s = y_pred_1[i]+ y_pred_2[i] + y_pred_3[i]
    if s>1:
        y_pred_final[i] = 1
    else:
        y_pred_final[i] = 0
        

In [213]:
print(classification_report(y_test, y_pred_final))

              precision    recall  f1-score   support

           0       0.89      0.81      0.85      1604
           1       0.44      0.60      0.50       396

    accuracy                           0.77      2000
   macro avg       0.66      0.70      0.68      2000
weighted avg       0.80      0.77      0.78      2000

