# Supervised Learning: Beta Bank

## Introduction

In this project I will be working with data from Beta Bank. I will be trying to make a machine learning model that helps predict customer's behaviour to find out if they will leave the bank soon, and ultimately find a way to maintain customer count by focusing on keeping existing customers longer instead of trying to attract new customers.

For this project I will be going through a few steps:

1. Data Overview
2. Data Preprocessing
3. Model Training

Firstly, I will be overviewing the data to get an idea of what I'm working with. I will try to identify from the overview if there's anything that needs to attend to later during the preprocessing.

Next, I will be working on the preprocessing the data. I will try to deal with issues that came up during the overview to make the data cleaner before I can start making models.

Lastly, I will be training the models and tuning the hyperparameters to get the best results.

## Data Overview

In [1]:
# Importing necessary libraries and packages

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, make_scorer, classification_report, precision_score, recall_score, roc_auc_score, accuracy_score
from sklearn.preprocessing import OrdinalEncoder
from sklearn.utils import shuffle, resample
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [2]:
# Pulling data file

df = pd.read_csv('/datasets/Churn.csv')

# Getting the first 10 rows
df.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [3]:
# Pulling info on dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


The info shows that the Tenure column has missing values. I will be looking into that.

## Data Preprocessing

Now I will begin to delve deeper into the data.

In [4]:
df.Tenure.isna().sum()

909

In [5]:
df[df.Tenure.isna()]

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.00,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.00,1,0,0,84509.57,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9944,9945,15703923,Cameron,744,Germany,Male,41,,190409.34,2,1,1,138361.48,0
9956,9957,15707861,Nucci,520,France,Female,46,,85216.61,1,1,0,117369.52,1
9964,9965,15642785,Douglas,479,France,Male,34,,117593.48,2,0,0,113308.29,0
9985,9986,15586914,Nepean,659,France,Male,36,,123841.49,2,1,0,96833.00,0


In [6]:
df[df['Tenure'].notna()]

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,9995,15719294,Wood,800,France,Female,29,2.0,0.00,2,0,0,167773.55,0
9995,9996,15606229,Obijiaku,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1


In [7]:
print(df.Tenure.describe())

count    9091.000000
mean        4.997690
std         2.894723
min         0.000000
25%         2.000000
50%         5.000000
75%         7.000000
max        10.000000
Name: Tenure, dtype: float64


It looks like the values from the Tenure column are evenly spread without any outliers, which means I can safely change the NaN values to the mean of the column.

In [8]:
df.Tenure = df['Tenure'].fillna(df.Tenure.mean()).round(2)

In [9]:
df.tail(15)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9985,9986,15586914,Nepean,659,France,Male,36,5.0,123841.49,2,1,0,96833.0,0
9986,9987,15581736,Bartlett,673,Germany,Male,47,1.0,183579.54,2,0,1,34047.54,0
9987,9988,15588839,Mancini,606,Spain,Male,30,8.0,180307.73,2,1,1,1914.41,0
9988,9989,15589329,Pirozzi,775,France,Male,30,4.0,0.0,2,1,0,49337.84,0
9989,9990,15605622,McMillan,841,Spain,Male,28,4.0,0.0,2,1,1,179436.6,0
9990,9991,15798964,Nkemakonam,714,Germany,Male,33,3.0,35016.6,1,1,0,53667.08,0
9991,9992,15769959,Ajuluchukwu,597,France,Female,53,4.0,88381.21,1,1,0,69384.71,1
9992,9993,15657105,Chukwualuka,726,Spain,Male,36,2.0,0.0,1,1,0,195192.4,0
9993,9994,15569266,Rahman,644,France,Male,28,7.0,155060.41,1,1,0,29179.52,0
9994,9995,15719294,Wood,800,France,Female,29,2.0,0.0,2,0,0,167773.55,0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Now all the values are filled and are ready to be worked on.

## Model Selection and Training

Now I will be working on making the target and feature frames, and try to include and/or exclude unused columns.

In [11]:
df.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


It looks like for the training of the model, the column Row Number, Customer Id, and Surname will not be useful for the training so I will be dropping them. Columns such as Geography and Gender are also still objects/strings which I would need to convert to numerical values for the model to understand.

In [12]:
# Checking for imbalances
class_distribution = df.Exited.value_counts()
print(class_distribution)

0    7963
1    2037
Name: Exited, dtype: int64


It seems that nearly 20% of customers are closing their accounts. This imbalance in values would lead to poor recall and f1 scores.

In [13]:
# Setting the target column
target = df.Exited
# OHE on Gender and Geography column
gender_dummy = pd.get_dummies(df.Gender)
geo_dummy = pd.get_dummies(df.Geography)

df_new = pd.concat([df, gender_dummy, geo_dummy], axis=1)
df_new = df_new.drop(['Gender', 'Geography'], axis=1)


features = df_new.drop(['Exited', 'RowNumber', 'CustomerId', 'Surname'], axis=1) 
features.head(10)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Female,Male,France,Germany,Spain
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,1,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,1,0,0,0,1
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,1,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,1,0,1,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,1,0,0,0,1
5,645,44,8.0,113755.78,2,1,0,149756.71,0,1,0,0,1
6,822,50,7.0,0.0,2,1,1,10062.8,0,1,1,0,0
7,376,29,4.0,115046.74,4,1,0,119346.88,1,0,0,1,0
8,501,44,4.0,142051.07,2,0,1,74940.5,0,1,1,0,0
9,684,27,2.0,134603.88,1,1,1,71725.73,0,1,1,0,0


In [14]:
target.head()

0    1
1    0
2    1
3    0
4    0
Name: Exited, dtype: int64

In [15]:
# Training 70%, test 30%, which is a pretty reasonable amount of split
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.3, random_state=321)


# Creating a dictionary for multiple model training

models = {
    "Decision Tree": DecisionTreeClassifier(random_state=123),
    "Logistic Regression": LogisticRegression(random_state=123),
    "Random Forest": RandomForestClassifier(random_state=123)
}

for name, model in models.items():
    model.fit(features_train, target_train)

In [16]:
# Making the prediction variable and testing the F1 score on test data

for name, model in models.items():
    predicted_test = model.predict(features_test)
    f1 = f1_score(target_test, predicted_test)
    print("F1 Score:", f1.round(3))

F1 Score: 0.523
F1 Score: 0.089
F1 Score: 0.586


From the base results without any tuning, it seems that the Random Forest model has the highest F1 score compared to the other 2 models. This hasn't taken into account the fact that the class is heavily imbalanced.

In [17]:
print(classification_report(target_test, predicted_test))

              precision    recall  f1-score   support

           0       0.87      0.97      0.92      2381
           1       0.79      0.47      0.59       619

    accuracy                           0.86      3000
   macro avg       0.83      0.72      0.75      3000
weighted avg       0.86      0.86      0.85      3000



In [18]:
precision = precision_score(target_test, predicted_test)
recall = recall_score(target_test, predicted_test)
f1 = f1_score(target_test, predicted_test)

print(f"Precision: {precision.round(3)}, Recall: {recall.round(3)}, F1 Score: {f1.round(3)}")

Precision: 0.791, Recall: 0.465, F1 Score: 0.586


The imbalances to the class results in very poor scores. This is why the model will do poorly in predicting the values.

I will now try to improve the quality of the models and deal with the class imbalances.

In [19]:
# Setting up the model's variables
dt = DecisionTreeClassifier(random_state=123, class_weight='balanced')
# Added the balanced parameter to make the model more focused on the minority class
lr = LogisticRegression(random_state=123, class_weight='balanced') 
rf = RandomForestClassifier(random_state=123, class_weight='balanced')


## Defining the pipelines

pipeline_dt = Pipeline([
    ('scaler', StandardScaler()),
    ('model', dt)
])

pipeline_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('model', lr)
])

pipeline_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('model', rf)
])


In [20]:
# Setting up the hyperparameters for each model
param_grid_dt = {
    'model__max_depth': [5, 10, 15],
    'model__min_samples_split': [2, 5, 10]
}

param_grid_lr = {
    'model__C': [0.1, 1, 10],
    'model__solver': ['liblinear']
}

param_grid_rf = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [10, 20],
    'model__min_samples_split': [2, 5, 10]
}

# Performing GridSearch for all the models, prioritizing on the F1 scores

grid_search_dt = GridSearchCV(pipeline_dt, param_grid_dt, cv=5, scoring='f1', n_jobs=-1)
grid_search_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=5, scoring='f1', n_jobs=-1)
grid_search_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, scoring='f1', n_jobs=-1)

# Fitting the GridSearch parameters to the train dataset

grid_search_dt.fit(features_train, target_train)
grid_search_lr.fit(features_train, target_train)
grid_search_rf.fit(features_train, target_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('model',
                                        RandomForestClassifier(class_weight='balanced',
                                                               random_state=123))]),
             n_jobs=-1,
             param_grid={'model__max_depth': [10, 20],
                         'model__min_samples_split': [2, 5, 10],
                         'model__n_estimators': [100, 200]},
             scoring='f1')

Now that I have trained the models with the best parameters, I will be checking their performance on the scores.

In [21]:
# Printing the F1 scores
print("F1 Score for Decision Tree:", grid_search_dt.best_score_.round(3))
print("F1 Score for Logistic Regression:", grid_search_lr.best_score_.round(3))
print("F1 Score Random Forest:", grid_search_rf.best_score_.round(3))

F1 Score for Decision Tree: 0.56
F1 Score for Logistic Regression: 0.492
F1 Score Random Forest: 0.624


The Random Forest model seems to be performing much better than before, and when compared to the other models. I will be focusing on this model and re-tuning it to see if I can improve the scores. I will also be using other methods to deal with the class imbalances.

In [22]:
# Upsampling the train data

# Merging the train data
merged_train = pd.concat([features_train, target_train], axis=1)

# Separate majority and minority classes
train_majority = merged_train[merged_train.Exited == 0]
train_minority = merged_train[merged_train.Exited == 1]

# Upsample minority class
train_minority_upsampled = resample(train_minority, 
                                 replace=True,        # Sample with replacement
                                 n_samples=len(train_majority), # Match number of majority class
                                 random_state=42)     # Set seed for reproducibility

# Combine majority class with upsampled minority class
train_upsampled = pd.concat([train_majority, train_minority_upsampled])

# Shuffle the combined dataset
train_upsampled = train_upsampled.sample(frac=1, random_state=42).reset_index(drop=True)

features_train_upsampled = train_upsampled.drop(['Exited'], axis=1)
target_train_upsampled = train_upsampled.Exited


In [23]:
train_upsampled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11164 entries, 0 to 11163
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      11164 non-null  int64  
 1   Age              11164 non-null  int64  
 2   Tenure           11164 non-null  float64
 3   Balance          11164 non-null  float64
 4   NumOfProducts    11164 non-null  int64  
 5   HasCrCard        11164 non-null  int64  
 6   IsActiveMember   11164 non-null  int64  
 7   EstimatedSalary  11164 non-null  float64
 8   Female           11164 non-null  uint8  
 9   Male             11164 non-null  uint8  
 10  France           11164 non-null  uint8  
 11  Germany          11164 non-null  uint8  
 12  Spain            11164 non-null  uint8  
 13  Exited           11164 non-null  int64  
dtypes: float64(3), int64(6), uint8(5)
memory usage: 839.6 KB


Now I have the upsampled data and will be using it for further testing.

In [24]:
# Manually adjusting the model to avoid distortion of cross validation process

# Reinitializing the model with adjusted parameters
rf = RandomForestClassifier(
    random_state=123,
    class_weight='balanced',
    n_estimators=150,
    bootstrap=True,
    min_samples_split=5,
    min_samples_leaf=4
)

rf.fit(features_train_upsampled, target_train_upsampled)

RandomForestClassifier(class_weight='balanced', min_samples_leaf=4,
                       min_samples_split=5, n_estimators=150, random_state=123)

In [25]:
# Predict on the test set
predicted_test = rf.predict(features_test)


# Display metrics
print(f"Accuracy: {accuracy_score(target_test, predicted_test).round(3)}")
print(f"F1 Score: {f1_score(target_test, predicted_test).round(2)}")
print()
print(classification_report(target_test, predicted_test))

Accuracy: 0.841
F1 Score: 0.61

              precision    recall  f1-score   support

           0       0.90      0.90      0.90      2381
           1       0.61      0.61      0.61       619

    accuracy                           0.84      3000
   macro avg       0.76      0.76      0.76      3000
weighted avg       0.84      0.84      0.84      3000



Now the model is performing above the required threshold with all the methods combined. I am ready to perform the final testing as well as cross-checking with the AUC-ROC.

## Final Testing

Now I'm going to perform the final testing with the test data.

In [26]:
# Getting the best model
best_model = rf

# Calculating the F1 score
Y_pred = best_model.predict(features_test)

f1 = f1_score(target_test, Y_pred)
print("F1 Score on Test Data:", f1.round(2))

F1 Score on Test Data: 0.61


In [27]:
print(classification_report(target_test, Y_pred))

              precision    recall  f1-score   support

           0       0.90      0.90      0.90      2381
           1       0.61      0.61      0.61       619

    accuracy                           0.84      3000
   macro avg       0.76      0.76      0.76      3000
weighted avg       0.84      0.84      0.84      3000



In [28]:
# Getting the probabilities for the AUC-ROC
target_proba = best_model.predict_proba(features_test)[:, 1]

# Calculate the AUC-ROC score
auc_roc = roc_auc_score(target_test, target_proba)
print("AUC-ROC Score on Test Data:", auc_roc.round(3))

AUC-ROC Score on Test Data: 0.851


In [29]:
# Performing manual cross validation
f1_scores = cross_val_score(rf, features, target, cv=5, scoring='f1')
print(f"Cross-validated F1 Scores: {f1_scores.round(3)}")
print(f"Mean F1 Score: {f1_scores.mean().round(3)}")

Cross-validated F1 Scores: [0.613 0.647 0.606 0.623 0.627]
Mean F1 Score: 0.623


After performing all the necessary methods in dealing with the class imbalances, the model has achieved satisfactory results, having over 60% F1 Score on the test data set as well as 85% on the AUC-ROC. This indicates that the model is performing well in distinguishing between classes and yet would struggle at a certain threshold for predicting on the minority class.

## Conclusion

The data provided has high imbalance on the target column, which led to poor F1 Scores. Even with the hypertuning of the parameters, it didn't really improve that much. So the imbalance of the classes had to be addressed. I have chosen to upsample the data so as to avoid data-loss, as well as re-adjusted the model and add class weighing to it. The class weighing and additional parameters had led to an increase to the F1 Score, which shows that the Random Forest model would be the best option for this data. I have then re-adjusted the parameters and upsampled the data for better scores. The results finally reached satisfactory levels, going above 60% in the F1 score, which led me to performing the final testing on the test data, and getting 85% on the AUC-ROC. The model has been well-calibrated and is ready for the task at hand. 