#Imbalanced Classification Project- Beta Bank



## 1. Defining the Question

### a) Specifying the Data Analysis Question

Beta Bank need  to predict whether a customer will leave the bank soon

### b) Defining the Metric for Success

We will have accomplished our objective if we build a high accuarcy model (F1 Score>0.59) that predicts whether a customer will leave the bank soon

### c) Understanding the Context

Beta Bank customers are leaving: little by little, chipping away every month. The bankers
figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. Data on
clients’ past behavior and termination of contracts with the bank is available


### d) Recording the Experimental Design

1. Load libraries and datasets.
2. Prepare the data
3. Analyze the data
4. Build model
5. Conclusions and recommedation

### e) Data Relevance

The given data sets were relevant in answering the research question.

## 2. Data Cleaning & Analysis

In [None]:
# Loading the required libraries

import pandas as pd

import numpy as np



In [None]:
#read dataset
churn_df=pd.read_csv("https://bit.ly/2XZK7Bo")

churn_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [None]:
churn_df.shape

(10000, 14)

In [None]:
#check null values in each column
churn_df.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

In [None]:

#Drop unnecessary columns
churn_df.drop(['RowNumber'], axis=1, inplace=True)
churn_df.drop(['CustomerId'], axis=1, inplace=True)
churn_df.drop(['Surname'], axis=1, inplace=True)


churn_df = churn_df[churn_df['Tenure'].notna()]


#One hot encoding to convert non-numerical to numerical categories
new_churn_df= pd.get_dummies(churn_df)

In [None]:
new_churn_df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,1,0,0,1,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,0,1,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,1,0,0,1,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,1,0,0,1,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,0,1,1,0


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

#get target and features
target = new_churn_df['Exited']
features = new_churn_df.drop(['Exited'], axis=1)

#check class balance

class_0 = new_churn_df[new_churn_df['Exited'] == 0]
class_1 = new_churn_df[new_churn_df['Exited'] == 1]# print the shape of the class
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)





class 0: (7237, 14)
class 1: (1854, 14)


observation: There is an imbalance of classes

##Models with imbalanced classes

In [None]:
# check behaviour of model with imbalanced classes
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier

model= LogisticRegression(random_state = 12345)
model1 = RandomForestClassifier()

#fit model
model.fit(features_train, target_train)
model1.fit(features_train, target_train)
# Predict the test data
predicted_valid = model.predict(features_valid)
predicted_valid1 = model1.predict(features_valid)

print('LogisticRegression: ROC AUC = ',str(round(roc_auc_score(target_valid, predicted_valid)*100,1)), '%')
print('LogisticRegression: F1-Score = ',str(round(f1_score(target_valid, predicted_valid)*100,1)), '%')

print('RandomForestClassifier: ROC AUC = ',str(round(roc_auc_score(target_valid, predicted_valid1)*100,1)), '%')
print('RandomForestClassifier: F1-Score = ',str(round(f1_score(target_valid, predicted_valid1)*100,1)), '%')



LogisticRegression: ROC AUC =  51.5 %
LogisticRegression: F1-Score =  9.3 %
RandomForestClassifier: ROC AUC =  70.3 %
RandomForestClassifier: F1-Score =  56.4 %


##Dealing with Imbalanced Classes

##Method 1

####NearMiss

In [None]:
#a) Using Nearmiss which is an undersampling technique
from imblearn.under_sampling import NearMiss

nm = NearMiss()
features_nm, target_nm = nm.fit_resample(features, target)

#split dataset into train and test data
features_train, features_valid, target_train, target_valid = train_test_split(
    features_nm, target_nm, test_size=0.25)





In [None]:
 #Build and evaluate Logistic regression Model with the Balanced data


LogisticRegressionclf= LogisticRegression(random_state = 12345)

#fit model
LogisticRegressionclf.fit(features_train, target_train)

# Predict the test data
y_predicted = LogisticRegressionclf.predict(features_valid)


print('LogisticRegression: ROC AUC = ',str(round(roc_auc_score(target_valid, y_predicted)*100,1)), '%')
print('LogisticRegression: F1-Score = ',str(round(f1_score(target_valid, y_predicted)*100,1)), '%')


LogisticRegression: ROC AUC =  87.7 %
LogisticRegression: F1-Score =  86.2 %


##Method 2

####Synthetic Minority Oversampling Technique

In [None]:
# import library
from imblearn.over_sampling import SMOTE

smote = SMOTE()

# fit predictor and target variable
features_smote, target_smote  = smote.fit_resample(features, target)

#split dataset into train and test data
features_trainn, features_validd, target_trainn, target_validd = train_test_split(
    features_smote, target_smote, test_size=0.25)

In [None]:
 #Build and evaluate RandomForestClassifier Model with the Balanced data

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

rfc = RandomForestClassifier()

# fit the predictor and target
rfc.fit(features_trainn, target_trainn)

# predict
rfc_predict = rfc.predict(features_validd)# check performance

print('F1 score:',f1_score(target_validd, rfc_predict))
print('ROC AUC:',roc_auc_score(target_validd, rfc_predict))

F1 score: 0.9047483650838785
ROC AUC: 0.9078688219141262


##Findings and Recommendations

1. Data had imbalanced classes
2. Balancing classes imcreased the F1 score of the models