# BANK CUSTOMER CHURN PREDICTION


<img align="left" width="500" height="400" src="https://drive.google.com/uc?export=view&id=1cndfDAb6JDdtMtxxSl6bIyZfDztJeTkS">

## Introduction.

### Customer churn refers to the phenomenon when a customer leaves a company or an organization,in our case a bank. Some studies shows that accquiring new coustomers can cost 5 times than that of satisfying and retaining existing customers. Thus tracking of bank customer churn rate through prediction will help in reducing marketing costs, lead to increase in capital ,expanding total customers and a lot more.

### In this project, we will be doing an Exploratory Data Analysis(EDA) and churn prediction through machine learning and deep learning techniques on the bank customers dataset which is taken from Kaggle.

## Overview of Notebook

### 1. Load and Manipulate Data
### 2. Exploratory Data Analysis¶
### 3. Feature Engineering for the baseline model
### 4. Data Preparation for the Model fitting
### 5. Model fitting and selection
### 6.Handling the problem of Imbalanced dataset
### 7. Conclusion.

### Check my github repo for more info- https://github.com/tanish265/Bank-Customers-Churn-Prediction

In [None]:
# importing libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
# tf.test.gpu_device_name()

## 1. Load and Manipulate Data

In [None]:
df=pd.read_csv('../input/predicting-churn-for-bank-customers/Churn_Modelling.csv')
df.head()

In [None]:
df.info()

In [None]:
#  Checking missing values in dataset
df.isnull().sum().sum()

In [None]:
# Checking unique values in a column to categorize into continuous and categorical columns.
df.nunique()

In [None]:
# Dropping columns which are not necessary for prediction
df = df.drop(["RowNumber", "CustomerId", "Surname"], axis = 1)

In [None]:
df.shape

In [None]:
df.dtypes

## 2. Exploratory Data Analysis¶

In [None]:
labels = 'Exited', 'Retained'
sizes = [df.Exited[df['Exited']==1].count(), df.Exited[df['Exited']==0].count()]
explode = (0, 0.1)
fig1, ax1 = plt.subplots(figsize=(9, 7))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')
plt.title("Proportion of customer churned and retained", size = 20)
plt.show()

### From above pie chart,we can see that around 20% of customers had churned i.e exited and 80% retained.This shows that our dataset is a little imbalanced so we have to predict customer churn with a good accuracy as this 20% customers are of more interest to the bank. 

### Now visualizing countplots for categorical columns.

In [None]:
sns.countplot(x='Geography', hue = 'Exited',data = df).set_title('Countplot-Geography Column')


In [None]:
sns.countplot(x='Gender', hue = 'Exited',data = df).set_title('Countplot-Gender Column')

In [None]:
sns.countplot(x='HasCrCard', hue = 'Exited',data = df).set_title('Countplot-HasCreditCard Column')

In [None]:
sns.countplot(x='IsActiveMember', hue = 'Exited',data = df).set_title('Countplot-IsActiveMember Column')

### From the above countplots we can infer that-

#### 1.Total umber of customers who retained is highest from France and those who exited are highest from Germany,which means the bank needs to focus more on customers from Germany followed by France so that they don't churn.
#### 2. The proportion of female customers churning is greater than that of male customers.
#### 3. Suprisingly,coustomers who had credit card churned more which can be a coincidence.
#### 4. As usual,the inactive members churned more. 

In [None]:
 # Relations based on the continuous data attributes
fig, axarr = plt.subplots(3, 2, figsize=(20, 12))
sns.boxplot(y='CreditScore',x = 'Exited', hue = 'Exited',data = df, ax=axarr[0][0]).set_title('Boxplot- Credit Score Column')
sns.boxplot(y='Age',x = 'Exited', hue = 'Exited',data = df , ax=axarr[0][1]).set_title('Boxplot- Age Column')
sns.boxplot(y='Tenure',x = 'Exited', hue = 'Exited',data = df, ax=axarr[1][0])
sns.boxplot(y='Balance',x = 'Exited', hue = 'Exited',data = df, ax=axarr[1][1])
sns.boxplot(y='NumOfProducts',x = 'Exited', hue = 'Exited',data = df, ax=axarr[2][0])
sns.boxplot(y='EstimatedSalary',x = 'Exited', hue = 'Exited',data = df, ax=axarr[2][1])

### From the above boxplots we can infer that-

#### -- There is no significant difference in Credit score,estimated salary and number of products they possess  between customers who churned and who don't.
#### -- The older customers are churning more than the young ones which indicates that the bank need to focus on older customers more.
#### -- Customers with tenure period with bank either too less or too more tends to churn more.
#### -- Customers who churned generally have more bank balance which is a bad indications as it will lead to capital deficiency in the bank.

## 3. Feature Engineering

### We would like to add features that are likely to have an impact on the probability of churning.

In [None]:
# 1st Attribute - Balance Salary Ratio
df['BalanceSalaryRatio'] = df.Balance/df.EstimatedSalary
sns.boxplot(y='BalanceSalaryRatio',x = 'Exited', hue = 'Exited',data = df)
plt.ylim(-1, 5)

### Clearly we can see that customers with high BalanceSalaryRatio is churning more,which balance or salary feature didn't showed up.

In [None]:
#  2nd Attribute-Tenure By Age
df['TenureByAge'] = df.Tenure/(df.Age)
sns.boxplot(y='TenureByAge',x = 'Exited', hue = 'Exited',data = df)
plt.ylim(-0.2, 0.7)
plt.show()

In [None]:
# 3rd Attribute- Credit Score Given Age
df['CreditScoreGivenAge'] = df.CreditScore/(df.Age)
sns.boxplot(y='CreditScoreGivenAge',x = 'Exited', hue = 'Exited',data = df)
plt.show()

In [None]:
df.head()

In [None]:
df.shape

## 4. Data Preparation for the Model fitting

In [None]:
# Arranging columns by data type for easier manipulation

continuous_vars = ['CreditScore',  'Age', 'Tenure', 'Balance','NumOfProducts', 'EstimatedSalary', 'BalanceSalaryRatio',
                   'TenureByAge','CreditScoreGivenAge']
categorical_vars = ['HasCrCard', 'IsActiveMember','Geography', 'Gender']
df = df[['Exited'] + continuous_vars + categorical_vars]
df.head()

#### Correlation Matrix for continuous attributes

In [None]:
sns.set()
sns.set(font_scale = 1.25)
sns.heatmap(df[continuous_vars].corr(), annot = True,fmt = ".1f")
plt.show()

### We can see from the correlation matrix that only the columns which we have created have some significant correlation with columns they are made from.

In [None]:
# Changing values of column HasCrCard and IsActiveMember from 0 to -1 so that they will influence negatively to the model instead of no effect.
df.loc[df.HasCrCard == 0, 'HasCrCard'] = -1
df.loc[df.IsActiveMember == 0, 'IsActiveMember'] = -1
df.head()

### One-hot encoding categorical columns

In [None]:
df['Gender'].unique()

In [None]:
df['Geography'].unique()

In [None]:
from sklearn.preprocessing import LabelEncoder 
  
le = LabelEncoder() 
  
df['Gender']= le.fit_transform(df['Gender']) 
df['Geography']= le.fit_transform(df['Geography']) 

# Gender 0-Female,1-Male
# Geography 0-France,1-Germany,2-Spain

In [None]:
df.head()

In [None]:
df1 = pd.get_dummies(data=df, columns=['Gender','Geography'])
df1.columns

In [None]:
df1.head()

In [None]:
continuous_vars

### Scaling the continuous attributes using MinMaxScaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df1[continuous_vars] = scaler.fit_transform(df1[continuous_vars])

In [None]:
for col in df1:
    print(f'{col}: {df1[col].unique()}')

## 5. Model fitting and selection


### For Model fitting, we will try a couple of different machine learning algorithms in order to get an idea about which machine learning algorithm performs better.Since this is a classification problem,we will try the following algorithms :
### 1. Logistic Regression
### 2. Logistic Regression with degree 2 polynomial kernel
### 3.SVM with Rbf kernel and poly kernel
### 4. Random Forest Classifier
### 5. Extreme Gradient Boosting Classifier


## We will also use deep learning  after these techniques.

In [None]:
# Support functions
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform

# Fit models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Scoring functions
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

In [None]:
df1.head()
df1.shape

In [None]:
X = df1.drop('Exited',axis='columns')
y = df1['Exited']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)

In [None]:
X_train.shape

### Figuring out the importance of features in our dataset

In [None]:
# We perform training on the Random Forest model and generate the importance of the features

features_label = X_train.columns
forest = RandomForestClassifier (n_estimators = 1000, random_state = 0, n_jobs = -1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for i in range(X.shape[1]):
    print ("%2d) %-*s %f" % (i + 1, 30, features_label[i], importances[indices[i]]))

In [None]:
# Visualization of the Feature importances
plt.title('Feature Importances')
plt.bar(range(X_train.shape[1]), importances[indices], color = "green", align = "center")
plt.xticks(range(X_train.shape[1]), features_label, rotation = 90)
plt.show()

In [None]:
# Function to give best model score and parameters
def best_model(model):
    print(model.best_score_)    
    print(model.best_params_)
    print(model.best_estimator_)


### Fitted different models to GridSearchCV to find out the best parameters.

### Fitting our training dataset with the model with best parameters got from GridSearchCV for each of the machine learning techniques.

In [None]:
# Fit primal logistic regression
log_primal = LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=250, multi_class='auto',n_jobs=None, 
                                penalty='l2', random_state=None, solver='lbfgs',tol=1e-05, verbose=0, warm_start=False)
log_primal.fit(X_train,y_train)

In [None]:
# Fit logistic regression with pol 2 kernel
poly2 = PolynomialFeatures(degree=2)
df_train_pol2 = poly2.fit_transform(X_train)
log_pol2 = LogisticRegression(C=50, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=300, multi_class='auto', n_jobs=None, 
                              penalty='l2', random_state=None, solver='liblinear',tol=0.0001, verbose=0, warm_start=False)
log_pol2.fit(df_train_pol2,y_train)

In [None]:
# Fit SVM with RBF Kernel
SVM_RBF = SVC(C=150, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf', max_iter=-1, probability=True, 
              random_state=None, shrinking=True,tol=0.001, verbose=False)
SVM_RBF.fit(X_train,y_train)

In [None]:
# Fit SVM with Pol Kernel
SVM_POL = SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,  decision_function_shape='ovr', degree=2, gamma=0.1, kernel='poly',  max_iter=-1,
              probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)
SVM_POL.fit(X_train,y_train)

In [None]:
# Fit Random Forest classifier
RF = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',max_depth=8, max_features=7, max_leaf_nodes=None,min_impurity_decrease=0.0,
                            min_impurity_split=None,min_samples_leaf=1, min_samples_split=3,min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
                            oob_score=False, random_state=None, verbose=0,warm_start=False)
RF.fit(X_train,y_train)

In [None]:
# Fit Extreme Gradient Boost Classifier
XGB = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bytree=1, gamma=0.01, learning_rate=0.1, max_delta_step=0,max_depth=5,
                    min_child_weight=1, missing=None, n_estimators=100,n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,reg_alpha=0, 
                    reg_lambda=1, scale_pos_weight=1, seed=None,  subsample=1)
XGB.fit(X_train,y_train)

### Reviewing best model fit accuracy. Our keen interest is on the performance in predicting 1's (Customers who churn)

In [None]:
# Normal logistic regression
print(classification_report(y_train, log_primal.predict(X_train)))

In [None]:
# Logistic Regression with degree 2 polynomial kernel
print(classification_report(y_train,  log_pol2.predict(df_train_pol2)))

In [None]:
# SVM with RBF kernel
print(classification_report(y_train,  SVM_RBF.predict(X_train)))

In [None]:
# SVM with polynomial kernel
print(classification_report(y_train,  SVM_POL.predict(X_train)))

In [None]:
# Random Forest Classifier
print(classification_report(y_train,  RF.predict(X_train)))

In [None]:
# Xtreme Gradient Boosting
print(classification_report(y_train,  XGB.predict(X_train)))

### Clearly XG Boost is giving the best training data acuracy of 89% for our dataset.

### Checking accuracy for test data with XG Boost Model

In [None]:
print(classification_report(y_test,  XGB.predict(X_test)))

### Final accuracy for the test data is coming to be 86 % which is quite good but as we have seen that our dataset is a little imbalanced thatswhy our accuracy for customers who had exited is coming low.


## Using Artificial Neural Network technique

In [None]:
X_train.shape

### Fitting model with 2 hidden layers along with appling dropout regularization.Final accuracy for training data is coming to be 85.28 %

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten

In [None]:
# creating the model
model = tf.keras.Sequential()

from keras.layers import Dropout

# first hidden layer
model.add(Dense(8,activation = 'relu', input_dim = 16))
model.add(Dropout(0.1))

# second hidden layer
model.add(Dense( 8, activation = 'relu'))
model.add(Dropout(0.1))

# output layer
model.add(Dense( 1,activation = 'sigmoid'))

# Compiling the NN
# binary_crossentropy loss function used when a binary output is expected
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy']) 

model.fit(X_train, y_train, batch_size = 10, epochs = 50)

### Fitting model with 2 hidden layers along without appling dropout regularization.Final accuracy for training data is coming to be 86.21 %,better than the previous one.

In [None]:
# creating the model
model = Sequential()

from keras.layers import Dropout

# first hidden layer
model.add(Dense(8,activation = 'relu', input_dim = 16))

# second hidden layer
model.add(Dense( 8, activation = 'relu'))

# output layer
model.add(Dense( 1,activation = 'sigmoid'))

# Compiling the NN
# binary_crossentropy loss function used when a binary output is expected
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy']) 

model.fit(X_train, y_train, batch_size = 10, epochs = 50)

In [None]:
y_test.shape

### Evaluating test data with this model and accuracy is coming to be 85.85 % which is almost similar to our Random Forest Model.

In [None]:
model.evaluate(X_test, y_test)

In [None]:
# Manually verifying some predictions
yp = model.predict(X_test)
yp[:10]

In [None]:
y_pred = []
for element in yp:
    if element > 0.5:
        y_pred.append(1)
    else:
        y_pred.append(0)
y_pred[:10]

In [None]:
y_test[:10]

### Classification Report for this model is almost same as that of Random Forest Model.

In [None]:
from sklearn.metrics import confusion_matrix , classification_report

print(classification_report(y_test,y_pred))

### Confusion Matrix

In [None]:
import seaborn as sn
cm = tf.math.confusion_matrix(labels=y_test,predictions=y_pred)

plt.figure(figsize = (10,7))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')

## 6.Handling the problem of Imbalanced dataset

### Removing the imbalance of our dataset by SMOTE oversampling technique  

In [None]:
X.shape

In [None]:
y.shape

In [None]:
y.value_counts()

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_sample(X, y)

y_sm.value_counts()

### Now we have equal number of churned and retaining customers.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2, random_state=15, stratify=y_sm)

In [None]:
y_train.value_counts()

### Fitting with the XGB model generated using GridSearchCV.

In [None]:
XGB2 = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bytree=1, gamma=0.01, learning_rate=0.2, max_delta_step=0,max_depth=7,
                    min_child_weight=1, missing=None, n_estimators=100,n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,reg_alpha=0, 
                    reg_lambda=1, scale_pos_weight=1, seed=None,  subsample=1)
XGB2.fit(X_train,y_train)

### Training set accuracy is coming to be 97 % which is great in itself.

In [None]:
a=XGB2
print(classification_report(y_train,  a.predict(X_train)))

### Testing set accuracy is coming to be 91 % which has increased from 86% which we got in from our previous XGB model .

In [None]:
print(classification_report(y_test,  a.predict(X_test)))

In [None]:
XGB2 = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bytree=1, gamma=0.01, learning_rate=0.2, max_delta_step=0,max_depth=7,
                    min_child_weight=1, missing=None, n_estimators=100,n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,reg_alpha=0, 
                    reg_lambda=1, scale_pos_weight=1, seed=None,  subsample=1)

In [None]:
import joblib 
  
# Save the model as a pickle in a file 
joblib.dump(XGB2, 'churnXGB.pkl') 
  
# Load the model from the file 
# XGB_from_joblib = joblib.load('churnXGB.pkl')  
  
# Use the loaded model to make predictions 
# XGB_from_joblib.predict(X_test) 

## 7.Conclusion

### We can see that by balancing the dataset has increased our overall testing data accuracy to 91% , also it has invidually increased the accuracy for the customers who had churned (57% previously to 91% now) from the bank which matters to us more than the customers who retained.