# <center> Project - Bank Churn Prediction Case Study

 
### In this project, we aim to predict the churn for a bank, i.e, given a Bank customer, can we build a classifier which can determine whether they will leave or not using Neural networks?

### Objective :
Given a Bank customer, build a neural network based classifier that can determine whether they will leave or not in the next 6 months. 
### Context :
Businesses like banks which provide service have to worry about problem of 'Churn' i.e. customers leaving and joining another service provider. It is important to understand which aspects of the service influence a customer's decision in this regard. Management can concentrate efforts on improvement of service, keeping in mind these priorities.

### Data Description :

The case study is from an open-source dataset from Kaggle.The dataset contains 10,000 sample points with 14 distinct features such as CustomerId, CreditScore, Geography, Gender, Age, Tenure, Balance etc.<br>

Link to the Kaggle project site:https://www.kaggle.com/barelydedicated/bank-customer-churn-modelingPoints 

### Marks Distribution:
The points distribution for this case is as follows:
1. Read the dataset
2. Drop the columns which are unique for all users like IDs 
3. Distinguish the feature and target set 
4. Divide the data set into trainingand test sets 
5. Normalize the train and test data
6. Initialize & build the model. Identify the points of improvement and implement the same the same.
7. Predict the results using 0.5 as a threshold 
8. Print the Accuracy score and confusion matrix 

## Reference Link - 

* Getting Started with Neural Networks - https://courses.analyticsvidhya.com/courses/getting-started-with-neural-networks
* Introduction to PyTorch for Deep Learning - https://courses.analyticsvidhya.com/courses/introduction-to-pytorch-for-deeplearning
* A Comprehensive Learning Path for Deep Learning in 2020 - https://courses.analyticsvidhya.com/courses/Comprehensive-learning-path-for-deep-learning-in-2020


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import tensorflow as tf
print(tf.__version__)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix
from tensorflow.keras import optimizers

In [None]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve, auc

#### Read the dataset

In [None]:
ds = pd.read_csv("/kaggle/input/dl-intro-to-nn/bank.csv")

In [None]:
ds.head(2)

In [None]:
ds.shape

In [None]:
ds['CustomerId'].nunique()

#### Drop the columns which are unique for all users like IDs

In [None]:
#RowNumber #CustomerId and #Surname are unique hence dropping it
ds = ds.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
ds

In [None]:
ds['Geography'].value_counts()

In [None]:
ds['Exited'].value_counts()

In [None]:
ds['Exited'].value_counts(normalize=True)

In [None]:
pd.crosstab(ds['Geography'],ds['Exited'])

In [None]:
?pd.crosstab

In [None]:
pd.crosstab(ds['Geography'],ds['Exited'],normalize='index')

In [None]:
ds.info()

In [None]:
sns.countplot(x=ds['Exited']);

In [None]:
ds['Exited'].value_counts(normalize=True)

#### Distinguish the feature and target set

In [None]:
ds

In [None]:
ds_new=pd.get_dummies(ds,drop_first=True)
print(ds_new.shape)
ds_new.head()

In [None]:
# X = ds.iloc[:,0:10].values # Credit Score through Estimated Salary
# y = ds.iloc[:,10].values # Exited

X=ds_new.drop('Exited',axis=1).values
y=ds_new['Exited'].values

In [None]:
# # Encoding categorical (string based) data. Country: there are 3 options: France, Spain and Germany
# # This will convert those strings into scalar values for analysis
# print(X[:8,1], '... will now become: ')

# label_X_country_encoder = LabelEncoder()
# X[:,1] = label_X_country_encoder.fit_transform(X[:,1])
# print(X[:8,1])

In [None]:
# # We will do the same thing for gender. this will be binary in this dataset
# print(X[:6,2], '... will now become: ')

# label_X_gender_encoder = LabelEncoder()
# X[:,2] = label_X_gender_encoder.fit_transform(X[:,2])
# print(X[:6,2])

In [None]:
# # The Problem here is that we are treating the countries as one variable with ordinal values (0 < 1 < 2). 
# # Therefore, one way to get rid of that problem is to split the countries into respective dimensions.
# # Gender does not need this as it is binary

# # Converting the string features into their own dimensions. Gender doesn't matter here because its binary
# #countryhotencoder = OneHotEncoder(categories = [1]) # 1 is the country column
# countryhotencoder = ColumnTransformer([("countries", OneHotEncoder(), [1])], remainder="passthrough")
# X = countryhotencoder.fit_transform(X)
# #X = countryhotencoder.fit_transform(X).toarray()

In [None]:
X.shape

In [None]:
X

In [None]:
# A 0 on two countries means that the country has to be the one variable which wasn't included 
# This will save us from the problem of using too many dimensions
# X = X[:,1:] # Got rid of Spain as a dimension.

#### Divide the data set into Train and test sets

In [None]:
# Splitting the dataset into the Training and Testing set.

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)

#### Normalize the train and test data

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X_train.shape

# <center> Initialize & Build the Neural N/W Model

In [None]:
# 1. Should perform well on both Train and Test i.e. it shouldn't overfit
# 2. build a confusion matrix for both train and test - compute recall and precision
# 3. Use Adam optimizer * lr of 0.005

In [None]:
# Initializing the ANN
classifier = Sequential()

classifier.add(Dense(32, activation = 'relu'))
classifier.add(Dense(16, activation = 'relu'))
classifier.add(Dense(8, activation = 'relu'))
classifier.add(Dense(1,  activation = 'sigmoid'))

In [None]:
sgd = optimizers.Adam(lr = 0.005)

classifier.compile(optimizer = sgd, loss = 'binary_crossentropy', metrics=['accuracy'])

classifier.fit(X_train, y_train, batch_size =500, epochs = 12, verbose = 1)

In [None]:
classifier.summary()

In [None]:
print('Accuracy Model1: '+ str(classifier.evaluate(X_train,y_train)[1]))

Y_pred_cls_train = classifier.predict_classes(X_train, batch_size=200, verbose=0)
print('Recall_score: ' + str(recall_score(y_train,Y_pred_cls_train)))
print('Precision_score: ' + str(precision_score(y_train, Y_pred_cls_train)))
print('F-score: ' + str(f1_score(y_train,Y_pred_cls_train)))
confusion_matrix(y_train, Y_pred_cls_train)

In [None]:
print('Accuracy Model1: '+ str(classifier.evaluate(X_test,y_test)[1]))
Y_pred_cls = classifier.predict_classes(X_test, batch_size=200, verbose=0)
print('Recall_score: ' + str(recall_score(y_test,Y_pred_cls)))
print('Precision_score: ' + str(precision_score(y_test, Y_pred_cls)))
print('F-score: ' + str(f1_score(y_test,Y_pred_cls)))
confusion_matrix(y_test, Y_pred_cls)