**1. Introduction:**


Churn is defined as movement of customer from one company to another. 
The reasons can for exa[](http://)mple be: 
• Availability of latest technology 
• Customer-friendly bank staff 
• Low interest rates • Location 
• Services offered 


It is very important for a bank to predict the churn rat eof customers which will further help them in deciding the marketing strategies.
The cost of attracting new customers can be five to six times more than holding on to an existing customers.
Long term customers become less costly to serve, they generate higher profits, and they may also provide new referrals, on the other hand,Losing a customer usually leads to loss in profit for the bank.


So with this dataset we can analyse the features responsible for customer churn, which will help the bank in determining its future plans and strategies and help the bank predict whether the customer will churn or not. 

In [None]:
#Import the libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve


In [None]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#Read the Dataframe
df = pd.read_csv('/kaggle/input/predicting-churn-for-bank-customers/Churn_Modelling.csv', delimiter=',')
df.shape

In [None]:
df.head()

In [None]:
df.info()

We can see form the info that there are in total 14 columns. 
Out of these 14 columns we have 11 numerical features and 3 categorical features. 

In [None]:
df.isnull().sum()

There are no null values present in the dataset. 

Exited column is the target column which gives us the information on which customer have churned. 

In [None]:
df.describe()

In [None]:
df.describe(include = 'object')

In [None]:
df['Exited'].value_counts()

'0' means the customer did not churn. 
'1' means the customer churned.

In [None]:
sns.countplot(df['Exited'])

From the above plot it can be seen that the data is imbalanced data which might lead the model to biased towards the majority class.
We will deal with the imbalanceddata further in the process. 

In [None]:
df.nunique()

Since RowNumber and CustomerId are all unique values and are customer speicfic so they do not play a role in predicting whether a particular customer will churn or not. 
Nothing can be inferred from RowNumber and CustomerId, so we can drop these 2 columns. 
Even Surname seems to be an irrelevant column as nothing can be inferred from the Surname. We will drop Surname also.

In [None]:
df.drop(['RowNumber','CustomerId','Surname'], axis = 1, inplace = True)

In [None]:
df.shape

In [None]:
df.columns

NumOfProducts, IsActiveMember and HasCrCard are the columns in the dataset which are given as int but are categorical columns. 
So we will convert these 3 columns into object type.

In [None]:
df['NumOfProducts'] = df['NumOfProducts'].astype('object')
df['IsActiveMember'] = df['IsActiveMember'].astype('object')
df['HasCrCard'] = df['HasCrCard'].astype('object')

In [None]:
df.info()

Now we have 5 numerical and 5 categorical columns. 

In [None]:
labels = 'Churn', 'No Churn'
sizes = [df.Exited[df['Exited']==1].count(), df.Exited[df['Exited']==0].count()]
explode = (0, 0.2)
fig1, ax1 = plt.subplots(figsize=(10, 8))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')
plt.title("Proportion of customer churned and retained", size = 20)
plt.show()

Only 20% of the data is depicted by the customers who exited, while approx 80% are the customers who were retained by the bank. 
The data is imbalanced. 

In [None]:
cat_cols = ['Geography','Gender','NumOfProducts','HasCrCard','IsActiveMember']
for i in cat_cols:
    sns.countplot(df[i], hue = df['Exited'])
    plt.show()

From the above plots we can see that, there are more number of customer churn in Germany and France as compared to Spain.
Majority of the data is also availabel from France. 
There are less number of churns as a whole as compared to the large number of people being retained by the banks. 
There is a slight increase in the number of females who exit the bank as compared to the males who exit the bank. 
People with 4 NumberOfProducts are all retained by the bank, while customer with only 1 product are more likely to exit the bank. 
It is clearly seen that people with credit card are most likely to churn rather than ones without the credit card, so it might be a 
possibility that people with credit card might not be satisfied with the services and might exit the bank and switch to some other bank which might offer some better servies.
With Active number of members it can be seen that there are more number of inactive members of the bank who are exiting the bank as compared to the active members, so the bank must pay more attention to these inactive customers by providing them with more information, services and programs in order to satisfy the customers and retain them. 

In [None]:
df.columns

In [None]:
num_cols = ['CreditScore','Age','Tenure','Balance','EstimatedSalary']
for i in num_cols:
    sns.distplot(df[i])
    plt.show()

Estimated Salary has 9999 unique values of salaries which fall in the range of 0 - 200000. 
Tenure are fixed values in the range from 0 - 10 years. 

In [None]:
fig, axarr = plt.subplots(3, 2, figsize=(20, 12))
sns.boxplot(y='CreditScore',x = 'Exited', hue = 'Exited',data = df, ax=axarr[0][0])
sns.boxplot(y='Age',x = 'Exited', hue = 'Exited',data = df , ax=axarr[0][1])
sns.boxplot(y='Tenure',x = 'Exited', hue = 'Exited',data = df, ax=axarr[1][0])
sns.boxplot(y='Balance',x = 'Exited', hue = 'Exited',data = df, ax=axarr[1][1])
sns.boxplot(y='EstimatedSalary',x = 'Exited', hue = 'Exited',data = df, ax=axarr[2][0])

In [None]:
num_cols = ['CreditScore','Age','Tenure','Balance','EstimatedSalary']
for i in num_cols:
    fig = plt.subplots(figsize = 10,8)
    sns.boxplot(y = df[i],x = df['Exited'], hue = df['Exited'])
    plt.show()

The points that can be inferred from the boxplots are :
1. Not much difference can be seen in the credit score of customers who exited and those who did not churn and wwere retained by the bank. 
   Upper Whisker, median value as well as the spread of the values is approximately same, so not much can be said about the CreditScore pattern
   of churned as well as reatined customers. 
2. People of higher range that is people above 40 years are exiting the bank more as compared to the people of younger range. So the bank needs
   to review the decisions and implement more strategies and programs in order to target the older customers may be by including some schemes
   such as better investment plans, tax-free bonds etc.There are also exceptions that ther are many older people which even tend to stick with
   their banks as can be seen from the Extreme/Values or outliers in the people whi have not churned.
3. From the boxplot of tenure it can be seen that majority of the people have been in the bank for around 5 years. But people with either min tenure
   that is 2 years and people with maximum tenure that is 8 years are most likely to churn or exit the bank. 
4. People with higher bank balance tend to exit the bank more which might be a serious problem for the bank.
5. Esimated Salary does not make any much difference, average value of the estimated salary, highest and the lowest value of estimated 
   salary are almost same. So this is not an important feature as not much can be inferred from this. 
    


In [None]:
sns.heatmap(df.corr(), annot = True)

Seeing at the correlation all the variables are slightly correlated with each other. The Age factor affects much to the exiting of the customers
as compared to the other variables.

In [None]:
df.columns

In [None]:
df['CreditScoregivenAge'] = df['CreditScore']/(df['Age'])

In [None]:
df['CreditScoregivenAge'][:10]

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()
bin_cols = ['Gender','HasCrCard','IsActiveMember']
for i in bin_cols:
    df[i] = le.fit_transform(df[i])

In [None]:
multi_cols = ['Geography','Tenure','NumOfProducts',]
df = pd.get_dummies(data = df, columns = multi_cols, drop_first = True)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
from sklearn.preprocessing import MinMaxScaler
num_cols = ['CreditScore','Age','Balance','EstimatedSalary','CreditScoregivenAge',]
minmax = MinMaxScaler()
df[num_cols]= minmax.fit_transform(df[num_cols].values)

In [None]:
x = df.drop(['Exited'], axis = 1)
y = df['Exited']

In [None]:
from sklearn.model_selection import train_test_split,KFold

In [None]:
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size = 0.30, random_state = 0)
xtrain.shape, xtest.shape, ytrain.shape, ytest.shape

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
LR = LogisticRegression()
LR.fit(xtrain,ytrain)

In [None]:
ypred = LR.predict(xtest)
ypred

In [None]:
acc = accuracy_score(ytest,ypred)
acc

In [None]:
print(classification_report(ytest,ypred))

It can be seen that recall score for the churned customer is very poor. The model is not able to correctly recall the churned customer accurately.


In [None]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(ytest,ypred)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")

There is a high number of misclassified data points in the base model inspite of a god accuracy.
We will hypertune the parameters and try building other models with a better prediction. 

In [None]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

In [None]:
Knn = KNeighborsClassifier()
AdaBoost = AdaBoostClassifier()
GB = GradientBoostingClassifier()
RF = RandomForestClassifier()
XGB = XGBClassifier()

In [None]:
models = []
models.append(('LR',LR))
models.append(('KNN',Knn))
models.append(('AdaBoost',AdaBoost))
models.append(('Gradientboost',GB))
models.append(('RandomForest',RF))
models.append(('XGBoost',XGB))

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
#Kfold cross validation
results = []
names = []
for name, model in models:
    kfold = KFold(shuffle=True, n_splits = 5, random_state = 0)  
    cv_results = cross_val_score(model, x, y, cv=kfold, scoring='roc_auc')
    results.append(cv_results)
    names.append(name)
    print("%s: %f (%f)" % (name, np.mean(cv_results), np.var(cv_results,ddof=1)))
    
fig = plt.figure()
fig.suptitle('Algorithm Comaprison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
Knn.fit(xtrain,ytrain)

In [None]:
RF.fit(xtrain,ytrain)

In [None]:
AdaBoost.fit(xtrain,ytrain)

In [None]:
GB.fit(xtrain,ytrain)

In [None]:
XGB.fit(xtrain,ytrain)

In [None]:
fpr_knn, tpr_knn, _ = roc_curve(ytest, Knn.predict_proba(np.array(xtest.values))[:,1])
fpr_lr, tpr_lr, _ = roc_curve(ytest, LR.predict_proba(np.array(xtest.values))[:,1])
fpr_ada, tpr_ada, _ = roc_curve(ytest, AdaBoost.predict_proba(np.array(xtest.values))[:,1])
fpr_gb, tpr_gb, _ = roc_curve(ytest, GB.predict_proba(np.array(xtest.values))[:,1])
fpr_rf, tpr_rf, _ = roc_curve(ytest, RF.predict_proba(np.array(xtest.values))[:,1])

In [None]:
plt.figure(figsize = (12,6), linewidth= 1)
plt.plot(fpr_knn, tpr_knn, label = 'KNN Score: ' + str(round(Knn.score(xtest,ytest), 5)))
plt.plot(fpr_lr, tpr_lr, label = 'LR score: ' + str(round(LR.score(xtest,ytest), 5)))
plt.plot(fpr_ada, tpr_ada, label = 'AdaBoost Score: ' + str(round(AdaBoost.score(xtest,ytest), 5)))
plt.plot(fpr_gb, tpr_gb, label = 'GB Score: ' + str(round(GB.score(xtest,ytest), 5)))
plt.plot(fpr_rf, tpr_rf, label = 'RF score: ' + str(round(RF.score(xtest,ytest), 5)))
plt.plot([0,1], [0,1], 'k--', label = 'Random guessing: 0.5')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve ')
plt.legend(loc='best')
plt.show()


Based on Cross-validation score and roc_auc curve we can see that Gradient Boost Classifier is showing good
accuracy with a low variance error. So we will consider Gradient Boost Classifier to predict the model.

In [None]:
ypredGB = GB.predict(xtest)
ypredGB

In [None]:
cm=confusion_matrix(ytest,ypredGB)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")

In [None]:
plt.figure(figsize = (12,6), linewidth= 1)
plt.plot(fpr_gb, tpr_gb, label = 'GB Score: ' + str(round(GB.score(xtest,ytest), 5)))
plt.plot([0,1], [0,1], 'k--', label = 'Random guessing: 0.5')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve ')
plt.legend(loc='best')
plt.show()


In [None]:
print(classification_report(ytest,ypredGB))

Gradient Boost has increased the recall rate of customers who have churned compared to the logistic regression base model. But still there are isclassiified data points also which might further 
can be improved by hypertuning the parameters. This model still needs more improvement to further predict the customers well who are likely to churn and exit the bank. 
Gradient Boost has better accuracy than the other models. This model can help the bank in identifying the customers who are likely to exit the bank and thus make decision and strategies which can help them retain the customers. 