# Churn Prediction for Bank Customers using ML

**Table of Contents:**

1. Case Study

2. Attribute Information

3. Our Approach

4. Importing required libraries

5. Loading the dataset

6. Exploratory Data Analysis

7. Data Visualization 

7. Checking Data Distribution

8. Checking Correlation among Variables

9. Feature Engineering

10. Creating a Validation Set

11. Oversampling using SMOTE

12. Train-Test Split

13. Data Standardization

14. Importing performance metrics

15. Model Selection

16. Model Training

17. Performance Evaluation

18. Hyperparameter Optimization

**Case Study:**

As we know, it is much more expensive for a company to sign in a new client than keeping an existing one.

It is advantageous for them to know what leads a client towards the decision to leave the company.

Churn prediction allows companies to develop loyalty programs and retention campaigns to keep as many customers as possible.

**Attribute Information:**

* RowNumber — Corresponds to the record (row) number and has no effect on the output.

* CustomerId — Contains random values and has no effect on customer leaving the bank.
 
* Surname — The surname of a customer has no impact on their decision to leave the bank.
 
* CreditScore — Can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.
 
* Geography — A customer’s location can affect their decision to leave the bank.
 
* Gender — It’s interesting to explore whether gender plays a role in a customer leaving the bank.
 
* Age — This is certainly relevant, since older customers are less likely to leave their bank than younger ones.
 
* Tenure — Refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.
 
* Balance — Also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.
 
* NumOfProducts — Refers to the number of products that a customer has purchased through the bank.

* HasCrCard — Denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.
 
* IsActiveMember — Active customers are less likely to leave the bank.
 
* EstimatedSalary — As with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.
 
* Exited — Whether or not the customer left the bank.

**Our Approach:**

If we try to understand the business problem here, we'll get to know that the only possible way in which the bank 
could make some benefit from this analysis is by correctly classifying which customers are more likely to leave the bank.  

For that we should focus on improving the recall score instead of optimizing the accuracy score and roc auc score.

Since the dataset is imbalanced, we will get the desired result only by maximizing the True Positives and minimizing the False Negatives.

In [None]:
import warnings
warnings.simplefilter('ignore')

### Importing required libraries -

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [None]:
pd.set_option("display.max_columns", None)
pd.options.display.float_format = '{:.2f}'.format

### Loading the dataset -

In [None]:
filepath = '../input/churn-for-bank-customers/churn.csv'
data = pd.read_csv(filepath)

### Performing EDA -

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.dtypes

In [None]:
data.info()

In [None]:
data.describe().T

Checking for any missing values -

In [None]:
data.isna().sum(axis=0)

Checking for any duplicate rows -

In [None]:
duplicate_rows = data[data.duplicated()]
duplicate_rows.shape[0]

Dropping all the columns which are not required for the analysis -

In [None]:
data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

Columns with numerical data -

In [None]:
numerical_data = data.select_dtypes(include='number')
numerical_data.columns

Columns with categorical data -

In [None]:
categorical_data = data.select_dtypes(exclude='number')
categorical_data.columns

Checking the outcome labels -

In [None]:
data['Exited'].value_counts()

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(data=data, x='Exited')
plt.show()

### Data Visualization (based on categorical features) -

Customer distribution based on gender -

In [None]:
data['Gender'].value_counts()

In [None]:
x = data['Gender'].value_counts().values
plt.figure(figsize=(7, 6))
plt.pie(x, center=(0, 0), radius=1.5, labels=['Male', 'Female'], autopct='%1.1f%%', pctdistance=0.5)
plt.axis('equal')
plt.show()

Customer churn based on gender -

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='Exited', hue='Gender', data=data)
plt.show()

Customer distribution based on geography -

In [None]:
print(data['Geography'].value_counts())

In [None]:
x = data['Geography'].value_counts().values
plt.figure(figsize=(7, 6))
plt.pie(x, center=(0, 0), radius=1.5, labels=['France', 'Germany', 'Spain'], autopct='%1.1f%%', pctdistance=0.5)
plt.axis('equal')
plt.show()

Customer churn based on geography -

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='Exited', hue='Geography', data=data)
plt.show()

Checking the gender distribution of exited customers based on their location -

In [None]:
plt.figure(figsize=(8, 5))
crosstab = pd.crosstab(data['Geography'], data['Gender'], values=data['Exited'], aggfunc=np.sum)
sns.heatmap(crosstab, annot=True, fmt='d')
plt.show()

Checking the number of active members -

In [None]:
data['IsActiveMember'].value_counts()

Customer churn based on whether the customer is an active member or not -

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='Exited', hue='IsActiveMember', data=data)
plt.show()

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(data=data, y='IsActiveMember', hue='Gender')
plt.show()

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(data=data, y='IsActiveMember', hue='Geography')
plt.show()

Checking the numbers of customers having a credit card -

In [None]:
data['HasCrCard'].value_counts()

Customer churn based on whether the customer has a credit card or not -

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='Exited', hue='HasCrCard', data=data)
plt.show()

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(data=data, y='HasCrCard', hue='Gender')
plt.show()

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(data=data, y='HasCrCard', hue='Geography')
plt.show()

Checking the number of products a customer has purchased through the bank -

In [None]:
data['NumOfProducts'].value_counts()

Customer churn based on the number of products a customer has purchased through the bank -

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='Exited', hue='NumOfProducts', data=data)
plt.show()

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(data=data, y='NumOfProducts', hue='Gender')
plt.show()

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(data=data, y='NumOfProducts', hue='Geography')
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
sns.set(style="darkgrid")
ax = sns.pointplot(x='Geography', y='CreditScore', hue='Gender', data=data, dodge=True)
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
sns.set(style="darkgrid")
ax = sns.pointplot(x='Geography', y='Balance', hue='Gender', data=data, dodge=True)
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
sns.set(style="darkgrid")
ax = sns.pointplot(x='Geography', y='EstimatedSalary', hue='Gender', data=data, dodge=True)
plt.show()

Checking the distribution of data -

In [None]:
feat_set1 = ['CreditScore', 'Balance', 'EstimatedSalary']
data[feat_set1].hist(figsize=(15, 8))
plt.show()

In [None]:
sns.set(style="whitegrid")
fig = plt.figure(figsize=(12, 5))
fig.subplots_adjust(right=1.5)

plt.subplot(1, 3, 1)
sns.boxplot(y=data['CreditScore'])

plt.subplot(1, 3, 2)
sns.boxplot(y=data['Balance'])

plt.subplot(1, 3, 3)
sns.boxplot(y=data['EstimatedSalary'])

plt.show()

In [None]:
def diagnostic_plot(data, col):
    fig = plt.figure(figsize=(9, 4))
    fig.subplots_adjust(right=1.5)
    
    plt.subplot(1, 2, 1)
    sns.distplot(data[col], kde=True, color='red')
    plt.title('Histogram')
    
    plt.subplot(1, 2, 2)
    stats.probplot(data[col], dist='norm', fit=True, plot=plt)
    plt.title('Q-Q Plot')
    
    plt.show()

In [None]:
diagnostic_plot(data, 'Balance')

In [None]:
diagnostic_plot(data, 'CreditScore')

In [None]:
diagnostic_plot(data, 'EstimatedSalary')

Checking the age distribution of the customers -

In [None]:
plt.figure(figsize=(8, 5))
sns.distplot(data['Age'], kde=True, color='red')
plt.show()

In [None]:
print("Minimum Age is {}".format(data['Age'].min()))
print("Maximum Age is {}".format(data['Age'].max()))

In [None]:
print("Mean: {:.2f}".format(data['Age'].mean()))
print("Median: {:.2f}".format(data['Age'].median()))

Checking the correlation between independent variables -

In [None]:
X = data.drop('Exited', axis=1)
X.corr(method='spearman')

Checking the correlation of independent variables with dependent variable -

In [None]:
plt.figure(figsize=(8, 5))
X.corrwith(data['Exited']).plot(kind='barh', title="Correlation with 'Exited' column -")
plt.show()

Plotting the Correlation Matrix -

In [None]:
plt.figure(figsize = (10, 8))
corr = data.corr(method='spearman')
mask = np.triu(np.ones_like(corr, dtype=bool))
cormat = sns.heatmap(corr, mask=mask, annot=True, cmap='YlGnBu', linewidths=1, fmt=".2f")
cormat.set_title('Correlation Matrix')
plt.show()

### Feature Engineering -

In [None]:
data['Age Group'] = pd.cut(x=data['Age'], 
                            bins=[18, 40, 60, 95], 
                            labels=['Youngster', 'Middle-Aged', 'Senior Citizen'])

In [None]:
data['Exited'].groupby(data['Age Group']).sum()

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='Exited', hue='Age Group', data=data)
plt.show()

One-hot encoding the categorical features -

In [None]:
feat_set2 = ['Geography', 'Gender']
data_encoded = pd.get_dummies(data[feat_set2], drop_first=True)
data_encoded.sample(5)

In [None]:
df1 = data.join(data_encoded)
df1.sample(5)

In [None]:
df2 = df1.drop(['Geography', 'Gender', 'Age Group'], axis=1)
df2.sample(5)

Columns present in the processed data -

In [None]:
df2.columns

Shuffling the dataset - 

In [None]:
from sklearn.utils import resample
df3 = resample(df2, replace=False, n_samples=None, random_state=42)
df3.head()

Saving the last 500 records from the shuffled dataset for validation -

In [None]:
final_data = df3.iloc[0:9500, :]
valid_set = df3.iloc[9500:10000, :]

In [None]:
final_data.shape

In [None]:
valid_set.shape

### Splitting the data into independent & dependent variables -

In [None]:
X = final_data.drop('Exited', axis=1)
y = final_data['Exited']

In [None]:
X_val = valid_set.drop('Exited', axis=1)
y_val = valid_set['Exited']

### Splitting the data into train & test sets -

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True, random_state=42)

### Oversampling the minority class instances using SMOTE -

In [None]:
from collections import Counter
print(Counter(y_train))

In [None]:
from imblearn.over_sampling import SMOTE
oversampler = SMOTE(sampling_strategy='auto', k_neighbors=5, random_state=42)
X_train, y_train = oversampler.fit_resample(X_train, y_train)

In [None]:
print(Counter(y_train))

In [None]:
print(X_train.shape)
print(y_train.shape)

### Scaling the data -

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_val_scaled = scaler.transform(X_val)

### Importing performance metrics for binary classification -

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score, precision_score, recall_score, roc_auc_score, roc_curve, auc

### Model Selection -

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.pipeline import Pipeline
pipeline_lr = Pipeline([('lr', LogisticRegression())])
pipeline_svc = Pipeline([('svc', SVC())])
pipeline_knn = Pipeline([('knn', KNeighborsClassifier())])
pipeline_nb = Pipeline([('nb', GaussianNB())])

pipelines = [pipeline_lr, pipeline_svc, pipeline_knn, pipeline_nb]

pipe_dict = {0: 'Logistic Regression', 
             1: 'Support Vector Classifier', 
             2: 'K-Neighbors Classifier', 
             3: 'Naive Bayes Classifier'}

for pipe in pipelines:
    pipe.fit(X_train_scaled, y_train)
    


for i, model in enumerate(pipelines):
    y_pred = model.predict(X_test_scaled)
    print(f'Accuracy Score:')
    score = accuracy_score(y_test, y_pred)
    print("{}: {:.4f}".format(pipe_dict[i], score))
    print(f'ROC AUC Score:')
    score = roc_auc_score(y_test, y_pred)
    print("{}: {:.4f}".format(pipe_dict[i], score))
    print(f'F1 Score:')
    score = f1_score(y_test, y_pred)
    print("{}: {:.4f}\n".format(pipe_dict[i], score))
    

## Decision Tree Classifier

### Model Training -

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

### Model Evaluation -

In [None]:
y_pred_dt = dt.predict(X_test)
y_pred_proba_dt = dt.predict_proba(X_test)[:, 1]

In [None]:
print("Train accuracy :{:.4f}".format(accuracy_score(y_train, dt.predict(X_train))))
print("Test accuracy :{:.4f}".format(accuracy_score(y_test, dt.predict(X_test))))

In [None]:
conmat = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(conmat, annot=True, fmt='d')
plt.title("Confusion Matrix")
plt.show()

In [None]:
print("Classification Report")
print(classification_report(y_test, y_pred_dt))

In [None]:
print("F1 Score: {:.4f}".format(f1_score(y_test, y_pred_dt)))
print("Precision: {:.4f}".format(precision_score(y_test, y_pred_dt)))
print("Recall: {:.4f}".format(recall_score(y_test, y_pred_dt)))

In [None]:
print("AUC Score: {:.4f}".format(roc_auc_score(y_test, y_pred_proba_dt)))

In [None]:
from sklearn.model_selection import KFold, cross_val_score
kfold = KFold(n_splits=5)
dt_acc = np.mean(cross_val_score(dt, X_train, y_train, cv=kfold, scoring='f1')) 
print("Cross Validation Score: {:.4f}".format(dt_acc))

In [None]:
fpr_dt, tpr_dt, threshold_dt = roc_curve(y_test, y_pred_proba_dt)
auc_dt = auc(fpr_dt, tpr_dt)

plt.style.use('seaborn-darkgrid')
plt.figure(figsize=(8, 5))
plt.plot(fpr_dt, tpr_dt, label="Decision Tree Classifier (area = {:.4f})".format(auc_dt))
plt.plot([0, 1], [0, 1], linestyle='--')
plt.legend(loc='lower right', frameon=True)
plt.title("ROC Curve")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.show()

### Performance Evaluation on Validation Set -

In [None]:
y_pred_dt_val = dt.predict(X_val)
y_pred_proba_dt_val = dt.predict_proba(X_val)[:, 1]

In [None]:
print("Train accuracy :{:.4f}".format(accuracy_score(y_train, dt.predict(X_train))))
print("Test accuracy :{:.4f}".format(accuracy_score(y_val, dt.predict(X_val))))

In [None]:
conmat = confusion_matrix(y_val, y_pred_dt_val)
sns.heatmap(conmat, annot=True, fmt='d')
plt.title("Confusion Matrix")
plt.show()

In [None]:
print("F1 Score: {:.4f}".format(f1_score(y_val, y_pred_dt_val)))
print("Precision: {:.4f}".format(precision_score(y_val, y_pred_dt_val)))
print("Recall: {:.4f}".format(recall_score(y_val, y_pred_dt_val)))

In [None]:
print("AUC Score: {:.4f}".format(roc_auc_score(y_val, y_pred_proba_dt_val)))

## Random Forest Classifier

### Model Training -

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

### Model Evaluation -

In [None]:
y_pred_rf = rf.predict(X_test)
y_pred_proba_rf = rf.predict_proba(X_test)[:, 1]

In [None]:
print("Train accuracy :{:.4f}".format(accuracy_score(y_train, rf.predict(X_train))))
print("Test accuracy :{:.4f}".format(accuracy_score(y_test, rf.predict(X_test))))

In [None]:
conmat = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(conmat, annot=True, fmt='d')
plt.title("Confusion Matrix")
plt.show()

In [None]:
print("Classification Report")
print(classification_report(y_test, y_pred_rf))

In [None]:
print("F1 Score: {:.4f}".format(f1_score(y_test, y_pred_rf)))
print("Precision: {:.4f}".format(precision_score(y_test, y_pred_rf)))
print("Recall: {:.4f}".format(recall_score(y_test, y_pred_rf)))

In [None]:
print("AUC Score: {:.4f}".format(roc_auc_score(y_test, y_pred_proba_rf)))

In [None]:
from sklearn.model_selection import KFold, cross_val_score
kfold = KFold(n_splits=5)
rf_acc = np.mean(cross_val_score(rf, X_train, y_train, cv=kfold, scoring='f1')) 
print("Cross Validation Score: {:.4f}".format(rf_acc))

In [None]:
fpr_rf, tpr_rf, threshold_rf = roc_curve(y_test, y_pred_proba_rf)
auc_rf = auc(fpr_rf, tpr_rf)

plt.style.use('seaborn-darkgrid')
plt.figure(figsize=(8, 5))
plt.plot(fpr_rf, tpr_rf, label="Random Forest Classifier (area = {:.4f})".format(auc_rf))
plt.plot([0, 1], [0, 1], linestyle='--')
plt.legend(loc='lower right', frameon=True)
plt.title("ROC Curve")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.show()

### Performance Evaluation on Validation Set -

In [None]:
y_pred_rf_val = rf.predict(X_val)
y_pred_proba_rf_val = rf.predict_proba(X_val)[:, 1]

In [None]:
print("Train accuracy :{:.4f}".format(accuracy_score(y_train, rf.predict(X_train))))
print("Test accuracy :{:.4f}".format(accuracy_score(y_val, rf.predict(X_val))))

In [None]:
conmat = confusion_matrix(y_val, y_pred_rf_val)
sns.heatmap(conmat, annot=True, fmt='d')
plt.title("Confusion Matrix")
plt.show()

In [None]:
print("F1 Score: {:.4f}".format(f1_score(y_val, y_pred_rf_val)))
print("Precision: {:.4f}".format(precision_score(y_val, y_pred_rf_val)))
print("Recall: {:.4f}".format(recall_score(y_val, y_pred_rf_val)))

In [None]:
print("AUC Score: {:.4f}".format(roc_auc_score(y_val, y_pred_proba_rf_val)))

## XGBoost Classifier

### Model Training -

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)

### Model Evaluation -

In [None]:
y_pred_xgb = xgb.predict(X_test)
y_pred_proba_xgb = xgb.predict_proba(X_test)[:, 1]

In [None]:
print("Train accuracy :{:.4f}".format(accuracy_score(y_train, xgb.predict(X_train))))
print("Test accuracy :{:.4f}".format(accuracy_score(y_test, xgb.predict(X_test))))

In [None]:
conmat = confusion_matrix(y_test, y_pred_xgb)
sns.heatmap(conmat, annot=True, fmt='d')
plt.title("Confusion Matrix")
plt.show()

In [None]:
print("Classification Report")
print(classification_report(y_test, y_pred_xgb))

In [None]:
print("F1 Score: {:.4f}".format(f1_score(y_test, y_pred_xgb)))
print("Precision: {:.4f}".format(precision_score(y_test, y_pred_xgb)))
print("Recall: {:.4f}".format(recall_score(y_test, y_pred_xgb)))

In [None]:
print("AUC Score: {:.4f}".format(roc_auc_score(y_test, y_pred_proba_xgb)))

In [None]:
from sklearn.model_selection import KFold, cross_val_score
kfold = KFold(n_splits=5)
xgb_acc = np.mean(cross_val_score(xgb, X_train, y_train, cv=kfold, scoring='f1')) 
print("Cross Validation Score: {:.4f}".format(xgb_acc))

In [None]:
fpr_xgb, tpr_xgb, threshold_xgb = roc_curve(y_test, y_pred_proba_xgb)
auc_xgb = auc(fpr_xgb, tpr_xgb)

plt.style.use('seaborn-darkgrid')
plt.figure(figsize=(8, 5))
plt.plot(fpr_xgb, tpr_xgb, label="XGBoost Classifier (area = {:.4f})".format(auc_xgb))
plt.plot([0, 1], [0, 1], linestyle='--')
plt.legend(loc='lower right', frameon=True)
plt.title("ROC Curve")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.show()

### Performance Evaluation on Validation Set -

In [None]:
y_pred_xgb_val = xgb.predict(X_val)
y_pred_proba_xgb_val = xgb.predict_proba(X_val)[:, 1]

In [None]:
print("Train accuracy :{:.4f}".format(accuracy_score(y_train, xgb.predict(X_train))))
print("Test accuracy :{:.4f}".format(accuracy_score(y_val, xgb.predict(X_val))))

In [None]:
conmat = confusion_matrix(y_val, y_pred_xgb_val)
sns.heatmap(conmat, annot=True, fmt='d')
plt.title("Confusion Matrix")
plt.show()

In [None]:
print("F1 Score: {:.4f}".format(f1_score(y_val, y_pred_xgb_val)))
print("Precision: {:.4f}".format(precision_score(y_val, y_pred_xgb_val)))
print("Recall: {:.4f}".format(recall_score(y_val, y_pred_xgb_val)))

In [None]:
print("AUC Score: {:.4f}".format(roc_auc_score(y_val, y_pred_proba_xgb_val)))

### Optimizing the Hyperparameters for XGBoost using RandomizedSearchCV -

In [None]:
from sklearn.model_selection import RandomizedSearchCV

params = { 'subsample': [1],
           'reg_lambda': [0.001, 0.01, 0.1, 1, 10, 100],
           'n_estimators': [100, 300, 500],
           'min_child_weight': [1],
           'max_depth': [3, 4, 5, 10],
           'learning_rate': [0.001, 0.01, 0.1, 0.3],
           'gamma': [0, 0.01, 0.1, 0.3],
           'colsample_bytree': [1] }

rdm = RandomizedSearchCV(xgb, param_distributions=params, scoring='f1', n_jobs=-1, cv=5)
rdm.fit(X_train, y_train)

In [None]:
rdm.best_params_

In [None]:
rdm.best_score_

If you find this notebook useful then please provide your valuble feedback.

Any kind of suggestions are welcomed.

Don't forget to upvote if you like my work.