# Bank Customer Churn Prediction Analysis
**Problem definition-** A fictitious bank, serving over 10,000 customers across 3 European countries, is facing a crucial challenge with the customer leaving their business, which has gradually increased over time. The bank objects to address this issue proactively by developing and deploying a predictive model that can identify at-risk customers before they decide to leave. This leads to the enhancement of techniques and strategies for customer retention with low-cost efficiency.

**Project Goal-** The primary objective of this project is to deploy a robust and generalized churn prediction model that helps banks forecast customer churn from both existing and new, unseen data. This will allow the bank to take pointed actions to reduce churn ratios. In addition, the project also aims to understand and disclose the underlying patterns and statistics about factors influencing customer churn. Specifically, will explore relationships between churn and demographic factors for instance age, income, and region. Behavioral factors like the count of products registered by customers, and number of years being as a customer in the bank. 
To achieve these objectives, I will employ a combination of three machine learning algorithms to predict the target variable. Those are a linear model, a non-linear model, and Ensemble methods to optimize the prediction rate. 


## Data Understanding

In [None]:
# Basic libraries for EDA
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns 
from seaborn import heatmap

# Upsampling library
from imblearn.over_sampling import SMOTE

# modelling
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# validation and evaluation metrics libraries
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import roc_auc_score, roc_curve

In [None]:
# loads dataset
dfm = pd.read_csv(r"C:\Users\naniv\Downloads\UOP\Semester_3\Customer_Analytics\Capstone_project\BankData.csv")
dfm.head(5)

In [None]:
# dataframe shape
print(f'There are {dfm.shape[0]} rows and {dfm.shape[1]} columns in the dataset')

In [None]:
# data types of the attributes
dfm.dtypes

In [None]:
# Dataframe information
dfm.info()

In [None]:
# checks duplicates
dupl = dfm[dfm['CustomerId'].duplicated()]
dupl

In [None]:
# checks missing values

dfm.isnull().sum()

In [None]:
# Set id column as index
#dfm = dfm.set_index('id')
dfm = dfm.set_index('RowNumber')
dfm.head(2)

In [None]:
# uniqueness of data available 

a = dfm['CustomerId'].nunique()
b = dfm.shape[0]

if a == b:
    print('Each row in the dataframe represents individual customers. Proceed further :)')
else:
    print(' There are duplicate customers in the datase. Please Check!!')

In [None]:
# statistical summary of the dataframe

dfm.describe()

In [None]:
dfm.describe(include = 'object')

In [None]:
# Data distributions of the numerical variables

fig,ax = plt.subplots(3,2, figsize = (14,14))

ax[0,0].hist(dfm['CreditScore'])
ax[0,0].set_title('CreditScore')
ax[0,1].hist(dfm['Age'])
ax[0,1].set_title('Age')
ax[1,0].hist(dfm['Tenure'])
ax[1,0].set_title('Tenure')
ax[1,1].hist(dfm['EstimatedSalary'])
ax[1,1].set_title('EstimatedSalary')
ax[1,1].set_ylim(950,None)
ax[2,0].hist(dfm['NumOfProducts'])
ax[2,0].set_title('Number of products')
plt.show()

In [None]:
# Ther is class imbalance in target variable 
# class imbalance is common because most of the customer do not leave the service only very few leave bank 
# ratio of target variable is 4:1
#dfm['Exited'].value_counts()[0]
#dfm['Exited'].value_counts()[1]
dfm['Exited'].value_counts()

In [None]:
# percentage of data belonging to minority class
class1per = (dfm['Exited'].value_counts()[1] / dfm.shape[0]) * 100
print(f'Percentage of Minority class {class1per:.2f} %')
# imbalance is moderate and need to be addressed using data mitigation techniques

In [None]:
# Target variable distribution
sns.displot(dfm['Exited'])
plt.title('Target Variable Distribution')
plt.show()

In [None]:
# categorical variables distribution

catcols09 = ['Gender', 'Geography','HasCrCard', 'IsActiveMember']

for i in catcols09:
    sns.histplot(dfm[i])
    plt.title(f'Distribution of {i}')
    plt.show()

In [None]:
# Data validations

catcols09 = ['Gender', 'Geography','HasCrCard', 'IsActiveMember']

for t in catcols09:
    count = dfm[t].value_counts()
    
    unq_valper = count / len(dfm) * 100 
    print(f'Percentage of unique values of column {t}')
    print(unq_valper)
    print('-----------------------------')

In [None]:
# Data Validations
numcols09 = ['CreditScore', 'Age','Tenure', 'NumOfProducts','Balance','EstimatedSalary' ]

for m in numcols09:
    count = dfm[m].value_counts()
    
    unq_valper = (count / len(dfm) * 100 ).nlargest(3)
    print(f'Percentage of unique values of column {m}')
    print(unq_valper)
    print('-----------------------------')

## Exploratory Data Analysis

In [None]:
# pair plot for correlation

sns.pairplot(dfm)
plt.show()

In [None]:
# correlation between features
# using spearman Method
corr11 = dfm.corr()
corr11


In [None]:
# visual representation of correlation

fig, ax = plt.subplots(figsize = (8,8))
ax = heatmap(
corr11,
annot = True,
ax = ax,
cmap = "RdBu_r",
vmin = 1,
vmax = 1,
)
fontdict = { 'fontsize': 20}
ax.set_title("Heatmap of Correlation between continuous variables", fontdict= fontdict, pad =40)
plt.show()


In [None]:
#age has highest correlation with target variable
# lets look in to patterns of age vs cat variable by exited or not


for i in catcols09:
    sns.boxplot(data = dfm, x = dfm[i], y = 'Age', hue = 'Exited')
    plt.title(f'{i} vs Age boxplot' )
    plt.show()
    pass

In [None]:
# skewness values to check data is skewed or not
print(dfm.skew(numeric_only = True)) #threshold -1 to 1 
# Age column is right skewed because life expectancy of european regions france, spain and germany is less than 84 since older are less represented and younger customers are more in dataset

In [None]:
# Count plot comparing balance wtih cat cols by Exited
borders = ['top','right']
for i in catcols09:
    ax = sns.barplot(data = dfm, x = dfm[i], y = 'Balance', hue = 'Exited',errwidth = 0)
    for j in borders:
        ax.spines[j].set_visible(False) # removes borders 
    ax.grid(True,which= 'major', axis = 'y', linestyle= '-',linewidth = 0.3,zorder = 0) # set gridlines
    ax.set_axisbelow(True) # overlay bars from the gridlines
    ax.set_title(f'Relation of {i} with Balance Amount')
    plt.show()

In [None]:
# Validating findings in visualizations with acutal numbers 
#customer exited or not of average balance
amnt0 = np.mean(dfm[dfm['Exited']==0]['Balance'])
amnt1 = np.mean(dfm[dfm['Exited']== 1]['Balance'])
print(f'Customer exited from Bank having average balance of {np.round(amnt0)} \ncustomers stayed loyal with bank having average balance of {np.round(amnt1 )}')

In [None]:
#Checks count of customers exited and not exited using number of products

prod0=dfm[dfm['Exited'] == 0] ['NumOfProducts']
prod1 = dfm[dfm['Exited']== 1] ['NumOfProducts']

print('Customers using Number of products Not exited')
print('--' * 20)
print(prod0.value_counts())

print('--' * 30)

print('Customers using Number of products Exited')
print('--' * 20)
print(prod1.value_counts())

In [None]:
# visualize to support above results
#negatively correlated num of products vs balance in bank account
sns.scatterplot(data = dfm, x ='NumOfProducts',y= 'Balance', hue ='Exited')
plt.title('Num Of Products Registered vs Balance Amount')
plt.show()

In [None]:
# customers exited with less than average balance
 
countbal = (dfm['Balance'] <= np.mean(dfm['Balance'])) & (dfm['Exited'] == 1 )
countbal.value_counts()

print(f'Number of Customers having less than average balance Exited the bank are {countbal.value_counts()[1]}' )

## Data Preparation & Modelling

In [None]:
# there are no duplicates and no missing values
# Data cleaning
cleaned_dfm = dfm.drop(labels = ['CustomerId', 'Surname'], axis = 1)

In [None]:
# encoding categorical features
df_encoded= pd.get_dummies(cleaned_dfm, drop_first = True)
df_encoded.head()

In [None]:
X1 = df_encoded.drop(labels = ['Exited'], axis = 1) 
y1 = df_encoded['Exited']

X_train1, X_test1, y_train1, y_test1 = train_test_split(X1,y1, test_size = 0.30, random_state = 42, stratify = y1)
# stratify balance the imbalance data


In [None]:
# scaling data
scaler = StandardScaler(with_mean = True)

X_train1 = scaler.fit_transform(X_train1)
X_test1 = scaler.transform(X_test1)

In [None]:
# Baseline model performance 


algorithms = {"logistic Regression": LogisticRegression(),
       "Decision Tree": DecisionTreeClassifier(),
       "Random Forest Classifier": RandomForestClassifier()} # dictinory of supervised models


for key , algo in algorithms.items():
    
    #fits the data in to the model
    algo.fit(X_train1, y_train1)
    
    #Prediction
    prediction = algo.predict(X_test1)
    
    #accuracy 
    acc = accuracy_score(y_test1, prediction)
    print(f'Accuracy of {key} {acc:.2f} ')
    
    #Classification Report 
    report = classification_report(y_test1, prediction,zero_division=0)
    print(f'Classification Report of  {key}')
    print(report)
    
    
    #confusion matrix 
    conf_mtrx = confusion_matrix(y_test1, prediction)
    print(f"Confusion Matrix {key}:")
    print(conf_mtrx)
    print('--------------------------------------------------------------')

## Model Evaluation

In [None]:
# k fold  cross validation
# used for model who have class imbalance in dataset

kf = KFold(n_splits = 5)
kf.get_n_splits(X1)

print(kf)

In [None]:
for key, algo in algorithms.items():
    kfoldscores = cross_val_score(algo, X_train1,y_train1 , cv = kf)
    print(f'kfold Cross validation of {key} {kfoldscores}')
    print('*************************')
    print(f'Kfold cross validation score mean {kfoldscores.mean()}')
    print('--------------------------------------------')

In [None]:
# Upsampling distributions using SMOTE

sm = SMOTE(sampling_strategy='auto', k_neighbors= 5)

x_resamp , y_resamp = sm.fit_resample(X1,y1)

X_train2, X_test2, y_train2, y_test2 = train_test_split(x_resamp,y_resamp, test_size = 0.30, random_state = 42, stratify = y_resamp)


X_train2 = scaler.fit_transform(X_train2)
X_test2 = scaler.transform(X_test2)

In [None]:
# checks classes counts after upsampling and compares it with count of before upsampling
y_resamp.value_counts(), y1.value_counts()

In [None]:
# model performance after upsampling target variable

algorithms = {"logistic Regression": LogisticRegression(),
       "Decision Tree": DecisionTreeClassifier(),
       "Random Forest Classifier": RandomForestClassifier()}


for key , algo in algorithms.items():
    
    #fits the data in to the model
    algo.fit(X_train2, y_train2)
    
    #Prediction
    prediction = algo.predict(X_test2)
    
    #accuracy 
    acc = accuracy_score(y_test2, prediction)
    print(f'Accuracy of {key} {acc:.2f} ')
    
    #Classification Report 
    report = classification_report(y_test2, prediction,zero_division=0)
    print(f'Classification Report of  {key}')
    print(report)
    
    
    #confusion matrix 
    conf_mtrx = confusion_matrix(y_test2, prediction)
    print(f"Confusion Matrix {key}:")
    print(conf_mtrx)
    print('--------------------------------------------------------------')

In [None]:
# best model evaluation metrics and confusion matrix

randforest = RandomForestClassifier()

randforest.fit(X_train2, y_train2)

prediction= randforest.predict(X_test2)

acc = accuracy_score(y_test2, prediction)
print(f'Accuracy {acc:.2f} ')

clf_report =  classification_report(y_test2, prediction)
print(f'Classification Report')
print(clf_report)

confu_matrix = confusion_matrix(y_test2, prediction)
print('Confusion Matrix')
print(confu_matrix)

In [None]:
for key, algo in algorithms.items():
    kfoldscores = cross_val_score(algo, X_train2,y_train2 , cv = 5)
    print(f'kfold Cross validation of {key} {kfoldscores}')
    print('*************************')
    print(f'Kfold cross validation score mean {kfoldscores.mean()}')
    print('--------------------------------------------')

In [None]:
# visualize confusion matrix

cm_display = ConfusionMatrixDisplay(confusion_matrix = confu_matrix, display_labels = randforest.classes_)

cm_display.plot()

plt.show()

In [None]:
# Based on the f1 score and cross validation score considered random forest classifier as final model
# Feature importances for the final model

randforest = RandomForestClassifier()

randforest.fit(X_train2, y_train2)

print('Feature importances for Model Random Forest Classifier')
print('------------------------------------------------------')
for col , val in sorted(
    zip(X1.columns, 
       randforest.feature_importances_,
       ),
    key = lambda x: x[1],
    reverse = True,
    ):
    print(f'{col:15}{val:10.3f}')


In [None]:
# roc auc score
rocauc = roc_auc_score(y_test2, prediction)
rocauc

**References**


https://www.kaggle.com/datasets/saurabhbadole/bank-customer-churn-prediction-dataset
https://creativecommons.org/licenses/by-nc-sa/4.0/
https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.grid.html
https://en.wikipedia.org/wiki/List_of_European_countries_by_life_expectancya
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
https://medium.com/@rithpansanga/choosing-the-right-size-a-look-at-the-differences-between-upsampling-and-downsampling-methods-daae83915c19#:~:text=If%20the%20focus%20is%20on,may%20be%20a%20better%20option.
https://youtu.be/4SivdTLIwHc?feature=shared
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
https://www.geeksforgeeks.org/spearmans-rank-correlation/
https://aws.amazon.com/what-is/data-preparation/
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html
https://www.datacamp.com/tutorial/understanding-logistic-regression-python
https://www.ibm.com/topics/decision-trees
https://www.ibm.com/topics/random-forest