In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Bank Customer Churn

![](https://miro.medium.com/max/800/0*dzmm3qresODlScte)


In this notebook we are going to see what do we mean by **customer churn** and why is it important to identify the important factors that drives this process.Also how can we improve our process to reduce churn and how to predict such customers in early stages.

First of all **Customer churn is an inactivity or disengagement of customers with our current business**.We are going to identify the factors that drives this process through EDA also known as **Exploratory Data Analysis**.Also we'll be applying Machine learning algorithm in order to identify such candidates in eary stage so that we could engage with them to enhance retention.

## Let's Start

In [None]:
# Importing importants libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Importing Dataset

customer_data = pd.read_csv('/kaggle/input/churn-for-bank-customers/churn.csv')

In [None]:
# Checking dataset top rows in order to verify whether it's correctly loaded

customer_data.head()

In [None]:
# Checking the summary of dataset

customer_data.describe()

**Highlights of Summary statistics**
* We have total 10K customers record with us.
* Our customer Credit Score ranges from 350 to 850 as maximum
* We have minimum customer's age as 18 years and maximum age of 92 years
* The maximum tenure of any customer with our bank is 10 years
* Our customers are using atleast one of our bank product and maximum 4 products
* Many customers are holding 1 credit card and we have customers without any credit card as well.

In [None]:
# Let's check the info of our file which will show you data types of each columns,number of missing values etc

customer_data.info()

**Highlights from info**
* We have no null vlaues in our dataset
* We have 3 different datatypes in our dataset e.g. integer,object and float
* Now let's drop column RowNumber,CustomerId,Surname as this basically unique id's and not going to contribute anything into our analysis

In [None]:
# dropping RowNumber,CustomerId and Surname

customer_data = customer_data.drop(['RowNumber','CustomerId','Surname'],axis = 1)
customer_data.head()

In [None]:
customer_data['Geography'] = customer_data['Geography'].astype("category")
customer_data['Gender'] = customer_data['Gender'].astype("category")

In [None]:
customer_data.info()

We have successfully dropped all three columns mentioned above<br>
**NOTE** : Data cleanup is not required on this file as we don't have null values,no outliers etc so now we are directly moving to EDA part.

# **Descriptive Analytics**
Descriptive Analytics basically the analysis done on the historical data in this case we have 10k customers data that's been collected over a period of time.Descriptive Analytics is useful in order to understand the trend from past.

# **Exploratory Data Analysis**

# Visualizations
Any analysis begins with visualization as it's hard to find any kind of a trend by just looking raw data as it's recommended to do visualization.<br>


In [None]:
fig, axs = plt.subplots(figsize=(20, 10))
sizes = [customer_data.Exited[customer_data['Exited']==1].count(), customer_data.Exited[customer_data['Exited']==0].count()]
axs.pie(sizes, explode=(0, 0.1), labels=['Exited', 'Retained'], autopct='%1.1f%%',shadow=True)
axs.axis('equal')
plt.title("Churned and Retained proportion", size = 25)
plt.show()

We have around 20% of customers leaving our bank over a period of time

# **Continuous Variable Distribution**

In [None]:
# Let's visualize the CreditScore distribution

sns.set(rc={'figure.figsize':(25,15)})
fig,axs = plt.subplots(2,2)
sns.set_theme(palette="crest_r")
sns.histplot(data = customer_data,x = "CreditScore",ax=axs[0,0])
sns.histplot(data = customer_data,x = "Age",ax=axs[0,1])
sns.histplot(data = customer_data,x = "Balance",ax=axs[1,0])
sns.histplot(data = customer_data,x = "EstimatedSalary",ax=axs[1,1])

**Highlights from above graphs**

* **Grpah 1-** Majority of our customers are having Credit Score less than 700(ideal credit score >= 700),i.e we have more number of non-reliable customers
* **Grpah 2-** Age of our customers seem right skewed whiich means we have few customers whose age are more as compared to majority of customers also we have more numbers of young customers ranging from 29 to 40 years of age
* **Grpah 3-** Majority of our customers have bank balance normally distributed except around ~3600 customers have Zero bank balance
* **Grpah 4-** Estimated Salary of our customers are uniformly distribute means we have around same number of customers for all different estimated salary.

# **Categorical Variable Distribution**

In [None]:
# let's visualize categorical variables 

sns.set(rc={'figure.figsize':(25,15)})
fig,axs = plt.subplots(2,3)
sns.set_theme(palette="Paired")
sns.countplot(data = customer_data,x = 'Gender',ax=axs[0,0])
sns.countplot(data = customer_data,x = 'Geography',ax=axs[0,1])
sns.countplot(data = customer_data,x = 'Tenure',ax=axs[0,2])
sns.countplot(data = customer_data,x = 'NumOfProducts',ax=axs[1,0])
sns.countplot(data = customer_data,x = 'HasCrCard',ax=axs[1,1])
sns.countplot(data = customer_data,x = 'IsActiveMember',ax=axs[1,2])

**Insights from above graphs**
* **Grpah 1-** Majority of our customers are Male
* **Grpah 2-** We have more number of customers from France, we have almost same number of customers from Germany and Spain
* **Grpah 3-** Majority of our customers have a good relationship with bank ranging from 1 to 9 years
* **Grpah 4-** Most of our customers are using one of our bank products followed by 2 products
* **Grpah 5-** Almost 70% of bank customers are using credit card
* **Grpah 6-** surprise to see that we have almost same number of customers in terms of digitally active and inactive 

In [None]:
# let's check correlation between all numerical columns

sns.set(rc={'figure.figsize':(20,10)})
corr = customer_data.corr()
sns.heatmap(corr,annot=True)

Age shows some correlation with Exited status i.e age would be one of the factors that drive this process however we'll see this in details later in this notebook when we explore Age with Exit status.Also there is no correlation between independent variables i.e no multicollinearity

# Continuous Variable Exploration with respect to Exit Status 

In [None]:
# let explore all the continuous variable with respect to our Exit status

sns.set(rc={'figure.figsize':(25,30)})
fig,axs = plt.subplots(3,2)
sns.set_theme(palette="tab10")
sns.boxplot(data = customer_data,x = "Exited",y = "Age",ax = axs[0,0])
sns.boxplot(data = customer_data,x = "Exited",y = "CreditScore",ax = axs[0,1])
sns.boxplot(data = customer_data,x = "Exited",y = "Balance",ax = axs[1,0])
sns.boxplot(data = customer_data,x = "Exited",y = "EstimatedSalary",ax = axs[1,1])
sns.boxplot(data = customer_data,x = "Exited",y = "Tenure",ax=axs[2,0])

**Insights from above graphs**
* **Graph 1-** Age shows significant difference in order to identify the customer churn i.e aged customers tends to churn more as compared to young ones.And we have already seen in the correlation plot that Age was related to Exited Status
* **Graph 2-** By looking at the credit Score it is hard to tell whether customer tends to churn or not because the behaviour are almost same for both.So Credit Score can't be considered as one of the factor
* **Graph 3-** Balance shows somewhat same behaviour except that customers with high balance tends to churn little more
* **Graph 4-** Estimated Salary shows same behavoiur for both status
* **Graph 5-** Tenure also shows similar behaviour but we can tell that the customers with average tenure with bank tends to stay

# Categorical Variable Exploration with respect to Exit Status 

In [None]:
from statsmodels.graphics.mosaicplot import mosaic

sns.set(rc = {'figure.figsize':(20,30)})
fig,axs = plt.subplots(3,2)
cd =  customer_data.copy(deep=True)
cd['Exited'] = np.where(cd['Exited']==1,"(Exited)","(Non-Exited)")
cd['IsActiveMember'] = np.where(cd['IsActiveMember']==1,"Active","Non-Active")
cd['HasCrCard'] = np.where(cd['HasCrCard']==1,"Credit Card User","Non-Credit Card user")
cd['NumOfProducts'] = np.where(cd['NumOfProducts']==1,"Product=1",
                               (np.where(cd['NumOfProducts']==2,"Product=2",
                                         (np.where(cd['NumOfProducts']==3,"Product=3","Product=4")))))
mosaic(cd,['Gender','Exited'],ax=axs[0,0])
mosaic(cd,['Geography','Exited'],ax=axs[0,1])
mosaic(cd,['Tenure','Exited'],ax=axs[1,0])
mosaic(cd,['IsActiveMember','Exited'],ax=axs[1,1])
mosaic(cd,['HasCrCard','Exited'],ax=axs[2,0])
mosaic(cd,['NumOfProducts','Exited'],ax=axs[2,1])
plt.show()

**Insights from above graphs**
* Let's understand how to interpret mosaic plot - The thickness of bar shows the population means thicker the bar more the number of populations and vice-versa and height shows count of such values
* **Graph 1** - We have more number of male populations however female tends to churn more
* **Graph 2** - We have more number of customer churn from Germany but France and Spain have equal number of customer churn however we have more customers from France.So France is by far the best in all three region.
* **Graph 3** - Every tenure almost shows similar number of customer churn so Tenure would not be a best factor to identify
* **Graph 4** - Those who are not digitally active tends to churn more as compared to digitally active ones
* **Graph 5** - We have almost same proportion of customer churn in both the cases
* **Graph 6** - Customer tends to churn more if thay are using only one product of our bank

# **Insights from EDA**

**Based on our EDA the factors that drive this process or an important one to identify customers churns are :**
* Age
* IsActiveMember
* Geography
* NumOfProducts
* Gender

# **Conclusion and Reccomendation**

* Our priority should be to bring most of our customers to digital platform because most of our digital inactive customers have churned. This can be done by promoting digital platform by giving them fist time joining offers, any kind of cashback etc.

* Germany have high churn percentage as compared to France and Spain so bank should focus on Germany to identify what are the facilities that are present in France and Spain like number of ATM machines, number of bank branch, locality of bank etc. and sholud be incorporated in Germany in order to enhance retention

* Customers churn is high for aged ones reason might be that the aged person looking for some kind of good retirement plan or high return on fixed deposits which others banks might be offering so in this case we suppose to target such customers and offer them better offers in order to retain them

* We should also focus on getting reliable customers because we have more than 68% of our customers whose credit score are less than 700 (ideal credit score is 700 and above)

* Bank should offer more than 1 products to their customers because customers have churned less if they have more than 1 products

* Bank should engage more with female customers by giving them facilities like door step KYC, welcome goodies for joining bank etc.

* Bank should also focus on customers who tends to churn without taking their entire balance from bank