# Bank Customer Churn Research Project

**Research question:**
This notebook will focus on determining which non-sociodemographic customer attributes can help us predict whether or not the customer will leave the bank.

More specifically, we will explore some of the features of the clients provided in the dataset, such as:
- Credit Score
- Tenure
- Balance
- Number of products
- Salary

**Explanation:**
I want to eliminate the distortion caused by some sociodemographic attributes (e.g. age, sex) since they can predetermine the psychologial and behavioral aspects of a customer. For example, younger people are likely to be more flexible in terms of choosing the bank, while people of older ages may experience difficulty in switching between banks.

For the purposes of this analysis, I want to focus only on business-related features of a customer (if we can say so).


**Motivation:**
I work for a bank and one of my recent projects was connected with determining customer attributes which influence customers activity. I did the research by manually exploring the database and plotting excel graphs. Now that I've gained some skills in Python, I could possibly find some new insights and use Machine Learning tools to predict customers' activity based on their characteristics.

**Expected resuls:**
1. A set of factors which determine a customer's decision to leave the bank + their correlation coefficients.
2. A model that helps predict whether or not the customer will leave the bank.

**Practical value:**

The model will have a practical application in the banking sector, particularly for the bank I work for. This solution will help us find weak spots in our customer base and thus re-activate the customers which are likely to leave the bank.

**Further research:**

After I find some interesting insights from this research, I wish to conduct a more detailed product-related analalysis. Particularly, I want to find out what specific banking products (deposits, current accounts, credit cards etc.) make the customer stay with the bank.

## Dataset used:
Please find the link for the dataset used in this project: https://www.kaggle.com/mathchi/churn-for-bank-customers

**Content of the dataset**:
- **RowNumber:** corresponds to the record (row) number and has no effect on the output.
- **CustomerId:** contains random values and has no effect on customer leaving the bank.
- **Surname:** the surname of a customer has no impact on their decision to leave the bank.
- **CreditScore:** can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.
- **Geography:** a customer’s location can affect their decision to leave the bank.
- **Gender:** it’s interesting to explore whether gender plays a role in a customer leaving the bank.
- **Age:** this is certainly relevant, since older customers are less likely to leave their bank than younger ones.
- **Tenure:** refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.
- **Balance:** also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.
- **NumOfProducts:** refers to the number of products that a customer has purchased through the bank.
- **HasCrCard:** denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.
- **IsActiveMember:** active customers are less likely to leave the bank.
- **EstimatedSalary:** as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.
- **Exited:** whether or not the customer left the bank.  (0=No,1=Yes)

**Data cleaning:**

**Step 1**

The provided Dataset is quite clean and does not contain any outlier values. This was chekced with the use of the following functions:
- .isnull().values.any()
- .info()
- .describe()

However, if we get back to our initial purpose, we will not need all of the features to be included in our final dataset. Therefore, I deleted some of the columns which constitute sociodemographic aspects of a customer (e.g. 'Surname', 'Gender', 'Age').

**Step 2**

Right now the dataset contains values from three countries: Germany, Spain, France.
We don't know where the data comes from, whether it's from one single bank, or an aggregata data. So, to avoid the dilution of insights, let's focus on one country. If the analysis provides some actionable insights, the findings could be extrapolated on other countries. Maybe we will still need the aggregate data, so I will save a copy of the original dataset.

For now, the dataset is ready to go. Let's get down to work!


In [None]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
raw_data = pd.read_csv('./churn.csv')

In [None]:
raw_data

**Let's check if there are any null values**

In [None]:
raw_data.isnull().values.any()

In [None]:
raw_data['Geography'].unique()

In [None]:
raw_data.info()

In [None]:
raw_data.describe()

**So, the dataset appears to be pretty clean. No null values, no evident outliers**

**_The only thing, that may be useful to conduct a more precise analysis is to choose one country. We don't know where the data comes from, whether it's from one single bank, or an aggregata data. So, to avoid insights dilution, let's focus on Germany, for instance._**

In [None]:
df = raw_data[(raw_data['Geography'] == 'Germany')]

In [None]:
df

**So, Germany accounts for 25% of the whole dataset.**

In [None]:
del df['Surname']

In [None]:
del df['Gender']

In [None]:
del df['Age']

In [None]:
del df['CustomerId']

In [None]:
del df['HasCrCard']

In [None]:
del df['IsActiveMember']

In [None]:
df

In [None]:
df.describe()

In [None]:
df['Exited'].value_counts()

**So, of all customers, 814 have left the bank. Let's explore each of the parameters and their impact on the exit factor.**

In [None]:
# Customers who left the bank
left = df.loc[df['Exited'] == 1]

In [None]:
# Customers who left the bank
not_left = df.loc[df['Exited'] == 0]

**Credit Score**

In [None]:
# Credit Score & Those customers who left the bank
plt.figure(figsize=(8,6))
plt.xlabel('CreditScore')
plt.ylabel('Count of customers')
plt.hist(left['CreditScore'],bins=10, alpha=1.0, label='Left the bank')
plt.legend(loc='upper right')
plt.show()

**Balance**

In [None]:
# Balance & Those customers who left the bank
plt.figure(figsize=(8,6))
plt.xlabel('Balance')
plt.ylabel('Count of customers')
plt.hist(left['Balance'],bins=15, alpha=1.0, label='Left the bank')
plt.legend(loc='upper right')
plt.show()

**Number of products**

In [None]:
# Num of products & Those customers who left the bank
plt.figure(figsize=(8,6))
plt.xlabel('Number of products')
plt.ylabel('Count of customers')
plt.hist(left['NumOfProducts'],bins=3, alpha=1.0, label='Left the bank')
plt.legend(loc='upper right')
plt.show()

**Estimated Salary**

In [None]:
# Estimated Salary & Those customers who left the bank
plt.figure(figsize=(8,6))
plt.xlabel('Estimated Salary')
plt.ylabel('Count of customers')
plt.hist(left['EstimatedSalary'],bins=10, alpha=1.0, label='Left the bank')
plt.legend(loc='upper right')
plt.show()

**Now we will build a heatmap with correlation indices of the chosen parameters.**

In [None]:
del df['RowNumber']

In [None]:
df

In [None]:
k = 7
col = df.corr().nlargest(k, 'Exited')['Exited'].index
corr = df[col].corr()
plt.figure(figsize=(10,6))
sb.heatmap(corr, annot=True)

**So, as we can see, among all the features, we can see different tendencies:**
1. Credit score: customers with average score 600-700 leave the bank more frequently compared with others. (let's call them 'middle zone')
2. Balance: customers with balances of 100k-150k leave the bank more frequently than others. (again, 'middle zone').
3. Number of products: customers with 1 product are the most frequent leavers.
4. Estimated Salary: poor correlation, the histogram does not help us fetch any specific insights - customers with various salaries leave the bank.

And, according to the correlation matrix, balance has the best correlation index, though we cannot fully trust this diagram, because the histogram previously showed as the 'pyramid' structure.

**Let's go on to the Machine Learning exercise.**

**This is going to be a classification task since the Exit status is a binary feature (0/1).
Therefore, we will use the classification tree method here.**

In [None]:
df

**The target (y) is Exited status (0/1).**

In [None]:
y = df['Exited'].copy()

In [None]:
y

In [None]:
features = ['CreditScore','Tenure','Balance','NumOfProducts','EstimatedSalary']

In [None]:
X = df[features].copy()

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

In [None]:
y_train

In [None]:
y_train[:50]

In [None]:
y_test[:50]

In [None]:
ExitedClassifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)
ExitedClassifier.fit(X_train, y_train)

In [None]:
type(ExitedClassifier)

In [None]:
predictions = ExitedClassifier.predict(X_test)

In [None]:
predictions[:40]

In [None]:
y_test

In [None]:
accuracy_score(y_true = y_test, y_pred = predictions)

**So, the accuracy score of our model is 73% which is quite good! Remember, the human-calculated rate is around 80%.**

However, it is still recommended to look in more detail at recall & precision scores.