# **A Machine Learning program detecting bank churns ** 

## **Introduction**

Customer churn is a term used to refer to customers who leave the financial institution they've been working with. It has emerged as one of the major problems for financial institutions including banks ([ukessays, 2016](http://www.ukessays.com/essays/marketing/customer-churn-management-in-banking-and-finance-marketing-essay.php)).

The following program codes contain three phases including:

        1. Analyzing data and feature engineering

        2. Building a Machine Learning Mode

        3. Building a Demo-API detecting Bank Churn from user inputs
      


**Key terms used in the codes**
1. Model: A machine learning term used to represent an algorithm that learns from the data and make predictions
1. Features : Usually known as columns
2. Observations: Usually known as rows
3. Label or Target: A column which is being predicted
4. Outlier: A value situated away form the mean 
4. Feature engineering: Checking the correlation between features and the target and removing features which doesn't support the accuracy of our model.

# **Analyzing data and feature engineering**

## Overview on the Dataset

1. The dataset is a labelled dataset (has features determining the target) thus we'll use **Supervised Learning models**
2. The target column is classifying the customer into Churned customer or not, thus we'll implement **Classification models**

### 1. Importing all Libraries


In [None]:
import numpy as np # For data manipulation
import pandas as pd # For data representation
import matplotlib.pyplot as plt # For basic visualization
import seaborn as sns  # For synthetic visualization
from sklearn.cross_validation import train_test_split # For splitting the data into training and testing
from sklearn.neighbors import KNeighborsClassifier # K neighbors classification model
from sklearn.naive_bayes import GaussianNB # Gaussian Naive bayes classification model
from sklearn.svm import SVC # Support Vector Classifier model
from sklearn.tree import DecisionTreeClassifier # Decision Tree Classifier model
from sklearn.linear_model import LogisticRegression # Logistic Regression model
from sklearn.ensemble import RandomForestClassifier # Random Forest Classifier model
from sklearn.metrics import accuracy_score # For checking the accuracy of the model

### 2. Importing the dataset and grasping basic insights 

In [None]:
# Importing the dataset
churn_dataset = pd.read_csv('../input/Churn_Modelling.csv')
# Visualizing first five elements in the dataset
churn_dataset.head()

In [None]:
# Checking basic information (rows, columns, missing values, datatypes of columns, etc) in our dataset
churn_dataset.info()

From the above cell, we conclude that there are no missing values in our dataset (since all features have 10000 non-null values). 

However we can identify some features which have non-numerical data (usually represented as objects) and we'll need to delete them or encode them into numbers (**Because computers don't understand texts thus we transfer them into encoded numbers**). Those features are:
1. Surname
2. Geography
3. Gender

In [None]:
# Checking statistical information in our dataset
churn_dataset.describe()

**As for features with numerical datatypes, we analyze their statistical distributions** (count, mean, standard deviation, median, etc). 
This also help us to easily detect outliers (ex: if the maximum value on the age column was 200, that would easily mark the presence of outliers in our dataset).

### 3. Feature Engineering

**a. Working with categorical features (Features with non-numerical datatypes)**

We'll be analyzing the likeability of encoding values in these features by checking their unique characters (we'll automatically drop the feature if it has more than 3 meaningful unique characters for better accuracy).

In [None]:
# Checking set of unique characters in each categorical feature
for col in churn_dataset.columns:  # Looping over all columns 
    if churn_dataset[col].dtypes == 'object':
        num_of_unique_cat = len(churn_dataset[col].unique()) # Checking the length of unique characters
        print("feature '{col_name}' has '{unique_cat}' unique categories".format(col_name = col, unique_cat=num_of_unique_cat))


* Since unique values in the Surname feature are more than 3, we'll consider** deleting the feature**

In [None]:
# Deleting the Surname feature from the dataset
churn_dataset = churn_dataset.drop("Surname", axis=1)

In [None]:
# Creating a pivot table demonstrating the percentile
# Of different genders and geographical regions in exiting the bank 
visualization_1 = churn_dataset.pivot_table("Exited", index="Gender", columns="Geography")
visualization_1

From the table above, we can easily detect the following trends:
1. Many females have exited the bank than males in all regions represented in the Dataset
2. Germany is the country with many bank churns

However, though these features (Geography and Gender) might correlate with the target column, it is better to drop them from our model to preserve the universality of our prediction model (not only predicting values from Germany, France and Spain).

In [None]:
# Deleting gender and geography features from the dataset
churn_dataset = churn_dataset.drop(["Geography", "Gender"], axis=1)

**b. Working with numerical features**

We'll be analyzing the correlation between these features and the target

Some features such as RowNumber and CustomerId contain personal informations which doesn't affect our model, thus they should also be removed from the dataset

In [None]:
# Removing RowNumber and CustomerId features from the dataset
churn_dataset = churn_dataset.drop(["RowNumber", "CustomerId"], axis=1)

In [None]:
correlation = churn_dataset.corr()
sns.heatmap(correlation.T, square=True, annot=False, fmt="d", cbar=True)

The above heatmap easily depict how different features correlate among themselves (including against the target feature: "Exited").

Trends from the heatmap visualization:
1. All features have a weak or strong correlation with the target (Thus we are considering all of them for our model)
1. Age, Balance, NumOfProducts, IsActiveMember, CreditScore are the features with significant correlation.

# Building a Machine Learning Model

## 1. Preparing the data 

a. Shuffle the data (to randomize all data)

b. Split feature data from the target (To easily differentiate what is being predicted from determinants)

c. Split feature data and target into training and testing sets (For validation accuracy)




In [None]:
# Shuffling the dataset
churn_dataset = churn_dataset.reindex(np.random.permutation(churn_dataset.index))

In [None]:
# Splitting feature data from the target
data = churn_dataset.drop("Exited", axis=1)
target = churn_dataset["Exited"]

In [None]:
# Splitting feature data and target into training and testing
X_train, X_test, y_train, y_test = train_test_split(data, target)

## 2. Choosing the best classification model to use

In [None]:
# Creating a python list containing all defined models
model = [GaussianNB(), KNeighborsClassifier(), SVC(), DecisionTreeClassifier(), RandomForestClassifier(n_estimators=5, random_state=0), LogisticRegression()]
model_names = ["Gaussian Naive bayes", "K-nearest neighbors", "Support vector classifier", "Decision tree classifier", "Random Forest", "Logistic Regression",]
for i in range(0, 6):
    y_pred = model[i].fit(X_train, y_train).predict(X_test)
    accuracy = accuracy_score(y_pred, y_test)*100
    print(model_names[i], ":", accuracy, "%")

From the above, we can easily see that Random Forest Classifier is the model with the highest accuracy, thus it is the one we are going to use.

In [None]:
# Working with the selected model
model = RandomForestClassifier(n_estimators = 100, random_state = 0)
y_pred = model.fit(X_train, y_train).predict(X_test)
print("Our accuracy is:", accuracy_score(y_pred, y_test)*100, "%")

Having an accuracy of 86% means that our model is good enough to predict new data

# A Demo-API detecting Bank Churn from user inputs


### To run the following API, Do fork the notebook and remove comments


## The following are inputs:

1. Credit score of the client
2. Age of the client
3. Tenure of the client
4. Current balance in the bank account of the client
5. Number of product the client uses
6. Does the client have a credit card
7. Is the client an active member
8. Estimated salary of the client

In [None]:
#print("Enter the credit score of the client \n")
#credit_score = int(input())
#print("Enter the age of the client \n")
#age = int(input())
#print("Enter the tenure of the client \n")
#tenure = int(input())
#print("Enter the current balance of the client \n")
#balance = float(input())
#print("Enter the number of product the client use \n")
#product_no = int(input())
#print("Press 1 if the user has a credit card or 0 if not \n")
#credit_card = int(input())
#print("Press 1 if the user is an active member or 0 if not \n")
#active_member = int(input())
#print("Enter the estimated salary of the client \n")
#salary = float(input())


#X_user = np.array([credit_score, age, tenure, balance, product_no, credit_card, active_member, salary])

#y_pred = model.predict([X_user])
#index = y_pred  
#if index == 1:
#    print("\n Client is not exiting the bank")
#elif index == 0:
#   print("\n Client is on the threshold of exiting the bank")
#    print("\n Consider taking further steps to incentivise the client")


# Conclusion and Recommendation

In conclusion, we have been able to get insights from our churn dataset and predicted clientelle dynamics with the Random Forest Classifier model at 86% accuracy.  

However, this accuracy would be improved by collecting more relevant data with enough features from different individuals. Additionally, I do believe this accuracy would be improved after having a full grasp of the domain of data to use in the improved model. 