
---

<img src="https://petlja.org/images/img/python.png"
     alt="Opis"
     width="250"
     style="float: right; margin-right: 10px;" />

Churn Modeling
==============

***Predicting service cancelation using bank customers data, Customer Churn***

**Author:** *Žarko Milošev*

---

# Table of Contents:

1. [Data Preparation](#1.-Data-preparation)
    - [Imports](#Imports)
    - [Loading data indo pandas DF object](#Loading)
    - [Data Cleaning](#Cleaning)
<br>
2. [EDA - Exploratory Data Analasys](#2.-EDA)
    - [General overview of data Distribution](#Distribution)
    - [Data Visualisation](#Visualisation)
    - [Explore Categorical Collumns](#Categorical-Features)
    - [Explore Numerical Features - Discrete values](#Discrete-Features)
    - [Outlier Detection](#Outliers)
    - [Histograms](#Histograms)
    - [Correlation Matrix](#Correlations)
<br>
3. [Data Preprocessing](#3.-Preprocessing)
    - [Removal of data with low information value](#Removal-of-data-with-low-information-value)
    - [Outlier handeling](#Outlier-removal)
    - [Handling Imbalanced Data - Oversampling](#Oversampling)
    - [Handling Imbalanced Data - Undersampling](#Undersampling)
    - [Encoding Categorical Features](#Encoding)
    - [Feature Scaling - data normalisation](#Scaling)
    - [Spliting to train and test](#Train-test-split)
<br>
4. [Data Modeling](4.-Modeling)
    - [Optimal model parameters - Cross Validation](#Optimal-model-parameters)
    - [Model evaluation OOB](#OOB)
    - [Feature optimisation](#Feature-optimisation)
    - [Training the Model](#Training)
    - [Results](#Results)
5. [Conclusion](#Conclusion)



---

[<< Project topic](#Churn-Modeling) | [Content Table](#Table-of-Contents:) | [Next Chapter >>](#1.-Data-Preparation)

---

# **1. Data Preparation**  
*Since the data set is available directly from the site there is no need to do any ETL work, and can proceed to importing the needed modules to operate data, and import the data itself*

---

[<< Previous Chapter](#Table-of-Contents:) | [Content Table](#Table-of-Contents:) | [Next Chapter >>](#2.-EDA)

---

## **Imports** 
*The first thing to do is to import all of the modules needed for this kernel.*

In [None]:
#NumPy and Pandas DataFrame object type
import numpy as np
import pandas as pd
#Options for Pandas DF
pd.options.display.max_rows = None
pd.options.display.max_columns = None
#Ploting libs for visualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#Preprocessing libs
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
#Chosen models from scikitlearn library
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
#All of the metrics used in this example
from sklearn.metrics import (accuracy_score, precision_score, average_precision_score, 
                             recall_score, roc_auc_score, roc_curve, 
                             classification_report, f1_score, confusion_matrix)
#Libs for testing optimal model parameters 
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from collections import OrderedDict
from sklearn.datasets import make_classification

## **Loading**  
*In this step the data is loaded and a small portion is displayed in the table*

In [None]:
df = pd.read_csv('/kaggle/input/churn-modelling/Churn_Modelling.csv')
df.shape
df.head(10)

## **Cleaning**  
*Checking for missing values with further examination of the data set*

In [None]:
df.info()

# **2. EDA**  
*It is important to examine every feature (collumn) in order to get a full picture of data distribution across the set. Looking at previous output - using *`.info()`* command, there are several columns flaged as Object type - meaning they contain categorical data (in this case strings). Other collumns have discreet, numerical values such as float or integer*

---

[<< Previous Chapter](#1.-Data-Preparation) | [Content Table](#Table-of-Contents:) | [Next Chapter >>](#3.-Preprocessing)

---

## **Distribution**  
*Examining the distribution of the data across all of the columns*

In [None]:
#statistical overview of numerical features (collumns)
df.describe()

In [None]:
#Listing unique value distribution acress collumns
df.nunique()

In [None]:
#Distribution of .mean value of the feature we are trying to predict - IsActiveMember 0 and 1 across Discrete features
df.groupby(df['IsActiveMember']).mean().head()

## **Visualisation**  
*Visualising the data gives a beter insight into the distribution*

In [None]:
#The distribution of Exited column
labels = 'Churn', 'Active'
sizes = [df.Exited[df['Exited']==1].count(), df.Exited[df['Exited']==0].count()]
explode = (0, 0.05)
fig1, ax1 = plt.subplots(figsize=(4, 4))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.2f%%',textprops={'fontsize': 24})
ax1.axis('equal')
plt.show()

## **Categorical Features**

In [None]:
#Distribution of active membership (IsActiveMember) over categorical features
fig, axarr = plt.subplots(2, 2, figsize=(16, 16))
sns.countplot(x='Geography', hue = 'Exited',data = df, ax=axarr[0][0])
sns.countplot(x='Gender', hue = 'Exited',data = df, ax=axarr[0][1])
sns.countplot(x='HasCrCard', hue = 'Exited',data = df, ax=axarr[1][0])
sns.countplot(x='IsActiveMember', hue = 'Exited',data = df, ax=axarr[1][1])

## **Discrete Features**

In [None]:
# Distribution of active membership (IsActiveMember) over discreet features
fig, axarr = plt.subplots(3, 2, figsize=(16, 24))
sns.boxplot(y='CreditScore',x = 'Exited', hue = 'Exited',data = df, ax=axarr[0][0])
sns.boxplot(y='Age',x = 'Exited', hue = 'Exited',data = df , ax=axarr[0][1])
sns.boxplot(y='Tenure',x = 'Exited', hue = 'Exited',data = df, ax=axarr[1][0])
sns.boxplot(y='Balance',x = 'Exited', hue = 'Exited',data = df, ax=axarr[1][1])
sns.boxplot(y='NumOfProducts',x = 'Exited', hue = 'Exited',data = df, ax=axarr[2][0])
sns.boxplot(y='EstimatedSalary',x = 'Exited', hue = 'Exited',data = df, ax=axarr[2][1])

## **Outliers**  
*Discovering outliers is an important part of EDA, these must be flaged for preprocessing tasks, or eliminated using other methods*

In [None]:
#The distribution of Churn across Tenure and NumberOfProducts
fig, axarr = plt.subplots(1,2, figsize=(16, 8))
sns.countplot(x='Tenure', hue = 'Exited',data = df, ax=axarr[0])
sns.countplot(x='NumOfProducts', hue = 'Exited',data = df, ax=axarr[1])

In [None]:
#Possible outliers on customers that have 3 or 4 products, displaying the discreet distribution of Churn
df.NumOfProducts.value_counts()

In [None]:
#Examining Customer age Distribution since the visualisation on the discreet values showed indications of outliers
df.Age.value_counts().plot(kind='bar',figsize=(16,4))
df.Age.value_counts().tail(5)

In [None]:
#Distribution of age across Churn
churn_age = df[['Exited']].groupby(df['Age']).mean()
churn_age.plot(kind='bar',figsize=(20,8))

## **Histograms**  

In [None]:
#Using histograms to examine the selected distributions for Age, Credit Score, Balance and EstimatedSalary
fig, axa = plt.subplots(2, 2, figsize=(16, 16))
df.CreditScore.hist(bins=500, ax = axa[0][0])
df.Age.hist(bins=75, ax = axa[0][1])
df[(df.Balance!=0)].Balance.hist(bins=225,figsize=(24,16), ax = axa[1][0])
df.EstimatedSalary.hist(bins=300,figsize=(24,16), ax = axa[1][1])

## **Correlations**

In [None]:
# Correlation matrix
plt.subplots(figsize=(14,12))
sns.heatmap(df.corr(), annot=True, cmap="tab20c")
plt.show()

In [None]:
#Examining feature importances
df.drop(["Exited"], axis = 1).corrwith(df['Exited']).plot.bar(figsize = (16, 8), title = "Corelation with Exited bool", fontsize = 20, rot = 45, grid = True)
df.corrwith(df['Exited'])

# **3. Preprocessing** 
*During the preprocessing the data is modified in such a way that in the end it is digestable by the chosen models algorithm. That being said, and considering the methods chosen, all of the data needs to be Discreet - meaning that all of the categorical values must be translated into numerical, also eliminate any outliers that we caught in previous [EDA chapter](#2.-EDA)*

---

[<< Previous Chapter](#2.-EDA) | [Content Table](#Table-of-Contents:) | [Next Chapter >>](#4.-Modeling)

---

In [None]:
#Taking a fresh copy to work with
data = df.copy()

## **Removal of data with low information value**  
*Some of the collumns are redundant. For example the ID's, and row numbers, as well as customer names*

In [None]:
# removal of non related data
data = data.drop(["RowNumber", "CustomerId", "Surname"], axis = 1)
#result
data.head()

## **Outlier removal**
*Removing previousley detected outliers*

In [None]:
#outlier removal
data = data[(data.CreditScore>400)]
data = data[(data.Age < 78)]
data = data[(data.NumOfProducts<3)]
#result
data.info()

## **Encoding**  
*Gender and Geography must be transformed to numerical values so that RFR algo can digest the data in an expected way*

In [None]:
#transforming text values to numerical 
dataLebelEncoded = data.copy()
# Lebel enkoder, used to encode male/female to 0/1
le= LabelEncoder()
dataLebelEncoded['Gender']= le.fit_transform(dataLebelEncoded['Gender'])
dataLebelEncoded.head()
#dataLebelEncoded.info()
#dataLebelEncoded.hist(bins = 100, figsize=(24,24))

In [None]:
#encoding the Geography values...
dataEncoded = dataLebelEncoded.copy()
dataEncoded = pd.get_dummies(dataEncoded, columns=['Geography'])
dataEncoded.rename(columns={"Geography_France":"France",
                   "Geography_Germany":"Germany",
                   "Geography_Spain":"Spain"}, inplace=True)
dataEncoded.head()

## **Scaling**  
*Note that only some Discreet values are undergoing normalisation, compare results to *[Histograms](#Histograms)*, notice that values range from 0 to 1 now. *

In [None]:
#Value normalisation
scaler = MinMaxScaler() 
normalizacija = ["CreditScore", "Age", "Balance",'EstimatedSalary']
dataScaled = pd.DataFrame(data = dataEncoded)
dataScaled[normalizacija] = scaler.fit_transform(dataEncoded[normalizacija])
dataScaled.head()

In [None]:
# visualizing the distribution after normalization
fig, axa = plt.subplots(2, 2, figsize=(16, 16))
dataScaled.CreditScore.hist(bins=400, ax = axa[0][0])
dataScaled.Age.hist(bins=60, ax = axa[0][1])
dataScaled[(dataScaled.Balance!=0)].Balance.hist(bins=225,figsize=(24,16), ax = axa[1][0])
dataScaled.EstimatedSalary.hist(bins=300,figsize=(24,16), ax = axa[1][1])

In [None]:
#Lets see what we got so far
dataScaled.describe()

## **Train test split**  
*The main object is to prepare a test and training set.*

In [None]:
#Separating the y *(Churns, Exited customers)
X = dataScaled.drop("Exited", axis=1)
y = dataScaled["Exited"]
X.shape, y.shape

In [None]:
# Podela seta na testni deo, i deo za obuku
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# **4. Modeling**  
*In modeling phase warious classifiers are tested in order to find the best and most persistant model* 

---

[<< Previous Chapter](#3.-Preprocessing) | [Content Table](#Table-of-Contents:) | [Conclusion >>](#5.-Conclusion)

---

## **Optimal model parameters**  
*It is important to run heavy load tasks such as cross search validation on as much processor cores as possible since the folds can be done in paralell.Notice the *`n_jobs=-1`* parameter in the *` gsCV = GridSearchCV(gsCV_model, tuned_parameters,n_jobs=-1, verbose=1)`* command, this way we can use all of the CPU cores available.*

In [None]:
#Taking a smaller sample
sampleData = dataScaled.sample(frac=0.1, replace=True)

#delimo podatke na y i X
X_sample=sampleData.drop(['Exited'], axis=1)
y_sample=sampleData['Exited']

X_sample.shape, y_sample.shape

In [None]:

gsCV_model = RandomForestClassifier(oob_score=True)
tuned_parameters = {'max_depth': [10, 20, 30, 50],
                    'min_samples_leaf': [1, 2, 3, 5],
                    'min_samples_split': [2, 3, 4, 5, 6, 8],
                    'n_estimators': [100]},
gsCV = GridSearchCV(gsCV_model, tuned_parameters,n_jobs=-1, verbose=1)  
gsCV.fit(X_sample,y_sample)

In [None]:
#The results:
print('Best Score: {:.3f} \n'.format(gsCV.best_score_ ))
print('Parameters used:')
print(gsCV.best_params_)

*Note that you can shufle the input data and use different ranges for GSCV, the best result was: *`{'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}`*   
so these are going to be used for further testing. *

In [None]:
ensemble_clfs = [("RandomForestClassifier, max_depth='30', min_samples_leaf='1', min_samples_split='2'", 
                   RandomForestClassifier (max_depth=30,   min_samples_leaf=1,   min_samples_split=2, oob_score=True))]
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
#The range in which we are wearching for the optimal number estimators
min_estimators = 20
max_estimators = 800
#Change the step if needed, here its 10, the smaller the number the more steps there is to calculate, so the operation takes more time
for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1,10):
        clf.set_params(n_estimators=i)
        clf.fit(X_sample, y_sample)
        #Recording the OOB error rate for every `n_estimators=i`
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))

## **OOB**  
*Calculating the optimal number of estimators*

In [None]:
#Visualising the plot for "OOB error rate" vs. "n_estimators" 
for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.show()

## **Feature optimisation**  
*Cross validation of features*

In [None]:
#The number of correct classifications
clf_rf_4 = RandomForestClassifier(n_estimators = 290) 
rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5,scoring='accuracy')   #5-fold cross-validation
#fitting the classifier
rfecv = rfecv.fit(X_train, y_train.values.ravel())

print('Optimal number of features: ', rfecv.n_features_)
print('Best features', X_train.columns[rfecv.support_])

#Plot features VS cross-validation scores
plt.figure()
plt.xlabel("Number of features")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

clf_rf = RandomForestClassifier(n_estimators = 100)      
clr_rf = clf_rf.fit(X_train,y_train.values.ravel())
#printing the score 
y_pred=clf_rf.predict(X_test)
ac = accuracy_score(y_test,y_pred)
print('Accuracy: ',ac)

## **Training**  
*Some of the fits (the ones with higher parameter value) *

In [None]:
#Fittingg the parameters
RFC = RandomForestClassifier(bootstrap=True, 
                             class_weight=None,
                             criterion='gini', 
                             max_depth=30,
                             max_features=10,
                             min_samples_leaf=1,
                             min_samples_split=3,
                             n_estimators=200,
                             oob_score=True, 
                             random_state=743,
                             verbose=0, 
                             warm_start=False,
                             n_jobs=-1)

#Training the model
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)

## **Results**  
*Precision report and confusion matrix are presented to have a better insight to what exactly is predicted*

In [None]:
#Report
print('Precision for Random Forest Classificator: \n')
print(classification_report(y_test, y_pred))

#Visual representation
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion_matrix(y_test, y_pred))
plt.title('Confusion Matrix')
fig.colorbar(cax)
plt.xlabel('Predicted')
plt.ylabel('Real')
plt.show()
print(confusion_matrix(y_test, y_pred))

***Literal translation disclaimer: These figures can differ slightly depending on the random seeds*** 

---
  
During the testing there had been 1602 total class 0, and 325 total class 1.  

Total found Class 0: 1724 out of which 1526 are correct and 198 are incorrect, the accuracy for class 0 is 89% whithin the classified cluster, if we compare 1526 correctly classified samples with a total number tested 1602 the accuracy is stagering 95%. However the best score that is readable on the scoreboard is not neceseraly what is the bussiness question in the first place:  **Which customers are about to cancel subscription?**  
<br>
Total found Class 1: 203 out of which 127 are correct and 76 are incorrect, the accuracy for class 1 is 63% whithin the classified cluster, if we compare 127 correctly classified samples with a total number tested 325 the accuracy is 39%. So the answer to the question would be: **At this time the model can detect 39% of total amount of the customers that are about to cancel their subscription, the accuracy within the selected cluster is 63%, meaning that 37% of customers were not about to cancel anyway**  

*Retest with other ansemble methods is recommended*

*Looking at the overall results, there is a noticable overfiting to majoriti class, in our case the *`Exited[0]`*, the class imbalance is an issue that can be addressed using either undersampling or oversampling *

# **5. Conclusion**  
*In conclusion:  

*The chalange here was to predict which customers will cancell their subscription *`Exited[1]`*, RFC caught 127 out of total 325 the accuracy here is quite low, how ever the rate of false positives is not so high (76). This means that 37% of the customers that have "Exiting" are incorrectley clasified, considering that the recall on *`class [0]`* is 95% class balancing should improve our 39% hit chance atleest to some extent. *



---

[<< Previous Chapter](#4.-Modeling) | [Content Table](#Table-of-Contents:) | [Conclusion >>](#5.-Conclusion)

---