# Customer Churn Prediction using XGBoost and ANN

![](https://miro.medium.com/max/762/1*X-oZNRw5Pnef-kR9CgLx1g.png)

*(Awesome, evocative picture. Courtesy of <a href="https://medium.com/@superalbert">Super Albert</a>,  on <a href="https://medium.com/diogo-menezes-borges/predicting-banks-churn-with-artificial-neural-networks-f48393fb1f9c">Medium</a>)*

## Word Before 

Briefly, this is my firt official Notebook posted on Kaggle. Although I've been a user of this platform for 2 years now, searching for data, reading notebooks, browsing through users and competitions, I've never participated myself. But this platform has helped me tremendously during my studies and has been a fascinating digital refuge for me, in search of amazing things us humans have learned to do. It is why I also wish to start contributing more to this industry and its community. I will try to make my submisions as thorough as possible, so that others may learn from me, as I have also done. Also, without wanting to seem umprofesional, I would like to keep an easy-going, reader-friendly language and let you be the judges of that. 


## Objectives, tasks and questions

As per this <a href="https://en.wikipedia.org/wiki/Churn_rate">Wikipedia Article</a>, churn "is a measure of the number of individuals or items moving out of a collective group over a specific period. The general purpose or end-goal of this notebook is to predict if a bank's customer will churn, by making use of Supervised Machine Learning methods. I chose this dataset because I believe it to be an everyday industrial use of these technologies and techniques.Obviously, the dataset is very basic and real corporations from the financial sector probably have a much more thorough and extensive 'case file' on their clientele. But this data will do just fine. Below, I am going to enuntiate a  


### Step 1: Data Mining
- Check database for duplicates and missing values.
- Check data types.
- Check dispersion of data
- Check Pearson's Correlation Matrix 
- Detect outliers 

### Step 2: Data analysis 
- What is the range of values for credit scores? 
- Where do our customers come from? 
- What is the gender distribution?
- Our newest and oldest customers?
- Active / inactive distribution. 
- Salary in relation to the balance. 
- Plot the results! 
- Come up up interesting numbers?
- What are the actionable insights?

### Step 3: Machine Learning:
- Do we have enough data? Is there some way to use data augmentation?
- Is this a Regression or a Classification problem?
- Candidates for y: Exited (binary classification) & Credit Score (regression) & Gender (binary classification)
- Candidates for algorithms: XGBoost & ANN.


# I. Importing libraries

Here we have quite a lot of imports to take in, but they are (I hope) neatly organized.

In [None]:
# Basics 
import pandas as pd 
import numpy as np 
# Data Viz 
import matplotlib.pyplot as plt 
import seaborn as sns 
# Miscelaneous 
import warnings 
import os 
import time 
# Machine Learning
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.metrics import confusion_matrix 
from imblearn.over_sampling import SMOTE
import xgboost as xgb
# Neural Networks 
import tensorflow as tf
print("Tensorflow Version:",tf.__version__)
from tensorflow.keras.layers import Input, Dense, Dropout, LeakyReLU
from tensorflow.keras.models import Model
# Ignore pesky warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 
warnings.filterwarnings("ignore", category=UserWarning) 
%matplotlib inline 

# II. Helper Functions

Next, I am going to write a series of funtions that will help along the way to plot some graphs or fit and evaluate some models. 

In [None]:
##### EDA Helper Functions 

def correlation_matrix(df):
    '''
    This function prints out a Pearson's Correlation Matrix 
    '''
    corr = df.corr()
    f, ax = plt.subplots(figsize=(10, 10))
    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    sns.heatmap(corr, mask=None, cmap=cmap, vmax=.3, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5})

##### DATA PREPARATION HELPER FUNCTIONS #####

def make_dummy(df, feature):
    '''
    This feature makes dummy variables out of a data frame's feature.
    '''
    dummy = pd.get_dummies(df[feature])
    return(dummy)

##### MACHINE LEARNING HELPER FUNCTIONS #####

def classifier(model,X_train, y_train, X_test, y_test, printOut = False):
    '''
    This function trains a model and evaluates it with classification performance metrics.
    '''
    # Fit / train
    from sklearn.metrics import accuracy_score
    model.fit(X_train,y_train)
    
    # Predict
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    accuracy_score_train = accuracy_score(y_train, y_pred_train)
    accuracy_score_test = accuracy_score(y_test, y_pred_test)
    
    # Printing predictions 
    print("TRAIN Results:\n")
    print(f"Accuracy: {round(accuracy_score_train, 2)}%")
    print("\n")
    print("TEST Results:\n")
    print(f"Accuracy: {round(accuracy_score_test, 2)}%")
    print("\n")

    # , precision_score_test, recall_score_test, f1_score_test, roc_auc_test
    return (accuracy_score_test)

def map_feature_importances(clf,X,top=20):
    '''
    This function prints out a graph representing the top 20 most important features of a RF or XGB model.
    '''
    import matplotlib as plt
    feat_importances = pd.Series(clf.feature_importances_, index=X.columns.values)
    feat_importances.nlargest(top).sort_values().plot(kind='barh', color='darkgrey', figsize=(20,10))
        
        
def plot_xgb(model,X_train,X_test,y_train,y_test):
    '''
    This function prints out the booster's Log Loss and Classification Errors over time. 
    '''
    from sklearn.metrics import accuracy_score
    from matplotlib import pyplot
    eval_set = [(X_train, y_train), (X_test, y_test)]
    # make predictions for test data
    y_pred = model.predict(X_test)
    predictions = [round(value) for value in y_pred]

    # evaluate predictions
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: %.2f%%" % (accuracy * 100.0))

    # retrieve performance metrics
    results = model.evals_result()
    epochs = len(results['validation_0']['error'])
    x_axis = range(0, epochs)

    # plot log loss
    fig, ax = pyplot.subplots(figsize=(8,8))
    ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
    ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
    ax.legend()
    pyplot.ylabel('Log Loss')
    pyplot.title('XGBoost Log Loss')
    pyplot.show()

    # plot classification error
    fig, ax = pyplot.subplots(figsize=(8,8))
    ax.plot(x_axis, results['validation_0']['error'], label='Train')
    ax.plot(x_axis, results['validation_1']['error'], label='Test')
    ax.legend()
    pyplot.ylabel('Classification Error')
    pyplot.title('XGBoost Classification Error')
    pyplot.show()

# III. Data Mining

In [None]:
# Dealing with data frame 
df = pd.read_csv("../input/churn-for-bank-customers/churn.csv", index_col=False)
print("The dataset has ", df.shape[0], " data points and ", df.shape[1], " features.\n\n")

In [None]:
df.head()

The dataset has no missing values. The statistics look allright. The datatypes are as expected of such features. I consider these features to be sufficient. Feature Engineering could be helpful, but nothing useful can come of the present ones, as demonstrated in <a href="https://www.kaggle.com/ahalimzdemir/churn-project">this notebook</a>, where the newly created features were at the bottom of the models' importance charts. 

### Continuous variables
- Credit Score: MAX  is 850. MIN is 350. AVG of 650; STD is 96,7 points. 
- Age: MAX is 92 yrs; MIN is 18 yrs; AVG is 38 yrs; STD is 10,5 yrs. 
- Tenure: MAX 10 years; MIN is 0; AVG is 5; STD is 2,89. 
- Balance: MAX is $ 250.898; MIN is $ 0; AVG is $ 77.000; STD is 62.000.
- Num of Products: MAX is 4. MIN is 1. AVG is 1,53. 
- Estimated Salary: MAX is 199.992; MIN is 11; AVG is 100.000; STD is 57.000. (Quite high variance. Must further inspect this.)


In [None]:
#I've comented the followig lines out but checking the dataset for missing values and data types is a must. 
#df.isna().sum()
#df.dtypes
df.describe() # Finally, this prints out a lot of statistics. 

We can even check out some of the more interesting features using different data visualization techniques. 

#### Box Plot Insights
- Some true outliers with very low credit scores.
- Some very old customers are outliers.
- Tenure perfectly distributed
- Salaries perfectly distributed
- Very few outliers have 4 products.
- Too many customers seem to have very low balance.

#### Histogram Insights
- Normal(ish) distribution for credit scores.
- Predominantly young clientele. Mostly working class.
- Large number of old clients. Very few brad new clients.
- Vast majority of clients have 1 or 2 products, rarely 3, almost never 4.
- Estimated salary's distribution looks off. I should inspect that in detail.
- Balance: if not for the vast majority having 0 balance, it would be a perfectly normal distribution.
- Idea: Perhaps the 0 balance corresponds to only exited clients?

#### Pie Plot insights 
- Slightly more males than females. Almost 50/50. 
- Around 75% have a credit card. 
- 50 / 50 Active and Inactive members. 
- 25 % of clients are EXITED. 
- Half the customers are French. The rest is evenly split between Germans and Spaniards.

#### Observation: 
- If I am to classify Exited or not, the classes are imbalanced. This will need to be addressed. 

In [None]:

df.boxplot(column="CreditScore")

In [None]:
df.Gender.value_counts().plot(kind='pie')

This tiny line of code is awesome for producing histograms and scatter plots for the whole dataset. It takes a bit of time to run but the end result is worth it. We can even color code the plot's hue by a feature of our choice. In this case, blue is remaining customers and orange is exited customers.

### Insights 

Exited customers are older, on average, than those still active. This kind of makes sense, as clients who have left must have been with the bank some time. The young ones have not really had the reason or the opportunity to yet leave. The bank should look out for middle aged clients who might be looking for alternatives. 

The exited customers also had higher balances on average. Again, this makes more sense if we look back at the previous ovservation. Having younger clients naturally means that their savings are likely to be lower, and so should their income. Either way, what we can observe is basically two large groups (or sub-distributions) in each distribution (exited/not-exited): those with low balance (actually most of it is absolute zero) and the other ones, which have a Gaussian Bell Shaped distribution of data points, as all data tends to conform to eventually. 

The customers that remain mostly have 0 to 1 products. Those who have exited also held up to 4, in a small proportion. I believe there might be something unexplained in the data here. Perhaps it is because the bank used to have more products but now it doesn't, and older customers, with greater tenure, that have been with them for a long time, benefited from different products/services that are no longer available. Or perhaps the tendency is for clients to not use those products anymore.  

Indeed,there were more exited customers who didn't have a card. This is a sure sign of not sticking with the bank much. 

In [None]:
sns.pairplot(df, hue="Exited", palette="coolwarm")

### Pearson's Correlation Matrix

Ideally, we don't want our features to be correlated too strongly. This would make including both features of such pairs redundant, since they influence the result in similar fashion. The matrix shows us that:

- Number of products indirectly correlated with Balance. (Remaining customers have max 1 product. This explains the negative correlation.)
- High correlation between Exited and Age. (The older the client, the higher the probability of having left.)
- Some small positive correlation between Exited and Balance. 
- Some negative correlation between Exited and Active/Inactive.

Overall, we have heteroscedasticity. The correlation indices are way to small to consider any feature for removal. Generally, we want to eliminate one feature in a pair which has correlation greater than modulo(0.5).


In [None]:
# correlation table

corr = df.corr()
f, ax = plt.subplots(figsize=(10, 10))
cmap = sns.diverging_palette(220, 20, as_cmap=True)
sns.heatmap(corr, mask=None, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

# IV. Data Analysis

The way I would analyze the data is visual inspection using Tableau, as it makes the whole process much easier and pleasant. Perhaps I will come back to this section soon to uncover more of the data's hidden secrets. Stay tuned.

### Further questions:
- Are there any 0 Tenure customers who have already Exited?
- Which Tenure group (there are 10 groups) has the highest Churn rate?
- What about the gender pay gap?
- Which country has the biggest problems in customer retention?
- Are there any new features that could be engineered? 

# V. Machine Learning

First, we need to prepare the data. We eliminate useless features and one-hot-encode categorical features.

In [None]:
# Preparing data 
data = df.drop(['RowNumber','CustomerId','Surname'], axis=1)

dummy_feature_list = ['Gender','Geography'] 
for f in dummy_feature_list:
    dumm = make_dummy(data,f)
    data = pd.concat([data,dumm], axis=1)
data = data.drop(dummy_feature_list, axis=1)


### Train/Test splitting, SMOTE and Standardization 
XGBoost does not require the data to be standardized, but the ANN we will use later does. Either way, I noticed no difference in training the XGB with un-standardised data or otherwise. What is truly important and had a great effect on our performance is using the SMOTE algorithm to "produce" more artificial data points that become representative of our minority class, which is customers who are no longer with the bank, which are outnumbered 3 to 1. Ideally, we'd like to have data that is equally representative of all of our studied classes. When we do not have that, we can use sampling strategies.

In [None]:
# ORIGINAL DATA
X = data.drop(['Exited'],axis=1) # The features
y = data.Exited # The target feature
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42) # Splitting data 

# RESAMPLED DATA
seed = 42
k = 1 
sm = SMOTE(sampling_strategy = 'minority', k_neighbors = k, random_state = seed)
res_X, res_y = sm.fit_resample(X, y)
res_X_train, res_X_test, res_y_train, res_y_test = train_test_split(res_X, res_y,test_size=0.2, random_state=42)

# RESCALED DATA 
scaler = StandardScaler()
std_X = scaler.fit_transform(res_X)
std_X_train,std_X_test,std_y_train,std_y_test = train_test_split(std_X,res_y,test_size=0.2,random_state=42) # Splitting data

# Going to stick to rescaled data but rename it for convenience's sake
X_train,X_test,y_train,y_test = std_X_train,std_X_test,std_y_train,std_y_test

### Hyperparameter choice 

I used a grid of hyperparameter combinations as input for a Randomized Grid Search with Cross Validation, which is faster and not as exhaustive as a full-on grid search strategy. Even so, it is still fairly slow but allows us to more easily come up with ideal hyperparameter choices. 

In [None]:
xgb_clf = xgb.XGBClassifier()

xgb_params = {"colsample_bytree":[0.3,0.6,0.8],
              "gamma":[0,0.01,0.1,0.5],
              "learning_rate":[0.01,0.03,0.1],
              "max_depth": [2,4,6,None],
              "n_estimators": [100,200, 500, 1000],
              "subsample":[0,0.5,1],
              "min_samples_split": [2,5,10,20,30],
              "lambda":[0,0.001,0.01,1],
              "alpha":[0,0.001,0.01,1]}

xgb_cv_model = RandomizedSearchCV(estimator= xgb_clf, 
                                 param_distributions = xgb_params, 
                                 n_iter=50, cv=10, scoring='accuracy', 
                                 n_jobs=-1, verbose=2).fit(std_X_train, std_y_train)

In [None]:
# These are the best hyperparameters that the search could give us. 
xgb_cv_model.best_params_

Here we can see our model's results. Pretty good.

In [None]:
# instantiate xgboost with best parameters
booster = xgb.XGBClassifier(subsample=0.5,colsample_bytree=0.6, alpha=0.001,gamma=0.5,learning_rate=0.03, 
                           max_depth=6, n_estimators=500, random_state=1, verbosity=1)
booster_metrics = classifier(booster,X_train, y_train, X_test, y_test, printOut=True)

In [None]:
# create a baseline
eval_set = [(X_train, y_train), (X_test, y_test)]
booster.fit(X_train,y_train,eval_set=eval_set,eval_metric=["error", "logloss"])

### Plotting the training results 

Here we can see that the validation sticks close to the training in both loss and classification error. We can also see that there is room for a bit if improvement if we can perhaps let the Booster grow more trees. Albeit, the progress is going to be very slow after these 500 iterations. 

In [None]:
plot_xgb(booster,X_train,X_test,y_train,y_test)

### Plotting the feature importances 

Random Forests and XGBoost both come equipped with these feature importance lists, which tells us which variables had the greatest predictive power. 

It seems to me that the assumptions made by the uploader of this dataset were proven true. The country is important. But not as much as the gender of the client. Activity is the most important predictor.

In [None]:
feat_importances = pd.Series(booster.feature_importances_, index=X.columns.values)
feat_importances.nlargest(20).sort_values().plot(kind='barh', color='darkgrey', figsize=(20,10))

# VI. Deep Learning (ANN)

Now let's attempt a simple neural network and see if it performs better than the XGBoost. Honestly, on this type of data, I assumed it won't (and it didn't). Plus, the cost (time, computational power) of training is be much greater. There's also the question of architecture choice and hyperparameter tuning which is more difficult with NNs.

Note: I tried smaller architectures and they didn't perform as well. Also, without adding dropout layers, the NN overfits.

In [None]:
# Build the model using the functional API
# create model
i = Input(shape=std_X_train[-1].shape)
x = Dense(512,  activation='relu')(i)
x = Dropout(0.2)(x)
x = Dense(512,  activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(256,  activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(256,  activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.4)(x)
x = Dense(1, activation='sigmoid')(x)

model = Model(i, x)

# Compile and fit
# Note: make sure you are using the GPU for this!
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

r = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50)

In [None]:
# Plot loss per iteration
import matplotlib.pyplot as plt
plt.plot(r.history['loss'], label='loss')
plt.plot(r.history['val_loss'], label='val_loss')
plt.legend()

In [None]:
# Plot accuracy per iteration
plt.plot(r.history['accuracy'], label='acc')
plt.plot(r.history['val_accuracy'], label='val_acc')
plt.legend()

As can be seen from the charts, the model simply starts overfitting after 15 epochs, where it's reached its peak. After this, validation accuracy does not increase anymore and model just learns the training data by heart. 

## Question

I haven't figured out why this NN behaves as such. Why is the validation accuracy higher and its loss lower than those of the test set? I welcome you all to comment and try to solve this. Perhaps I should try different train/test splits?

## Possible answers

Przemys≈Çaw Dolata had <a href="https://www.researchgate.net/post/When_can_Validation_Accuracy_be_greater_than_Training_Accuracy_for_Deep_Learning_Models">this</a> to say about dropout (a regularization technique I am making heavy use of in this example): "The training loss is higher because you've made it artificially harder for the network to give the right answers. However, during validation all of the units are available, so the network has its full computational power - and thus it might perform better than in training."


In [None]:
# Build the model using the functional API
# create model
i = Input(shape=std_X_train[-1].shape)
x = Dense(1024,  activation='relu')(i)
x = Dropout(0.2)(x)
x = Dense(512,  activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(256,  activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(256,  activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.4)(x)
x = Dense(1, activation='sigmoid')(x)

model = Model(i, x)

import keras
# Choosing optimizer
opt = keras.optimizers.Adam(learning_rate=0.001)

# Compile and fit
model.compile(optimizer=opt,
              loss='binary_crossentropy',
              metrics=['accuracy'])

r = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=15)

In [None]:
# Plot loss per iteration
import matplotlib.pyplot as plt
plt.plot(r.history['loss'], label='loss')
plt.plot(r.history['val_loss'], label='val_loss')
plt.legend()

In [None]:
# Plot accuracy per iteration
plt.plot(r.history['accuracy'], label='acc')
plt.plot(r.history['val_accuracy'], label='val_acc')
plt.legend()

### Eliminating dropout

This is what happens if we eliminate the regularization. Frankly, there is not much going on. The model is behaving in approximately the same way, only this time it gets a little worse and the training and validation performance cross over each other earlier, at the 6th epoch.

In [None]:
# Build the model using the functional API
# create model
i = Input(shape=std_X_train[-1].shape)
x = Dense(1024,  activation='relu')(i)
#x = Dropout(0.1)(x)
x = Dense(512,  activation='relu')(x)
#x = Dropout(0.1)(x)
x = Dense(256,  activation='relu')(x)
#x = Dropout(0.1)(x)
x = Dense(256,  activation='relu')(x)
#x = Dropout(0.1)(x)
x = Dense(64, activation='relu')(x)
#x = Dropout(0.1)(x)
x = Dense(32, activation='relu')(x)
#x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
#x = Dropout(0.1)(x)
x = Dense(1, activation='sigmoid')(x)

model = Model(i, x)

import keras
# Choosing optimizer
opt = keras.optimizers.Adam(learning_rate=0.001)

# Compile and fit
model.compile(optimizer=opt,
              loss='binary_crossentropy',
              metrics=['accuracy'])

r = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=7)

In [None]:
# Plot accuracy per iteration
plt.plot(r.history['accuracy'], label='acc')
plt.plot(r.history['val_accuracy'], label='val_acc')
plt.legend()

# Conclusions 

This type of Machine Learning task is rather trivial. The data is small and clean. The predicted outcome is binary. The data points are labeled and the learning is supervised. Still, there is much to be learned from the Exploratory Data Analysis and from basic Statistical Assumptions about the underlying structure of the data. When I say this, I am referring to the relationships that exists between the studied variables and even about their own individial characteristics such as distribution. 

This type of work could come in handy to an organization that cares about customer retention. I would recommend that machine learning could be used to flag customers that are potentially willing to leave the bank. After this, a case worker cand pick up the person's profile and analyse it. A decision on how to proceed with the client is made by balancing the worker's expertize, experience and guts and insights that were gained through Data Analytics.

We saw here a brief taste of what it means to search for good hyperparameters, we saw what regularization techniques can do for the model's performance. We saw that re-sampling imbalanced datasets can be used to reduce bias. Basic data visualization on-the-go is great for testing assumptions and getting creative. 

I truly hope that you find this notebook useful and that it also finds you in your time of need. ^^ Please leave a comment if you liked it and don't be afraid to question, criticize or direct to my attention any peculiarity that you find here. 