<a href="https://colab.research.google.com/github/tdegamboda/Kaggle_Rough/blob/main/Customer_Attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bank Churn Analysis

A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers' decisions in the opposite direction

I got this dataset from a website with the URL as https://leaps.analyttica.com/home. I have been using this for a while to get datasets and accordingly work on them to produce fruitful results. The site explains how to solve a particular business problem.

Now, this dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc. There are nearly 18 features.

We have only 16.07% of customers who have churned. Thus, it's a bit difficult to train our model to predict churning customers.

In [None]:
#Colab library to upload files to notebook
from google.colab import files

# Install Kaggle library
!pip install -q kaggle

In [None]:
# Upload kaggle API key file
uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

!chmod 600 ~/.kaggle/kaggle.json\

In [None]:
!kaggle datasets download -d sakshigoyal7/credit-card-customers

Downloading credit-card-customers.zip to /content
  0% 0.00/379k [00:00<?, ?B/s]
100% 379k/379k [00:00<00:00, 56.1MB/s]


## We will start by importing all necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as ex
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
# pyo.init_notebook_mode()
# sns.set_style('darkgrid')

In [None]:
data = pd.read_csv("credit-card-customers.zip", compression = 'zip', header = 0, sep = ",", quotechar='"')

data.columns

Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],
      dtype='object')

In [None]:
# We will not be using the last two columns of this dataset hence we drop them

data = data[data.columns[:-2]]
data.head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0


In [None]:
fig = make_subplots(rows = 2, cols = 1)

tr1 = go.Box(x = data['Customer_Age'], name = 'Age Box Plot', boxmean = True)
tr2 = go.Histogram(x = data['Customer_Age'], name = 'Age Histogram')

fig.add_trace(tr1, row = 1, col = 1)
fig.add_trace(tr2, row = 2, col = 1)

fig.update_layout(height = 700, width = 1200, title_text = "Distribution of Customer Ages")
fig.show()

We observe a generally normal distribution of the customer ages in our dataset hence we will use the age feature in further work under the normality assumption

In [None]:
fig1 = ex.pie(data, names = 'Gender', title = 'Proportion Of Customer Genders')
fig1.update_layout(height = 500, width = 700)
fig1.show()

Given that the distrubtion of males to females differ by 5.8% we can assume that the genders are uniformly distributed

In [None]:
fig = make_subplots(rows = 2, cols = 1)

tr1 = go.Box(x = data['Dependent_count'], name = 'Dependent Count Box Plot', boxmean = True)
tr2 = go.Histogram(x = data['Dependent_count'], name = 'Dependent Count Histogram')

fig.add_trace(tr1, row = 1, col = 1)
fig.add_trace(tr2, row = 2, col = 1)

fig.update_layout(height = 700, width = 1200, title_text = "Distribution of Dependent Counts (Close Family Size)")
fig.show()

The distribution of dependent counts is fairly normally distributed with a slight right skew. On average, we can state that families have a dependent count of 2.

In [None]:
fig1 = ex.pie(data, names = "Education_Level", title = "Proportion of Education Levels")
fig1.update_layout(height = 500, width = 700)
fig1.show()

If we make the assumption that an "Unknown" education level referes to a lack of any formal education then we may state that more than 70% of the customers have a formal education level. From the proportion of clients with a formal education, we also note that 71.4% these clients have or are currently pursuing a higher level of education.

In [None]:
fig1 = ex.pie(data, names = "Marital_Status", title = "Proportion of Marital Statuses")
fig1.update_layout(height = 500, width = 700)
fig1.show()

The majority of clients at this bank are married. Generally the distribution of marital status is as expected based on global rates

In [None]:
fig1 = ex.pie(data, names = "Income_Category", title = "Proportion of Income Levels")
fig1.update_layout(height = 500, width = 700)
fig1.show()

In [None]:
fig1 = ex.pie(data, names = "Card_Category", title = "Proportion of Different Card Categories")
fig1.update_layout(height = 500, width = 700)
fig1.show()

It is interesting to compare the distribution of income bands to the card categories. It is evident that a staggering majority of clients use the Blue card even though the distribution of income levels vary. Only 6.8% of the bank clients use higher card tiers (above Blue) even though 7.8% of clients fall in the highest income group

In [None]:
fig = make_subplots(rows = 2, cols = 1)

tr1 = go.Box(x = data['Months_on_book'], name = 'Months on Book Box Plot', boxmean = True)
tr2 = go.Histogram(x = data['Months_on_book'], name = 'Months on Book Histogram')

fig.add_trace(tr1, row = 1, col = 1)
fig.add_trace(tr2, row = 2, col = 1)

fig.update_layout(height = 700, width = 1200, title_text = "Distribution of Months on Book")
fig.show()

In [None]:
kurt = data['Months_on_book'].kurt()
print("Kurtosis of the distribution of months on books: {}".format(kurt))

Kurtosis of the distribution of months on books: 0.40010012019986707


The low kurtosis can be interpreted as an indication of a very flat shaped distribution. In this case we cannot assume the normality of the feature.

In [None]:
fig = make_subplots(rows = 2, cols = 1)

tr1 = go.Box(x = data['Total_Relationship_Count'], name = 'Total No. of Products Box Plot', boxmean = True)
tr2 = go.Histogram(x = data['Total_Relationship_Count'], name = 'Total No. of Products Histogram')

fig.add_trace(tr1, row = 1, col = 1)
fig.add_trace(tr2, row = 2, col = 1)

fig.update_layout(height = 700, width = 1200, title_text = "Distribution of the Total No. of Proucts Held by the Customer")
fig.show()

The distribution of the total number of products held by the customer has an irregular distribution. We can observe a uniform distribution where the total number of products is greater than 4 and a left skewed distribution everywhere else. This may suggest that this feature will be useless for the predictor of the churn status

In [None]:
fig = make_subplots(rows = 2, cols = 1)

tr1 = go.Box(x = data['Months_Inactive_12_mon'], name = 'No. of Months Inactive Box Plot', boxmean = True)
tr2 = go.Histogram(x = data['Months_Inactive_12_mon'], name = 'No. of Months Inactive Histogram')

fig.add_trace(tr1, row = 1, col = 1)
fig.add_trace(tr2, row = 2, col = 1)

fig.update_layout(height = 700, width = 1200, title_text = "Distribution of the Number of Months Inactive in the Last 12 Months")
fig.show()

In [None]:
fig = make_subplots(rows=2, cols=1)

tr1=go.Box(x=data['Credit_Limit'],name='Credit_Limit Box Plot',boxmean=True)
tr2=go.Histogram(x=data['Credit_Limit'],name='Credit_Limit Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)

fig.update_layout(height=700, width=1200, title_text="Distribution of the Credit Limit")
fig.show()

In [None]:
fig = make_subplots(rows=2, cols=1)

tr1=go.Box(x=data['Total_Trans_Amt'],name='Total_Trans_Amt Box Plot',boxmean=True)
tr2=go.Histogram(x=data['Total_Trans_Amt'],name='Total_Trans_Amt Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)

fig.update_layout(height=700, width=1200, title_text="Distribution of the Total Transaction Amount (Last 12 months)")
fig.show()

It is evident that the distribution of total transactions presents a multimodal distribution. Generally this can imply that we have some underlying groups in our data that can be clustered and further examined.

In [None]:
fig1 = ex.pie(data, names = "Attrition_Flag", title = "Proportion of Churn v. Not Churn Customers")
fig1.update_layout(height = 500, width = 700)
fig1.show()

Given that we have an uneven distribution of the two classes we will upsample the Attrited Customers. This will give the later selected models a better chance of catching on small details which will almost definitely be missed out with such a size difference.