# Predicting & Visualising Customer Churning
This notebook aims to visualise different features in this data, and use them to predict whether a customer will leave the credit card company.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from collections import Counter
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder, StandardScaler as ss, MinMaxScaler as mms
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/credit-card-customers/BankChurners.csv')

In [None]:
df = df.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1'], axis=1)
df = df.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis=1)
df = df.drop('CLIENTNUM', axis=1)

In [None]:
df

In [None]:
X = df.drop('Attrition_Flag', axis=1)
y = df['Attrition_Flag']

## Categorical visualisation
The first step in this dataset is the visualisation of the different features.

The below plot is a pie chart which shows that roughly 84% of customers in our data are staying with the same firm, while 16% left.

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))
count = Counter(y)
ax.pie(count.values(), labels=count.keys(), autopct=lambda p:f'{p:.2f}%')
ax.set_title('Percentage of existing and attrited customers')
plt.show()

The next graph is a bar chart which tells us the education level of the customers. Most of the people using the bank have some form of education, with only around 1,500 not being educated.

In [None]:
fig, ax = plt.subplots(figsize=(15, 6))
count = Counter(X['Education_Level'])
count = pd.Series(count).sort_values(ascending=False)
labels = []

for i in count.keys():
    labels.append(i + ' (' + str(count[i]/len(X['Education_Level'])*100)[:5] + '%)')

plt.bar(labels, count, color='blue')
plt.title('Education level of the customers')
plt.xlabel('Education level')
plt.ylabel('Number of customers')
plt.show()

The distribution of gender for this bank is relatively equal, with there being only around 3% more women than men.

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))
count = Counter(X['Gender'])

ax.pie(count.values(), labels=count.keys(), autopct=lambda p:f'{p:.2f}%')
ax.set_title('Gender of customers')
plt.show()

The marital status of the customers in our bank shows us that around half are married, roughly 40% are single, and 7% are unknown and 7% divorced.

In [None]:
fig, ax = plt.subplots(figsize=(8, 7))
count = Counter(X['Marital_Status'])
labels = []

for i in count:
    labels.append(i + ' (' + str(count[i]/len(X['Marital_Status'])*100)[:5] + '%)')
    
plt.bar(labels, count.values(), color='green')
plt.title('Marital status for customers')
plt.ylabel('Number of customers')
plt.xlabel('Marital status')
plt.show()

The vast majority (93%) of customers use blue cards, followed by Silver (5%), Gold (1%) and Platinum (0.2%).

In [None]:
fig, ax = plt.subplots(figsize=(7, 7))
count = Counter(X['Card_Category'])

ax.pie(count.values(), labels=count.keys(), autopct=lambda p:f'{p:.2f}%')
ax.set_title('Card category of customers')
plt.show()

The most common number of relationships people have had is three (22%), followed by four, five and six, which have roughly 18% each. Followed by that is 2 relationships (12%) and 1 relationship (9%).

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
count = Counter(X['Total_Relationship_Count'])
count = pd.Series(count).sort_values(ascending=False)
labels = []

for i in count.keys():
    labels.append(str(i) + ' (' + str(count[i]/len(X['Total_Relationship_Count'])*100)[:5] + '%)')
    
plt.bar(labels, count, color='purple')
plt.title('Number of relationships for customers')
plt.ylabel('Customers')
plt.xlabel('Number of relationships')
plt.show()

More than a third of the people make less than $40K, and the next most common category is $40-60K, which is almost half as frequent. After that is $80-120K (15%), $60-80K (14%), Unknown (11%) and then $120K+ (7%).

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
count = Counter(X['Income_Category'])

ax.pie(count.values(), labels=count.keys(), autopct=lambda p:f'{p:.2f}%')
ax.set_title('Income per customer')
plt.show()

## Numerical visualisation

The features in our dataset have some correlation, for example, Months_on_book and Customer_age, Avg_Utilization_Ratio and Total_Revolving_Bal.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
sns.heatmap(X.corr(), annot=True)
plt.show()

To visualise that correlation, we will use a scattergraph on the six most correlatable features: '**Total_Trans_Amt**' and '**Total_Trans_Ct**', '**Total_Revolving_Bal**' and '**Avg_Utilization_Ratio**', '**Months_on_book**' and '**Customer_Age**'.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 5))

ax1.scatter(X['Total_Trans_Amt'], df['Total_Trans_Ct'])
ax2.scatter(X['Total_Revolving_Bal'], df['Avg_Utilization_Ratio'])
ax3.scatter(X['Months_on_book'], df['Customer_Age'])

ax1.set_xlabel('Total_Trans_Amt', fontsize=20)
ax1.set_ylabel('Total_Trans_Ct', fontsize=20)

ax2.set_xlabel('Total_Revolving_Bal', fontsize=20)
ax2.set_ylabel('Avg_Utilization_Ratio', fontsize=20)

ax3.set_xlabel('Months_on_book', fontsize=20)
ax3.set_ylabel('Customer_Age', fontsize=20)

ax2.set_title('Correlation of features', fontsize=40, pad=40)

plt.show()

Afterwards, we check the distribution of the five least evenly-distributed features and see how they change with log transform, box cox, standard scaler and min max scaler.

The graphs below show us that 'Credit_Limit' and 'Avg_Utilization_Ratio' work best without a transformation, 'Avg_Open_To_Buy' and 'Total_Amt_Chng_Q4_Q1' are best with box cox and 'Total_Trans_amt' needs the log transform on it.

In [None]:
cols =['Credit_Limit','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1','Total_Trans_Amt','Avg_Utilization_Ratio']

for col in cols:
    i = 0
    
    fig, axes = plt.subplots(1, 5, figsize=(15, 3))
    
    f1 = df[col]
    f2 = (df[col]+1).transform(np.log)
    f3 = pd.DataFrame(stats.boxcox(df[col]+1)[0])
    f4 = pd.DataFrame(ss().fit_transform(np.array(df[col]).reshape(-1, 1)))
    f5 = pd.DataFrame(mms().fit_transform(np.array(df[col]).reshape(-1, 1)))
    
    for column in [[f1, 'cyan', 'Normal'], [f2, 'pink', 'Log'], [f3, 'lightgreen', 'Box Cox'], 
                   [f4, 'skyblue', 'Standard'], [f5, 'yellow', 'MinMax']]:
        feature = column[0]
        colour = column[1]
        name = column[2]
        
        feature.hist(ax=axes[i], color=colour)
        deciles = feature.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
        
        for pos in np.array(deciles).reshape(1, -1)[0]:
            handle = axes[i].axvline(pos, color='darkblue', linewidth=1)

        axes[i].legend([handle], ['decile'])
        axes[i].set_xlabel(name)
        
        i += 1 
    
    axes[2].set_title(col, fontsize=15, pad=15)
    axes[3].set_title('')
    axes[4].set_title('')
                    
    plt.show()

plt.show()

These techniques are applied below:

In [None]:
X['Credit_Limit'] = X['Credit_Limit']
X['Avg_Open_To_Buy'] = stats.boxcox(X['Avg_Open_To_Buy']+1)[0]
X['Total_Amt_Chng_Q4_Q1'] = stats.boxcox(X['Total_Amt_Chng_Q4_Q1']+1)[0]
X['Total_Trans_Amt'] = (X['Total_Trans_Amt']+1).transform(np.log)
X['Avg_Utilization_Ratio'] = X['Avg_Utilization_Ratio']

Now, we will do some binning on the five features which have the widest range of values. We will reduce the amount of unique categories per feature from thousands to one hundred.

In [None]:
for i in ['Credit_Limit', 'Total_Revolving_Bal','Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 
            'Total_Trans_Amt']:
    col = X[i]
    diff = col.max() - col.min()
    bins = np.digitize(col, np.arange(col.min(), col.max(), (diff/100)).tolist())
    X[i+'_bin'] = bins

The final piece of feature visualisation that we will do is displaying the distribution of the binned variables, where we can see that they are roughly centered, except from 'Credit_Limit_bin'.

In [None]:
i = 0
cols = ['Credit_Limit_bin', 'Total_Revolving_Bal_bin', 'Avg_Open_To_Buy_bin', 'Total_Amt_Chng_Q4_Q1_bin',
        'Total_Trans_Amt_bin']
colours = ['pink', 'lightblue', 'lightgreen', 'skyblue', 'yellow']

fig1, axes1 = plt.subplots(1, 2, figsize=(8, 3))
fig2, axes2 = plt.subplots(1, 3, figsize=(15, 3))

for ax in axes1:
    col = X[cols[i]]
    pd.DataFrame(col).hist(ax=ax, color=colours[i])
    deciles = col.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
    
    for pos in deciles:
        handle = ax.axvline(pos, color='darkblue', linewidth=1.15)
    
    ax.legend([handle], ['decile'])
    i += 1
    
for ax in axes2:
    col = X[cols[i]]
    pd.DataFrame(col).hist(ax=ax, color=colours[i])
    deciles = col.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
    
    for pos in deciles:
        handle = ax.axvline(pos, color='darkblue', linewidth=1.15)
    
    ax.legend([handle], ['decile'])
    i += 1
    
plt.show()

## Predicting Customer Churn
Now is the time to use our dataset to predict whether a customer will end their use of the credit card company.

We will firstly use a LabelEncoder to convert the 'Gender', 'Education_Level', 'Marital_Status', 'Income_Category', and 'Card_Category' columns from categorical into numerical. Afterwards, we split the X and y into train and test datasets. The train will have 80% of X and the test will have 20%.

In [None]:
cat_cols = ['Gender','Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
for col in cat_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    
le = LabelEncoder()
y = le.fit_transform(y)
X = X.drop(['Credit_Limit', 'Total_Revolving_Bal_bin', 'Avg_Open_To_Buy_bin', 'Total_Amt_Chng_Q4_Q1_bin',
            'Total_Trans_Amt_bin'], axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
count = Counter(y_train)
print('Distribution of target 1 & 2:', count[1], '&', count[0])

However, the '0' target has 5 times less samples than the '1' target. Therefore, we will need to use SMOTE to resample it so that both can be even.

In [None]:
smote = SMOTE()
X_train, y_train = smote.fit_resample(X_train, y_train)

In [None]:
count = Counter(y_train)
print('Distribution of target 1 & 2:', count[1], '&', count[0])

Next, we will create a selection of classifiers and use the best one for our final output. The predictors that will be used are XGBoost, Random Forest, K Nearest Neighbours, SGD Classifier and SVC.

In [None]:
classifiers = [[XGBClassifier(),'XGB Classifier'], [RandomForestClassifier(),'Random Forest'], 
    [KNeighborsClassifier(), 'K-Nearest Neighbours'], [SGDClassifier(),'SGD Classifier'], [SVC(),'SVC']]

To evaluate the results of our models, we will loop over them, fit them with the train sets and display the results with the score, cross_val and roc_auc metrics.

In [None]:
score_list = []
cross_val_list = []
roc_auc_list = []

for classifier in classifiers:
    model = classifier[0]
    model.fit(X_train, y_train)
    model_name = classifier[1]
    
    pred = model.predict(X_test)

    score = model.score(X_test, y_test)
    cross_val = cross_val_score(model, X_test, y_test).mean()
    roc_auc = roc_auc_score(y_test, pred)
    
    score_list.append(score)
    cross_val_list.append(cross_val)
    roc_auc_list.append(roc_auc)
    
    print(model_name, 'model score:     ' + str(round(score*100, 2)) + '%')
    print(model_name, 'cross val score: ' +str(round(cross_val*100, 2)) + '%')
    print(model_name, 'roc auc score:   ' + str(round(roc_auc*100, 2)) + '%')
    
    if model_name != classifiers[-1][1]:
        print('')

As seen from the bar chart below, the XGBoost and the Random Forest consistently crush the rest of the competition. However, through a small margin, the winning model is the XGBoost Classifier. This predictor manages to achieve very high accuracies of **96%, 95%, and 94%**.

The Random Forest achieves slightly lower than this, with **95%, 93% and 92%**.

Next is the K-Nearest Neighbours achieving **81%, 88% and 79%**.

The SGD and SVC classifiers manage to get accuracies ranging from **40% to 84%**

In [None]:
labels = ['XGBoost', 'Random Forest', 'KNN', 'SGD Classifier', 'SVC']
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 5))

ax1.bar(labels, score_list, color='blue')
ax2.bar(labels, cross_val_list, color='red')
ax3.bar(labels, roc_auc_list, color='green')

ax1.set_title('Model score')
ax2.set_title('Cross validation score')
ax3.set_title('ROC AUC score')

plt.show()

We have chosen the XGBoost Classifier as the one to make our final prediction with, and the results are shown below.

In [None]:
model = XGBClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)

score = model.score(X_test, y_test)
cross_val = cross_val_score(model, X_test, y_test).mean()
roc_auc = roc_auc_score(y_test, pred)

print('model score:     ' + str(round(score*100, 2)) + '%')
print('cross val score: ' +str(round(cross_val*100, 2)) + '%')
print('roc auc score:   ' + str(round(roc_auc*100, 2)) + '%')

### Thank you for reading my notebook.

### If you enjoyed this notebook and found it helpful, please upvote it so that I can make more of these.