# Credit churn - data exploration and modeling

0. [Introduction](#0)

1.  [Preparation](#1)

    1.1 [Packages](#1.1)
    
    1.2 [Data](#1.2)
    
    1.3 [Data dictionary](#1.3)
    
2. [EDA and visualization](#2)

3. [Classification model](#3)

    3.1 [Imputation, label encoding and baseline](#3.1)
    
    3.2 [Dataset balancing](#3.2)
    
    3.3 [Feature selection](#3.3)
    
    3.4 [Neural Networks?](#3.4)

## 0. Introduction <a id=0></a>

Credit card churning is a widespread phenomenon where people apply for multiple credit cards (or, more generally, open multiple credit lines) to take advntage of signup bonuses, with no intention of keeping all of them active in the long term. Credit institutions have therefore a strong interest in identifying churning customers and, more generally, predicting whether a client is gonna cancel their credit card or not. We perform data exploration on a dataset of 10000 customers with a view towards this goal and then build a predicting model.

## 1. Preparation <a id=1></a>

### 1.1 Packages <a id=1.1></a>

In [None]:
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score, precision_recall_fscore_support
from sklearn.model_selection import RandomizedSearchCV
from imblearn.over_sampling import ADASYN, SMOTE

import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf


init_notebook_mode(connected=True)
cf.set_config_file(sharing='public',theme='white',offline=True)

!pip install -Uqq fastbook

import fastbook
fastbook.setup_book()

from fastai.vision.all import *
from fastai.tabular.all import *
from fastbook import *

warnings.filterwarnings(action='ignore', category=UserWarning)

### 1.2 Data <a id=1.2></a>

As specified in the original dataset, we can drop the last two columns.

In [None]:
#import data and print first five rows of dataframe
df = pd.read_csv('../input/credit-card-customers/BankChurners.csv')
df = df.iloc[:, :-2]
df.head()

## 1.3 Data dictionary <a id=1.3></a>

`CLIENTNUM` - ID of the customer holding the credit card.

`Customer_Age` - Age of the customer.

`Gender` - Sex of the customer.

`Dependent_count` - Number of dependents of the customer.

`Education_Level` - Educational qualification of the customer.

`Marital_Status` - Civil status of the customer.

`Income_Category` - Annual income range of the customer.

`Card_Category` - Type of card owned by the customer.

`Months_on_book` - Number of months elapsed since the account opening.

`Total_Relationship_Count` - Total number of products held by the customer.

`Months_Inactive_12_mon` - Number of months with no transactions in the last year.

`Contacts_Count_12_mon` - Number of contacts with the bank in the last year.

`Credit_Limit` - Credit limit on the credit card.

`Total_Revolving_Bal` - Total revolving balance on the credit card.

`Avg_Open_To_Buy` - Average card "Open To Buy" (=credit limit - account balance) in the last year.

`Total_Amt_Chng_Q4_Q1` - Change in transaction amount over the last year (Q4 over Q1).

`Total_Trans_Amt` - Total amount of transactions made in the last year.

`Total_Trans_Ct` - Number of transactions made in the last year.

`Total_Ct_Chng_Q4_Q1` - Change in transaction number over the last year (Q4 over Q1).

`Avg_Utilization_Ratio` - Average card "Utilization ratio" (=account balance / credit limit) in the last year.

`Attrition_Flag` - Target variable. "Attrited Customer" if the customer closed their account, otherwise "Existing Customer".

In [None]:
df.info()

There are four categorical features, namely `Education_Level`, `Income_Category`, `Card_Category` and `Marital_Status`, in addition to the dependent variable `Attrition_Flag`. Even though the non-null count does not show any missing data, by looking at the different levels we see that there there are entries in the columns `Education_Level`, `Income_Category` and `Marital_Status` which are marked as *Unknown*.

## 2. EDA and visualization <a id=2></a>

In [None]:
#Age histogram
age_att = df.loc[df['Attrition_Flag'] == 'Attrited Customer', 'Customer_Age']
age_ex = df.loc[df['Attrition_Flag'] == 'Existing Customer', 'Customer_Age']
conc_age = pd.concat([age_att, age_ex], axis=1)
conc_age.columns = ['Attrited Customer', 'Existing Customer']
conc_age.iplot(kind='hist', keys=['Attrited Customer', 'Existing Customer'],
           colors=['blue', 'orange'], histnorm='percent', opacity=0.5, bins=30,
           title='Customers\' age', xTitle='Age', yTitle='% customers')

#Marital_Status normalized bar plot
mar_count = df.groupby(['Marital_Status', 'Attrition_Flag']).count()['CLIENTNUM'].unstack()
mar_count = mar_count.loc[['Single', 'Married', 'Divorced', 'Unknown']]
mar_count = mar_count / mar_count.sum()

mar_count.iplot(kind='bar', orientation='h',
                title='Customers\' marital status', xTitle='normalized count', yTitle='Marital status',                
                bargap=0.5, colors=['blue', 'orange'])

#Education_Level normalized bar plot
edu_count = df.groupby(['Education_Level', 'Attrition_Flag']).count()['CLIENTNUM'].unstack()
edu_count = edu_count.loc[['Uneducated', 'High School', 'College', 'Graduate', 'Post-Graduate', 'Doctorate','Unknown']]
edu_count = edu_count / edu_count.sum()

edu_count.iplot(kind='bar', orientation='h', 
                title='Customers\' education', xTitle='normalized count', yTitle='Education level',                
                bargap=0.5, colors=['blue', 'orange'])

#Income_Category normalized bar plot
inc_count = df.groupby(['Income_Category', 'Attrition_Flag']).count()['CLIENTNUM'].unstack()
inc_count = inc_count.loc[['Less than $40K','$40K - $60K', '$60K - $80K' , '$80K - $120K', '$120K +', 'Unknown']]
inc_count.index = inc_count.index.str.replace('$', '')
inc_count = inc_count / inc_count.sum()

inc_count.iplot(kind='bar', orientation='h',
                title='Customers\' annual income', xTitle='normalized count', yTitle='Income range ($)',                
                bargap=0.5, colors=['blue', 'orange'])

Attrited customers seems to have a slightly higher education level and sliglty lower annual income, but overall from the demographic standpoint there are no striking differences between customers who have churned and customers who have not. Let us look at credit card usage and spending habits.

In [None]:
#Normalized histogram of annual transactions
df_att = df.loc[df['Attrition_Flag'] == 'Attrited Customer', 'Total_Trans_Amt']
df_ex = df.loc[df['Attrition_Flag'] == 'Existing Customer', 'Total_Trans_Amt']
conc = pd.concat([df_att, df_ex], axis=1)
conc.columns = ['Attrited Customer', 'Existing Customer']

print("Churned customers' mean total transactions' amount: {:.2f}$".format(conc['Attrited Customer'].mean()))
print("Non-churned customers' mean total transactions' amount: {:.2f}$".format(conc['Existing Customer'].mean()))

conc.iplot(kind='hist', keys=['Attrited Customer', 'Existing Customer'],
           colors=['blue', 'orange'], histnorm='percent', opacity=0.6,
           title='Annual transactions', xTitle='Total transactions\' amount ($)', yTitle='% customers')

#Scatter plot: annual transactions' amount vs number
df.iplot(kind='scatter', x='Total_Trans_Amt', y='Total_Trans_Ct', categories='Attrition_Flag', size=6,
        xTitle='Total annual transactions\' amount ($)', yTitle='no. annual transactions')

df_att = df.loc[df['Attrition_Flag'] == 'Attrited Customer', 'Total_Ct_Chng_Q4_Q1']
df_ex = df.loc[df['Attrition_Flag'] == 'Existing Customer', 'Total_Ct_Chng_Q4_Q1']
conc = pd.concat([df_att, df_ex], axis=1)
conc.columns = ['Attrited Customer', 'Existing Customer']

#Box plot of ratio of change in transactions' count over the last year
conc.iplot(kind='box', keys=['Attrited Customer', 'Existing Customer'],
           colors=['blue', 'orange'], opacity=0.5, boxpoints='all', legend=False,
           title='Change in transactions count Q4/Q1 -- Box Plot', yTitle='change (ratio)')

As it can be expected, we notice a direct correlation between the number of annual transactions and their total amount. More interestingly, it appears that we can segment the dataset into three different clusters of clients, one of which (high spenders, i.e. `Total_Trans_Amt` > 11k) contains no attrited customers at all. Churned customers used their credit card less often and spent significantly less money. They are also characterized by a more significant reduction in the usage of their credit card during the last 12 months.

In [None]:
#Boxplot of revolving balance
df_att = df.loc[df['Attrition_Flag'] == 'Attrited Customer', 'Total_Revolving_Bal']
df_ex = df.loc[df['Attrition_Flag'] == 'Existing Customer', 'Total_Revolving_Bal']
conc = pd.concat([df_att, df_ex], axis=1)
conc.columns = ['Attrited Customer', 'Existing Customer']

conc.iplot(kind='box', keys=['Attrited Customer', 'Existing Customer'],
           colors=['blue', 'orange'], opacity=0.5, legend=False,
           title='Revolving balance', yTitle='Total revolving balance ($)')

We notice a major difference when analizying the revolving balance of the customers' credit lines: more than half of the churned customers have paid off all their debt. It is unclear whether this measure for the revolving balance is conducted before or after the termination of the credit line and whether customers who cancel their credit cards are asked to pay off all their standing debt. If this is the case, this feature has no real predicting value and should not be used for classification, as it is implicitly biased. However, since it was included in the dataset, we will use it in our models.

## 3. Classification model <a id=3></a>

### 3.1 Imputation, label encoding and baseline

We change the *Unknown* entries in the categorical features to NaN and then we impute such missing data, while label encoding all categorical features (we can treat `Education_Level`, `Income_Category` and `Card_Category` as ordinal features, and follow the obvious ordering in the choice of the labels). For the dependent variable, we will consider *Attrited Customer* as the positive class (label = 1).

In [None]:
income_ord = 'Less than $40K','$40K - $60K', '$60K - $80K' , '$80K - $120K', '$120K +'
edu_ord = 'Uneducated', 'High School', 'College', 'Graduate', 'Post-Graduate', 'Doctorate'
card_ord = 'Blue', 'Silver', 'Gold', 'Platinum'
marital_status = 'Single', 'Married', 'Divorced'

df['Income_Category'] = df['Income_Category'].astype('category')
df['Income_Category'].cat.set_categories(income_ord, ordered=True, inplace=True)
df['Education_Level'] = df['Education_Level'].astype('category')
df['Education_Level'].cat.set_categories(edu_ord, ordered=True, inplace=True)
df['Card_Category'] = df['Card_Category'].astype('category')
df['Card_Category'].cat.set_categories(card_ord, ordered=True, inplace=True)
df['Marital_Status'] = df['Marital_Status'].astype('category')
df['Marital_Status'].cat.set_categories(marital_status, ordered=False, inplace=True)

dep_var = 'Attrition_Flag'
df[dep_var] = (df[dep_var] == 'Attrited Customer').astype(int)

print("# existing customers: {}\n".format(len(df.loc[df[dep_var] == 0])))
print("# attrited customers: {}\n".format(len(df.loc[df[dep_var] == 1])))

Our dataset is **heavily unbalanced**: only 16% of the datapoints have `Attrition_Flag` = 1. This poses a problem which needs to be addressed, both while training classification models (to avoid bias) and while evaluating their performances (a naive model which always predicts `Attrition_Flag` = 0 will have an accuracy of almost 84%). In particular, as metric we will need to look separately at precision and recall for each of the two levels of the dependent variable, not just at the overall accuracy. Since the bank is especially interested in predicting which customers will churn, we will prioritize the recall for the positive class: this will be reflected in the choices fo the ROC thresholds.

As baseline, let us train a RandomForest classifier using all variables in the unbalanced dataset. We reserve a stratified 20% of data for validation.

In [None]:
cont, cat = cont_cat_split(df, max_card=1, dep_var='Attrition_Flag')
procs = [Categorify, FillMissing]
to_rf1 = TabularPandas(df, procs, cat, cont, y_names=dep_var, y_block=CategoryBlock, splits=RandomSplitter()(range_of(len(df))))

train_X, train_y = to_rf1.train.xs, to_rf1.train.y
valid_X, valid_y = to_rf1.valid.xs, to_rf1.valid.y

rf1 = RandomForestClassifier(n_estimators=500, oob_score=True)
rf1.fit(train_X, train_y)
pred_y = rf1.predict(valid_X)

print("OOB score: {:.3f}\n".format(rf1.oob_score_))

print(classification_report(valid_y, pred_y))

sns.heatmap(confusion_matrix(valid_y, pred_y), annot=True, fmt='d')
plt.xlabel('Predicted flag')
plt.ylabel('True flag')
plt.title('Confusion Matrix -- RandomForest')

#ROC curve and AUC
probs = rf1.predict_proba(valid_X)
fpr, tpr, thr = roc_curve(valid_y, probs[:, 1])

fig, ax = plt.subplots(figsize=(7, 5))
plt.plot([0,1],[0,1],"k--")
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve - AUC={:.4f}".format(roc_auc_score(valid_y, probs[:, 1])))

As we see from the classification report, the recall for the positive level of the dependent variable is significantly lower than the overall accuracy. This could be of course increased by lowering the probability threshold in the binary classification, at the cost of a precision reduction.

### 3.2 Dataset balancing <a id=3.2></a>
 
We can balance the dataset by adding synthetic data. We do so by using an upsampling algorithm like SMOTE or ADASYN. However, before upsampling we remove from the dataset a test set which we'll use for validation (and we repeat this over multiple folds). This is to prevent data leakage: we construct the synthetic datapoints (which will constitute almost half of the final training set) without them seeing the validation set, aiming for a more robust predictor.
We choose the probability threshold for predicting `Attrition_Flag` = 0/1 using ROC.

In [None]:
def test_model_oversamp(df, model, ovs_method, norm=False, n_iter=5, n_epochs=15, thr_mod=1):
    res = []
    roc = []
    procs=[Categorify, FillMissing]
    if norm == True:
        procs = procs + [Normalize]
    ind0 = df[df['Attrition_Flag'] == 0].index.to_list()
    ind1 = df[df['Attrition_Flag'] == 1].index.to_list()
    np.random.shuffle(ind0)
    np.random.shuffle(ind1)
    #Split the dataset into n_iter folds, with a 3:1 balancing of the dependent variable
    test_chks = [ind1[i::n_iter] for i in range(n_iter)]
    train_chks = [ind0[i::3*n_iter] for i in range(n_iter)]
    for i in range(n_iter):
        #Validation fold
        test_idx = test_chks[i] + train_chks[i]
        test_idx.sort()
        #Merge the training folds
        train_idx = [a for a in range(len(df)) if (a not in test_idx)]

        cont, cat = cont_cat_split(df, max_card=1, dep_var='Attrition_Flag')
        to = TabularPandas(df, procs, cat, cont, y_names='Attrition_Flag', y_block=CategoryBlock, splits=(train_idx, test_idx))
        train_X, train_y = to.train.xs, to.train.y
        test_X, test_y = to.valid.xs, to.valid.y

        #Upsampling of training set
        ovsX, ovsy = ovs_method.fit_resample(train_X, train_y) 
            
        #Neural Network setup using FastAI
        if model == nn:
            ovs_concat = pd.concat([to.valid.items, pd.concat([ovsX, pd.DataFrame({'Attrition_Flag':ovsy})], axis=1)]).reset_index().drop('index', axis=1)
            dls = TabularDataLoaders.from_df(ovs_concat, y_names='Attrition_Flag', y_block=CategoryBlock, valid_idx=range(len(test_idx)), bs=512)
            metrics = RocAucBinary()
            learn = tabular_learner(dls, metrics=metrics)
            learn.fit_one_cycle(n_epochs, 1e-2)
            probs, _ = learn.get_preds()
            probs = np.array(probs)
        else:
            ovsX = np.array(ovsX)
            test_X = np.array(test_X)
            model.fit(ovsX, ovsy)
            probs = model.predict_proba(test_X)
        
        #Choose probability threshold using ROC
        fpr, tpr, thr = roc_curve(test_y, probs[:, 1])
        best_thr = thr[np.argmin(fpr - tpr)]

        #Predictions of the model
        preds = np.where(probs[:, 1] > best_thr/thr_mod, 1, 0)

        #Accuracy and precision/recall scores
        acc = accuracy_score(test_y, preds)
        prec_rec = precision_recall_fscore_support(test_y, preds)
        res.append((acc, prec_rec[:2]))
        roc.append(roc_auc_score(test_y, probs[:, 1]))
        print("Iteration {} concluded".format(i+1))
  
    accu = np.mean([a for a,_ in res])
    accu_std = np.std([a for a,_ in res])
    auc = np.mean(roc)
    prec0 = np.mean([pr[0][0] for _,pr in res])
    prec0_std = np.std([pr[0][0] for _,pr in res])
    prec1 = np.mean([pr[0][1] for _,pr in res])
    prec1_std = np.std([pr[0][1] for _,pr in res])
    rec0 = np.mean([pr[1][0] for _,pr in res])
    rec0_std = np.std([pr[1][0] for _,pr in res])
    rec1 = np.mean([pr[1][1] for _,pr in res])
    rec1_std = np.std([pr[1][1] for _,pr in res])
    print("\nMean AUC score: {:.3f}".format(auc))
    print("\nMean accuracy: {0:.3f} (std: {1:.3f})".format(accu, accu_std))
    print("\nCLASS 0: \n \t mean precision: {0:.3f} (std: {1:.3f}), \t mean recall: {2:.3f} (std: {3:.3f})".format(prec0, prec0_std, rec0, rec0_std))
    print("CLASS 1: \n \t mean precision: {0:.3f} (std: {1:.3f}), \t mean recall: {2:.3f} (std: {3:.3f})".format(prec1, prec1_std, rec1, rec1_std))
    return res, roc  

In [None]:
rf = RandomForestClassifier(n_estimators=300, max_features='auto')
ovs_method = ADASYN()
res, roc = test_model_oversamp(df, rf, ovs_method=ovs_method)

Pretty good! With upsampling, the recall for the positive class has jumped to 94%, with a still good overall accuracy.
Let us try using a different model, namely XGBoost.

In [None]:
model_xgb = xgb.XGBClassifier(use_label_encoder=False, max_depth=25, eval_metric='logloss')
ovs_method = ADASYN()
res, roc = test_model_oversamp(df, model_xgb, ovs_method=ovs_method)

We see that XGB gives better overall results than RandomForest. We perform a randomized grid search for hyperparameter tuning.

In [None]:
to_all = TabularPandas(df, procs, cat, cont, y_names=dep_var, y_block=CategoryBlock)
train_X, train_y = to_all.train.xs, to_all.train.y

param_tuning = {
        'learning_rate': [0.01, 0.1, 0.5],
        'max_depth': [5, 10, 25, 40],
        'subsample': [0.5, 0.7],
        'colsample_bytree': [0.5, 0.7],
        'n_estimators' : [100, 200, 400]
        }

model= xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')

grid = RandomizedSearchCV(estimator=model, param_distributions=param_tuning)
grid.fit(train_X, train_y)

best_xgb = grid.best_estimator_

print(grid.best_params_)

ovs_method = ADASYN()
res, roc = test_model_oversamp(df, best_xgb, ovs_method=ovs_method)

### 3.3 Feature selection <a id=3.3></a>

Up until now we have worked with all features in the original dataset. It is worth looking at whether we can restrict to a subset of features (thus improving interpretability of the model) without losing prediction accuracy. To do so, we first train a RandomForest on all the dataset and take a look at the corresponding features importance.

In [None]:
to_all = TabularPandas(df, procs, cat, cont, y_names=dep_var, y_block=CategoryBlock)

train_X, train_y = to_all.train.xs, to_all.train.y

rf_all = RandomForestClassifier(n_estimators=500, max_features='auto', oob_score=True)
rf_all.fit(train_X, train_y)

fi = pd.DataFrame({'feature':train_X.columns, 'feature_imp': rf_all.feature_importances_}).sort_values('feature_imp', ascending=False)
plt.subplots(figsize=(13,8))
plt.barh(fi.feature, fi.feature_imp)
plt.title("Feature importance")

print("OOB score: {:.3f}\n".format(rf_all.oob_score_))

We notice that all categorical features have low importance. The most signficant variables (namely `Total_Trans_Amt`, `Total_Trans_Ct`, `Total_Revolving_Bal` and `Total_Ct_Chng_Q4_Q1`) have been analyzed in the preliminary dataset exploration. Let's toss out the less important features and check how the OOB score changes.

In [None]:
feat_filt = fi.loc[fi['feature_imp'] > 0.02, 'feature'].to_list()
x_filt = train_X[feat_filt]

rf_filt = RandomForestClassifier(n_estimators=500, oob_score=True)
rf_filt.fit(x_filt, train_y)

print("OOB score: {:.3f}\n".format(rf_filt.oob_score_))
print("Number of features: {}".format(len(feat_filt)))

cluster_columns(x_filt)

Passing from 20 features to just 14, the OOB score has actually slightly improved! Incidentally, we notice that we have tossed out all categorical features. Looking at the hierarchical feature clustering, it seems that we might also spare one feature between `Credit_Limit` and `Avg_Open_To_Buy` (we discard the second one, which has smaller importance). We also try removing `CLIENTNUM`, which we expect has no significant predicting value.

In [None]:
x_filt2 = train_X[feat_filt].drop(['Avg_Open_To_Buy', 'CLIENTNUM'], axis=1)

rf_filt2 = RandomForestClassifier(n_estimators=500, oob_score=True)
rf_filt2.fit(x_filt2, train_y)

print("OOB score: {:.3f}\n".format(rf_filt2.oob_score_))

The OOB score has not decreased. We restrict to the remaining 12 features and check the upsampled XGB algorithm CV performance as before.

In [None]:
df_fin = df[feat_filt + ['Attrition_Flag']].drop(['Avg_Open_To_Buy', 'CLIENTNUM'], axis=1)

model = best_xgb
ovs_method = ADASYN()
res, roc = test_model_oversamp(df_fin, model, ovs_method=ovs_method)

The overall performance has improved passing from the original 20 features to the 12 most significant ones, reaching an **AUC score of 99.2% and a recall for the positive class of over 97%**.

### 3.4 Neural Networks? <a id=3.4></a> 

Let us try using a neural network for the classification model, both with the total set of features and the filtered one. We train a 2-layers NN for 15 epochs and cross-validate as before.

In [None]:
#All features
res, roc = test_model_oversamp(df, nn, norm=True, ovs_method=ADASYN())

In [None]:
#Selected features
res, roc = test_model_oversamp(df_fin, nn, norm=True, ovs_method=ADASYN())

As before, using only the 12 most important features gives better results, but the NN approach does not perform as good as  XGB.