# Credit Card Customers - EDA and XGBoost

**Introduction**

This notebook uses data available at [https://www.kaggle.com/sakshigoyal7/credit-card-customers](http://) and:

* Reads in the relevant data.
* Performs exploratory data analysis (EDA) to identify trends and estimate feature importance.
* Tests various different classifier models on the data (spoiler alert - XGBoost wins).
* Attempt tuning of the winning classifier to improve performance.
* Once the final model is built, feature importance for the model is evaluated.
* Finally, insights and potential next steps are discussed.

The first step gets the data and some basic information, checking column data types assigned by Pandas and establishing whether there is missing data in any columns.  

In [None]:
import pandas as pd

#whilst not useful for analysis, the CLIENTNUM is taken as the index as, were the model productionised, it can be used as a key to match prediction to customer
#similarly, the original Naive Bayes classifier columns were included in the data, so these are removed
df = pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv',index_col='CLIENTNUM').iloc[:,:-2]

print(df.shape)
print(df.info())

The dataframe has 10,127 rows and 20 columns, with *Attrition_Flag* the target variable. The majority of columns are numeric type, but there are also object dtype columns, which relate to several categorical variables. These columns will be re-encoded to new binary columns  before building the model.

Firstly, let's visualize the data in each column, grouping results by *Attrition_Flag*. I'll use boxplots for numeric columns and countplots for categorical columns.

In [None]:
from pandas.api.types import is_numeric_dtype
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

#we get our numeric columns and produce a 4x4 boxplot grid, removing empty plots
numeric_cols= df.select_dtypes('number').columns.tolist()

fig,ax = plt.subplots(4,4,figsize=(20,15))
ax = ax.flatten()
for i in range(len(numeric_cols)):
    boxplot = sns.boxplot(x=df['Attrition_Flag'],y=df[numeric_cols[i]],ax=ax[i],order=['Attrited Customer','Existing Customer'])
    boxplot.set_title(numeric_cols[i])
    plt.tight_layout()
#there are 14 columns, so plots 15 & 16 are empty and can be removed
fig.delaxes(ax.flatten()[14])
fig.delaxes(ax.flatten()[15])

Looking at the numeric columns, we can see that:
* Existing customers make more transactions (count and amounts) and tend to have more relationships with the bank versus leavers.
* Existing customers have higher revolving balances and credit utilisation percentages.
* Leavers ('Attrited_Customer') tend to have been inactive more often and contacted the bank more frequently.
* Certain columns (e.g. Customer Age) appear to be similar across the two groups.

A key assumption made at this point is that any column relative to a time (e.g. months inactive), for leavers, is reflective of their tenure rather than a fixed point. If the latter is true, differences in these columns may be due to timing (for instance, if months inactive counts the last twelve months, and a customer left 8 months ago,their inactivity would be high by default).

Before moving on to categorical columns, I'll look at a correlation matrix to understand which numeric features may be correlated with one another. Visualizing this will be useful for any attempts to reduce the dimensions of the model without compromising significantly on accuracy

In [None]:
import numpy as np
corr_matrix= df.corr()
mask = np.triu(np.ones_like(corr_matrix))
plt.figure(figsize=(25,20))
colmap = sns.diverging_palette(150, 150, s=100, as_cmap=True)
sns.heatmap(corr_matrix,cmap=colmap,mask=mask,center=0,annot=True,fmt=".2f")

In [None]:
#now we get the category columns (minus Attrition_Flag) and build a 2x3 stacked countplot grid,removing empty plots
#we don't want to plot Attrition_Flag, hence slicing the category_cols list in this way
category_cols = df.select_dtypes('object').columns.tolist()[1:]

fig,ax = plt.subplots(2,3,figsize=(20,15))
ax = ax.flatten()
for i in range(len(category_cols)):
    pivot = df.pivot_table(index='Attrition_Flag',columns=str(category_cols[i]),aggfunc='size')
    pivot = pivot.div(pivot.sum(1),axis=0)
    graph = pivot.plot(kind='bar',stacked=True,label=str(category_cols[i]),ax=ax[i],sort_columns=True,rot=0)
    relevant = pivot.iloc[1,:].reset_index(level=0)
    tickvals = graph.get_yticks().tolist()
    graph.yaxis.set_major_locator(ticker.FixedLocator(tickvals))
    graph.set_yticklabels(["{:,.1%}".format(y) for y in tickvals])
    graph.set_ylabel(str(category_cols[i]))
plt.tight_layout()

                     
#there are 5 columns, so plot 6 is empty and can be removed
fig.delaxes(ax.flatten()[5])

Here I used a stacked countplot to compare the distribution of categories across the two customer groups. We see that the distribution of the categorical variables across our two customer groups is reasonably close for all columns. This suggests they may be less valuable in building a model (because any differences are less clear-cut).

In the next step, let's split out the target and predictor columns, and re-encode categorical columns as boolean ready for modelling.

In [None]:
#split into target and dependent variables
#the target variable is first re-encoded to 0 (stayed) and 1 (left/lost), as some of the models don't support non-numeric labels
df['Attrition_Flag'].replace({'Attrited Customer':1,'Existing Customer':0},inplace=True)
target=df['Attrition_Flag']
predictor = df.drop(['Attrition_Flag'],axis=1)

predictor = pd.get_dummies(predictor,drop_first=True)
print(target.shape)
print(predictor.shape)

The dataset has increased from 20 columns (19 predictor and the target column) to 32 columns by re-encoding the categorical columns.

In the final steps before creating a model, let's use sci-kit learn's feature selection package to assess the estimated feature importance of the predictor variables: which features does it expect will most and least influence the model?

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif

plt.figure(figsize=(15,8))
fmodel = SelectKBest(score_func=mutual_info_classif,k='all')
fmodel.fit(predictor,target)
featureimportance = pd.DataFrame(fmodel.scores_,index=predictor.columns,columns=['score']).sort_values(by=['score'])
plt.barh(featureimportance.index,featureimportance['score'])
tickvals = plt.gca().get_xticks().tolist()
plt.gca().xaxis.set_major_locator(ticker.FixedLocator(tickvals))
plt.gca().set_xticklabels(["{:,.1%}".format(x) for x in tickvals])
plt.tight_layout()

From this graph we can see that Total Transaction Amount is expected to most influence the model, and the columns with the highest influence are all numeric columns, which supports the graphs seen above.

Note that the two features deemed most important to the predicting churn (Total Transaction Amount / Total Transaction Count) are also highly correlated as seen above. It may be possible to remove either of these features without losing accuracy in the model (though dimension reduction techniques won't be used here).

Conversely, categorical columns are largely expected to have little influence on the model, which again supports the results of our countplot grid.

I'll revisit the feature importance graph later with the final model: for now it's time to split the data into training,test and holdout sets. The data description noted a significant class imbalance in the target variable, with leavers accounting for just 16% of records, so this needs to be accounted for when splitting the data.

For this model, I'll train against ~80% of records, with test and holdouts being 15% and ~5% respectively.

The predictor columns will also be scaled after splitting the data. This process reduces the absolute values in each column but keeps the relevant distribution and distance between individual data points. This is helpful for model efficiency

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#build into stratified train/test/holdout splits (to account for target imbalance)
X_train,X_test,y_train,y_test = train_test_split(predictor,target,test_size=0.15,random_state=1909,stratify=target)
X_train,X_holdout,y_train,y_holdout = train_test_split(predictor,target,train_size=0.95,random_state=1909,stratify=target)
    
#now the data is scaled, placing the scaled results into the datasets above
dataset_list = [(X_train,'X_train'),(X_test,'X_test'),(X_holdout,'X_holdout')]
for df, out in dataset_list:
      scaled =pd.DataFrame(StandardScaler().fit_transform(df),columns=predictor.columns,index=df.index)
      globals()[out] = scaled

Now onto model building. Firstly, it's worth trying out several different models with 'cookie cutter' (no hyperparameters tuned, seed set if applicable for reproducible results) setups.

The next code block:
* imports the required packages
* loops over each model fitting it to our training data and predicting the target column for our test data
* the results are verified using the classification_report function, with the results stored in a DataFrame and plotted

In [None]:
#now we initialise our models, preparing for a loop
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.metrics import classification_report

#we make a list of model names and functions
#xgboost 
modellist = [('KNN',KNeighborsClassifier()),\
             ('Random Forest',RandomForestClassifier(random_state=1909)),\
             ('Logistic Regression',LogisticRegression()),\
             ('XGBoost',xgb.XGBClassifier(objective='binary:logistic',verbosity=0,use_label_encoder=False,random_state=1909))]

#create an empty list
base = []

#loop through the models
for name, model in modellist:
    model.fit(X_train,y_train)
    predictions = model.predict(X_test)
    predictions_holdout = model.predict(X_holdout)
    linked_datasets = [(y_test,predictions,'test'),(y_holdout,predictions_holdout,'holdout')]
    for actual,predicted,split in linked_datasets:
    #We keep the results for our key class (Attrited Customers / 1) only, slicing [1:2]. All models perform reasonably well at predicting customers who stay
        results_table = pd.DataFrame(classification_report(actual,predicted,output_dict=True)).transpose()[1:2]
        results_table['model'] = name
        results_table['split'] = split
        base.append(results_table)

results = pd.concat(base).round(decimals=2)
print(results)

fig,ax = plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(12,8))
for i,split in ((0,'test'),(1,'holdout')):
    sns.barplot(data=results[results['split']==str(split)],x='model',y='recall',ax=ax[i])
    tickvals = plt.gca().get_yticks().tolist()
    ax[i].title.set_text('Recall scores ('+str(split)+')')
plt.gca().yaxis.set_major_locator(ticker.FixedLocator(tickvals))
plt.gca().set_yticklabels(["{:,.1%}".format(y) for y in tickvals])
plt.tight_layout()

Of the four models used, XGBoost is a clear winner (and K-Nearest Neighbors performed particularly poorly). 

The next step is to see whether the performance can be improved further by tweaking some of the hyperparameters used for the XGBoost model. For this I'll take advantage of XGBoost's sci-kit learn API functionality to perform GridSearch Cross Validation.

In [None]:
import numpy as np
from sklearn.model_selection import GridSearchCV
#initialise our winning model, taking advantage of the GPU to make use of gradinet based sampling
xgbmodel = xgb.XGBClassifier(random_state=1909,
                             use_label_encoder=False,
                             verbosity=0,
                             objective='binary:logistic',
                             tree_method='gpu_hist'
                             )

#set up hyperparameters to tune - focus is on reducing overfitting to improve generalization without a penalty to performance on test/holdout datasets 
parameters = {"learning_rate":np.arange(0.1,0.8,0.1),
              "max_depth":np.arange(2,9,1)}
#perform 5-fold cross validation, using recall as the scoring method
xgb_crossval = GridSearchCV(estimator=xgbmodel, param_grid=parameters,cv=3,scoring='recall',n_jobs=-1)
xgb_crossval.fit(X_train,y_train)
print("The best parameters are: ", xgb_crossval.best_params_)
print('The recall score for the best parameters is {0:.3f}'.format(xgb_crossval.best_score_))

The cross-validated model is now used to make predictions on the test and holdout sets.

In [None]:
tuned_predictions = xgb_crossval.predict(X_test)
tuned_predictions_holdout = xgb_crossval.predict(X_holdout)

print(classification_report(y_test,tuned_predictions))
print(classification_report(y_holdout,tuned_predictions_holdout))


The scores on both sets are lower than their untuned counterpart: given the high performance of the untuned model, and the focus on cross validation to help model generalization, this is not unexpected. It also points to different patterns within the data across our subsets, as the default parameters used in tuning (max depth = 6 and learning_rate=0.3) were within the range of values used in the tuning process, but were not selected by the cross validated model as the best hyperparameters.

As both have high performance versus the original Naive Bayes classifier, either could be productionised. Given that fewer steps are involved the final steps will use the original model, which had a score of 96% on the test set and 93% on the holdout set.

Firstly, the graph below shows the final tree and its predictions. Some points to note:
* The values used to split branches are based on the scaled data, not the actual data
* The leaf values are the final groups each data point can end up in, and the values are used to determine the prediction.
* Leaf values are higher for outcomes of the 'positive' class (here 'Attrited Customer')

In [None]:
#fit the xgbmodel and plot the final tree
xgbmodel.fit(X_train,y_train)
fig,ax = plt.subplots(figsize=(25,25))
finaltree = xgb.plot_tree(xgbmodel,num_trees=-1,rankdir="LR",ax=ax)

In the final tree we can see that the model heavily relies on numeric features, with Gender and Education Level being the only categorical fields used in any branch.

We'll also revisit the feature importance graph from earlier, seeing how the 'winning' model measured importance comparative to the estimates earlier in the process.

In [None]:
fig,ax = plt.subplots(figsize=(15,8))
results = pd.DataFrame(xgbmodel.feature_importances_,index=predictor.columns,columns=['Importance']).sort_values(by=['Importance'],ascending=True).plot(kind='barh',ax=ax,legend=False)
tickvals = results.get_xticks().tolist()
results.xaxis.set_major_locator(ticker.FixedLocator(tickvals))
results.set_xticklabels(["{:,.1%}".format(x) for x in tickvals])
plt.tight_layout()

The final model's feature importance graph broadly maps back to the original graph, though some of the most important features have been declared as less or more important (e.g. Total Transaction Amount has dropped in importance, though Total Transaction Count remains high).

At this point, a high performing, production-ready model is available to be deployed. Revisiting the original question, what insight can be drawn from the EDA and model process that can translate into actionable insight for the business stakeholder?

* Some of the most important model features for predicting customer churn aren't directly able to be influenced (e.g. Total Transaction Count) - knowing who will leave doesn't necessarily mean that process can be halted.
* That said, initiatives could be launched that would reward use of the card, which may increase customer's incentive to use it (which should translate into them being more likely to remain with the bank).
* The EDA also noted that a higher number of relationships with a customer improves their loyalty: is there scope to offer more products and contact credit card customers?


In terms of the model, the next steps are suggested:
* If more records are available (it's unknown whether the 10.1k sample is 5%/25%/75% etc. of the bank's total customer count), test model performance on the remaining customers.
* Productionize the model and setup periodic reviews.
* Look at options to reduce unnecessary features and/or engineer new features that could support model accuracy. This may involve building a 'challenger' model to compete with the 'champion' model.