# Credit Card Default Predictive Modelling

# Content
* Introduction
* Preparation
* Exploratory Data Analysis
* Modelling & Evaluation
* Conclusions
* Reference

# Introduction

This kernel is to practice data analysis and machine learning techniques. The aim is to predict default of credit card clients using several classification models and compare the performances accordingly.

After that, the model with best performance will be chosen and optimized with feature engineering and parameter tuning.

# Preparation

### About the dataset

The dataset is from the UCI Machine Learning Repository, which contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

There are 25 variables, including 24 predictor variables and 1 target variable, as following:
* ID: ID of each client
* LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)
* AGE: Age in years
* PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
* PAY_2: Repayment status in August, 2005 (scale same as above)
* PAY_3: Repayment status in July, 2005 (scale same as above)
* PAY_4: Repayment status in June, 2005 (scale same as above)
* PAY_5: Repayment status in May, 2005 (scale same as above)
* PAY_6: Repayment status in April, 2005 (scale same as above)
* BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
* BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
* BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
* BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
* BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
* BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
* PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
* PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
* PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
* PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
* PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
* PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
* <font color = 'blue'>default.payment.next.month: Default payment (1=yes, 0=no)   — Target Variable</font>

In [None]:
# Loading packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Exploratory Data Analysis

Check the first 5 rows of the dataset 

In [None]:
# import csv
df = pd.read_csv('../input/UCI_Credit_Card.csv')
df.head()

Check the number of rows and columns.

In [None]:
df.shape

Check whether there is missing data in each columns.

In [None]:
# Checking missing data
df.isnull().sum()

Check the summary for each feature (column).

In [None]:
# Check the summary for each feature
df.describe().transpose()

The target variable is "default.payment.next.month". There are 24 predictors. 

First, check the data imbalance for the target, "Default" and "No Default" classes.

In [None]:
df['default.payment.next.month'].value_counts()
plt.title('Default Payment Next Month - data imbalance check')
ax1 = sns.countplot(x= 'default.payment.next.month', data = df)
ax1.set_xticklabels(['No Default','Default'])
plt.show()

Basically this is a binary classification problem. The percentage of "Default" class is about <font color = 'blue'>22%</font>, so the data imbalance is not significant. 

Then, let's take a look at how different predictors affect our target.

In [None]:
# Education Distribution
plt.title('Education Distribution')
ax2 = sns.countplot(x= 'EDUCATION', hue = 'default.payment.next.month', data = df)
ax2.set_xticklabels(['Unknown','graduate school','university','high school','others','unknown','unknown'],rotation = 90)
plt.show()

From above plot, we can see that most of the defaulters have the degree of graduate/university/high school. Among them, clients who have university degree are more likely to default than others.

In [None]:
# SEX distribution
plt.title('Sex Distribution')
ax3 = sns.countplot(x= 'SEX', hue = 'default.payment.next.month', data = df)
ax3.set_xticklabels(['Male','Female'])
plt.show()

Female has more probability of default than male.

In [None]:
# Age Distribution
plt.title('Age Distribution \n Default(Red) vs. No Default(Grey)')
agedist0 = df[df['default.payment.next.month']==0]['AGE']
agedist1 = df[df['default.payment.next.month']==1]['AGE']
sns.distplot(agedist0, bins = 100, color = 'grey')
sns.distplot(agedist1, bins = 100, color = 'red')
plt.show()

As the age increases to 30, the probability of default increases. Meanwhile, when clients are over 30, the probability decreases when aging.

In [None]:
# Credit Amount Distribution
plt.title('Credit Amount Distribution \n Default(Red) vs. No Default(Grey)')
cadist0 = df[df['default.payment.next.month']==0]['LIMIT_BAL']
cadist1 = df[df['default.payment.next.month']==1]['LIMIT_BAL']
sns.distplot(cadist0, bins = 100, color = 'grey')
sns.distplot(cadist1, bins = 100, color = 'red')
plt.xlabel('Credit Limit')
plt.show()

Clients with lower amount tend to default. Especially those with credit amount around 50000 default most.

# Modelling & Evaluation

### Modelling prepration

First, divide the features into predictor (X) and target(Y) before fitting the models.

In [None]:
# Define predictor and target variables with X and Y
X = df.columns[:24]
Y = df.columns[-1]

Then, split the dataset into train and test sets.

In [None]:
# training and test dataset split, leaving 30% as test set
x_train, x_test, y_train, y_test = train_test_split(df[X],df[Y], 
                                                    test_size = .3, shuffle = True, random_state = 0)

In [None]:
# Check splitted data for train and test sets respectively
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

### Logistic Regression

In [None]:
clfLR = LogisticRegression(solver = 'lbfgs',
                           max_iter = 500,
                          random_state = 0)

clfLR.fit(x_train,y_train)

predLR = clfLR.predict(x_test)

In [None]:
# Cross Validation
cross_val_score_LR = cross_val_score(clfLR, x_test, y_test, cv = 10)
print('cross_val_score: ',cross_val_score_LR.mean().round(2))

# Precision Score
print('precision score is ',precision_score(y_test, predLR).round(2))

# Recall Score
print('recall_score is ',recall_score(y_test, predLR).round(4))
# F1 Score
print('f1 score is ',f1_score(y_test, predLR).round(3))

# ROC_AUC
print('ROC AUC is ',roc_auc_score(y_test, predLR).round(2))

### Support Vector Machine

In [None]:
clfSVC = SVC(kernel = 'rbf',
             gamma = 'scale',
                random_state = 0)

clfSVC.fit(x_train,y_train)

predSVC = clfSVC.predict(x_test)

In [None]:
# Cross Validation
cross_val_score_SVC = cross_val_score(clfSVC, x_test, y_test, cv = 10)
print('cross_val_score: ',cross_val_score_SVC.mean().round(2))

# Precision Score
print('precision score is ',precision_score(y_test, predSVC).round(2))

# Recall Score
print('recall_score is ',recall_score(y_test, predSVC).round(4))
# F1 Score
print('f1 score is ',f1_score(y_test, predSVC).round(3))

# ROC_AUC
print('ROC AUC is ',roc_auc_score(y_test, predSVC).round(2))

### K Nearest Neighbors

In [None]:
clfKNN = KNeighborsClassifier(n_neighbors = 3)
clfKNN.fit(x_train,y_train)

predKNN = clfKNN.predict(x_test)

In [None]:
# Cross Validation
cross_val_score_KNN = cross_val_score(clfKNN, x_test, y_test, cv = 10)
print('cross_val_score: ',cross_val_score_KNN.mean().round(2))

# Precision Score
print('precision score is ',precision_score(y_test, predKNN).round(2))

# Recall Score
print('recall_score is ',recall_score(y_test, predKNN).round(4))
# F1 Score
print('f1 score is ',f1_score(y_test, predKNN).round(3))

# ROC_AUC
print('ROC AUC is ',roc_auc_score(y_test, predKNN).round(2))

### Random Forest

In [None]:
clfRF = RandomForestClassifier(criterion = 'gini',
                              n_estimators = 100,
                              verbose = False,
                              random_state = 0)

clfRF.fit(x_train,y_train)

predRF = clfRF.predict(x_test)

In [None]:
# Cross Validation
cross_val_score_RF = cross_val_score(clfRF, x_test, y_test, cv = 10)
print('cross_val_score: ',cross_val_score_RF.mean().round(2))

# Precision Score
print('precision score is ',precision_score(y_test, predRF).round(2))

# Recall Score
print('recall_score is ',recall_score(y_test, predRF).round(4))
# F1 Score
print('f1 score is ',f1_score(y_test, predRF).round(3))

# ROC_AUC
print('ROC AUC is ',roc_auc_score(y_test, predRF).round(2))

### XGBoost

In [None]:
clfXGB = xgb.XGBClassifier()
clfXGB.fit(x_train,y_train)
predXGB = clfXGB.predict(x_test)

In [None]:
# Cross Validation
cross_val_score_XGB = cross_val_score(clfXGB, x_test, y_test, cv = 10)
print('cross_val_score: ',cross_val_score_XGB.mean().round(2))

# Precision Score
print('precision score is ',precision_score(y_test, predXGB).round(2))

# Recall Score
print('recall_score is ',recall_score(y_test, predXGB).round(4))
# F1 Score
print('f1 score is ',f1_score(y_test, predXGB).round(3))

# ROC_AUC
print('ROC AUC is ',roc_auc_score(y_test, predXGB).round(2))

### LightGBM

In [None]:
clfLGB = LGBMClassifier(n_estimators = 100,
                           learning_rate = .2,
                           random_state = 0)

clfLGB.fit(x_train,y_train)

predLGB = clfLGB.predict(x_test)

In [None]:
# Cross Validation
cross_val_score_LGB = cross_val_score(clfLGB, x_test, y_test, cv = 10)
print('cross_val_score: ',cross_val_score_LGB.mean().round(2))

# Precision Score
print('precision score is ',precision_score(y_test, predLGB).round(2))

# Recall Score
print('recall_score is ',recall_score(y_test, predLGB).round(4))
# F1 Score
print('f1 score is ',f1_score(y_test, predLGB).round(3))

# ROC_AUC
print('ROC AUC is ',roc_auc_score(y_test, predLGB).round(2))

### CatBoostClassifier

In [None]:
clfCB = CatBoostClassifier(iterations = 100,
                           learning_rate = .2,
                           depth = 5,
                           eval_metric = 'AUC',
                           random_seed = 0)

clfCB.fit(x_train,y_train)

predCB = clfCB.predict(x_test)

In [None]:
# Cross Validation
cross_val_score_CB = cross_val_score(clfCB, x_test, y_test, cv = 10)
print('cross_val_score: ',cross_val_score_CB.mean().round(2))

# Precision Score
print('precision score is ',precision_score(y_test, predCB).round(2))

# Recall Score
print('recall_score is ',recall_score(y_test, predCB).round(4))
# F1 Score
print('f1 score is ',f1_score(y_test, predCB).round(3))

# ROC_AUC
print('ROC AUC is ',roc_auc_score(y_test, predCB).round(2))

### Confusion Matrix

In [None]:
# Confusion Matrix
cmLR = confusion_matrix(y_test, predLR)
cmSVC = confusion_matrix(y_test, predSVC)
cmKNN = confusion_matrix(y_test, predKNN)
cmRF = confusion_matrix(y_test, predRF)
cmXGB = confusion_matrix(y_test, predXGB)
cmLGB = confusion_matrix(y_test, predLGB)
cmCB = confusion_matrix(y_test, predCB)

# Confusion Matrix List
cmList = [cmLR, cmSVC,cmKNN, cmRF, cmXGB, cmLGB, cmCB]
cmTitle = ['Logistic Regression','Support Vector Machines','K Nearest Neighbors','Random Forest','XGB','LightGB','CatGBM',None]
i = 0
plt.figure()
fig, ax = plt.subplots(2,4, num = 6, figsize = (30,10))
for cm in cmList:
    i += 1
    plt.subplot(2,4,i)
    plt.title(cmTitle[i-1])
    sns.heatmap(cm, annot = True, cmap = 'YlGnBu')
plt.show();

From above confusion matrices, it is observed:
* Logistic Regression has no false positive, but most false negative. It is overfitting.
* Ensemble models perform better than others in true positive.
* It depends on the cost of event ( Cost of False Positive & Cost of False Negative) to further choose which gradient boosting model will be selected for further work

# Conclusions

* Following the machine learning pipeline, we have analyzed selected features (both predictors and target) distributions, built models and evaluated the performances of each model. 
* 7 different models are used, including logistic regression, support vector machines, K nearest neighbors, random forest, XGBoost, LightBoost and CatBoost.
* The techniques in evaluating the performances of the models are cross validation, precision score, recall score, F1 score, ROC_AUC and confusion matrix.
* Using the default parameters in all models, Gradient Boosting models outperform others, among which CatBoost has the best performance.



### FURTHER WORK
* Feature Engineering is not applied in this kernel. For real business case, it is better to communicate with different teams to figure out the best approach. After all, "there is no free lunch”.  For example, One-Hot Encoding can be used.
* The ratio of "default" vs "no default" is about 1:3 in the dataset. It may affect the accuracy of each model. There are several ways to solve it. For instance, SMOTE.
* Parameter tuning can be applied as well. For example, to leverage between different learning_rate and n_estimators combos in gradient boosting models.
* HAVE FUN.

# Reference

* Default of credit card clients dataset, https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
* Machine Learning Pipeline, https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b
* Feature distribution, https://www.kaggle.com/gpreda/default-of-credit-card-clients-predictive-models 
* Logistic Regression, https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* Support Vector Machines, https://scikit-learn.org/stable/modules/svm.html
* K Nearest Neighbors, https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
* Random Forest Classifier, https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* XGBoost, https://xgboost.readthedocs.io/en/latest/ 
* LightGBM, https://lightgbm.readthedocs.io/en/latest/Python-API.html
* CatBoost, https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/
* Cross Validation, https://www.ritchieng.com/machine-learning-cross-validation/ 
* Model Evaluation, https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation
* Accuracy & Precision & Recall & F1, https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
* Feature Importance, https://tech.yandex.com/catboost/doc/dg/features/feature-importances-calculation-docpage/
* Parameter Tuning, https://tech.yandex.com/catboost/doc/dg/concepts/parameter-tuning-docpage/
* Feature Engineering, https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63