# **Credit Default** - Advanced Models

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

------

## Overview

Continuing with the credit default problem we already know, we now implement more advanced models. Both the dummy and logit model achieve approximately 0.5 in AUC. Our goal is to surpass this baseline. We skip exploratory data analysis and compare the following models: logit with lasso regularization, K-nearest neighbors (KNN), decision tree, random forest, gradient boosted trees and (a very simple) ensemble.

<img src="https://greendayonline.com/wp-content/uploads/2017/03/Recovering-From-Student-Loan-Default.jpg" width="500" height="500" align="center"/>


Image: https://greendayonline.com/wp-content/uploads/2017/03/Recovering-From-Student-Loan-Default.jpg

#### The Credit Card Default Dataset 

We will try to predict the probability of defaulting on a credit card account at a Taiwanese bank. A credit card default happens when a customer fails to pay the minimum due on a credit card bill for more than 6 months. 

We will use a dataset from a Taiwanese bank with 30,000 observations (Source: *Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.*). Each observation represents an account at the bank at the end of October 2005.  We renamed the variable default_payment_next_month to customer_default. The target variable to predict is `customer_default` -- i.e., whether the customer will default in the following month (1 = Yes or 0 = No). The dataset also includes 23 other explanatory features. 

Variables are defined as follows:

| Feature name     | Variable Type | Description 
|------------------|---------------|--------------------------------------------------------
| customer_default | Binary        | 1 = default in following month; 0 = no default 
| LIMIT_BAL        | Continuous    | Credit limit   
| SEX              | Categorical   | 1 = male; 2 = female
| EDUCATION        | Categorical   | 1 = graduate school; 2 = university; 3 = high school; 4 = others
| MARRIAGE         | Categorical   | 0 = unknown; 1 = married; 2 = single; 3 = others
| AGE              | Continuous    | Age in years  
| PAY1             | Categorical   | Repayment status in September, 2005 
| PAY2             | Categorical   | Repayment status in August, 2005 
| PAY3             | Categorical   | Repayment status in July, 2005 
| PAY4             | Categorical   | Repayment status in June, 2005 
| PAY5             | Categorical   | Repayment status in May, 2005 
| PAY6             | Categorical   | Repayment status in April, 2005 
| BILL_AMT1        | Continuous    | Balance in September, 2005  
| BILL_AMT2        | Continuous    | Balance in August, 2005  
| BILL_AMT3        | Continuous    | Balance in July, 2005  
| BILL_AMT4        | Continuous    | Balance in June, 2005 
| BILL_AMT5        | Continuous    | Balance in May, 2005  
| BILL_AMT6        | Continuous    | Balance in April, 2005  
| PAY_AMT1         | Continuous    | Amount paid in September, 2005
| PAY_AMT2         | Continuous    | Amount paid in August, 2005
| PAY_AMT3         | Continuous    | Amount paid in July, 2005
| PAY_AMT4         | Continuous    | Amount paid in June, 2005
| PAY_AMT5         | Continuous    | Amount paid in May, 2005
| PAY_AMT6         | Continuous    | Amount paid in April, 2005

The measurement scale for repayment status is:   

    -2 = payment two months in advance   
    -1 = payment one month in advance   
    0 = pay duly   
    1 = payment delay for one month   
    2 = payment delay for two months   
    3 = payment delay for three months   
    4 = payment delay for four months   
    5 = payment delay for five months   
    6 = payment delay for six months   
    7 = payment delay for seven months   
    8 = payment delay for eight months   
    9 = payment delay for nine months or more  

-------

## **Part 0**: Setup

Put all import statements, constants and helper functions at the top of your notebook.

### Imports

It is a good idea to put all of your import statements up at the top of a project in one location. You then can quickly see all of the requirements needed to run the project.

In [None]:
# Standard imports
import numpy  as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras

import itertools
import pandas_profiling

# Plotting
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline  
import seaborn as sns
sns.set(style="white")

# scikit-learn
from sklearn.dummy        import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors    import KNeighborsClassifier
from sklearn.tree         import DecisionTreeClassifier, plot_tree
from sklearn.ensemble     import RandomForestClassifier
from sklearn.ensemble     import GradientBoostingClassifier

# Supporting functions from scikit-learn
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
from graphviz import Source
from sklearn.decomposition import PCA

# ignore some warnings 
import warnings
warnings.filterwarnings('ignore')

### Constants

It is a good idea to move as many of the constant values used in your project as possible, up to the top and to give then variable names ALL_IN_CAPS so you can quickly identify constant values.

In [None]:
# Set a seed for replication
SEED = 1

### Custom Functions

We will also define a few "helper functions" to automate repetitive tasks that we will perform below.

In [None]:
def plot_confusion_matrix(cm, classes=[0,1], normalize=False, title='Confusion Matrix', cmap=plt.cm.Reds):
    """ 
    Function to plot a sklearn confusion matrix, showing number of cases per prediction condition 
    
    Args:
        cm         an sklearn confusion matrix
        classes    levels of the class being predicted; default to binary outcome
        normalize  apply normalization by setting `normalize=True`
        title      title for the plot
        cmap       color map
    """
    plt.imshow(cm, aspect='auto', interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    plt.locator_params(nbins=2)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    thresh = cm.max() / 2.
    # add FP, TP, FN, TN counts
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, round (cm[i, j],2), horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')

In [None]:
def plot_roc(fpr, tpr, title='ROC Curve', note=''):
    """
    Function to plot an ROC curve in a consistent way.
    
    Args:
        fpr        False Positive Rate (list of multiple points)
        tpr        True Positive Rate (list of multiple points)
        title      Title above the plot
        note       Note to display in the bottom-right of the plot
    """
    plt.figure(1)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.plot(fpr, tpr)
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title(title)
    if note: plt.text(0.6, 0.2, note)
    plt.show()

In [None]:
def print_feature_importance(tree_model, feature_names):
    """
    Function to print a list of features from an sklearn tree model (ranked by importance of the feature)
    
    Args:
        tree_model       A sklearn DecisionTreeClassifier()
        feature_names    A list of features used by the DecisionTreeClassifier
    """
    print('Feature'.center(12), '   ',  'Importance')
    print('=' * 30)
    for index in reversed(np.argsort(tree_model.feature_importances_)):
        print(str(feature_names[index]).center(12) , '   ', '{0:.4f}'.format(tree_model.feature_importances_[index]).center(8)) 

## **Part 1**: Load, Preprocess and Split Data

In [None]:
# Load Data
data = pd.read_csv('credit_data.csv')

# Move target variable to first column (not necessary, but easier to see)
data = data.set_index('customer_default').reset_index() 

# One-hot-encode SEX and MARRIAGE  
data = pd.get_dummies(data=data, columns=['SEX', 'MARRIAGE'])

# Remove 'id'
data = data.drop(columns=['ID'])

# Select target
y = np.array(data['customer_default'])

# Select features 
features = list(set(list(data.columns)) - set(['customer_default']))
X = data.loc[:, features]

### Split Data: Training, Validation, & Testing

In all of the code that follows, we will now divide the data into three parts:  **training** (60%), **validation** (20%) and **test** (20%). In the python code, we refer to these subsets as: 

| Subset      |  Pct.  |  X code var     | Target code var |
|-------------|--------|-----------------|-----------------|
| training    |  60%   |  X_train_train  | y_train_train
| validation  |  20%   | X_train_val     | y_train_val
| testing     |  20%   | X_test          | y_test


In [None]:
# Split data into "train", "validation", and "test" 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
X_train_train, X_train_val, y_train_train, y_train_val = train_test_split(X_train, y_train, test_size=0.25, random_state=SEED)

## **Part 2**: Logit Model with L1 Regularization ("Lasso")

In [None]:
# Build pipeline (not striclty necessary as pipeline only contains one step)
estimators = []
estimators.append(('logit_model_l1', LogisticRegression()))  # tell it to use a logit model
pipeline = Pipeline(estimators) 
pipeline.set_params(logit_model_l1__penalty='l1')            # tell it to regularize with L1 norm
pipeline.set_params(logit_model_l1__solver='liblinear')      # tell it to use the liblinear solver (see)

# Tune C  
results = []
for c in np.logspace(-4, 5, 10):
    pipeline.set_params(logit_model_l1__C=c) 
    pipeline.fit(X_train_train,y_train_train)
    y_train_pred = pipeline.predict_proba(X_train_val)       # use validation set during hyper-parameter tuning
    auc_lml1 = roc_auc_score(y_train_val, y_train_pred[:,1])   
    results.append( (auc_lml1, c)  )
logit_model_l1 = pipeline.named_steps['logit_model_l1']      # capture model so we can use it later

# View results 
print('C'.center(12), '   ', 'AUC'.center(8), '\n', '=' * 25)
for (auc, c) in results:
    print('{0:.4f}'.format(c).rjust(12), '   ',  '{0:.4f}'.format(auc).center(8))

In [None]:
# Select best C       NOTE: The "best" C does not have to have the highest AUC
best_C = 0.100        # perhaps select a lower C with a slightly lower AUC so Lasso will find a simpler model

In [None]:
# Test final model 
pipeline.set_params(logit_model_l1__C=best_C)
pipeline.fit(X_train,y_train)
y_prob_logit_lasso = pipeline.predict_proba(X_test)
fpr_logit_lasso, tpr_logit_lasso, _ = roc_curve(y_test, y_prob_logit_lasso[:, 1])
best_auc_logit_lasso = roc_auc_score(y_test, y_prob_logit_lasso[:,1])
print('L1 Regularized Logit Model: Final test score for C = {0:.4f} has AUC = {1:.4f}'.format(best_C, best_auc_logit_lasso))

In [None]:
plot_roc(fpr_logit_lasso, tpr_logit_lasso, 'ROC Curve for L1 Regularized Logit Model', 'AUC = %2.4f' % best_auc_logit_lasso)

In [None]:
# Compare coefficients between a standard logit model and a regularized logit model, to see how Lasso drops predictors
logit_model = Pipeline([('s', StandardScaler()), ('m', LogisticRegression())]).fit(X_train_train, y_train_train).named_steps['m']
print('REGULARIZATION'.center(20), 'NONE'.center(10), 'L1'.center(10))
print('=' * 50)
for (varname, lm_coef, lml1_coef) in zip(features, logit_model.coef_[0], logit_model_l1.coef_[0]):
    lm_coeff  = "{0:.4f}".format(lm_coef).rjust(10)
    lml1_coef = "{0:.4f}".format(lml1_coef).rjust(10) if abs(lml1_coef) > 0.0001 else ""
    print(str(varname).center(20), lm_coeff, lml1_coef)

## **Part 3**: K-Nearest Neighbors (KNN)

Do we need to standardize the data? Yes! Standardization is required as the KNN algorithm measures the distances between pairs of samples and these measurements depend on the measurement units. 

In [None]:
# Build pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('knn_model', KNeighborsClassifier()))
pipeline = Pipeline(estimators)

# Tune K
results = []
for k in range(5, 100, 10):
    pipeline.set_params(knn_model__n_neighbors = k) 
    pipeline.fit(X_train_train ,y_train_train)
    y_prob = pipeline.predict_proba(X_train_val)
    auc = roc_auc_score(y_train_val, y_prob[:,1])
    results.append( (auc, k) )
    
# View results 
print('K'.rjust(5), '   ', 'AUC'.center(8), '\n', '=' * 20)
for (auc, k) in results:
    print('{0}'.format(k).rjust(5), '   ',  '{0:.4f}'.format(auc).center(8))    

In [None]:
# Select best K
best_K = sorted(results)[-1][1]
auc = sorted(results)[-1][0]
print('Best value of K = %d, with AUC = %2.4f' % (best_K, auc))

In [None]:
# Test final model 
pipeline.set_params(knn_model__n_neighbors = best_K)
pipeline.fit(X_train, y_train)
y_prob_knn = pipeline.predict_proba(X_test)
fpr_knn, tpr_knn, _ = roc_curve(y_test, y_prob_knn[:, 1])
best_auc_knn = roc_auc_score(y_test, y_prob_knn[:,1])
print('KNN Model: Final test score for K = %d has AUC = %2.4f' % (best_K, best_auc_knn))

In [None]:
# Plot ROC curve
plot_roc(fpr_knn, tpr_knn, 'ROC Curve for KNN Model', 'K = %d \n\nAUC = %2.4f' % (best_K, best_auc_knn))

## **Part 4**: Decision Trees

Next, we turn to a model from information theory called **Decision Trees**. Unlike a **Logit Model** or a **KNN Classifier**, a decision tree is not sensitive to the scaling of categorical or numerical features -- a decision tree can find a cutting point along any arbitrary numeric or categorical feature. The implementation of decision trees in sklearn, however, expects that categorical are "one-hot-encoded" into separate variables (which is convenient for the implementation, but not technically required by the algorithm).

#### One Decision Tree 

The basic decision tree model starts with just one tree. We will see that just one tree is likely to be unreliable, but we will fit one and show that next for illustration. 

In [None]:
# Fit Model      
tree_model = DecisionTreeClassifier(random_state=SEED)  # just one tree, so no pipeline needed
tree_model.fit(X_train, y_train)

# Test model
y_prob_tree = tree_model.predict_proba(X_test)
fpr_tree, tpr_tree, _ = roc_curve(y_test, y_prob_tree[:, 1])
auc_one_tree = roc_auc_score(y_test, y_prob_tree[:,1])
print('One Decision Tree Model: Final test score has AUC = {0:0.4f}'.format(auc_one_tree))

In [None]:
plot_roc(fpr_tree, tpr_tree, 'ROC Curve for One Decision Tree', 'AUC = %2.4f' %  auc_one_tree)

In [None]:
# Print feature importance
print_feature_importance(tree_model, features)

In [None]:
# Visualize the tree    
#     this may not work on all computers - requires graphviz to be installed
#     install graphviz on a Mac computer by running:  brew install -v graphviz
#     install graphviz on Linux computer by running:  sudo apt-get install graphviz
Source(export_graphviz(tree_model, out_file=None, feature_names=features, max_depth=3))
# !dot -Tpng tree.dot -o tree.png

## **Part 5**: A Random Forest of Decision Trees

Just one tree may be an arbitrary solution to the prediction problem. So next, we run a random forest of different trees and features, and average across them for a more reliable prediction.

In [None]:
# Build pipeline
estimators = []
estimators.append(('forest_model', RandomForestClassifier()))
pipeline = Pipeline(estimators)
pipeline.set_params(forest_model__random_state = SEED)
    
# Tune N   
results = []
for n in [10, 50, 150, 200, 250]:
    pipeline.set_params(forest_model__n_estimators = n) 
    pipeline.fit(X_train_train, y_train_train)
    y_train_pred = pipeline.predict_proba(X_train_val)
    auc = roc_auc_score(y_train_val, y_train_pred[:,1])
    results.append( (auc, n))

# View results 
print('N'.rjust(5), '   ', 'AUC'.center(8), '\n', '=' * 20)
for (auc, n) in results:
    print('{0}'.format(n).rjust(5), '   ',  '{0:.4f}'.format(auc).center(8))   

In [None]:
# Test final model
best_N = 150   # AUC will generally improve with more interations, but eventually level off
pipeline.set_params(forest_model__n_estimators=best_N)
pipeline.fit(X_train, y_train)
y_prob_forest = pipeline.predict_proba(X_test)
fpr_forest, tpr_forest, _ = roc_curve(y_test, y_prob_forest[:, 1])
best_auc_forest = roc_auc_score(y_test, y_prob_forest[:,1])
print('Random Forest Model: Final test score for N = %d has AUC = %2.4f' % (best_N, best_auc_forest))

In [None]:
plot_roc(fpr_forest, tpr_forest, 'ROC Curve for a Random Forest', 'N = %d \n\nAUC = %2.4f' %  (best_N, best_auc_forest))

In [None]:
# Print feature importance
forest_model = pipeline.named_steps['forest_model']
print_feature_importance(forest_model, features)

## **Part 6**: Gradient Boosted Trees 

Just one tree may be arbitrary and an unreliable model. So next we run many trees by gradient boosting through a sequence of such trees.

In [None]:
# Build a pipeline
estimators = []
estimators.append(('g_boosted_trees', GradientBoostingClassifier()))
pipeline = Pipeline(estimators)
pipeline.set_params(g_boosted_trees__random_state = SEED)
    
# Tune N using validation set 
results = []
for n in [10, 50, 150, 200, 250]:
    pipeline.set_params(g_boosted_trees__n_estimators = n) 
    pipeline.fit(X_train_train, y_train_train)
    y_train_pred = pipeline.predict_proba(X_train_val)
    auc = roc_auc_score(y_train_val, y_train_pred[:,1])
    results.append( (auc, n))

# View results 
print('N'.rjust(5), '   ', 'AUC'.center(8), '\n', '=' * 20)
for (auc, n) in results:
    print('{0}'.format(n).rjust(5), '   ',  '{0:.4f}'.format(auc).center(8))   

In [None]:
# Test final model
best_N = 150   # AUC will generally improve with more interations, but eventually level off
pipeline.set_params(g_boosted_trees__n_estimators = best_N)
pipeline.fit(X_train, y_train)
y_prob_boosted = pipeline.predict_proba(X_test)
fpr_boosted, tpr_boosted, _ = roc_curve(y_test, y_prob_boosted[:, 1])
best_auc_boosted = roc_auc_score(y_test, y_prob_boosted[:,1])
print('Random Forest Model: Final test score for N = %d has AUC = %2.4f' % (best_N, best_auc_boosted))

In [None]:
plot_roc(fpr_boosted, tpr_boosted, 'ROC Curve for Boosted Trees', 'N = %d \nAUC = %2.4f' % (best_N, best_auc_boosted) )

## **Part 7**: A (Very Simple) Ensemble Model  

We now use a very simple approach to ensemble together the various models we have tried so far -- we will simply average together prdictions from:   
  
  * Logit + Lasso
  * K-Nearest Neighbors
  * Random forest
  * Gradient boosted trees
  

In [None]:
# Average predictions across 4 models
y_prob_ensemble = np.mean( np.array([y_prob_logit_lasso[:,1], y_prob_knn[:,1], y_prob_forest[:,1], y_prob_boosted[:,1]]), axis=0)
fpr_ensemble, tpr_ensemble, _ = roc_curve(y_test, y_prob_ensemble)
ensembled_auc = roc_auc_score(y_test, y_prob_ensemble)
print('Simple Ensemble: Final test score has AUC = {0:0.4f}'.format(ensembled_auc))

In [None]:
# Plot ROC  
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_ensemble, tpr_ensemble)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve for Average Ensemble')
plt.show()

## **SUMMARY OF ROC CURVES**

In [None]:
plot_roc(fpr_logit_lasso, tpr_logit_lasso, 'ROC Curve for L1 Regularized Logit Model', 'AUC = %2.4f' % best_auc_logit_lasso)

In [None]:
plot_roc(fpr_knn, tpr_knn, 'ROC Curve for KNN Model', 'K = %d \n\nAUC = %2.4f' % (best_K, best_auc_knn))

In [None]:
plot_roc(fpr_forest, tpr_forest, 'ROC Curve for a Random Forest', 'N = %d \n\nAUC = %2.4f' %  (best_N, best_auc_forest))

In [None]:
plot_roc(fpr_boosted, tpr_boosted, 'ROC Curve for Boosted Trees', 'N = %d \nAUC = %2.4f' % (best_N, best_auc_boosted) )

## **SUMMARY OF AUC VALUES**

In [None]:
# Print summary of AUC scores 
width     = 30
width_box = 100
models    = ['Baseline', 'Logit + Lasso', 'KNN', 'Random Forest', 'Gradient Boosting', 'Averaged Ensemble']
results   = [0.5000, best_auc_logit_lasso, best_auc_knn, best_auc_forest, best_auc_boosted, ensembled_auc]

print(str('=' * width).center(width_box))
print('Summary of AUC Scores'.center(width_box))
print(str('=' * width).center(width_box))
for i in range(len(models)):
    line = models[i].center(width - 8) + '{0:.4f}'.format(results[i])
    print(line.center(width_box))
print()

# Plot ROC
fig, ax1 = plt.subplots(1, 1)
fig.set_size_inches(12, 12)
ax1.xaxis.grid(True, linestyle='--', which='major', color='grey', alpha=.25)
ax1.yaxis.grid(True, linestyle='--', which='major', color='grey', alpha=.25)
plt.plot([0, 1], [0, 1], 'k--', label = 'Baseline (AUC 0.5)')
plt.plot(fpr_logit_lasso, tpr_logit_lasso, label = 'Logit + Lasso (AUC {})'.format(round(best_auc_logit_lasso, 4)))
plt.plot(fpr_knn,         tpr_knn,         label = 'KNN (AUC {})'.format(round(best_auc_knn, 4)))
plt.plot(fpr_forest,      tpr_forest,      label = 'Random Forest (AUC {})'.format(round(best_auc_forest, 4)))
plt.plot(fpr_boosted,     tpr_boosted,     label = 'Gradient Boosting (AUC {})'.format(round(best_auc_boosted, 4)))
plt.plot(fpr_ensemble,    tpr_ensemble,    label = 'Averaged Ensemble (AUC {})'.format(round(ensembled_auc, 4)) , linewidth=2)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend()
ax1.legend=True
plt.show()