## Table of contents

- [Imports](#im)
- [Read in Data](#1) 
- [EDA](#2) 
- [Data Preprocessing](#3) 
- [Modeling](#4) 
- [Hyperparameter tuning](#5) 
- [Model Interpretation](#6) 
- [Results on test set](#7) 

## Imports <a name="im"></a>

In [None]:
#imports
import os
%matplotlib inline
import string
import sys
from collections import deque

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# data
from sklearn.compose import ColumnTransformer, make_column_transformer

# Classifiers
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

# classifiers / models
from sklearn.linear_model import LogisticRegression

# other
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from lightgbm.sklearn import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, plot_confusion_matrix

# altair 
import altair as alt
alt.renderers.enable('mimetype')
alt.data_transformers.disable_max_rows()

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Read in Data <a name="1"></a>

In [None]:
#read in data 
data = pd.read_csv("../input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv")

In [None]:
#Split data into train and test splits
train_df, test_df = train_test_split(data, test_size = 0.2, random_state=123)

#rename default col name
train_df = train_df.rename(columns={'default.payment.next.month': 'default'})
test_df = test_df.rename(columns={'default.payment.next.month': 'default'})

#rename column for consistency
train_df = train_df.rename(columns={"PAY_0": "PAY_1"})
test_df = test_df.rename(columns={"PAY_0": "PAY_1"})

# EDA <a name="2"></a>

### Features
**Note: all amounts are in NT (New Taiwan) Dollars**

* `ID`: ID of each client
* `LIMIT_BAL`: Amount of given credit (includes individual and family/supplementary credit)
* `SEX`: Male or Female
* `EDUCATION`: graduate school, university, high school, other, or unknown
* `MARRIAGE`: Marital Status (married, single, or other)
* `AGE`: Age in years

All the features below use the following scale: -2: No consumption; -1: Paid in full; 0: The use of revolving credit; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

* `PAY_1`, `PAY_2`, ..., `PAY_6`: Repayment status in September, August,... , April 2005
* `BILL_AMT1`, `BILL_AMT2`, ..., `BILL_AMT6`: Amount of bill statement in September, August,... , April 2005
* `PAY_AMT1`, `PAY_AMT2`, ..., `PAY_AMT6`: Amount of previous payment in September, August,... , April 2005
* `default`: Default payment (1=yes, 0=no)

In [None]:
# renaming columns for plots for easier usage
plot_df = train_df.copy()
plot_df["SEX"].replace({1: "male", 2: "female"}, inplace=True)
plot_df["MARRIAGE"].replace({0: 3}, inplace=True)
plot_df["MARRIAGE"].replace({1: "married", 2: "single", 3: "other" }, inplace=True)
plot_df["EDUCATION"].replace({0: 5, 6: 5}, inplace=True)

In [None]:
alt.Chart(plot_df, title = "Proportion of non-default (0) to default (1) cases").mark_bar().encode(
    alt.X("default:O"),
    alt.Y("count(default)"),
    alt.Color("default:N", legend= None),
    tooltip='count(default)'
).properties(
    width=300,
    height=300
)


In [None]:
# normalized counts of target class
train_df["default"].value_counts(normalize="True")

> Around 22% of cases defaulted, and 78% did not default. As there is class imbalance and predicting both classes are equally as important, F1-Macro score will be used as the scoring metric. F1-macro takes the F1-score of each class and returns the average of the 2 scores. By taking the average of the 2 scores, it is ignoring the imbalanced data and weighing both classes equally. 

In [None]:
#default based on gender
gender_plot = alt.Chart(plot_df, title="Default cases based on gender").mark_bar().encode(
    alt.Y('count(default)', stack ="normalize", axis=alt.Axis(format='%')),
    alt.X('SEX:O'),
    alt.Color('default:O', scale=alt.Scale(scheme='tableau10')),
    tooltip='count(default)'
).properties(
    width=200,
    height=300
)#.configure_axisX(labelAngle=360)

#default based on marriage status plot
marriage_plot = alt.Chart(plot_df, title="Default cases based on marriage").mark_bar().encode(
    alt.Y('count(default)', stack ="normalize", axis=alt.Axis(format='%')),
    alt.X('MARRIAGE:O'),
    alt.Color('default:O', scale=alt.Scale(scheme='tableau10')),
    tooltip='count(default)'
).properties(
    width=300,
    height=300
)#.configure_axisX(labelAngle=360)

#concat gender and marriage plot
alt.hconcat(gender_plot, marriage_plot).configure_axisX(labelAngle=360)

> There doesn't seem to be a visual difference in defaults when looking at sex and marital status of an individual

In [None]:
lom = ["PAY_6", "PAY_5", "PAY_4", "PAY_3", "PAY_2", "PAY_1",]

alt.Chart(plot_df).mark_bar(size=20).encode(
   alt.X(alt.repeat('repeat'), type='quantitative'),
   alt.Y("count(default)", stack ="normalize", axis=alt.Axis(format='%')),
    alt.Color('default:O', scale=alt.Scale(scheme='tableau10')),
    tooltip='count(default)'
).properties(height = 250, width = 250
).repeat(repeat = lom, columns =3, title = "Repayment Status from April - Septmber 2005")

> The above plot explores the count of default to non-default cases for the repayment status in months April to September. 
> * Recall: -2: No consumption; -1: Paid in full; 0: The use of revolving credit; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
> 
> It looks as though a payment delay of 1/2 months and above is associated with a higher chance of defaulting. Another observation is that values of 1 are missing from April and May, this could be due to some error when collecting the data. 

In [None]:
#heatmap for all features, excluding ID. 
df_cor = train_df.drop(columns = ["ID"])
corrmat = df_cor.corr(method='pearson')
f, ax = plt.subplots(figsize=(12, 12))
sns.heatmap(corrmat, vmax=1., square=True, annot=False, cmap=plt.cm.Blues)
plt.title("Correlation map", fontsize=14)
plt.show()

> The correlation matrix highlights 2 interesting points; the 2 dark blue squares in the middle of the plot. One square shows that the repayment status going from month to month is positively correlated with each other. Meaning that the repayment status is likely to be similar every month, whether in month 1 or month 5. Similarly, the monthly Bill amount is also highly correlated, and this makes sense because people usually have more or less the same expenses occurring monthly. 

In [None]:
# available credit = Limit balance - bill amount
# take available creidt for the 6 bill amount months and get the average available credit

bill_month = ["BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6"]

tmp_df = pd.DataFrame()
for i in bill_month:
    tmp_df["avail" + i] = plot_df["LIMIT_BAL"] - plot_df[i]

plot_df["average_available_credit"] = round(np.mean(tmp_df, axis=1))


# plot
alt.Chart(plot_df, title = "Default cases based on Average available credit").mark_bar().encode(
    alt.X("average_available_credit", bin=True, title= "Average available credit"),
    #alt.Y("count(LIMIT_BAL)"),
    alt.Y("count(default)", stack ="normalize", axis=alt.Axis(format='%')),
    alt.Color('default:O', scale=alt.Scale(scheme='tableau10')),
    tooltip='count(default)'
).properties(
    width=500,
    height=300
)

> The above plot shows the default to non-default counts for `average_available_credit`. Available credit is calculated by `LIMIT_BAL` - `BILL_AMT`. Taking the average of the available credits of the 6 months will help reduce noise such as one-time large purchases. As monthly bill amounts are relatively stable/predictable as shown from the correlation matrix earlier, it can be a possible indicator for future prediction. 

> The first and last three bins in the plot above have very few observations, leading to a hard to see pattern. If we focus on the 2nd to 7th bin (> 500 observations each), it shows that users with less available credit after monthly bill are more likely to default. The plot also shows that individuals that use more than their limit balance have the highest chance of defaulting. 

# Data Preprocessing <a name="3"></a>

In [None]:
#Separate data into X_train and y_train 
X_train, y_train = train_df.drop(columns = ['default']), train_df['default']
X_test, y_test = test_df.drop(columns = ['default']), test_df['default']


#defining features
drop_features = ["ID"]
numeric_features = ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4','BILL_AMT5',
                    'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3', 'PAY_AMT4','PAY_AMT5','PAY_AMT6',
                    'PAY_1', 'PAY_2','PAY_3', 'PAY_4', 'PAY_5','PAY_6']
categorical_features = ['EDUCATION', 'MARRIAGE']
binary_features = ['SEX']

#transfomer pipelines for features
numeric_transformer = make_pipeline(
    StandardScaler(),
)

categorical_transformer = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"), 
)

binary_transformer = make_pipeline(
    OneHotEncoder(drop="if_binary"), 
)

#preprocessor 
preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
    (binary_transformer, binary_features),
)

# Modeling <a name="4"></a>

In [None]:
#function to calculate cv score with sd
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation
    """
    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

**Dummy Classifier**

In [None]:
#store results and define metric
results = {}
scoring_metric = "f1_macro"

# dummy baseline model
dummy = DummyClassifier()

results['Dummy'] = mean_std_cross_val_scores(dummy, X_train, y_train, return_train_score=True, scoring=scoring_metric)
pd.DataFrame(results)

**Logistic Regression**

In [None]:
#logistic regression pipeline and cross validate
pipe_lr = make_pipeline(preprocessor, LogisticRegression(class_weight="balanced", max_iter= 1000, random_state = 123))

results["Logistic_Regression"] = mean_std_cross_val_scores(pipe_lr, X_train, y_train, return_train_score=True, scoring=scoring_metric)

#print results
pd.DataFrame(results)

**KNN, Random Forest, and LGBM Classifiers**

In [None]:
#ratio for imbalanced data
ratio = np.bincount(y_train)[0] / np.bincount(y_train)[1]

#pipelines for KNN, RandomForestClassifier, and LGBMClassifier 
pipe_knn = make_pipeline(preprocessor, KNeighborsClassifier())
pipe_rf = make_pipeline(preprocessor, RandomForestClassifier(class_weight="balanced", random_state= 123))
pipe_lgbm = make_pipeline(preprocessor, LGBMClassifier(scale_pos_weight=ratio, random_state=123))

classifiers = {
    "kNN" : pipe_knn,
    "random forest": pipe_rf,
    "LGBM": pipe_lgbm
}

#for loop to loop over the classifiers for cross validation 
for name, classifier in classifiers.items():
    results[name] = mean_std_cross_val_scores(classifier, X_train, y_train, 
                                              return_train_score=True, 
                                              scoring=scoring_metric)

#print results
pd.DataFrame(results)

# Hyperparameter tuning <a name="5"></a>
* For simplicity, only the model with the best results using default hyperparameters will be tuned

In [None]:
#Parameters for LGBMClassifier Hyperparameter tuning 
param_grid = {
        'lgbmclassifier__max_depth': np.arange(1,50),
        'lgbmclassifier__learning_rate': [0.001, 0.01, 0.1, 0.2, 0,3],
        'lgbmclassifier__n_estimators': np.arange(20,100),
        'lgbmclassifier__num_leaves': np.arange(1,100)
}

#RandomizedSearchCV for lgbm model
randomizedcv_lgbm = RandomizedSearchCV(pipe_lgbm, param_grid, n_jobs=-1, 
                                       n_iter=10, 
                                       cv=5, 
                                       return_train_score = True, 
                                       scoring = scoring_metric,
                                       random_state = 123)
randomizedcv_lgbm.fit(X_train, y_train);

In [None]:
#results of LGBM hyperparameter tuning 
pd.DataFrame(randomizedcv_lgbm.cv_results_)[
    [
        "mean_train_score",
        "mean_test_score",
        "param_lgbmclassifier__n_estimators",
        "param_lgbmclassifier__max_depth",
        "param_lgbmclassifier__learning_rate",
        "param_lgbmclassifier__num_leaves",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().head(3)

In [None]:
#use best hyperparamters on lgbm model
best_lgbm_model = randomizedcv_lgbm.best_estimator_
results["lgbm (tuned)"] = mean_std_cross_val_scores(
    best_lgbm_model, X_train, y_train, return_train_score=True, scoring=scoring_metric
)
pd.DataFrame(results)

> It looks like hyperparameter tuning improved the score. The best score is currently 0.689 F1-macro using a Gradient Boosted tree model (LightGBM). 

# Model Interpretation <a name="6"></a>

In [None]:
# get list of feature names after preprocessing data 
categorical_OHE = list(
    best_lgbm_model.named_steps["columntransformer"]
    .named_transformers_['pipeline-2']
    .named_steps["onehotencoder"]
    .get_feature_names(categorical_features)
)

binary_OHE = list(
    best_lgbm_model.named_steps["columntransformer"]
    .named_transformers_['pipeline-3']
    .named_steps["onehotencoder"]
    .get_feature_names(binary_features)
)

feature_names = numeric_features + binary_OHE + categorical_OHE

In [None]:
import shap

#lgbm model with best parameters 

preprocessor.fit(X_train, y_train)

X_train_enc = pd.DataFrame(
    data=preprocessor.transform(X_train),
    columns=feature_names,
    index=X_train.index,
)

lgbm_tuned = LGBMClassifier(
    scale_pos_weight=ratio,
    random_state=123,
    learning_rate = randomizedcv_lgbm.best_params_["lgbmclassifier__learning_rate"],
    max_depth = randomizedcv_lgbm.best_params_["lgbmclassifier__max_depth"],
    n_estimators = randomizedcv_lgbm.best_params_["lgbmclassifier__n_estimators"],
    num_leaves = randomizedcv_lgbm.best_params_["lgbmclassifier__num_leaves"],
)

lgbm_tuned.fit(X_train_enc, y_train)
lgbm_explainer = shap.TreeExplainer(lgbm_tuned)
lgbm_shap_values = lgbm_explainer.shap_values(X_train_enc)

In [None]:
# shap summary plot
shap.summary_plot(lgbm_shap_values[0], X_train_enc)

> The plot above explores the feature importances of the LightGBM model. The most important features on this plot are ranked from the most important to least important starting from the top. So here we see that Pay_1, which is the repayment status of the month September is the most important feature followed by pay amount and limit balance. This also ties back to the exploratory plots from earlier, we saw that gender and marriage did not seem to have an effect on default and as we see here; those features are not as important in this model.

# Results on Test Set <a name="7"></a>


In [None]:
# predict on test set using best model
best_model = randomizedcv_lgbm.best_estimator_.fit(X_train, y_train)
preds = best_model.predict(X_test)
pd.DataFrame(f1_score(y_test, preds, average = 'macro'), index = ["F1_macro on test set"], columns = ["LGBMClassifier"])

In [None]:
# classification report
print(classification_report(y_test, best_model.predict(X_test), target_names=["No default", "Default"]))

In [None]:
# confusion matrix
plot_confusion_matrix(best_model, 
                      X_test, 
                      y_test, 
                      display_labels=["No default", "Default on payment"], 
                      values_format="d", 
                      cmap=plt.cm.Blues,)

> The F1 macro score on the test set using LGBMClassifier is 0.693. I trust the results because the test score is consistent with the cross validation score and the test set has not been touched or influenced during training or preprocessing. Further improvements include feature engineering and more model exploration. 
