# Clients and evaluators:
In order to use the interactive features of this notebook you must be signed in to Kaggle (it's a free account) and enable internet in the settings to the right. Do not enable an accelerator - the GPU and TPU are not used in training the models and do not speed up processing.

Click **Run All** above. There will be a roughly 3 1/2 minute loading time before visualizations will appear.

If you want to see how the data was processed and do a more in-depth investigation of the models please scroll down.

In [None]:
# install packages and import libraries
print("Installing packages...")
!pip install -Uqqq --use-feature=2020-resolver pycaret
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pycaret.classification import * # machine learning library
from ipywidgets import interact, interactive
from IPython.display import display

print("Loading data...")
# read data
data=pd.read_csv('../input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv')

# clean data
# education
ed_filter = (data.EDUCATION == 5) | (data.EDUCATION == 6) | (data.EDUCATION == 0)
data.loc[ed_filter, 'EDUCATION'] = 4

# marriage
data.loc[data.MARRIAGE == 0, 'MARRIAGE'] = 3

# split dataset into learning and verification
dataset = data.sample(frac=0.999, random_state=8675309).reset_index(drop=True)
data_unseen = data.drop(dataset.index).reset_index(drop=True)

# do pycaret setup
classifier = setup(data=dataset, target="default.payment.next.month", ignore_features=["ID"], silent=True, verbose=False, profile=False, session_id=8675309)

print("Training Logistic Regression model...")
log_reg = create_model("lr", verbose=False)
tuned_log_reg = tune_model(log_reg, choose_better=True, verbose=False)
final_log_reg = finalize_model(tuned_log_reg)
log_reg_results = predict_model(final_log_reg,probability_threshold=0.46, data=data_unseen)

print("Training Light Gradient Boosting Machine...")
grad_boost = create_model("lightgbm", verbose=False)
tuned_grad_boost = tune_model(grad_boost, choose_better=True, verbose=False)
final_grad_boost = finalize_model(tuned_grad_boost)
grad_boost_results = predict_model(final_grad_boost,probability_threshold=0.34, data=data_unseen)

print("Setup complete.")

# Model Evaluation
Here are several different metrics to evaluate both the Logistic Regression and Light Gradient Boosting Machine models. If you want to see more metrics you can scroll down below to use an interactive widget for each one. The interactive widgets are not used here since they can add several minutes to loading times depending on the metric you select.

# Logistic Regression
The first two graphs are for the Logistic Regression model. Something to note about the Logistic Regression model is that it ended up selecting 0 (No Default) for every case because the majority of customers did not actually default on their loans. That is why the Feature Importance plot is at a 1e-5 scale, signifying that no features were heavily used to determine whether or not a customer defaulted. The second graph shows the error rates for each predicted class. Since the model only predicted 0 for every case all of the errors are in the first column.

In [None]:
# Feature Importance
plot_model(final_log_reg, plot="feature")

In [None]:
# Accuracy
plot_model(final_log_reg, plot="error")

# Light Gradient Boosting Machine
These three graphs are for the Light Gradient Boosting Machine. The first one is using SHAP values to determine the impact of a field on the final results. It is similar to feature importance but is calculated by figuring out contribution to the final result as opposed to the absolute relationship between the feature and the end result. The SHAP values can only be calculated using models that are tree-based at this time, hence why there is none for the Logistic Regression. The second graph is the feature importance graph, similar to the one for the Logistic Regression, but note that the scale here shows that the features are being used to make the decision instead of being very marginally related as above. The third graph is for error rates like above but this time there is error for false positives (predicted to be 1 but was actually 0) and false negatives (predicted to be 0 but was actually 1).

In [None]:
# This gets the SHAP value of different fields to determine which are most impactful to the final result.
# The SHAP value is usually different from Feature Importance. It cannot be calculated for the Logistic Regression model because it is only applicable to tree-based models.
interpret_model(final_grad_boost, plot="summary")

In [None]:
# Feature Importance
plot_model(final_grad_boost, plot="feature")

In [None]:
# Accuracy
plot_model(final_grad_boost, plot="error")

# Predictor
Here is the example predictor that allows you to select a customer that has not been seen by the model while it was training. The number in the dropdown is the ID given to the customer and you will see the customer's information and then what the predictions were for both the Light Gradient Boosting Machine and the Logistic Regression, as well as what the actual result was.

In [None]:
# Predictor
def f(Customer):
    i = data_unseen.iloc[Customer]
    display(i)
    display('Predictions:')
    display('Logistic Regression: ' + ('Default' if log_reg_results.take([Customer])['Label'][Customer] == '1' else 'No Default'))
    display('Light Gradient Boost: ' + ('Default' if grad_boost_results.take([Customer])['Label'][Customer] == '1' else 'No Default'))
    display('Actual: ' + ('Default' if i['default.payment.next.month'] == 1 else 'No Default'))
interact(f, Customer=[(str(r + 29971), r) for r in range(0,30)])

# Below this point is the processing and model building code.
If you want to see how the Light Gradient Boosting Machine was chosen and how the models were tuned feel free to look below. Comments are in the code blocks explaining what each step is doing.

In [None]:
# Compare models to find best algorithm
# Ridge classifier excluded due to not being a binary classifier
# Gradient boost, extreme gradient boost, catboost, svm, and linear discriminant removed for time
compare_models(exclude = ['ridge', 'gbc', 'xgboost', 'catboost', 'svm', 'lda'])

In [None]:
# Calculate optimal threshold for Logistic Regression
# True positives and false negatives are more heavily weighted to give better results
optimize_threshold(final_log_reg, true_positive=10, true_negative=5, false_positive=-10, false_negative=-15)

In [None]:
# Calculate optimal threshold for Light Gradient Boosting Machine
# True positives and false negatives are more heavily weighted to give better results
optimize_threshold(final_grad_boost, true_positive=10, true_negative=5, false_positive=-10, false_negative=-15)

In [None]:
# This evaluates the Logistic Regression model by several useful metrics. Some of these, such as Feature Selection and Manifold Learning take a very long time to calculate.
evaluate_model(final_log_reg)

In [None]:
# This evaluates the Light Gradient Boosting Machine model by several useful metrics.
# Some of these, such as Feature Selection and Manifold Learning take a very long time to calculate.
evaluate_model(final_grad_boost)

In [None]:
# Test the model against the holdout data set to get accuracy information.
# Note that this is not the finalized model since that has no holdout and cannot be evaluated for accuracy.
predict_model(tuned_log_reg, probability_threshold=0.46)

In [None]:
# Test the model against the holdout data set to get accuracy information
# Note that this is not the finalized model since that has no holdout and cannot be evaluated for accuracy.
predict_model(tuned_grad_boost, probability_threshold=0.34)

In [None]:
# View final results from logistic regression against unseen data
log_reg_results

In [None]:
# View final results from light gradient boosting machine against unseen data
grad_boost_results