## 3. Gradient Boosted Trees XGBoost

1. Trees for defaults

You will now train a gradient boosted tree model on the credit data, and see a sample of some of the predictions. Do you remember when you first looked at the predictions of the logistic regression model? They didn't look good. Do you think this model be different?

In [20]:
# Import pandas as pd
import pandas as pd
import matplotlib as mtlb
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
import xgboost as xgb

In [2]:
# Read in the csv file
clean_loan_data = pd.read_csv('clean_loan_data.csv')

In [3]:
# Create two data sets for numeric and non-numeric data
cred_num = clean_loan_data.select_dtypes(exclude=['object'])
cred_str = clean_loan_data.select_dtypes(include=['object'])

# One-hot encode the non-numeric columns
cred_str_onehot = pd.get_dummies(cred_str)

# Union the one-hot encoded columns to the numeric ones
cr_loan_prep = pd.concat([cred_num, cred_str_onehot], axis=1)

In [4]:
# Create the X and y data sets
X = cr_loan_prep.drop(['loan_status'], axis=1)
y = cr_loan_prep[['loan_status']]

# Use test_train_split to create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)

In [5]:
clf_gbt = xgb.XGBClassifier().fit(X_train, np.ravel(y_train))

# Predict with a model
gbt_preds = clf_gbt.predict_proba(X_test)

# Create dataframes of first five predictions, and first five true labels
preds_df = pd.DataFrame(gbt_preds[:,1], columns = ['prob_default'])
true_df = y_test

# Concatenate and print the two data frames for comparison
#print(pd.concat([true_df.reset_index(drop = True), preds_df], axis = 1))

2. Gradient boosted portfolio performance

At this point you've looked at predicting probability of default using both a LogisticRegression() and XGBClassifier(). You've looked at some scoring and have seen samples of the predictions, but what is the overall affect on portfolio performance? Try using expected loss as a scenario to express the importance of testing different models.

A data frame called portfolio has been created to combine the probabilities of default for both models, the loss given default (assume LGD 20% for now), and the loan_amnt which will be assumed to be the exposure at default.

In [6]:
lr_prob_default = pd.read_csv('logreg_prob_default.csv')
print(len(lr_prob_default))
print(len(preds_df))

In [16]:
portfolio=pd.DataFrame()
#portfolio['gbt_prob_default']=preds_df
portfolio=preds_df
portfolio['lr_prob_default']=lr_prob_default
portfolio['loan_amnt']=cr_loan_prep[['loan_amnt']]
portfolio['lgd']=0.2
print(portfolio.head(2))

   prob_default  lr_prob_default  loan_amnt  lgd
0      0.990942         0.445779       1000  0.2
1      0.983987         0.223447       5500  0.2


In [17]:
print(portfolio.head(5))

# Create expected loss columns for each model using the formula
portfolio['gbt_expected_loss'] = portfolio['prob_default'] * portfolio['lgd'] * portfolio['loan_amnt']
portfolio['lr_expected_loss'] = portfolio['lr_prob_default'] * portfolio['lgd'] * portfolio['loan_amnt']

# Print the sum of the expected loss for lr
print('LR expected loss: ', np.sum(portfolio['lr_expected_loss']))

# Print the sum of the expected loss for gbt
print('GBT expected loss: ', np.sum(portfolio['gbt_expected_loss']))

   prob_default  lr_prob_default  loan_amnt  lgd
0      0.990942         0.445779       1000  0.2
1      0.983987         0.223447       5500  0.2
2      0.000807         0.288558      35000  0.2
3      0.001239         0.169358      35000  0.2
4      0.084892         0.114182       2500  0.2
LR expected loss:  4752523.770084315
GBT expected loss:  4407645.75624012


3. Assessing gradient boosted trees

So you've now used XGBClassifier() models to predict probability of default. These models can also use the .predict() method for creating predictions that give the actual class for loan_status.

You should check the model's initial performance by looking at the metrics from the classification_report(). Keep in mind that you have not set thresholds for these models yet.

In [21]:
# Predict the labels for loan status
gbt_preds = clf_gbt.predict(X_test)

# Check the values created by the predict method
print(gbt_preds)

# Print the classification report of the model
target_names = ['Non-Default', 'Default']
print(metrics.classification_report(y_test, gbt_preds, target_names=target_names))

[1 1 0 ... 0 0 0]
              precision    recall  f1-score   support

 Non-Default       0.93      0.99      0.96      9198
     Default       0.94      0.74      0.83      2586

    accuracy                           0.93     11784
   macro avg       0.94      0.86      0.89     11784
weighted avg       0.93      0.93      0.93     11784

