![WMLE LOGOS](https://github.com/sanjayksau/wmle2024/blob/main/logo3.png?raw=true)

#XGBoost Workbook: Loan Default Prediction Using HELOC Dataset



## Introduction
In this workbook, we will use XGBoost to predict whether a customer will default on their Home Equity Line of Credit (HELOC) based on several financial features.

A HELOC is a line of credit typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). A Home Equity Line of Credit (HELOC) is a type of loan that allows homeowners to borrow money using the equity in their home as collateral.

The target variable in this dataset is a binary variable called RiskPerformance. The value “Bad” indicates that an applicant was 90 days past due or worse at least once over a period of 24 months from when the credit account was opened. The value “Good” indicates that they have made their payments without ever being more than 90 days overdue.

There are 23 predictors in the dataset. See link below for dataset detail:


- https://pbiecek.github.io/xai_stories/story-heloc-credits.html

- https://docs.interpretable.ai/stable/examples/fico/

- https://community.fico.com/s/explainable-machine-learning-challenge

##Install Required Libraries

In [1]:
!pip install xgboost



##Load the HELOC Dataset
First, let's load the HELOC dataset from GitHub.





In [2]:
import pandas as pd

#TODO: Load the HELOC dataset
url = "https://raw.githubusercontent.com/benoitparis/explainable-challenge/refs/heads/master/heloc_dataset_v1.csv"

#TODO: Display the first few rows of the dataset


Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
0,Bad,55,144,4,84,20,3,0,83,2,...,43,0,0,0,33,-8,8,1,1,69
1,Bad,61,58,15,41,2,4,4,100,-7,...,67,0,0,0,0,-8,0,-8,-8,0
2,Bad,67,66,5,24,9,0,0,100,-7,...,44,0,4,4,53,66,4,2,1,86
3,Bad,66,169,1,73,28,1,1,93,76,...,57,0,5,4,72,83,6,4,3,91
4,Bad,81,333,27,132,12,0,0,100,-7,...,25,0,1,1,51,89,3,1,0,80


##Data Preprocessing
Before training the model, we need to clean and preprocess the data. This includes handling missing values and ensuring that the target variable (RiskPerformance) is binary (0 for "Good", 1 for "Bad").

In [3]:
# Drop rows with missing values for simplicity
df = df.dropna()

#TODO Map target labels: 'RiskPerformance' to binary (1 for "Bad", 0 for "Good")
df['RiskPerformance'] = df['RiskPerformance'].map({'Good': 0, 'Bad': 1})

#TODO: Display the first few rows after preprocessing
#Verify that the target column RiskPerformance has been converted to binary (0 for "Good", 1 for "Bad").


Unnamed: 0,RiskPerformance,ExternalRiskEstimate,MSinceOldestTradeOpen,MSinceMostRecentTradeOpen,AverageMInFile,NumSatisfactoryTrades,NumTrades60Ever2DerogPubRec,NumTrades90Ever2DerogPubRec,PercentTradesNeverDelq,MSinceMostRecentDelq,...,PercentInstallTrades,MSinceMostRecentInqexcl7days,NumInqLast6M,NumInqLast6Mexcl7days,NetFractionRevolvingBurden,NetFractionInstallBurden,NumRevolvingTradesWBalance,NumInstallTradesWBalance,NumBank2NatlTradesWHighUtilization,PercentTradesWBalance
0,1,55,144,4,84,20,3,0,83,2,...,43,0,0,0,33,-8,8,1,1,69
1,1,61,58,15,41,2,4,4,100,-7,...,67,0,0,0,0,-8,0,-8,-8,0
2,1,67,66,5,24,9,0,0,100,-7,...,44,0,4,4,53,66,4,2,1,86
3,1,66,169,1,73,28,1,1,93,76,...,57,0,5,4,72,83,6,4,3,91
4,1,81,333,27,132,12,0,0,100,-7,...,25,0,1,1,51,89,3,1,0,80


##Split the Dataset
We will now split the dataset into training and testing sets to evaluate the model.

In [4]:
from sklearn.model_selection import train_test_split

#Split the data into features (X) and target (y)
X = df.drop(columns='RiskPerformance')  # Features

#TODO: Assign the 'RiskPerformance' column as the target column: y

#TODO: Split the data into training (80%) and test (20%) sets

#TODO: Display the shapes of the train and test sets


((8367, 23), (2092, 23))

##Train an XGBoost Model
Next, we will train an XGBoost model using the training data.

In [5]:
import xgboost as xgb

# Instantiate the XGBoost classifier
#“binary:logistic” –logistic regression for binary classification, output probability
#logloss: negative log-likelihood
xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False)

# Fit the model
xgb_model.fit(X_train, y_train)

# Model training completed
print("XGBoost model training completed.")

Parameters: { "use_label_encoder" } are not used.



XGBoost model training completed.


##Hyperparameter Tuning using GridSearchCV
Now, let’s perform hyperparameter tuning using GridSearchCV to find the best parameters for our XGBoost model.

In [6]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for GridSearchCV
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200],
}

# Instantiate the GridSearchCV object with 3-fold cross-validation
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, verbose=1, n_jobs=-1)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Output the best parameters found by GridSearchCV
print("Best parameters: ", grid_search.best_params_)

# Get the best model from the grid search
best_model = grid_search.best_estimator_


Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best parameters:  {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}


Parameters: { "use_label_encoder" } are not used.



##Evaluate the Tuned Model
Finally, we evaluate the performance of the model after hyperparameter tuning by making predictions on the test set.

In [7]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Make predictions with the best model found by GridSearchCV

#TODO: call the best_model.predict with test samples: X_test as argument and assign it y_pred


# Confusion Matrix for the tuned model
#Confusion matrix:true label being i-th class and predicted label being j-th class.
#[[tp fn]
# [fp tn]]
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion matrix (Tuned Model):")
print(conf_matrix)


#Evaluate the model's accuracy
#Accuracy score is fraction of correcly classified samples.
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy after tuning: ", accuracy)


#TODO: Compute the following metrics:
#precision: tp/(tp+fp)
#recall: tp/tp+fn
#classification_report
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html
# https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.classification_report.html



Confusion matrix (Tuned Model):
[[652 352]
 [275 813]]
Accuracy after tuning:  0.7002868068833652


In [8]:
#TODO:
#Modifying the parameter grid in GridSearchCV to explore other combinations of hyperparameters.
#Applying feature scaling and engineering techniques to improve model performance.