# Predictive Modeling with XGBoost

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/noportman/noportman.github.io/blob/main/docs/notebooks/XGBoost.ipynb)


![Python](https://img.shields.io/badge/Python-3.10-blue)
![NumPy](https://img.shields.io/badge/NumPy-Numerical%20Computing-orange?logo=numpy)
![Pandas](https://img.shields.io/badge/Pandas-Data%20Wrangling-lightgrey?logo=pandas)
![Matplotlib](https://img.shields.io/badge/Matplotlib-Visualization-blue?logo=matplotlib)
![Scikit-Learn](https://img.shields.io/badge/scikit--learn-ML%20Toolkit-f7931e?logo=scikit-learn)
![XGBoost](https://img.shields.io/badge/XGBoost-Gradient%20Boosting-red?logo=xdot)

![Status](https://img.shields.io/badge/Status-Completed-brightgreen)
![License](https://img.shields.io/badge/License-MIT-yellow)

## Introduction

An end-to-end workflow using NumPy, Pandas, Matplotlib, and XGBoost to evaluate model performance with ROC AUC, accuracy, and regression metrics.

## Module Import

In [None]:
# !pip install xgboost==1.6.1

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from datetime import datetime
import xgboost
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


import warnings

warnings.simplefilter(action="ignore", category=UserWarning)

## Dataset Import

Using the LendingClub loans dataset.

In [None]:
url = "https://docs.google.com/spreadsheets/d/10L8BpkV4q1Zsou4daYoWul_8PFA9rsv2/export?format=csv&id=10L8BpkV4q1Zsou4daYoWul_8PFA9rsv2&gid=1710894028"
df = pd.read_csv(url, index_col=False)

In [None]:
df.info()

In [None]:
df.head(6)

In [None]:
df.default.value_counts(normalize=True)

## Training and Test Datasets

Let's split the data 70/30 into a training set (which we will use to build models) and a test set (on which we will evaluate any model we build).

In [None]:
X = df.drop(["default"], axis=1)
y = df["default"]


# Encode string class values as integers to avoid errors in newer versions of XGBoost
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(y)
y = label_encoder.transform(y)


# Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)
eval_set = [(X_test, y_test)]
print(X_train.shape, X_test.shape)

In [None]:
print("Initializing xgboost.sklearn.XGBClassifier and starting training...")

st = datetime.now()

clf = xgboost.sklearn.XGBClassifier(
    objective="binary:logistic",
    learning_rate=0.05,
    seed=9616,
    max_depth=20,
    gamma=10,
    n_estimators=500,
)


clf.fit(
    X_train,
    y_train,
    eval_set=eval_set,
    eval_metric="auc",
    early_stopping_rounds=20,
    verbose=False,
)

print(f"Training time: {datetime.now() - st}")

# Make predictions
y_pred = clf.predict(X_test)

print(datetime.now() - st)

accuracy = accuracy_score(np.array(y_test).flatten(), y_pred)
print("Accuracy: %.10f%%" % (accuracy * 100.0))

accuracy_per_roc_auc = roc_auc_score(np.array(y_test).flatten(), y_pred)
print("ROC-AUC: %.10f%%" % (accuracy_per_roc_auc * 100))

In [None]:
# Remember: The F score is based on how often a feature is used to split the data across all trees in the model, so this gives you a relative sense of importance, not causality.

xgboost.plot_importance(clf)

## Model Interpretation:



**1. Top Predictive Features:**

`fico_score` is by far the most important feature (F score: 83), suggesting that the model heavily relies on creditworthiness when predicting the target (or likely default).

`installment` (72) and `rev_balance` (58) are also strongly predictive — indicating that loan repayment terms and revolving balance significantly influence the model's decision-making.

**2. Moderately Important Features:**

`inquiries` (52) and `log_income` (47) contribute meaningfully, possibly capturing borrower activity and financial capability.

**3. Low Importance Feature:**

`records` (11) contributes very little to the model. This might mean it either has little variance or isn’t strongly correlated with default risk.