# Home Credit Default Risk

*Can you predict how capable each applicant is of repaying a loan?*

Spencer Brothers


## Tasks

- Set up a training set and a validation set using application_train.csv data set to do cross-validation.  Alternatively you could perform cross-validation using a different framework, such as k-fold cross validation as implemented in modeling packages such as caret or tidymodels or scikit-learn. The model performance that matters, of course, is the estimated performance on the test set as well as the Kaggle score.

- Identify the performance benchmark established by the majority class classifier.

- Fit several different logistic regression models using different predictors. Do interaction terms improve the model?  Compare model performance using not just accuracy but also AUC.
Explore using algorithms like random forest and gradient boosting. Compare model performance.

- Perform the data transformations required by a given algorithm.  For example, some algorithms require numeric data and perform better when it has been standardized or normalized.
Experiment with upsampling and downsampling the data to adjust for the imbalanced target variable.  (See APM Ch. 16.)  Does this strategy this improve model performance?

- Try combining model predictions--this is called an ensemble model--to improve performance.

- Try additional feature engineering to boost model performance. Can you combine variables or bin numeric variables?  Explore the notebooks at Kaggle for data transformation ideas. In particular, use the other data sets at Kaggle--beyond the application data--to create additional features.

- For machine learning models experiment with hyperparameter tuning  to try to boost performance.


## Table of Contents

## Introduction
The next step is to explore different modeling ideas for the project, with the aim of developing a model that beats a benchmark model (such as the majority class classifier) and produces results that -- hopefully! -- can be used to solve the business problem.

### Business Problem

Most lending services are based on credit, which excludes a large demographic of people (those with no credit history) from buying a home. Taking an uninformed lending approach is an unsustainable business practice that may leave underserved populations worse off, so using smart lending practices is essential to both Home Credit’s longevity and financial equity for unbanked populations.

### Benefit of a Solution

By better modeling clients’ behaviors, Home Credit can successfully predict clients’ repayment abilities. This supports Home Credit’s goals in two key areas:

1.	Home Credit will decrease costs of clients defaulting on loans or making late payments, supporting Home Credit's sustainability in an ever-changing economic and political ecosystem.

2.	Clients capable of repayment will receive necessary resources that empower their financial success when other financial institutions fail to lend. Loans will be given with principal, maturity, and a repayment schedule that optimizes clients’ lending experience

### Objectives of this notebook

*TO DO: Rewrite this section when finished with other sections*

- Practice feature engineering to improve model performance.

- Practice cross-validation.

- Learn about the properties of different modeling algorithms by experimenting with different methods and comparing different candidate models.

- Learn from group members.

## Data Preparation

Describe any additional data preparation related to modeling:

- variable transformations
- feature engineering
- handling of NAs.

### Setup

import libraries and read in data

In [36]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from tqdm import tqdm

In [66]:
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, make_scorer, precision_recall_curve


In [38]:
# mount to drive to access data
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [39]:
# read raw data into pandas dataframes
data_folder = '/content/drive/MyDrive/MSBA_Practice_Project/data'

app_train = pd.read_csv(f'{data_folder}/application_train.csv')
app_test = pd.read_csv(f'{data_folder}/application_test.csv')

### Feature Engineering

Along with `application_{train|test}.csv`, Home Credit also includes transactional datasets with information about each applicant. For this project, we will use `previous_application.csv` to hopefully improve our model performance. Because there are potentially multiple records for each applicant in `previous_application.csv`, we will make the following aggregations before joining the transactional data with `application_[train|test}.csv`:

1. Proportion of Past Loans Refused (PROP_NAME_PREV_REFUSED)
    - A high proportion of past loan refusals may indicate financial instability or a history of risky borrowing behavior, suggesting a higher likelihood of default.
3. Average Time Since Loan Decision (AVG_DAYS_DECISION)
	- The recency of past credit decisions can provide insight into financial behavior. Frequent recent loan applications may indicate financial distress or an increased reliance on borrowing.
4. Average Previous Credit Amount (AVG_PREV_CREDIT)
	- The typical size of previous loans can serve as an indicator of financial habits. Larger past loans may suggest significant debt obligations, which could impact the ability to repay future loans.
5. Average Down Payment (AVG_DOWN_PAYMENT)
	- Consistently low down payments may suggest over-leveraging, increasing the risk of default by indicating a lack of financial reserves.
6. Average Repayment Discrepancy (AVG_PREV_REPAYMENT_DISC)
	- The difference between the expected and actual last due date for previous loans can highlight repayment behavior. Large discrepancies may indicate late payments or loan extensions, both of which are potential risk factors.
7. Average Rate of Down Payment (AVG_RATE_DOWN_PAYMENT)
	- A lower down payment rate may suggest that borrowers are stretching their finances to secure a loan, which could indicate higher financial risk.
8. Count of Previous Loans (CNT_PREV_LOANS)
	- The total number of previous loans provides insight into borrowing patterns. A high number of past loans could indicate experience in managing debt but may also suggest a dependency on credit, which could be a risk factor.

In [40]:
# read transactional data into dataframe
prev_app = pd.read_csv(f'{data_folder}/previous_application.csv')

# Create a new column for the difference between DAYS_LAST_DUE and DAYS_LAST_DUE_1ST_VERSION
prev_app["PREV_REPAYMENT_DISC"] = prev_app["DAYS_LAST_DUE"] - prev_app["DAYS_LAST_DUE_1ST_VERSION"]

# Aggregate previous applications
prev_agg = prev_app.groupby("SK_ID_CURR").agg(
    # Proportion of past loans refused
    PROP_NAME_PREV_REFUSED=("NAME_CONTRACT_STATUS", lambda x: (x == "Refused").mean()),

    # Average time since loan decision (recent approvals may indicate cash flow issues)
    AVG_DAYS_DECISION=("DAYS_DECISION", "mean"),

    # Average previous credit amount (larger past loans might indicate risky borrowing behavior)
    AVG_PREV_CREDIT=("AMT_CREDIT", "mean"),

    # Average down payment (low down payments suggest high leverage, possible risk)
    AVG_DOWN_PAYMENT=("AMT_DOWN_PAYMENT", "mean"),

    # Average difference between actual and expected last due date (delayed repayment = risk)
    AVG_PREV_REPAYMENT_DISC=("PREV_REPAYMENT_DISC", "mean"),

    # Average rate of down payment (low rates could indicate over-leveraging)
    AVG_RATE_DOWN_PAYMENT=("RATE_DOWN_PAYMENT", "mean"),

    # Count of previous loans (a high number of past loans might indicate dependency on credit)
    CNT_PREV_LOANS=("SK_ID_PREV", "count")
).reset_index()

# Merge aggregated previous applications with application_train
train_merged = app_train.merge(prev_agg, on="SK_ID_CURR", how="left")

### Handling Missing Data

Many columns in the main dataset are mostly comprised of missing data, and most rows are missing data. By using binning and simple median imputation, we can fill all the missing data. While we may lose some detail in our data with binning, it offers a more accurate view of many highly predictive columns that may have structural missingness or where missing data represents a difference in populations, like in EXT_SOURCE{1|2|3}.

#### Missing Categorical Data

We used an LLM (Chat GPT 4o) to analyze the data dictionary, `HomeCredit_columns_description.csv`, and return a list of categorical features. We use this list to handle missing values in categorical columns, as well as encoding them as integers for model building and evaluation.

While many categoriacal variables are nominal, we will use label encoding (assume all categorical variables are ordinal) to prevent excess dimensionality on an already large dataset.

In [41]:
categorical_columns = [
    'TARGET', 'NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
    'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE',
    'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE',
    'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'FONDKAPREMONT_MODE',
    'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OWN_CAR_AGE', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG','YEARS_BUILD_AVG','COMMONAREA_AVG','ELEVATORS_AVG',
    'ENTRANCES_AVG','FLOORSMIN_AVG','LANDAREA_AVG','LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI',
    'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMIN_MEDI',  'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FLAG_DOCUMENT_2',
    'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
    'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'REG_REGION_NOT_LIVE_REGION',
    'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'WEEKDAY_APPR_PROCESS_START', 'CODE_GENDER'
]

numeric_columns = [col for col in train_merged.columns.values if col not in categorical_columns]
numeric_columns.remove('SK_ID_CURR')
print(len(numeric_columns) + len(categorical_columns)) # sanity check: should be 128 = app_cols + agg_cols - 1 (ID_col)

for col in tqdm(categorical_columns):
  train_merged[col] = train_merged[col].astype(str).fillna('missing')


128
128


100%|██████████| 94/94 [00:12<00:00,  7.76it/s]



#### Missing Numeric Data

Because this notebook is focusing on tree-based modeling algorithms, we will use binning to handle missing data.

In [42]:
# Bin numerical columns using quantiles
max_bins = 50
for col in tqdm(numeric_columns):
  train_merged[col] = pd.qcut(train_merged[col], q=max_bins,labels=False, duplicates='drop').astype(str)


100%|██████████| 34/34 [00:04<00:00,  7.25it/s]


In [43]:
# fill missing values with 'missing'
train_merged.replace('nan','missing',inplace=True)

In [44]:
train_merged

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,PROP_NAME_PREV_REFUSED,AVG_DAYS_DECISION,AVG_PREV_CREDIT,AVG_DOWN_PAYMENT,AVG_PREV_REPAYMENT_DISC,AVG_RATE_DOWN_PAYMENT,CNT_PREV_LOANS
0,100002,1,Cash loans,M,N,Y,0,16,18,24.0,...,0.0,0.0,0.0,0.0,31.0,34.0,0.0,9.0,0.0,0.0
1,100003,0,Cash loans,F,N,N,0,19,43,38.0,...,0.0,0.0,0.0,0.0,11.0,47.0,12.0,15.0,7.0,1.0
2,100004,0,Revolving loans,M,Y,Y,0,1,1,0.0,...,0.0,0.0,0.0,0.0,24.0,0.0,17.0,16.0,33.0,0.0
3,100006,0,Cash loans,F,N,Y,0,10,14,31.0,...,missing,missing,missing,1.0,46.0,42.0,35.0,24.0,30.0,7.0
4,100007,0,Cash loans,M,N,Y,0,8,22,19.0,...,0.0,0.0,0.0,0.0,13.0,32.0,12.0,23.0,30.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,0,Cash loans,M,N,N,0,12,8,29.0,...,missing,missing,missing,0.0,46.0,5.0,0.0,16.0,0.0,0.0
307507,456252,0,Cash loans,F,N,Y,0,2,10,6.0,...,missing,missing,missing,0.0,0.0,9.0,12.0,19.0,11.0,0.0
307508,456253,0,Cash loans,F,N,Y,0,12,30,32.0,...,0.0,0.0,0.0,0.0,1.0,0.0,15.0,19.0,33.0,0.0
307509,456254,1,Cash loans,F,N,Y,0,14,17,17.0,...,0.0,0.0,0.0,0.0,44.0,28.0,0.0,36.0,0.0,0.0


In [45]:
# sanity check: review binned columns
# for col in numeric_columns:
#   print(train_merged[col].value_counts())

## Modeling Process

We essentially de-linearized the data by binning because this notebook will focus on tree-based models. Specifically, we will train a XGBoost model, a gradient boosting algorithm.

Hyperparameters will be selected using a randomized search with 3-fold cross validation using the area under the ROC curve.

The best model parameters will then be used to train a final XG Boost model, which we will evaluate in the next section.

***NOTE*** : if using google colab, be careful about running this section, and make sure to use the A100 GPU Runtime.

In [46]:
# Load & Preprocess Data
X = train_merged.drop(columns=['TARGET', 'SK_ID_CURR'])  # Drop ID and target
y = train_merged['TARGET'].astype(int)  # Extract target

# Convert all categorical features to numerical values using Label Encoding
label_encoders = {}
for col in X.columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le  # Save encoders for later use if needed

In [65]:
# Define XGBoost Model
xgb_model = xgb.XGBClassifier(
    tree_method='hist',
    device='cuda', # Enable GPU acceleration
    n_jobs=-1
)

# Define Hyperparameter Grid for RandomizedSearchCV
param_dist = {
    'n_estimators': [500, 1000],
    'learning_rate': [0.01, 0.1],
    'max_depth': [4, 6, 8],
    'reg_lambda': [1, 3, 5],  # L2 regularization
    'subsample': [0.7, 1.0],  # Fraction of data to use per tree
    'colsample_bytree': [0.7, 1.0]  # Fraction of features per tree
}

# Set Up RandomizedSearchCV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    cv=cv,
    scoring='roc_auc',
    n_iter=30,  # Randomly test 30 parameter sets
    n_jobs=-1,
    verbose=1,
    random_state=42
)

# Train with RandomizedSearchCV
random_search.fit(X, y)

# Output Best Hyperparameters
print("Best parameters:", random_search.best_params_)
print("Best ROC AUC:", random_search.best_score_)


Fitting 5 folds for each of 30 candidates, totalling 150 fits




Best parameters: {'subsample': 1.0, 'reg_lambda': 5, 'n_estimators': 500, 'max_depth': 4, 'learning_rate': 0.1, 'colsample_bytree': 1.0}
Best ROC AUC: 0.7616342715054478


The above code block took about 15 minutes to run, which isn't too bad considering we trained 150 ensemble models.

## Model Performance

As seen above, the kaggle

In [54]:
# Extract the best trained model
best_model = random_search.best_estimator_

# Perform 5-Fold Cross-Validation for Evaluation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Store scores for each fold
scores = {'Accuracy': [], 'Precision': [], 'Recall': [], 'F1': [], 'ROC AUC': []}

# Loop through each fold
for train_idx, val_idx in tqdm(cv.split(X, y), desc='5-Fold Cross Validation'):
    # Split data
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    # Train model
    best_model.fit(X_train, y_train)

    # Make predictions
    y_pred = best_model.predict(X_val)
    y_pred_proba = best_model.predict_proba(X_val)[:, 1]  # Get probabilities for ROC AUC

    # Compute metrics
    scores['Accuracy'].append(accuracy_score(y_val, y_pred))
    scores['Precision'].append(precision_score(y_val, y_pred, pos_label=1))
    scores['Recall'].append(recall_score(y_val, y_pred, pos_label=1))
    scores['F1'].append(f1_score(y_val, y_pred, pos_label=1))
    scores['ROC AUC'].append(roc_auc_score(y_val, y_pred_proba))

# Compute average scores
avg_scores = {metric: np.mean(values) for metric, values in scores.items()}

# Print evaluation results
print("\n\nOptimized Model Evaluation with 5-Fold Cross-Validation:")
for metric, score in avg_scores.items():
    print(f"mean {metric}: {score:.4f}")

5it [00:11,  2.31s/it]



Optimized Model Evaluation with 5-Fold Cross-Validation:
Accuracy: 0.9196
Precision: 0.5338
Recall: 0.0278
F1: 0.0529
ROC AUC: 0.7616





In [55]:
avg_scores

{'Accuracy': 0.9195606023391942,
 'Precision': 0.5337665698141087,
 'Recall': 0.02783484390735146,
 'F1': 0.052904416308474624,
 'ROC AUC': 0.7616342715054478}

## Results

Summarize and discuss your findings.