## Notebook Overview

This notebook builds upon the feature-engineered dataset from the previous notebook (`04_feature_engineering.ipynb`) and focuses on **Model Training and Evaluation**. Our primary goal is to develop a credit risk prediction model that excels at identifying potential loan defaulters, thereby minimizing financial losses for retail banks while also considering their desired balance between risk aversion and loan approval rates. This translates to maximizing the recall of the positive class (loan defaulters) while maintaining acceptable precision and overall model performance.

### 0.5.1 Objectives

The main objectives of this notebook are:

1. **Model Selection:** Choose algorithms suitable for imbalanced classification problems.
2. **Model Training:** Train models with a focus on identifying potential defaulters.
3. **Hyperparameter Tuning:** Optimize models to increase recall for the positive class.
4. **Model Evaluation:** Assess models primarily on recall, while considering precision, F2-score, AUC-PR, and overall performance.
5. **Model Comparison:** Compare different models based on their ability to identify true positives and balance the precision-recall trade-off.
6. **Threshold Adjustment:** Explore the impact of classification thresholds on recall and precision, collaborating with retail banks to determine the optimal threshold.

### 0.5.2 Importance of Focusing on Recall

Prioritizing recall for defaulter prediction is crucial for minimizing financial losses, which is the primary business objective in credit risk assessment. The cost of missing a potential defaulter (false negative) is typically much higher than the cost of incorrectly classifying a non-defaulter as high-risk (false positive). While we prioritize recall, we will also carefully consider the precision-recall trade-off and aim for a model that maximizes recall without severely impacting precision. Techniques like threshold adjustment and cost-sensitive learning will be used to balance these metrics effectively. Furthermore, demonstrating a thorough approach to risk identification aligns with regulatory expectations in the financial sector, supporting the banks' compliance needs. This approach also allows for more conservative lending practices, which can be adjusted based on the bank's specific risk tolerance.

### 0.5.3 Our Approach

In this notebook, we will focus on the following modeling tasks:

1. **Data Preparation:** Address class imbalance using techniques like SMOTE or class weighting.
2. **Baseline Model:** A logistic regression model with class weights inversely proportional to class frequencies will serve as our baseline. This will provide a benchmark for evaluating more complex models.
3. **Advanced Models:** Train and evaluate models known for handling imbalanced data:
   - Decision Trees with adjusted class weights
   - Random Forest with balanced class weights
   - Gradient Boosting (XGBoost, LightGBM) with `scale_pos_weight` adjustment
4. **Hyperparameter Tuning:** We will employ techniques like GridSearchCV or RandomizedSearchCV, optimizing for the F2-score (which gives more weight to recall) or a custom cost-sensitive scoring function.
5. **Model Evaluation:** Prioritize recall in our metrics, while also considering precision, F2-score, AUC-PR, and AUC-ROC.
6. **Threshold Adjustment:** We will experiment with different classification thresholds and work closely with retail banks to determine the optimal threshold that balances their desired level of risk aversion with acceptable loan approval rates.
7. **Ensemble Methods:** Explore ensemble techniques that can improve recall without severely impacting precision.
8. **Cost-Sensitive Learning:** Incorporate misclassification costs to reflect the higher cost of false negatives, aligning the model's objective with the business goal of minimizing financial losses.

By the end of this notebook, we aim to have a model (or ensemble of models) that excels at identifying potential loan defaulters, providing the bank with a powerful tool for risk assessment and mitigation.


In [1]:
import pandas as pd
import numpy as np
import os

from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import (
    classification_report,
    roc_auc_score,
    precision_recall_curve,
    auc,
    confusion_matrix,
    f1_score,
    recall_score,
    precision_score,
    make_scorer,
)
from retail_bank_risk.model_training_utils import downscale_dtypes
from retail_bank_risk.advanced_visualizations_utils import (
    plot_confusion_matrix,
    plot_model_performance,
    shap_summary_plot,
    shap_force_plot,
    plot_roc_curve,
    plot_precision_recall_curve,
    plot_combined_confusion_matrices,
    plot_learning_curve,
)

from joblib import Parallel, delayed
import warnings
warnings.filterwarnings('ignore')  # Suppress warnings for cleaner output

import matplotlib.pyplot as plt
import seaborn as sns
import shap

After loading the necessary libraries, we load the training and test datasets and verify their dimensions and initial rows to ensure data integrity.

In [2]:
train_df = pd.read_parquet(
    "../data/processed/application_train_engineered.parquet"
)
test_df = pd.read_parquet(
    "../data/processed/application_test_engineered.parquet"
)

In [3]:
print(f"Training Data Shape: {train_df.shape}")
print(f"Test Data Shape: {test_df.shape}")

train_df.head()

Training Data Shape: (307511, 77)
Test Data Shape: (48744, 76)


Unnamed: 0,reg_city_not_work_city_0,reg_city_not_work_city_1,region_rating_client_w_city,region_rating_client,name_contract_type_cash loans,name_contract_type_revolving loans,code_gender_m,code_gender_f,flag_own_car_n,flag_own_car_y,...,is_anomaly_true,age_group,income_group,credit_amount_group,debt_to_income_ratio,credit_to_goods_ratio,annuity_to_income_ratio,ext_source_mean,credit_exceeds_goods,target
0,1,0,1.0,1.0,1,0,1,0,1,0,...,0,1.0,3.0,1.0,2.007889,1.158397,0.121978,0.201162,1,1
1,1,0,2.0,2.0,1,0,0,1,1,0,...,0,3.0,4.0,4.0,4.79075,1.145199,0.132217,0.588812,1,0
2,1,0,1.0,1.0,0,1,1,0,0,1,...,0,3.0,0.0,0.0,2.0,1.0,0.1,0.642739,0,0
3,1,0,1.0,1.0,1,0,0,1,1,0,...,0,3.0,1.0,1.0,2.316167,1.052803,0.2199,0.68046,1,0
4,0,1,1.0,1.0,1,0,1,0,1,0,...,0,3.0,1.0,2.0,4.222222,1.0,0.179963,0.39776,0,0


The data appears correctly processed: encoded variables are saved, new features are created, and the dataset shapes match expectations.

Next, we optimize data types to minimize memory usage without information loss.

In [4]:
train_df, test_df = downscale_dtypes(train_df, test_df, target_column='target')

train_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 77 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   reg_city_not_work_city_0                 307511 non-null  uint8  
 1   reg_city_not_work_city_1                 307511 non-null  uint8  
 2   region_rating_client_w_city              307511 non-null  float32
 3   region_rating_client                     307511 non-null  float32
 4   name_contract_type_cash loans            307511 non-null  uint8  
 5   name_contract_type_revolving loans       307511 non-null  uint8  
 6   code_gender_m                            307511 non-null  uint8  
 7   code_gender_f                            307511 non-null  uint8  
 8   flag_own_car_n                           307511 non-null  uint8  
 9   flag_own_car_y                           307511 non-null  uint8  
 10  flag_own_realty_y               

Next, we will be splitting the dataset into predictors and the target variable for modeling.

In [5]:
X = train_df.drop("target", axis=1)
y = train_df["target"]

print(f"Number of Features: {X.shape[1]}")
print("Feature Names:", X.columns.tolist())

Number of Features: 76
Feature Names: ['reg_city_not_work_city_0', 'reg_city_not_work_city_1', 'region_rating_client_w_city', 'region_rating_client', 'name_contract_type_cash loans', 'name_contract_type_revolving loans', 'code_gender_m', 'code_gender_f', 'flag_own_car_n', 'flag_own_car_y', 'flag_own_realty_y', 'flag_own_realty_n', 'name_type_suite_unaccompanied', 'name_type_suite_family', 'name_type_suite_spouse, partner', 'name_type_suite_children', 'name_type_suite_other_a', 'name_type_suite_mode', 'name_type_suite_other_b', 'name_type_suite_group of people', 'name_income_type_working', 'name_income_type_state servant', 'name_income_type_commercial associate', 'name_income_type_pensioner', 'name_income_type_unemployed', 'name_income_type_student', 'name_income_type_businessman', 'name_income_type_maternity leave', 'name_education_type', 'name_family_status_single / not married', 'name_family_status_married', 'name_family_status_civil marriage', 'name_family_status_widow', 'name_famil

Furthermore, we divide the data to evaluate model performance on unseen data while preserving class distribution.

In [6]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training Set Shape: {X_train.shape}")
print(f"Validation Set Shape: {X_val.shape}")

Training Set Shape: (246008, 76)
Validation Set Shape: (61503, 76)


Before optimizing and tuning complex models, it's essential to establish baseline models.

These models will provide a benchmark against which you can measure the performance of more sophisticated models and tuning strategies.

We will start by defining the baseline pipelines.

In [7]:
import re

def sanitize_feature_names(X):
    return X.rename(columns=lambda x: re.sub(r'[^\w]+', '_', x))

In [8]:
from sklearn.preprocessing import FunctionTransformer

sanitize_transformer = FunctionTransformer(sanitize_feature_names)

In [9]:
pipeline_dummy = Pipeline([
    ('sanitize', sanitize_transformer),
    ('classifier', DummyClassifier(strategy='most_frequent'))
])

pipeline_lr = Pipeline([
    ('sanitize', sanitize_transformer),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

pipeline_dt = Pipeline([
    ('sanitize', sanitize_transformer),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

pipeline_rf = Pipeline([
    ('sanitize', sanitize_transformer),
    ('classifier', RandomForestClassifier(random_state=42, n_jobs=-1))
])

pipeline_xgb = Pipeline([
    ('sanitize', sanitize_transformer),
    ('classifier', xgb.XGBClassifier(
        objective="binary:logistic",
        use_label_encoder=False,
        eval_metric='logloss',
        random_state=42,
        n_jobs=-1
    ))
])

pipeline_lgb = Pipeline([
    ('sanitize', sanitize_transformer),
    ('classifier', lgb.LGBMClassifier(
        objective='binary',
        random_state=42,
        n_jobs=-1
    ))
])

Next, we will organize models into a list for streamlined training and evaluation.

In [10]:
baseline_models = [
    ('Dummy Classifier', pipeline_dummy),
    ('Logistic Regression', pipeline_lr),
    ('Decision Tree', pipeline_dt),
    ('Random Forest', pipeline_rf),
    ('XGBoost', pipeline_xgb),
    ('LightGBM', pipeline_lgb)
]

Next we will run the function which will automate the training and evaluation process, providing key metrics and visualizations for each model.

In [11]:
def train_evaluate_model(name, pipeline, X_train, y_train, X_val, y_val):
    print(f"Training {name}...")
    pipeline.fit(X_train, y_train)

    y_pred = pipeline.predict(X_val)
    y_pred_proba = pipeline.predict_proba(X_val)[:, 1]

    recall = recall_score(y_val, y_pred, zero_division=0)
    precision = precision_score(y_val, y_pred, zero_division=0)
    f1 = f1_score(y_val, y_pred, zero_division=0)
    roc_auc = roc_auc_score(y_val, y_pred_proba)
    precision_vals, recall_vals, _ = precision_recall_curve(y_val, y_pred_proba)
    pr_auc = auc(recall_vals, precision_vals)

    print(f"\nClassification Report for {name}:")
    print(classification_report(y_val, y_pred))

    cm = confusion_matrix(y_val, y_pred)
    print(f"Confusion Matrix for {name}:\n{cm}\n")


    print(f"{name} Metrics:")
    print(f"Recall: {recall:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"AUC-PR: {pr_auc:.4f}")
    print(f"AUC-ROC: {roc_auc:.4f}\n")


    os.makedirs("../images", exist_ok=True)
    save_path = f"../images/confusion_matrix_{name.lower().replace(' ', '_')}.png"
    plot_confusion_matrix(
        y_true=y_val.values,
        y_pred=y_pred,
        labels=["Non-Defaulter", "Defaulter"],
        save_path=save_path
    )
    print(f"Confusion matrix saved to {save_path}")

    print("="*60 + "\n")

    return {
        'model': name,
        'recall': recall,
        'precision': precision,
        'f1_score': f1,
        'auc_pr': pr_auc,
        'auc_roc': roc_auc
    }

In [12]:
# Define number of parallel jobs based on your CPU cores
parallel_jobs = 6   # Adjust as needed

# Train and evaluate all baseline models in parallel
baseline_results = Parallel(n_jobs=parallel_jobs)(
    delayed(train_evaluate_model)(name, pipeline, X_train, y_train, X_val, y_val)
    for name, pipeline in baseline_models
)

Training Dummy Classifier...

Classification Report for Dummy Classifier:
Training Logistic Regression...
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     56538
           1       0.00      0.00      0.00      4965

    accuracy                           0.92     61503
   macro avg       0.46      0.50      0.48     61503
weighted avg       0.85      0.92      0.88     61503

Confusion Matrix for Dummy Classifier:
[[56538     0]
 [ 4965     0]]

Dummy Classifier Metrics:
Recall: 0.0000
Precision: 0.0000
F1 Score: 0.0000
AUC-PR: 0.5404
AUC-ROC: 0.5000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Training Decision Tree...
Training Random Forest...
Training XGBoost...
Training LightGBM...
[LightGBM] [Info] Number of positive: 19860, number of negative: 226148
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.067998 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3532
[LightGBM] [Info] Number of data points in the train set: 246008, number of used features: 71
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432482
[LightGBM] [Info] Start training from score -2.432482
Confusion matrix saved to ../images/confusion_matrix_dummy_classifier.png


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     56538
           1       0.55      0.01      0.02      4965

    accuracy                           0.92     61503
   macro avg       0.74      0.51      0.49     61503
weighted avg    

Parameters: { "use_label_encoder" } are not used.



Confusion matrix saved to ../images/confusion_matrix_logistic_regression.png


Classification Report for LightGBM:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56538
           1       1.00      1.00      1.00      4965

    accuracy                           1.00     61503
   macro avg       1.00      1.00      1.00     61503
weighted avg       1.00      1.00      1.00     61503

Confusion Matrix for LightGBM:
[[56538     0]
 [    0  4965]]

LightGBM Metrics:
Recall: 1.0000
Precision: 1.0000
F1 Score: 1.0000
AUC-PR: 1.0000
AUC-ROC: 1.0000


Classification Report for Decision Tree:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56538
           1       1.00      1.00      1.00      4965

    accuracy                           1.00     61503
   macro avg       1.00      1.00      1.00     61503
weighted avg       1.00      1.00      1.00     61503

Confusion Matrix for Decis

Define pipelines for different models.

This way, we ensure the same preprocessing steps are applied within each model's evaluation during cross-validation, preventing data leakage and making results more reliable.


Define parameter grids for each model
