## Notebook Overview

This notebook builds upon the feature-engineered dataset from the previous notebook (`04_feature_engineering.ipynb`) and focuses on **Model Training and Evaluation**. Our primary goal is to develop a credit risk prediction model that excels at identifying potential loan defaulters, thereby minimizing financial losses for retail banks while also considering their desired balance between risk aversion and loan approval rates.  This translates to maximizing the recall of the positive class (loan defaulters) while maintaining acceptable precision and overall model performance.

### 0.5.1 Objectives

The main objectives of this notebook are:

1. **Model Selection:** Choose algorithms suitable for imbalanced classification problems.
2. **Model Training:** Train models with a focus on identifying potential defaulters.
3. **Hyperparameter Tuning:** Optimize models to increase recall for the positive class.
4. **Model Evaluation:** Assess models primarily on recall, while considering precision, F2-score, AUC-PR, and overall performance.
5. **Model Comparison:** Compare different models based on their ability to identify true positives and balance the precision-recall trade-off.
6. **Threshold Adjustment:** Explore the impact of classification thresholds on recall and precision, collaborating with retail banks to determine the optimal threshold.

### 0.5.2 Importance of Focusing on Recall

Prioritizing recall for defaulter prediction is crucial for minimizing financial losses, which is the primary business objective in credit risk assessment.  The cost of missing a potential defaulter (false negative) is typically much higher than the cost of incorrectly classifying a non-defaulter as high-risk (false positive). While we prioritize recall, we will also carefully consider the precision-recall trade-off and aim for a model that maximizes recall without severely impacting precision. Techniques like threshold adjustment and cost-sensitive learning will be used to balance these metrics effectively.  Furthermore, demonstrating a thorough approach to risk identification aligns with regulatory expectations in the financial sector, supporting the banks' compliance needs. This approach also allows for more conservative lending practices, which can be adjusted based on the bank's specific risk tolerance.

### 0.5.3 Our Approach

In this notebook, we will focus on the following modeling tasks:

1. **Data Preparation:** Address class imbalance using techniques like SMOTE or class weighting.
2. **Baseline Model:** A logistic regression model with class weights inversely proportional to class frequencies will serve as our baseline. This will provide a benchmark for evaluating more complex models.
3. **Advanced Models:** Train and evaluate models known for handling imbalanced data:
   - Decision Trees with adjusted class weights
   - Random Forest with balanced class weights
   - Gradient Boosting (XGBoost, LightGBM) with `scale_pos_weight` adjustment
4. **Hyperparameter Tuning:** We will employ techniques like GridSearchCV or RandomizedSearchCV, optimizing for the F2-score (which gives more weight to recall) or a custom cost-sensitive scoring function.
5. **Model Evaluation:** Prioritize recall in our metrics, while also considering precision, F2-score, AUC-PR, and AUC-ROC.
6. **Threshold Adjustment:** We will experiment with different classification thresholds and work closely with retail banks to determine the optimal threshold that balances their desired level of risk aversion with acceptable loan approval rates.
7. **Ensemble Methods:** Explore ensemble techniques that can improve recall without severely impacting precision.
8. **Cost-Sensitive Learning:** Incorporate misclassification costs to reflect the higher cost of false negatives, aligning the model's objective with the business goal of minimizing financial losses.

By the end of this notebook, we aim to have a model (or ensemble of models) that excels at identifying potential loan defaulters, providing the bank with a powerful tool for risk assessment and mitigation.

In [22]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
import xgboost as xgb
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, auc, confusion_matrix, ConfusionMatrixDisplay

In [23]:
train_df = pd.read_parquet("../data/processed/application_train_engineered.parquet")
test_df = pd.read_parquet("../data/processed/application_test_engineered.parquet")

In [24]:
def downscale_dtypes(df_train, df_test=None):
    """Downscales numeric columns and encodes categoricals based on df_train."""
    downscale_actions = {}
    categorical_actions = {}

    for col in df_train.columns:

        if pd.api.types.is_numeric_dtype(df_train[col]):
            if pd.api.types.is_integer_dtype(df_train[col]):
                if df_train[col].min() >= 0:
                    if df_train[col].max() < np.iinfo(np.uint8).max:
                        downscale_actions[col] = np.uint8
                    elif df_train[col].max() < np.iinfo(np.uint16).max:
                        downscale_actions[col] = np.uint16
                    elif df_train[col].max() < np.iinfo(np.uint32).max:
                        downscale_actions[col] = np.uint32
                else:
                    if df_train[col].min() > np.iinfo(np.int8).min and df_train[col].max() < np.iinfo(np.int8).max:
                        downscale_actions[col] = np.int8
                    elif df_train[col].min() > np.iinfo(np.int16).min and df_train[col].max() < np.iinfo(np.int16).max:
                        downscale_actions[col] = np.int16
                    elif df_train[col].min() > np.iinfo(np.int32).min and df_train[col].max() < np.iinfo(np.int32).max:
                        downscale_actions[col] = np.int32

            elif pd.api.types.is_float_dtype(df_train[col]):
                if df_train[col].min() > np.finfo(np.float32).min and df_train[col].max() < np.finfo(np.float32).max:
                    downscale_actions[col] = np.float32

        elif hasattr(df_train[col].dtype, "categories"): # Assuming remaining cols with categories attribute are categorical.
            categorical_actions[col] = df_train[col].cat.categories


    df_train = df_train.astype(downscale_actions)


    if df_test is not None:
        df_test = df_test.astype(downscale_actions)

        for col, categories in categorical_actions.items(): # Use categories from TRAIN for test.
            if col in df_test.columns:
                df_test[col] = pd.Categorical(df_test[col], categories=categories).astype('category')


        return df_train, df_test
    else:
        return df_train

In [25]:
train_df, test_df = downscale_dtypes(train_df, test_df)

In [26]:
X = train_df.drop("target", axis=1)
y = train_df["target"]

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Define pipelines for different models.

This way, we ensure the same preprocessing steps are applied within each model's evaluation during cross-validation, preventing data leakage and making results more reliable.

In [27]:
pipeline_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

pipeline_dt = Pipeline([
    ('classifier', DecisionTreeClassifier(random_state=42))
])

pipeline_rf = Pipeline([
    ('classifier', RandomForestClassifier(random_state=42))
])

pipeline_xgb = Pipeline([
    ('classifier', xgb.XGBClassifier(objective="binary:logistic", random_state=42))
])

Define parameter grids for each model

In [28]:
param_grid_lr = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__class_weight': ['balanced', {0:1, 1:2}, {0:1, 1:5}, {0:1, 1:10}],
    'classifier__penalty': ['l1', 'l2', 'elasticnet', 'none']
}

param_grid_dt = {
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__class_weight': ['balanced', {0:1, 1:2}, {0:1, 1:5}, {0:1, 1:10}]
}

param_grid_rf = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__class_weight': ['balanced', {0:1, 1:2}, {0:1, 1:5}, {0:1, 1:10}]
}

param_grid_xgb = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__learning_rate': [0.01, 0.1, 0.3],
    'classifier__max_depth': [3, 5, 7],
    'classifier__scale_pos_weight': [1, 2, 5, 10]
}

In [29]:
pipelines = [pipeline_lr, pipeline_dt, pipeline_rf, pipeline_xgb]
param_grids = [param_grid_lr, param_grid_dt, param_grid_rf, param_grid_xgb]
best_estimators = []


kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)  # Stratified k-fold for cross-validation


for pipeline, param_grid in zip(pipelines, param_grids):
    grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=kfold, scoring='recall', n_jobs=-1, verbose=1) # Recall is chosen as the main metric
    grid_search.fit(X_train, y_train)

    best_estimators.append(grid_search.best_estimator_)

    print(f"Best parameters for {type(pipeline.named_steps['classifier']).__name__}: {grid_search.best_params_}")
    print(f"Best recall score: {grid_search.best_score_}")

# 4. Model Evaluation and Comparison:

results = []
names = []


for name, model in zip(['Logistic Regression', 'Decision Tree', 'Random Forest', 'XGBoost'], best_estimators):


    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='recall', n_jobs=-1) # scoring should be aligned with business objectives
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

    y_pred = model.predict(X_val)
    print(classification_report(y_val, y_pred))

    y_pred_proba = model.predict_proba(X_val)[:, 1]


    precision, recall, thresholds = precision_recall_curve(y_val, y_pred_proba)
    pr_auc = auc(recall, precision)
    print("AUC-PR:", pr_auc)

    # AUC ROC Curve
    roc_auc = roc_auc_score(y_val, y_pred_proba)
    print("AUC-ROC:", roc_auc)


    cm = confusion_matrix(y_val, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Non-Defaulter', 'Defaulter'])
    disp.plot()

Fitting 5 folds for each of 96 candidates, totalling 480 fits


Traceback (most recent call last):
  File "/Users/vytautasbunevicius/retail-bank-risk-evaluation/venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 971, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vytautasbunevicius/retail-bank-risk-evaluation/venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 279, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vytautasbunevicius/retail-bank-risk-evaluation/venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 371, in _score
    y_pred = method_caller(
             ^^^^^^^^^^^^^^
  File "/Users/vytautasbunevicius/retail-bank-risk-evaluation/venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 89, in _cached_call
    result, _ 

ValueError: Solver lbfgs supports only 'l2' or None penalties, got l1 penalty.