## Notebook Overview

This notebook builds upon the feature-engineered dataset from the previous notebook (`04_feature_engineering.ipynb`) and focuses on **Model Training and Evaluation**. Our primary goal is to develop a credit risk prediction model that excels at identifying potential loan defaulters, thereby minimizing financial losses for retail banks while also considering their desired balance between risk aversion and loan approval rates. This translates to maximizing the recall of the positive class (loan defaulters) while maintaining acceptable precision and overall model performance.

### 0.5.1 Objectives

The main objectives of this notebook are:

1. **Model Selection:** Choose algorithms suitable for imbalanced classification problems.
2. **Model Training:** Train models with a focus on identifying potential defaulters.
3. **Hyperparameter Tuning:** Optimize models to increase recall for the positive class.
4. **Model Evaluation:** Assess models primarily on recall, while considering precision, F2-score, AUC-PR, and overall performance.
5. **Model Comparison:** Compare different models based on their ability to identify true positives and balance the precision-recall trade-off.
6. **Threshold Adjustment:** Explore the impact of classification thresholds on recall and precision, collaborating with retail banks to determine the optimal threshold.

### 0.5.2 Importance of Focusing on Recall

Prioritizing recall for defaulter prediction is crucial for minimizing financial losses, which is the primary business objective in credit risk assessment. The cost of missing a potential defaulter (false negative) is typically much higher than the cost of incorrectly classifying a non-defaulter as high-risk (false positive). While we prioritize recall, we will also carefully consider the precision-recall trade-off and aim for a model that maximizes recall without severely impacting precision. Techniques like threshold adjustment and cost-sensitive learning will be used to balance these metrics effectively. Furthermore, demonstrating a thorough approach to risk identification aligns with regulatory expectations in the financial sector, supporting the banks' compliance needs. This approach also allows for more conservative lending practices, which can be adjusted based on the bank's specific risk tolerance.

### 0.5.3 Our Approach

In this notebook, we will focus on the following modeling tasks:

1. **Data Preparation:** Address class imbalance using techniques like SMOTE or class weighting.
2. **Baseline Model:** A logistic regression model with class weights inversely proportional to class frequencies will serve as our baseline. This will provide a benchmark for evaluating more complex models.
3. **Advanced Models:** Train and evaluate models known for handling imbalanced data:
   - Decision Trees with adjusted class weights
   - Random Forest with balanced class weights
   - Gradient Boosting (XGBoost, LightGBM) with `scale_pos_weight` adjustment
4. **Hyperparameter Tuning:** We will employ techniques like GridSearchCV or RandomizedSearchCV, optimizing for the F2-score (which gives more weight to recall) or a custom cost-sensitive scoring function.
5. **Model Evaluation:** Prioritize recall in our metrics, while also considering precision, F2-score, AUC-PR, and AUC-ROC.
6. **Threshold Adjustment:** We will experiment with different classification thresholds and work closely with retail banks to determine the optimal threshold that balances their desired level of risk aversion with acceptable loan approval rates.
7. **Ensemble Methods:** Explore ensemble techniques that can improve recall without severely impacting precision.
8. **Cost-Sensitive Learning:** Incorporate misclassification costs to reflect the higher cost of false negatives, aligning the model's objective with the business goal of minimizing financial losses.

By the end of this notebook, we aim to have a model (or ensemble of models) that excels at identifying potential loan defaulters, providing the bank with a powerful tool for risk assessment and mitigation.


In [1]:
import warnings

import lightgbm as lgb
import pandas as pd
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    train_test_split
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.tree import DecisionTreeClassifier
import optuna

from retail_bank_risk.advanced_visualizations_utils import (
    plot_combined_confusion_matrices,
    plot_confusion_matrix,
    plot_learning_curve,
    plot_model_performance,
    plot_precision_recall_curve,
    plot_roc_curve,
    shap_force_plot,
    shap_summary_plot,
)
from retail_bank_risk.model_training_utils import (
    downscale_dtypes,
    evaluate_model,
    sanitize_feature_names,
    optimize_hyperparameters
)

warnings.filterwarnings('ignore')

In [2]:
train_df = pd.read_parquet("../data/processed/application_train_engineered.parquet")
test_df = pd.read_parquet("../data/processed/application_test_engineered.parquet")

print(f"Training Data Shape: {train_df.shape}")
print(f"Test Data Shape: {test_df.shape}")

Training Data Shape: (307511, 78)
Test Data Shape: (48744, 77)


In [3]:
train_df, test_df = downscale_dtypes(train_df, test_df, target_column='target')

train_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 78 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   reg_city_not_work_city_0                 307511 non-null  uint8  
 1   reg_city_not_work_city_1                 307511 non-null  uint8  
 2   region_rating_client_w_city              307511 non-null  float32
 3   region_rating_client                     307511 non-null  float32
 4   name_contract_type_cash loans            307511 non-null  uint8  
 5   name_contract_type_revolving loans       307511 non-null  uint8  
 6   code_gender_m                            307511 non-null  uint8  
 7   code_gender_f                            307511 non-null  uint8  
 8   flag_own_car_n                           307511 non-null  uint8  
 9   flag_own_car_y                           307511 non-null  uint8  
 10  flag_own_realty_y               

In [4]:
X = train_df.drop(["target", "sk_id_curr"], axis=1)
y = train_df["target"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_test = test_df.drop("sk_id_curr", axis=1)
sk_id_curr = test_df["sk_id_curr"]

print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Test set shape: {X_test.shape}")

Training set shape: (246008, 76)
Validation set shape: (61503, 76)
Test set shape: (48744, 76)


In [5]:
pipelines = {
    'Dummy Classifier': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('classifier', DummyClassifier(strategy='stratified', random_state=42))
    ]),
    'Logistic Regression': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('scaler', StandardScaler()),
        ('feature_selection', SelectFromModel(LogisticRegression(random_state=42))),
        ('classifier', LogisticRegression(random_state=42, class_weight='balanced',
                                          max_iter=1000, penalty='l2', C=0.1))
    ]),
    'Decision Tree': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(DecisionTreeClassifier(random_state=42))),
        ('classifier', DecisionTreeClassifier(random_state=42, class_weight='balanced',
                                              max_depth=3, min_samples_split=5))
    ]),
    'Random Forest': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(RandomForestClassifier(random_state=42))),
        ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced',
                                              n_jobs=1, max_depth=5, n_estimators=100,
                                              min_samples_split=5, bootstrap=True))
    ]),
    'Gradient Boosting': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(GradientBoostingClassifier(random_state=42))),
        ('classifier', GradientBoostingClassifier(random_state=42, max_depth=3,
                                                  n_estimators=100, learning_rate=0.01,
                                                  subsample=0.8, min_samples_split=5))
    ]),
    'XGBoost': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(xgb.XGBClassifier(random_state=42))),
        ('classifier', xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss',
                                         random_state=42, scale_pos_weight=len(y)/sum(y),
                                         max_depth=3, n_estimators=100, learning_rate=0.01,
                                         subsample=0.8, colsample_bytree=0.8,
                                         min_child_weight=5, n_jobs=-1))
    ]),
    'LightGBM': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(lgb.LGBMClassifier(random_state=42))),
        ('classifier', lgb.LGBMClassifier(random_state=42, class_weight='balanced',
                                          max_depth=3, n_estimators=100, learning_rate=0.01,
                                          subsample=0.8, colsample_bytree=0.8,
                                          min_child_samples=5, n_jobs=-1))
    ])
}

In [6]:
results = []
for name, pipeline in pipelines.items():
    result = evaluate_model(name, pipeline, X, y)
    results.append(result)

print("Model Performance Ranking:")
for metric in ['precision', 'recall', 'f1_score', 'auc_roc']:
    print(f"\nRanking by {metric}:")
    sorted_results = sorted(results, key=lambda x: x[metric], reverse=True)
    for i, result in enumerate(sorted_results, 1):
        print(f"{i}. {result['model']}: {metric} = {result[metric]:.4f}")

Evaluating Dummy Classifier...
Loaded checkpoint: ../models/dummy_classifier_checkpoint.pkl
Resumed from checkpoint for model Dummy Classifier.
Dummy Classifier Cross-validation results:
Precision: 0.0831
Recall: 0.0822
F1-Score: 0.0826
AUC-ROC: 0.5013

Evaluating Logistic Regression...
Loaded checkpoint: ../models/logistic_regression_checkpoint.pkl
Resumed from checkpoint for model Logistic Regression.
Logistic Regression Cross-validation results:
Precision: 0.1566
Recall: 0.6663
F1-Score: 0.2536
AUC-ROC: 0.7372

Evaluating Decision Tree...
Loaded checkpoint: ../models/decision_tree_checkpoint.pkl
Resumed from checkpoint for model Decision Tree.
Decision Tree Cross-validation results:
Precision: 0.2039
Recall: 0.6926
F1-Score: 0.3151
AUC-ROC: 0.7981

Evaluating Random Forest...
Loaded checkpoint: ../models/random_forest_checkpoint.pkl
Resumed from checkpoint for model Random Forest.
Random Forest Cross-validation results:
Precision: 0.2550
Recall: 0.7633
F1-Score: 0.3822
AUC-ROC: 0.88

We evaluated several machine learning models for credit risk prediction, initially prioritizing recall to minimize false negatives (missed defaulters). However, to obtain a more balanced performance that considers both minimizing false negatives and the impact of false positives (incorrectly rejecting good loan applications), we will optimize for the F2-score. The F2-score gives higher weight to recall, aligning with our aim to reduce financial losses from defaults while mitigating the negative effects of rejecting creditworthy applicants.

**XGBoost** and **LightGBM** demonstrated the strongest initial performance, surpassing models like Random Forest, Gradient Boosting, Decision Trees, Logistic Regression, and the Dummy Classifier baseline. We'll concentrate our optimization efforts on XGBoost and LightGBM.

Next Steps:

1. **Hyperparameter Tuning with Optuna:** We'll employ Optuna to fine-tune XGBoost and LightGBM, seeking to maximize the F2-score.  Stratified K-Fold cross-validation will be crucial during optimization to provide robust performance estimates considering the dataset's class imbalance.

2. **Model Training and Evaluation:** For each model (XGBoost and LightGBM):
   a. Optimize hyperparameters via Optuna, targeting maximum F2.
   b. Train the model with the optimal hyperparameters.
   c. Evaluate on a held-out validation set using F2, precision, and recall.

3. **Ensemble Methods (Conditional):** If optimized models perform similarly, we may explore ensemble methods to combine their strengths and potentially achieve greater predictive power.

4. **Final Model Selection:** The final model will be chosen based on optimized performance, alignment with the bank's risk tolerance, and business objectives, considering the relative costs of false negatives and positives.


This retains the original format and flow while incorporating all the crucial corrections and enhancements for clarity and accuracy regarding the F2-score optimization.

In [7]:
storage = "sqlite:///../data/optuna_study.db"
study = optuna.load_study(study_name="xgboost_optimization", storage=storage)

completed_trials = len(study.trials)
print(f"Number of completed trials: {completed_trials}")

remaining_trials = 100 - completed_trials
if remaining_trials > 0:
    print(f"Running {remaining_trials} more trials to reach 100 in total.")

    best_params_xgb, best_f2_score_xgb, best_model_xgb = optimize_hyperparameters(
        X=X,
        y=y,
        model_type='xgboost',
        n_trials=remaining_trials,
        n_jobs=-1,
        checkpoint_dir='../models',
        study_name="xgboost_optimization",
        storage=storage
    )

    print(f"Best XGBoost parameters: {best_params_xgb}")
    print(f"Best XGBoost F2 score: {best_f2_score_xgb}")
    print("Best XGBoost model loaded from checkpoint.")
else:
    print("Study has already completed 100 or more trials.")


Number of completed trials: 0
Running 100 more trials to reach 100 in total.
Loaded existing study 'xgboost_optimization'.


[I 2024-09-29 19:06:59,019] Trial 2 finished with value: 1.0 and parameters: {'max_depth': 9, 'learning_rate': 0.09983323972566299, 'n_estimators': 259, 'min_child_weight': 9, 'subsample': 0.997437890015122, 'colsample_bytree': 0.7500654138122846, 'gamma': 4.708616993349988e-08, 'scale_pos_weight': 13.174681179797952}. Best is trial 2 with value: 1.0.
[I 2024-09-29 19:07:24,746] Trial 3 finished with value: 0.9937706559784537 and parameters: {'max_depth': 4, 'learning_rate': 0.019593970483689452, 'n_estimators': 776, 'min_child_weight': 1, 'subsample': 0.7109257340044541, 'colsample_bytree': 0.6913268006511722, 'gamma': 0.09475090323155133, 'scale_pos_weight': 1.9158086485222128}. Best is trial 2 with value: 1.0.
[I 2024-09-29 19:07:25,122] Trial 0 finished with value: 0.9879848620623537 and parameters: {'max_depth': 9, 'learning_rate': 0.0069953888132409684, 'n_estimators': 186, 'min_child_weight': 2, 'subsample': 0.600266066900057, 'colsample_bytree': 0.7945266733158207, 'gamma': 0.0

No checkpoint found at: ../models/xgboost_trial_1_checkpoint.pkl
Best XGBoost parameters: {'max_depth': 10, 'learning_rate': 0.04551205523283438, 'n_estimators': 300, 'min_child_weight': 5, 'subsample': 0.9828458094321034, 'colsample_bytree': 0.8654181129635092, 'gamma': 0.014898258524330391, 'scale_pos_weight': 6.994696889649713}
Best XGBoost F2 score: 1.0
Best XGBoost model loaded from checkpoint.


In [8]:
study = optuna.load_study(study_name="lightgbm_optimization", storage=storage)

completed_trials = len(study.trials)
print(f"Number of completed trials for LightGBM: {completed_trials}")

remaining_trials = 100 - completed_trials
if remaining_trials > 0:
    print(f"Running {remaining_trials} more trials for LightGBM to reach 100 in total.")

    best_params_lgb, best_f2_score_lgb, best_model_lgb = optimize_hyperparameters(
        X=X,
        y=y,
        model_type='lightgbm',
        n_trials=remaining_trials,
        n_jobs=-1,
        checkpoint_dir='../models',
        study_name="lightgbm_optimization",
        storage=storage
    )

    print(f"Best LightGBM parameters: {best_params_lgb}")
    print(f"Best LightGBM F2 score: {best_f2_score_lgb}")
    print("Best LightGBM model loaded from checkpoint.")
else:
    print("Study has already completed 100 or more trials for LightGBM.")

Number of completed trials for LightGBM: 0
Running 100 more trials for LightGBM to reach 100 in total.
Loaded existing study 'lightgbm_optimization'.
