## Notebook Overview

This notebook builds upon the feature-engineered dataset from the previous notebook (`04_feature_engineering.ipynb`) and focuses on **Model Training and Evaluation**. Our primary goal is to develop a credit risk prediction model that excels at identifying potential loan defaulters, thereby minimizing financial losses for retail banks while also considering their desired balance between risk aversion and loan approval rates. This translates to maximizing the recall of the positive class (loan defaulters) while maintaining acceptable precision and overall model performance.

### 0.5.1 Objectives

The main objectives of this notebook are:

1. **Model Selection:** Choose algorithms suitable for imbalanced classification problems.
2. **Model Training:** Train models with a focus on identifying potential defaulters.
3. **Hyperparameter Tuning:** Optimize models to increase recall for the positive class.
4. **Model Evaluation:** Assess models primarily on recall, while considering precision, F2-score, AUC-PR, and overall performance.
5. **Model Comparison:** Compare different models based on their ability to identify true positives and balance the precision-recall trade-off.
6. **Threshold Adjustment:** Explore the impact of classification thresholds on recall and precision, collaborating with retail banks to determine the optimal threshold.

### 0.5.2 Importance of Focusing on Recall

Prioritizing recall for defaulter prediction is crucial for minimizing financial losses, which is the primary business objective in credit risk assessment. The cost of missing a potential defaulter (false negative) is typically much higher than the cost of incorrectly classifying a non-defaulter as high-risk (false positive). While we prioritize recall, we will also carefully consider the precision-recall trade-off and aim for a model that maximizes recall without severely impacting precision. Techniques like threshold adjustment and cost-sensitive learning will be used to balance these metrics effectively. Furthermore, demonstrating a thorough approach to risk identification aligns with regulatory expectations in the financial sector, supporting the banks' compliance needs. This approach also allows for more conservative lending practices, which can be adjusted based on the bank's specific risk tolerance.

### 0.5.3 Our Approach

In this notebook, we will focus on the following modeling tasks:

1. **Data Preparation:** Address class imbalance using techniques like SMOTE or class weighting.
2. **Baseline Model:** A logistic regression model with class weights inversely proportional to class frequencies will serve as our baseline. This will provide a benchmark for evaluating more complex models.
3. **Advanced Models:** Train and evaluate models known for handling imbalanced data:
   - Decision Trees with adjusted class weights
   - Random Forest with balanced class weights
   - Gradient Boosting (XGBoost, LightGBM) with `scale_pos_weight` adjustment
4. **Hyperparameter Tuning:** We will employ techniques like GridSearchCV or RandomizedSearchCV, optimizing for the F2-score (which gives more weight to recall) or a custom cost-sensitive scoring function.
5. **Model Evaluation:** Prioritize recall in our metrics, while also considering precision, F2-score, AUC-PR, and AUC-ROC.
6. **Threshold Adjustment:** We will experiment with different classification thresholds and work closely with retail banks to determine the optimal threshold that balances their desired level of risk aversion with acceptable loan approval rates.
7. **Ensemble Methods:** Explore ensemble techniques that can improve recall without severely impacting precision.
8. **Cost-Sensitive Learning:** Incorporate misclassification costs to reflect the higher cost of false negatives, aligning the model's objective with the business goal of minimizing financial losses.

By the end of this notebook, we aim to have a model (or ensemble of models) that excels at identifying potential loan defaulters, providing the bank with a powerful tool for risk assessment and mitigation.


In [5]:
import warnings

import lightgbm as lgb
import pandas as pd
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    train_test_split
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.tree import DecisionTreeClassifier
import optuna

from retail_bank_risk.advanced_visualizations_utils import (
    plot_combined_confusion_matrices,
    plot_confusion_matrix,
    plot_learning_curve,
    plot_model_performance,
    plot_precision_recall_curve,
    plot_roc_curve,
    shap_force_plot,
    shap_summary_plot,
)
from retail_bank_risk.model_training_utils import (
    downscale_dtypes,
    evaluate_model,
    sanitize_feature_names,
    optimize_hyperparameters,
    load_checkpoint,
    sanitize_feature_names
)

from sklearn.metrics import precision_score, recall_score, f1_score, fbeta_score, roc_auc_score

warnings.filterwarnings('ignore')

In [3]:
train_df = pd.read_parquet("../data/processed/application_train_engineered.parquet")
submission_df = pd.read_parquet("../data/processed/application_test_engineered.parquet")

print(f"Training Data Shape: {train_df.shape}")
print(f"Test Data Shape: {submission_df.shape}")

Training Data Shape: (84728, 65)
Test Data Shape: (48744, 64)


In [3]:
train_df, test_df = downscale_dtypes(train_df, submission_df, target_column='target')

train_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 72 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   reg_city_not_work_city                   307511 non-null  uint8  
 1   region_rating_client_w_city              307511 non-null  float32
 2   region_rating_client                     307511 non-null  float32
 3   name_contract_type                       307511 non-null  uint8  
 4   code_gender                              307511 non-null  uint8  
 5   flag_own_car                             307511 non-null  uint8  
 6   flag_own_realty                          307511 non-null  uint8  
 7   name_type_suite_unaccompanied            307511 non-null  uint8  
 8   name_type_suite_family                   307511 non-null  uint8  
 9   name_type_suite_spouse, partner          307511 non-null  uint8  
 10  name_type_suite_children        

AttributeError: 'DataFrame' object has no attribute 'is_missing'

In [4]:
X = train_df.drop(["target", "sk_id_curr"], axis=1)
y = train_df["target"]

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42, stratify=y_train_val)  # 0.25 x 0.8 = 0.2 of original data

X_submission = submission_df.drop("sk_id_curr", axis=1)
sk_id_curr = submission_df["sk_id_curr"]

print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Submission set shape: {X_submission.shape}")

Training set shape: (184506, 70)
Validation set shape: (61502, 70)
Test set shape: (61503, 70)
Submission set shape: (48744, 70)


In [5]:
X_train.head()

Unnamed: 0,reg_city_not_work_city,region_rating_client_w_city,region_rating_client,name_contract_type,code_gender,flag_own_car,flag_own_realty,name_type_suite_unaccompanied,name_type_suite_family,"name_type_suite_spouse, partner",...,amt_goods_price,is_anomaly,age_group,income_group,credit_amount_group,debt_to_income_ratio,credit_to_goods_ratio,annuity_to_income_ratio,ext_source_mean,credit_exceeds_goods
88032,0,2.0,2.0,0,0,0,1,1,0,0,...,450000.0,0,1.0,3.0,3.0,3.386667,1.4224,0.165405,0.364338,1
187068,0,2.0,2.0,0,0,0,1,0,0,1,...,1183500.0,0,3.0,1.0,4.0,10.0396,1.145202,0.2945,0.476495,1
209491,0,2.0,2.0,0,0,0,0,1,0,0,...,346500.0,0,1.0,1.0,1.0,3.208333,1.0,0.167958,0.610565,0
188222,0,1.0,1.0,1,0,0,1,1,0,0,...,495000.0,0,2.0,3.0,2.0,2.972973,1.0,0.148649,0.338056,0
286715,0,1.0,1.0,0,1,1,1,1,0,0,...,472500.0,0,2.0,2.0,2.0,3.387097,1.0,0.144419,0.486503,0


In [6]:
pipelines = {
    'Dummy Classifier': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('classifier', DummyClassifier(strategy='stratified', random_state=42))
    ]),
    'Logistic Regression': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('scaler', StandardScaler()),
        ('feature_selection', SelectFromModel(LogisticRegression(random_state=42))),
        ('classifier', LogisticRegression(random_state=42, class_weight='balanced',
                                          max_iter=1000, penalty='l2', C=0.1))
    ]),
    'Decision Tree': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(DecisionTreeClassifier(random_state=42))),
        ('classifier', DecisionTreeClassifier(random_state=42, class_weight='balanced',
                                              max_depth=3, min_samples_split=5))
    ]),
    'Random Forest': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(RandomForestClassifier(random_state=42))),
        ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced',
                                              n_jobs=1, max_depth=5, n_estimators=100,
                                              min_samples_split=5, bootstrap=True))
    ]),
    'Gradient Boosting': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(GradientBoostingClassifier(random_state=42))),
        ('classifier', GradientBoostingClassifier(random_state=42, max_depth=3,
                                                  n_estimators=100, learning_rate=0.01,
                                                  subsample=0.8, min_samples_split=5))
    ]),
    'XGBoost': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(xgb.XGBClassifier(random_state=42))),
        ('classifier', xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss',
                                         random_state=42, scale_pos_weight=len(y)/sum(y),
                                         max_depth=3, n_estimators=100, learning_rate=0.01,
                                         subsample=0.8, colsample_bytree=0.8,
                                         min_child_weight=5, n_jobs=-1))
    ]),
    'LightGBM': Pipeline([
        ('sanitizer', FunctionTransformer(sanitize_feature_names)),
        ('feature_selection', SelectFromModel(lgb.LGBMClassifier(random_state=42))),
        ('classifier', lgb.LGBMClassifier(random_state=42, class_weight='balanced',
                                          max_depth=3, n_estimators=100, learning_rate=0.01,
                                          subsample=0.8, colsample_bytree=0.8,
                                          min_child_samples=5, n_jobs=-1))
    ])
}

In [7]:
results = []
for name, pipeline in pipelines.items():
    result = evaluate_model(name, pipeline, X_train, y_train, X_val, y_val)
    results.append(result)

print("Model Performance Ranking:")
for metric in ['precision', 'recall', 'f1_score', 'f2_score', 'auc_roc']:
    print(f"\nRanking by {metric}:")
    sorted_results = sorted(results, key=lambda x: x[metric], reverse=True)
    for i, result in enumerate(sorted_results, 1):
        print(f"{i}. {result['model']}: {metric} = {result[metric]:.4f}")

Evaluating Dummy Classifier...
Loaded checkpoint: ../models/dummy_classifier_checkpoint.pkl
Resumed from checkpoint for model Dummy Classifier.
Dummy Classifier Validation results:
Precision: 0.0792
Recall: 0.0783
F1-Score: 0.0788
F2-Score: 0.0785
AUC-ROC: 0.4992

Evaluating Logistic Regression...
No checkpoint found at: ../models/logistic_regression_checkpoint.pkl
Training model Logistic Regression
Saved checkpoint: ../models/logistic_regression_checkpoint.pkl
Checkpoint contents: ['model']
Saved new checkpoint for Logistic Regression
Logistic Regression Validation results:
Precision: 0.1548
Recall: 0.6604
F1-Score: 0.2507
F2-Score: 0.3994
AUC-ROC: 0.7355

Evaluating Decision Tree...
No checkpoint found at: ../models/decision_tree_checkpoint.pkl
Training model Decision Tree
Saved checkpoint: ../models/decision_tree_checkpoint.pkl
Checkpoint contents: ['model']
Saved new checkpoint for Decision Tree
Decision Tree Validation results:
Precision: 0.2262
Recall: 0.6292
F1-Score: 0.3327
F2-

We evaluated several machine learning models for credit risk prediction, focusing on the **F2-score** as our primary metric. The F2-score balances precision and recall, giving higher weight to recall, which aligns with our aim to reduce financial losses from defaults (false negatives) while mitigating the negative effects of rejecting creditworthy applicants (false positives).

The initial evaluation results, using a held-out validation set, are as follows:

| Model             | Precision | Recall | F1-Score | F2-Score | AUC-ROC |
|--------------------|-----------|--------|----------|----------|---------|
| Dummy Classifier   | 0.0792    | 0.0783 | 0.0788   | 0.0785   | 0.4992  |
| Logistic Regression | 0.1562    | 0.6659 | 0.2531   | 0.4030   | 0.7357  |
| Decision Tree      | 0.2262    | 0.6292 | 0.3327   | 0.4639   | 0.7848  |
| Random Forest      | 0.2600    | 0.7589 | 0.3873   | 0.5484   | 0.8864  |
| Gradient Boosting  | 1.0000    | 0.1525 | 0.2646   | 0.1836   | 0.8469  |
| **XGBoost**         | **0.3216**    | **0.8705** | **0.4697**   | **0.6490**   | **0.9433**  |
| **LightGBM**        | **0.2875**    | **0.8248** | **0.4264**   | **0.6004**   | **0.9107**  |

**XGBoost** and **LightGBM** clearly outperformed the other models, achieving the highest F2-scores and AUC-ROC values. We will proceed with these two models for further optimization.

**Next Steps:**

1. **Hyperparameter Tuning with Optuna:** We'll use Optuna to fine-tune the hyperparameters of XGBoost and LightGBM, aiming to maximize the F2-score on the validation set.

2. **Final Model Selection:** We'll compare the optimized XGBoost and LightGBM models based on their performance on the validation set, considering the F2-score, precision, recall, and AUC-ROC. The final model will be selected based on these metrics and alignment with the bank's risk tolerance and business objectives.

3. **Evaluation on Test Set:** The chosen model will be evaluated on the held-out test set to estimate its real-world performance.

4. **Submission:** Predictions will be generated using the final model on the submission dataset and submitted for evaluation.

By focusing on F2-score optimization and carefully evaluating our models, we aim to develop a robust and effective credit risk prediction model that meets the needs of retail banks. 

In [8]:
storage = "sqlite:///../data/optuna_study.db"

In [9]:
study_name = "xgboost_optimization"

try:
    study = optuna.load_study(study_name=study_name, storage=storage)
    print(f"Loaded existing study '{study_name}'.")
except KeyError:
    study = optuna.create_study(
        study_name=study_name,
        storage=storage,
        direction="maximize",
        load_if_exists=True
    )
    print(f"Created new study '{study_name}'.")

# Get the number of completed trials
completed_trials = len(study.trials)
print(f"Number of completed trials for XGBoost: {completed_trials}")

# Calculate remaining trials
remaining_trials = 100 - completed_trials
if remaining_trials > 0:
    print(f"Running {remaining_trials} more trials for XGBoost to reach 100 in total.")

    # Run the optimization
    results = optimize_hyperparameters(
        x_train=X_train,
        y_train=y_train,
        x_val=X_val,
        y_val=y_val,
        model_type='xgboost',
        n_trials=remaining_trials,
        n_jobs=-1,
        checkpoint_dir='../models',
        study_name=study_name,
        storage=storage
    )

    print(f"Best XGBoost parameters: {results['best_params']}")
    print(f"Best XGBoost F2 score: {results['f2_score']}")
    print(f"Best XGBoost model saved as: {results['model']}")
else:
    print("Study has already completed 100 or more trials.")

Loaded existing study 'xgboost_optimization'.
Number of completed trials for XGBoost: 100
Study has already completed 100 or more trials.


# Remove miss encoded features, missing values may be good, remove feature selection>
# check for binning strategies, reduce forest complexity, feature importance check, compare the values inputed

In [12]:
study_name = "lightgbm_optimization"

# Try to load the study, create a new one if it doesn't exist
try:
    study = optuna.load_study(study_name=study_name, storage=storage)
    print(f"Loaded existing study '{study_name}'.")
except KeyError:
    study = optuna.create_study(
        study_name=study_name,
        storage=storage,
        direction="maximize",
        load_if_exists=True
    )
    print(f"Created new study '{study_name}'.")

completed_trials = len(study.trials)
print(f"Number of completed trials for LightGBM: {completed_trials}")

remaining_trials = 100 - completed_trials
if remaining_trials > 0:
    print(f"Running {remaining_trials} more trials for LightGBM to reach 100 in total.")

    results = optimize_hyperparameters(
        x_train=X_train,
        y_train=y_train,
        x_val=X_val,
        y_val=y_val,
        model_type='lightgbm',
        n_trials=remaining_trials,
        n_jobs=-1,
        checkpoint_dir='../models',
        study_name=study_name,
        storage=storage
    )

    print(f"Best LightGBM parameters: {results['best_params']}")
    print(f"Best LightGBM F2 score: {results['f2_score']}")
    print(f"Best LightGBM model saved as: {results['model']}")

Loaded existing study 'lightgbm_optimization'.
Number of completed trials for LightGBM: 41
Running 59 more trials for LightGBM to reach 100 in total.
Optimizing hyperparameters for lightgbm using cross-validation...
Training data validation passed. X shape: (184506, 70), y shape: (184506,)
Validation data validation passed. X shape: (61502, 70), y shape: (61502,)
Loaded existing study 'lightgbm_optimization' with 41 trials.
[LightGBM] [Info] Number of positive: 11916, number of negative: 135688
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008919 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3516
[LightGBM] [Info] Number of data points in the train set: 147604, number of used features: 65
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[LightGBM] [In

In [None]:
from sklearn.model_selection import StratifiedKFold
import numpy as np

sanitizer = FunctionTransformer(sanitize_feature_names)

def evaluate_model_cv(model, X, y, n_splits=5):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    auc_scores = []

    for fold, (train_index, val_index) in enumerate(skf.split(X, y), 1):
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]

        X_train_sanitized = sanitizer.transform(X_train)
        X_val_sanitized = sanitizer.transform(X_val)

        model.fit(X_train_sanitized, y_train)
        y_pred_proba = model.predict_proba(X_val_sanitized)[:, 1]
        auc_score = roc_auc_score(y_val, y_pred_proba)
        auc_scores.append(auc_score)

        print(f"Fold {fold} AUC: {auc_score:.4f}")

    mean_auc = np.mean(auc_scores)
    std_auc = np.std(auc_scores)
    print(f"\nMean AUC: {mean_auc:.4f} (±{std_auc:.4f})")
    return mean_auc

# Load the data
X = train_df.drop(["target", "sk_id_curr"], axis=1)
y = train_df["target"]

# Evaluate models
xgb_model = load_checkpoint("xgboost_best", '../models', is_tuned=True)['model']
lgb_model = load_checkpoint("lightgbm_best", '../models', is_tuned=True)['model']

print("XGBoost Evaluation:")
xgb_auc = evaluate_model_cv(xgb_model, X, y)

print("\nLightGBM Evaluation:")
lgb_auc = evaluate_model_cv(lgb_model, X, y)

# Choose the best model
best_model = xgb_model if xgb_auc > lgb_auc else lgb_model
best_model_name = "XGBoost" if xgb_auc > lgb_auc else "LightGBM"

# Generate predictions for submission
X_submission = submission_df.drop("sk_id_curr", axis=1)
X_submission_sanitized = sanitizer.transform(X_submission)

submission_predictions = best_model.predict_proba(X_submission_sanitized)[:, 1]

# Create submission file
submission_df = pd.DataFrame({
    "sk_id_curr": submission_df["sk_id_curr"].astype(int),
    "target": submission_predictions
})

# submission_df.to_csv("submission.csv", index=False)
print(f"\nSubmission file generated using the best model ({best_model_name}).")

print("\nDistribution of predictions:")
print(submission_df['target'].describe())

In [8]:
import joblib

# 1. Load the best model
best_model_path = f"../models/tuned_xgboost_best_checkpoint.pkl"
best_model_data = joblib.load(best_model_path)
best_model = best_model_data['model']
selected_features = best_model_data['selected_features']

# 2. Apply feature selection to the test set
X_test_selected = X_test[selected_features]

# 3. Make predictions on the test set
y_test_pred = best_model.predict(X_test_selected)
y_test_pred_proba = best_model.predict_proba(X_test_selected)[:, 1]

# 4. Calculate and report performance metrics
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
test_f2 = fbeta_score(y_test, y_test_pred, beta=2)
test_auc_roc = roc_auc_score(y_test, y_test_pred_proba)

print("Test Set Results:")
print(f"Precision: {test_precision:.4f}")
print(f"Recall: {test_recall:.4f}")
print(f"F1-Score: {test_f1:.4f}")
print(f"F2-Score: {test_f2:.4f}")
print(f"AUC-ROC: {test_auc_roc:.4f}")

# 5. Compare with validation results
print("\nComparison with Validation Results:")
print(f"Validation F2-Score: {results['f2_score']:.4f}")
print(f"Test F2-Score:       {test_f2:.4f}")
print(f"Difference:          {abs(results['f2_score'] - test_f2):.4f}")

KeyError: "['name_family_status_civil_marriage'] not in index"