Skip to content

wanghan0987/MLPD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Freddie Mac MLPD Credit Risk Modeling (PD & LGD)

An explainable machine learning pipeline for building regulatory-grade credit risk models — Probability of Default (PD) and Loss Given Default (LGD) — using the Freddie Mac Multifamily Loan Performance Database (MLPD).


Overview

This project develops credit risk models for Freddie Mac's multifamily loan portfolio using a two-stage approach:

  1. Explainable ML for feature selection — LASSO, Random Forest, and gradient boosting (XGBoost/LightGBM) with SHAP-based attribution identify the most predictive and economically interpretable features.
  2. Regulatory-grade econometric models — The selected features feed into logistic regression (PD) and OLS/Tobit regression (LGD), satisfying interpretability and auditability requirements for model risk management (e.g., SR 11-7).

The workflow is designed so that ML methods inform—but do not replace—econometric models, striking the balance between predictive power and regulatory acceptability.


Data

Source: Freddie Mac Multifamily Loan Performance Database (MLPD), as of 2025Q2.

File Description Size
MLPD_datamart\mlpdy25q2_txt.txt Full panel database (bar-delimited), 1994–2025Q2 ~211 MB
MLPD_datamart\mlpd_y1994q1_y2008q4.csv.xlsx Snapshot data, 1994–2008 ~19 MB
MLPD_datamart\mlpd_y2009q1_y2025q2.csv.xlsx Snapshot data, 2009–2025Q2 ~124 MB
MLPD_data_dictionary.pdf Freddie Mac data dictionary ~0.2 MB
MLPD_Loss_Summary.pdf Freddie Mac loss summary ~0.8 MB
mlpd_terms_conditions.pdf Freddie Mac terms and conditions ~0.07 MB

The panel file contains one observation per loan per quarter (~61,000+ unique loans). The snapshot files contain one observation per loan at its last reported quarter.

Data is not tracked in this repository. Store raw and processed data outside the repo under:

  • C:\Users\Han Wang\projects\data\MLPD\raw
  • C:\Users\Han Wang\projects\data\MLPD\processed

See Data Setup.

Key Variables

Field Name Description
lnno Loan ID Unique loan identifier
quarter Quarter Observation period (e.g., y2020q1)
mrtg_status Loan Status 100=Current, 200=60+DPD, 250=Mod w/ loss, 300=Foreclosure, 450=REO, 500=Closed
amt_upb_endg Ending Balance UPB at end of quarter
amt_upb_pch Original Balance UPB at Freddie Mac purchase
rate_dcr Original DCR Debt Service Coverage Ratio at origination
rate_ltv Original LTV Loan-to-Value ratio at origination
rate_int Note Rate Current interest rate
code_int Rate Type FIX or VAR
cnt_mrtg_term Mortgage Term Loan term in months
cnt_io_per IO Period Interest-only period in months
code_st State Property state
geographical_region Metro Geographic region
Credit_loss Credit Loss Realized credit loss (defaulted loans only)
Sales_Price Sales Price REO disposition price

Methodology

Target Variables

PD (Probability of Default)

  • Binary indicator: loan enters mrtg_status ∈ {200, 250, 300, 450} (60+ days delinquent or worse) at any point during its life.
  • Modeled at the loan-level using origination and early-life characteristics, avoiding look-ahead bias.

LGD (Loss Given Default)

  • Severity: Credit_loss / amt_upb_pch for disposed defaulted properties.
  • From the loss summary: 172 disposed properties, total credit loss $376.9M, average severity ~32% (34.4% post-2008, 11.3% pre-2008).

Pipeline

Raw Panel Data
    │
    ▼
1. Data Processing & EDA
   ├── Parse bar-delimited panel file
   ├── Construct loan-level features from origination snapshot
   ├── Label defaults and compute LGD for disposed properties
   └── Train/validation/test split (temporal: pre-2020 / 2020-2022 / 2023+)
    │
    ▼
2. Feature Engineering
   ├── Origination characteristics (LTV, DCR, rate, term, IO period)
   ├── Property characteristics (state, metro, unit count)
   ├── Loan structure (fixed vs. variable, balloon term, amortization)
   ├── Interaction terms and transformations
   └── Macro/vintage controls (origination year, rate environment)
    │
    ▼
3. Explainable ML — Feature Selection
   ├── LASSO Logistic Regression (L1 regularization path)
   ├── Random Forest (Gini impurity importance + permutation importance)
   │   └── [Local] Activated-path modal cluster per loan (explain_loan_rf)
   ├── XGBoost / LightGBM (gain importance + SHAP values)
   │   └── [Local] Boosting-round decomposition per loan (explain_loan_xgb_sequential)
   └── Consensus feature ranking across methods
    │
    ▼
4. Econometric Models (Regulatory Use)
   ├── PD: Logistic Regression on SHAP-selected features
   │   ├── Coefficient sign/magnitude validation
   │   ├── Out-of-time validation (AUC, KS statistic, Gini coefficient)
   │   ├── Calibration (Hosmer-Lemeshow test, reliability diagrams)
   │   └── ⚠ Class imbalance checks (event rate ~0.29%; intercept correction, cut-off audit)
   └── LGD: OLS / Tobit / Beta Regression on SHAP-selected features
       ├── Fractional response model (LGD ∈ [0,1])
       ├── Out-of-time validation (RMSE, MAE)
       └── Residual analysis and heteroskedasticity checks
    │
    ▼
5. Model Validation & Reporting
   ├── SHAP summary plots (beeswarm, bar, dependence)
   ├── Partial dependence plots (PDPs)
   ├── Segment analysis (by vintage, LTV bucket, state, rate type)
   ├── LLM-assisted narrative generation (constrained Translator; SHAP input required)
   └── Regulatory documentation
    │
    ▼
6. [Challenger] PD Term Structure via Survival Analysis
   ├── Discrete-time hazard (DtH) model — quarterly intervals
   ├── Cox Proportional Hazard + Extended Cox PH (time-varying coefficients)
   ├── Random Boosting Forest (RBF) survival ensemble
   ├── Kaplan-Meier non-parametric benchmark
   └── PD term structure: PD_marginal(t) = Lifetime PD(t+1) − Lifetime PD(t)

Explainability Strategy

The key modeling constraint is regulatory interpretability. ML models are used as feature selectors and economic validators, not as the final scoring model:

ML Method Role Regulatory Benefit
LASSO Automatic variable selection via L1 penalty; produces sparse, interpretable coefficient paths Directly maps to logistic regression with fewer features
Random Forest Non-parametric importance ranking; captures non-linearities Validates LASSO selection; identifies interaction effects
XGBoost Gradient boosting with SHAP decomposition SHAP values provide additive feature attribution (local + global)
SHAP Unified feature attribution across all models Provides directional consistency checks against economic theory

The final econometric models must satisfy:

  • Monotonicity: Higher LTV → higher PD; lower DCR → higher PD
  • Economic interpretability: All retained variables have theoretically grounded signs
  • Stability: Coefficients stable across sub-samples and time periods

Activated-Path Explainability for Random Forest (Loan-Level)

Global Gini/permutation importance summarizes variable relevance across the full training set but cannot explain why the model flagged any specific loan. For loan-level explanations — required when model risk reviewers or auditors query individual predictions — the activated-path + modal cluster method is implemented in src/models/selection.py as a complement to global importance.

For observation $x$, the decision path through tree $T_k$ is the sequence of nodes traversed from root to leaf: $\text{path}_k(x) = (n_0 \to n_1 \to \cdots \to \ell)$, extractable via scikit-learn's decision_path(X) in $O(D)$ per tree. The procedure:

  1. Extract the activated path from every majority-vote tree.
  2. Represent each path as the set of (feature, split direction) tuples it traverses.
  3. Compute pairwise Jaccard similarity across trees: $J(k,k') = \frac{|\text{features}k \cap \text{features}{k'}|}{|\text{features}k \cup \text{features}{k'}|}$.
  4. Identify the modal cluster — the plurality of trees sharing the same core feature sequence.
  5. Report a frequency-weighted rule set, e.g.: LTV > 0.78 (73% of trees) → DCR < 1.15 (68%) → 7yr term (61%) → P(Default) = 0.82.

This is more defensible than cherry-picking the single most confident tree, which may follow an idiosyncratic path due to bootstrap sampling and random feature subsetting. The modal cluster represents the consensus reasoning of the forest for that specific loan.

Implementation target: src/models/selection.pyexplain_loan_rf(loan_id, model, X).

Boosting-Round Decomposition for XGBoost (Loan-Level)

In XGBoost, each tree $T_k$ contributes a residual correction $f_k(x)$ rather than a standalone prediction. Early trees (low $k$) capture dominant large-residual signals; late trees apply subtle corrections. Aggregated SHAP obscures this sequential structure. The boosting-round decomposition partitions the $K$ trees into tertiles (primary / secondary / refinement) and computes feature frequency in activated paths, weighted by each tree's leaf contribution magnitude:

$$\phi_j^{(\text{stage})}(x) = \sum_{k \in \text{stage}} |f_k(x)| \cdot \mathbf{1}[j \in \text{path}_k(x)]$$

A representative output:

Primary signal   (trees   1– 50):  LTV > 0.85      (weighted freq: 89%)
Secondary signal (trees  51–150):  DCR < 1.10      (weighted freq: 64%)
Refinement       (trees 151–300):  Loan Age < 24mo (weighted freq: 41%)

This hierarchical summary maps naturally onto the economic narrative expected in SR 11-7 documentation and provides richer input to the LLM narration step than a flat SHAP bar chart. Note that full additive attribution across all trees converges to TreeSHAP (Lundberg et al., 2020), which remains the rigorous baseline; the boosting-round decomposition is a supplementary diagnostic, not a replacement.

Implementation target: src/models/selection.pyexplain_loan_xgb_sequential(loan_id, model, X).

LLM-Assisted Narrative Generation (Constrained Translator)

SHAP values and rule clusters are technically precise but require translation for non-quantitative stakeholders (model risk management, internal audit, credit officers). A constrained LLM Translator pipeline generates plain-language explanations for inclusion in reports/model_documentation/.

The critical operating constraint, supported by Geng et al. (2026): the LLM must receive structured attribution output as input and must not be asked to determine feature importance autonomously from raw loan characteristics. LLMs operating without structured attribution invoke domain priors that may be inconsistent with the model's empirical behavior — a failure mode that is particularly dangerous in low-default portfolios where feature importance signals are subtle.

Recommended workflow per loan explanation:

  1. Compute TreeSHAP (XGBoost) or modal path cluster (Random Forest) for the target observation.
  2. Format as a structured feature-attribution table: feature, direction, magnitude/frequency.
  3. Pass to LLM with a prompt constraining narration strictly to the provided attribution.
  4. Human reviewer verifies the narrative against the attribution table before inclusion in documentation.

Class Imbalance and PD Calibration

The MLPD portfolio exhibits a severe class imbalance: 178 defaults among ~61,200 loans yields an empirical event rate of approximately 0.29% — well below the 1% threshold at which Schutte et al. (2026) document material calibration bias in logistic regression. Specific implications for this project:

  • Primary performance metric: Gini coefficient (= 2·AUC − 1) is robust to class imbalance at large sample sizes and is the appropriate headline metric. Accuracy-based metrics are misleading at this event rate.
  • Optimal cut-off: The standard 0.5 threshold is inappropriate. The optimal cut-off migrates downward as the event rate falls. Cut-off selection should be based on a cost-sensitive criterion (e.g., minimizing weighted misclassification cost) evaluated on out-of-time data.
  • Calibration: Raw logistic regression intercepts are biased downward at very low event rates. After model estimation, verify that the mean predicted PD aligns with the empirical default rate. If not, apply an intercept correction: $\hat{\beta}_0^* = \hat{\beta}_0 + \log!\left(\frac{\bar{y}}{1-\bar{y}}\right) - \log!\left(\frac{\pi_s}{1-\pi_s}\right)$, where $\bar{y}$ is the sample event rate and $\pi_s$ is the population rate.
  • Re-sampling: SMOTE or undersampling may be applied during ML feature selection steps to improve gradient signal, but the final logistic regression should be estimated on the original (uncorrected) sample with intercept correction applied post-estimation.

Implementation target: src/utils/metrics.py — add calibration_check(y_true, y_pred, target_rate) and optimal_cutoff(y_true, y_pred, cost_ratio).

PD Term Structure: Survival Analysis Challenger

The binary PD formulation (did the loan ever default?) requires an additional mapping step to produce the lifetime PD term structure ${PD(t)}_{t=1}^{T}$ required for CECL expected loss calculations. The MLPD panel structure — one observation per loan per quarter from origination to disposition — directly supports a survival analysis formulation that generates this term structure natively.

Four models are implemented as challengers to the logistic regression baseline:

Model Assumption Strength Limitation
Kaplan-Meier Non-parametric No assumptions; benchmark No covariate conditioning
Cox PH Proportional hazards Parsimonious; interpretable coefficients Proportionality may fail over long horizons
Extended Cox PH Time-varying coefficients Relaxes proportionality More parameters; less stable on sparse data
Random Boosting Forest (RBF) Non-parametric ensemble Captures non-linearities and interactions Less interpretable; higher C-index but higher AIC

The marginal PD at each horizon $t$ is recovered from the cumulative survival function $S(t)$:

$$PD_{\text{marginal}}(t) = S(t-1) - S(t) = \text{Lifetime PD}(t) - \text{Lifetime PD}(t-1)$$

Based on Moremoholo et al. (2026) — who apply the same model comparison on Freddie Mac single-family data — Cox PH is expected to achieve the best AIC/BIC (parsimony) while Extended Cox PH and RBF achieve higher C-index. For regulatory documentation under SR 11-7, Cox PH is the preferred production model; Extended Cox PH and RBF serve as benchmark challengers in model validation.

Implementation targets:

  • notebooks/07_PD_survival.ipynb — survival model estimation and comparison
  • src/models/pd_model.pySurvivalPDModel class wrapping lifelines / scikit-survival
  • src/utils/metrics.pyc_index(), brier_score(), marginal_pd_curve()

Repository Structure

MLPD/
├── README.md
├── .gitignore
├── requirements.txt
│
├── notebooks/
│   ├── 01_EDA.ipynb                    # Exploratory data analysis
│   ├── 02_feature_engineering.ipynb
│   ├── 03_PD_feature_selection.ipynb   # LASSO / RF / XGB + SHAP
│   ├── 04_PD_econometric.ipynb         # Logistic regression (final PD model)
│   ├── 05_LGD_feature_selection.ipynb
│   ├── 06_LGD_econometric.ipynb        # OLS / Tobit (final LGD model)
│   └── 07_PD_survival.ipynb            # [Challenger] Survival analysis PD term structure
│
├── src/
│   ├── __init__.py
│   ├── data/
│   │   ├── loader.py            # Parse bar-delimited panel file
│   │   └── preprocess.py        # Feature construction, train/test split
│   ├── features/
│   │   └── engineer.py          # Feature engineering functions
│   ├── models/
│   │   ├── pd_model.py          # PD model classes (ML + econometric + SurvivalPDModel)
│   │   ├── lgd_model.py         # LGD model classes
│   │   └── selection.py         # SHAP-based feature selection + explain_loan_rf + explain_loan_xgb_sequential
│   └── utils/
│       ├── metrics.py           # AUC, KS, Gini, RMSE, calibration_check, optimal_cutoff, c_index, brier_score
│       └── plots.py             # SHAP plots, PDPs, calibration curves, survival curves
│
└── reports/
    ├── figures/                 # Model output charts
    └── model_documentation/     # Regulatory model documentation

Data Setup

  1. Download the MLPD from https://mf.freddiemac.com/investors/data.
  2. Place raw files and documentation under:
C:\Users\Han Wang\projects\data\MLPD\raw\
├── MLPD_datamart\
│   ├── mlpdy25q2_txt.txt
│   ├── mlpd_y1994q1_y2008q4.csv.xlsx
│   └── mlpd_y2009q1_y2025q2.csv.xlsx
├── MLPD_data_dictionary.pdf
├── MLPD_Loss_Summary.pdf
└── mlpd_terms_conditions.pdf
  1. Store derived outputs under:
C:\Users\Han Wang\projects\data\MLPD\processed\
  1. Keep code in the repository folder only:
C:\Users\Han Wang\projects\repos\MLPD
  1. If scripts expect direct file paths, point them to the nested datamart folder:
C:\Users\Han Wang\projects\data\MLPD\raw\MLPD_datamart
  1. Convert the raw files into easier-to-load parquet outputs with:
python scripts/convert_raw_to_parquet.py

This writes:

C:\Users\Han Wang\projects\data\MLPD\processed\panel\mlpdy25q2_part_###.parquet
C:\Users\Han Wang\projects\data\MLPD\processed\panel\manifest.csv
C:\Users\Han Wang\projects\data\MLPD\processed\snapshots\*.parquet

Setup

# Clone the repo
git clone https://github.com/hanisworking0987/MLPD.git
cd MLPD

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Requirements

Core dependencies (see requirements.txt):

Package Purpose
pandas, numpy Data manipulation
scikit-learn LASSO, Random Forest, cross-validation, metrics, decision_path
xgboost, lightgbm Gradient boosting models
shap SHAP-based feature attribution (TreeSHAP for local explanations)
statsmodels Logistic regression, OLS, Tobit, diagnostic tests
lifelines Kaplan-Meier, Cox PH, Extended Cox PH survival models
scikit-survival Random Boosting Forest (RBF) survival ensemble
matplotlib, seaborn Visualization
jupyter Notebooks

Portfolio Summary (as of 2025Q2)

From the MLPD Loss Summary:

Metric Value
Total unique loans (database) ~61,200
Loans defaulted with credit loss 178
Total defaulted UPB $1,160 M
Total credit loss $376.9 M
Average loss severity ~32%
Post-2008 severity 34.4%
Pre-2008 severity 11.3%

Notable risk concentrations in defaulted loans: GA (13%), TX (11%), FL (10%); high LTV buckets (75–82%) account for 45% of defaulted UPB; 7–10 year terms are disproportionately represented.


License & Disclaimer

The MLPD data is provided by Freddie Mac for informational purposes only. This project is for research and educational use. Model outputs should not be used for investment decisions. See mlpd_terms_conditions.pdf for Freddie Mac's terms of use.


References

  • Freddie Mac Multifamily Loan Performance Database: https://mf.freddiemac.com/investors/data
  • SR 11-7: Guidance on Model Risk Management (Federal Reserve / OCC)
  • Lundberg, S. & Lee, S. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS.
  • Lundberg, S. et al. (2020). From Local Explanations to Global Understanding with Explainable AI for Trees. Nature Machine Intelligence. (TreeSHAP — basis for XGBoost local attribution and boosting-round decomposition)
  • Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS-B.
  • Geng, Y. et al. (2026). LLMs as Post-Hoc Explainability Tools in Credit Risk. arXiv:2602.18895. (Basis for constrained LLM Translator workflow; caution against autonomous LLM explainers)
  • Schutte, W. D. et al. (2026). Class Imbalance in Logistic Regression for Low-Default Portfolios. arXiv:2602.19663. (Gini robustness; intercept correction and cut-off audit at ~0.29% event rate)
  • Moremoholo, T. R. et al. (2026). PD Term Structure Under IFRS 9: Cox PH, Extended Cox PH, and RBF. International Journal of Financial Studies, 14, 62. (Survival model comparison on Freddie Mac data; marginal PD term structure construction)
  • Botha, M. et al. (2026). Term Structure of Loan Write-Off Risk Under IFRS 9. arXiv:2603.11897. (Two-stage LGD survival framework; DtH vs. CIST vs. GLM on mortgage data)

About

Freddie Mac MLPD credit risk modeling: PD and LGD using explainable ML for regulatory use

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages