An explainable machine learning pipeline for building regulatory-grade credit risk models — Probability of Default (PD) and Loss Given Default (LGD) — using the Freddie Mac Multifamily Loan Performance Database (MLPD).
This project develops credit risk models for Freddie Mac's multifamily loan portfolio using a two-stage approach:
- Explainable ML for feature selection — LASSO, Random Forest, and gradient boosting (XGBoost/LightGBM) with SHAP-based attribution identify the most predictive and economically interpretable features.
- Regulatory-grade econometric models — The selected features feed into logistic regression (PD) and OLS/Tobit regression (LGD), satisfying interpretability and auditability requirements for model risk management (e.g., SR 11-7).
The workflow is designed so that ML methods inform—but do not replace—econometric models, striking the balance between predictive power and regulatory acceptability.
Source: Freddie Mac Multifamily Loan Performance Database (MLPD), as of 2025Q2.
| File | Description | Size |
|---|---|---|
MLPD_datamart\mlpdy25q2_txt.txt |
Full panel database (bar-delimited), 1994–2025Q2 | ~211 MB |
MLPD_datamart\mlpd_y1994q1_y2008q4.csv.xlsx |
Snapshot data, 1994–2008 | ~19 MB |
MLPD_datamart\mlpd_y2009q1_y2025q2.csv.xlsx |
Snapshot data, 2009–2025Q2 | ~124 MB |
MLPD_data_dictionary.pdf |
Freddie Mac data dictionary | ~0.2 MB |
MLPD_Loss_Summary.pdf |
Freddie Mac loss summary | ~0.8 MB |
mlpd_terms_conditions.pdf |
Freddie Mac terms and conditions | ~0.07 MB |
The panel file contains one observation per loan per quarter (~61,000+ unique loans). The snapshot files contain one observation per loan at its last reported quarter.
Data is not tracked in this repository. Store raw and processed data outside the repo under:
C:\Users\Han Wang\projects\data\MLPD\rawC:\Users\Han Wang\projects\data\MLPD\processed
See Data Setup.
| Field | Name | Description |
|---|---|---|
lnno |
Loan ID | Unique loan identifier |
quarter |
Quarter | Observation period (e.g., y2020q1) |
mrtg_status |
Loan Status | 100=Current, 200=60+DPD, 250=Mod w/ loss, 300=Foreclosure, 450=REO, 500=Closed |
amt_upb_endg |
Ending Balance | UPB at end of quarter |
amt_upb_pch |
Original Balance | UPB at Freddie Mac purchase |
rate_dcr |
Original DCR | Debt Service Coverage Ratio at origination |
rate_ltv |
Original LTV | Loan-to-Value ratio at origination |
rate_int |
Note Rate | Current interest rate |
code_int |
Rate Type | FIX or VAR |
cnt_mrtg_term |
Mortgage Term | Loan term in months |
cnt_io_per |
IO Period | Interest-only period in months |
code_st |
State | Property state |
geographical_region |
Metro | Geographic region |
Credit_loss |
Credit Loss | Realized credit loss (defaulted loans only) |
Sales_Price |
Sales Price | REO disposition price |
PD (Probability of Default)
- Binary indicator: loan enters
mrtg_status ∈ {200, 250, 300, 450}(60+ days delinquent or worse) at any point during its life. - Modeled at the loan-level using origination and early-life characteristics, avoiding look-ahead bias.
LGD (Loss Given Default)
- Severity:
Credit_loss / amt_upb_pchfor disposed defaulted properties. - From the loss summary: 172 disposed properties, total credit loss $376.9M, average severity ~32% (34.4% post-2008, 11.3% pre-2008).
Raw Panel Data
│
▼
1. Data Processing & EDA
├── Parse bar-delimited panel file
├── Construct loan-level features from origination snapshot
├── Label defaults and compute LGD for disposed properties
└── Train/validation/test split (temporal: pre-2020 / 2020-2022 / 2023+)
│
▼
2. Feature Engineering
├── Origination characteristics (LTV, DCR, rate, term, IO period)
├── Property characteristics (state, metro, unit count)
├── Loan structure (fixed vs. variable, balloon term, amortization)
├── Interaction terms and transformations
└── Macro/vintage controls (origination year, rate environment)
│
▼
3. Explainable ML — Feature Selection
├── LASSO Logistic Regression (L1 regularization path)
├── Random Forest (Gini impurity importance + permutation importance)
│ └── [Local] Activated-path modal cluster per loan (explain_loan_rf)
├── XGBoost / LightGBM (gain importance + SHAP values)
│ └── [Local] Boosting-round decomposition per loan (explain_loan_xgb_sequential)
└── Consensus feature ranking across methods
│
▼
4. Econometric Models (Regulatory Use)
├── PD: Logistic Regression on SHAP-selected features
│ ├── Coefficient sign/magnitude validation
│ ├── Out-of-time validation (AUC, KS statistic, Gini coefficient)
│ ├── Calibration (Hosmer-Lemeshow test, reliability diagrams)
│ └── ⚠ Class imbalance checks (event rate ~0.29%; intercept correction, cut-off audit)
└── LGD: OLS / Tobit / Beta Regression on SHAP-selected features
├── Fractional response model (LGD ∈ [0,1])
├── Out-of-time validation (RMSE, MAE)
└── Residual analysis and heteroskedasticity checks
│
▼
5. Model Validation & Reporting
├── SHAP summary plots (beeswarm, bar, dependence)
├── Partial dependence plots (PDPs)
├── Segment analysis (by vintage, LTV bucket, state, rate type)
├── LLM-assisted narrative generation (constrained Translator; SHAP input required)
└── Regulatory documentation
│
▼
6. [Challenger] PD Term Structure via Survival Analysis
├── Discrete-time hazard (DtH) model — quarterly intervals
├── Cox Proportional Hazard + Extended Cox PH (time-varying coefficients)
├── Random Boosting Forest (RBF) survival ensemble
├── Kaplan-Meier non-parametric benchmark
└── PD term structure: PD_marginal(t) = Lifetime PD(t+1) − Lifetime PD(t)
The key modeling constraint is regulatory interpretability. ML models are used as feature selectors and economic validators, not as the final scoring model:
| ML Method | Role | Regulatory Benefit |
|---|---|---|
| LASSO | Automatic variable selection via L1 penalty; produces sparse, interpretable coefficient paths | Directly maps to logistic regression with fewer features |
| Random Forest | Non-parametric importance ranking; captures non-linearities | Validates LASSO selection; identifies interaction effects |
| XGBoost | Gradient boosting with SHAP decomposition | SHAP values provide additive feature attribution (local + global) |
| SHAP | Unified feature attribution across all models | Provides directional consistency checks against economic theory |
The final econometric models must satisfy:
- Monotonicity: Higher LTV → higher PD; lower DCR → higher PD
- Economic interpretability: All retained variables have theoretically grounded signs
- Stability: Coefficients stable across sub-samples and time periods
Global Gini/permutation importance summarizes variable relevance across the full training set but cannot explain why the model flagged any specific loan. For loan-level explanations — required when model risk reviewers or auditors query individual predictions — the activated-path + modal cluster method is implemented in src/models/selection.py as a complement to global importance.
For observation decision_path(X) in
- Extract the activated path from every majority-vote tree.
- Represent each path as the set of (feature, split direction) tuples it traverses.
- Compute pairwise Jaccard similarity across trees: $J(k,k') = \frac{|\text{features}k \cap \text{features}{k'}|}{|\text{features}k \cup \text{features}{k'}|}$.
- Identify the modal cluster — the plurality of trees sharing the same core feature sequence.
- Report a frequency-weighted rule set, e.g.:
LTV > 0.78 (73% of trees) → DCR < 1.15 (68%) → 7yr term (61%) → P(Default) = 0.82.
This is more defensible than cherry-picking the single most confident tree, which may follow an idiosyncratic path due to bootstrap sampling and random feature subsetting. The modal cluster represents the consensus reasoning of the forest for that specific loan.
Implementation target: src/models/selection.py → explain_loan_rf(loan_id, model, X).
In XGBoost, each tree
A representative output:
Primary signal (trees 1– 50): LTV > 0.85 (weighted freq: 89%)
Secondary signal (trees 51–150): DCR < 1.10 (weighted freq: 64%)
Refinement (trees 151–300): Loan Age < 24mo (weighted freq: 41%)
This hierarchical summary maps naturally onto the economic narrative expected in SR 11-7 documentation and provides richer input to the LLM narration step than a flat SHAP bar chart. Note that full additive attribution across all trees converges to TreeSHAP (Lundberg et al., 2020), which remains the rigorous baseline; the boosting-round decomposition is a supplementary diagnostic, not a replacement.
Implementation target: src/models/selection.py → explain_loan_xgb_sequential(loan_id, model, X).
SHAP values and rule clusters are technically precise but require translation for non-quantitative stakeholders (model risk management, internal audit, credit officers). A constrained LLM Translator pipeline generates plain-language explanations for inclusion in reports/model_documentation/.
The critical operating constraint, supported by Geng et al. (2026): the LLM must receive structured attribution output as input and must not be asked to determine feature importance autonomously from raw loan characteristics. LLMs operating without structured attribution invoke domain priors that may be inconsistent with the model's empirical behavior — a failure mode that is particularly dangerous in low-default portfolios where feature importance signals are subtle.
Recommended workflow per loan explanation:
- Compute TreeSHAP (XGBoost) or modal path cluster (Random Forest) for the target observation.
- Format as a structured feature-attribution table: feature, direction, magnitude/frequency.
- Pass to LLM with a prompt constraining narration strictly to the provided attribution.
- Human reviewer verifies the narrative against the attribution table before inclusion in documentation.
The MLPD portfolio exhibits a severe class imbalance: 178 defaults among ~61,200 loans yields an empirical event rate of approximately 0.29% — well below the 1% threshold at which Schutte et al. (2026) document material calibration bias in logistic regression. Specific implications for this project:
- Primary performance metric: Gini coefficient (= 2·AUC − 1) is robust to class imbalance at large sample sizes and is the appropriate headline metric. Accuracy-based metrics are misleading at this event rate.
- Optimal cut-off: The standard 0.5 threshold is inappropriate. The optimal cut-off migrates downward as the event rate falls. Cut-off selection should be based on a cost-sensitive criterion (e.g., minimizing weighted misclassification cost) evaluated on out-of-time data.
-
Calibration: Raw logistic regression intercepts are biased downward at very low event rates. After model estimation, verify that the mean predicted PD aligns with the empirical default rate. If not, apply an intercept correction:
$\hat{\beta}_0^* = \hat{\beta}_0 + \log!\left(\frac{\bar{y}}{1-\bar{y}}\right) - \log!\left(\frac{\pi_s}{1-\pi_s}\right)$ , where$\bar{y}$ is the sample event rate and$\pi_s$ is the population rate. - Re-sampling: SMOTE or undersampling may be applied during ML feature selection steps to improve gradient signal, but the final logistic regression should be estimated on the original (uncorrected) sample with intercept correction applied post-estimation.
Implementation target: src/utils/metrics.py — add calibration_check(y_true, y_pred, target_rate) and optimal_cutoff(y_true, y_pred, cost_ratio).
The binary PD formulation (did the loan ever default?) requires an additional mapping step to produce the lifetime PD term structure
Four models are implemented as challengers to the logistic regression baseline:
| Model | Assumption | Strength | Limitation |
|---|---|---|---|
| Kaplan-Meier | Non-parametric | No assumptions; benchmark | No covariate conditioning |
| Cox PH | Proportional hazards | Parsimonious; interpretable coefficients | Proportionality may fail over long horizons |
| Extended Cox PH | Time-varying coefficients | Relaxes proportionality | More parameters; less stable on sparse data |
| Random Boosting Forest (RBF) | Non-parametric ensemble | Captures non-linearities and interactions | Less interpretable; higher C-index but higher AIC |
The marginal PD at each horizon
Based on Moremoholo et al. (2026) — who apply the same model comparison on Freddie Mac single-family data — Cox PH is expected to achieve the best AIC/BIC (parsimony) while Extended Cox PH and RBF achieve higher C-index. For regulatory documentation under SR 11-7, Cox PH is the preferred production model; Extended Cox PH and RBF serve as benchmark challengers in model validation.
Implementation targets:
notebooks/07_PD_survival.ipynb— survival model estimation and comparisonsrc/models/pd_model.py—SurvivalPDModelclass wrappinglifelines/scikit-survivalsrc/utils/metrics.py—c_index(),brier_score(),marginal_pd_curve()
MLPD/
├── README.md
├── .gitignore
├── requirements.txt
│
├── notebooks/
│ ├── 01_EDA.ipynb # Exploratory data analysis
│ ├── 02_feature_engineering.ipynb
│ ├── 03_PD_feature_selection.ipynb # LASSO / RF / XGB + SHAP
│ ├── 04_PD_econometric.ipynb # Logistic regression (final PD model)
│ ├── 05_LGD_feature_selection.ipynb
│ ├── 06_LGD_econometric.ipynb # OLS / Tobit (final LGD model)
│ └── 07_PD_survival.ipynb # [Challenger] Survival analysis PD term structure
│
├── src/
│ ├── __init__.py
│ ├── data/
│ │ ├── loader.py # Parse bar-delimited panel file
│ │ └── preprocess.py # Feature construction, train/test split
│ ├── features/
│ │ └── engineer.py # Feature engineering functions
│ ├── models/
│ │ ├── pd_model.py # PD model classes (ML + econometric + SurvivalPDModel)
│ │ ├── lgd_model.py # LGD model classes
│ │ └── selection.py # SHAP-based feature selection + explain_loan_rf + explain_loan_xgb_sequential
│ └── utils/
│ ├── metrics.py # AUC, KS, Gini, RMSE, calibration_check, optimal_cutoff, c_index, brier_score
│ └── plots.py # SHAP plots, PDPs, calibration curves, survival curves
│
└── reports/
├── figures/ # Model output charts
└── model_documentation/ # Regulatory model documentation
- Download the MLPD from https://mf.freddiemac.com/investors/data.
- Place raw files and documentation under:
C:\Users\Han Wang\projects\data\MLPD\raw\
├── MLPD_datamart\
│ ├── mlpdy25q2_txt.txt
│ ├── mlpd_y1994q1_y2008q4.csv.xlsx
│ └── mlpd_y2009q1_y2025q2.csv.xlsx
├── MLPD_data_dictionary.pdf
├── MLPD_Loss_Summary.pdf
└── mlpd_terms_conditions.pdf
- Store derived outputs under:
C:\Users\Han Wang\projects\data\MLPD\processed\
- Keep code in the repository folder only:
C:\Users\Han Wang\projects\repos\MLPD
- If scripts expect direct file paths, point them to the nested datamart folder:
C:\Users\Han Wang\projects\data\MLPD\raw\MLPD_datamart
- Convert the raw files into easier-to-load parquet outputs with:
python scripts/convert_raw_to_parquet.pyThis writes:
C:\Users\Han Wang\projects\data\MLPD\processed\panel\mlpdy25q2_part_###.parquet
C:\Users\Han Wang\projects\data\MLPD\processed\panel\manifest.csv
C:\Users\Han Wang\projects\data\MLPD\processed\snapshots\*.parquet
# Clone the repo
git clone https://github.com/hanisworking0987/MLPD.git
cd MLPD
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCore dependencies (see requirements.txt):
| Package | Purpose |
|---|---|
pandas, numpy |
Data manipulation |
scikit-learn |
LASSO, Random Forest, cross-validation, metrics, decision_path |
xgboost, lightgbm |
Gradient boosting models |
shap |
SHAP-based feature attribution (TreeSHAP for local explanations) |
statsmodels |
Logistic regression, OLS, Tobit, diagnostic tests |
lifelines |
Kaplan-Meier, Cox PH, Extended Cox PH survival models |
scikit-survival |
Random Boosting Forest (RBF) survival ensemble |
matplotlib, seaborn |
Visualization |
jupyter |
Notebooks |
From the MLPD Loss Summary:
| Metric | Value |
|---|---|
| Total unique loans (database) | ~61,200 |
| Loans defaulted with credit loss | 178 |
| Total defaulted UPB | $1,160 M |
| Total credit loss | $376.9 M |
| Average loss severity | ~32% |
| Post-2008 severity | 34.4% |
| Pre-2008 severity | 11.3% |
Notable risk concentrations in defaulted loans: GA (13%), TX (11%), FL (10%); high LTV buckets (75–82%) account for 45% of defaulted UPB; 7–10 year terms are disproportionately represented.
The MLPD data is provided by Freddie Mac for informational purposes only. This project is for research and educational use. Model outputs should not be used for investment decisions. See mlpd_terms_conditions.pdf for Freddie Mac's terms of use.
- Freddie Mac Multifamily Loan Performance Database: https://mf.freddiemac.com/investors/data
- SR 11-7: Guidance on Model Risk Management (Federal Reserve / OCC)
- Lundberg, S. & Lee, S. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS.
- Lundberg, S. et al. (2020). From Local Explanations to Global Understanding with Explainable AI for Trees. Nature Machine Intelligence. (TreeSHAP — basis for XGBoost local attribution and boosting-round decomposition)
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS-B.
- Geng, Y. et al. (2026). LLMs as Post-Hoc Explainability Tools in Credit Risk. arXiv:2602.18895. (Basis for constrained LLM Translator workflow; caution against autonomous LLM explainers)
- Schutte, W. D. et al. (2026). Class Imbalance in Logistic Regression for Low-Default Portfolios. arXiv:2602.19663. (Gini robustness; intercept correction and cut-off audit at ~0.29% event rate)
- Moremoholo, T. R. et al. (2026). PD Term Structure Under IFRS 9: Cox PH, Extended Cox PH, and RBF. International Journal of Financial Studies, 14, 62. (Survival model comparison on Freddie Mac data; marginal PD term structure construction)
- Botha, M. et al. (2026). Term Structure of Loan Write-Off Risk Under IFRS 9. arXiv:2603.11897. (Two-stage LGD survival framework; DtH vs. CIST vs. GLM on mortgage data)