Freddie Mac MLPD Credit Risk Modeling (PD & LGD)

An explainable machine learning pipeline for building regulatory-grade credit risk models — Probability of Default (PD) and Loss Given Default (LGD) — using the Freddie Mac Multifamily Loan Performance Database (MLPD).

Overview

This project develops credit risk models for Freddie Mac's multifamily loan portfolio using a two-stage approach:

Explainable ML for feature selection — LASSO, Random Forest, and gradient boosting (XGBoost/LightGBM) with SHAP-based attribution identify the most predictive and economically interpretable features.
Regulatory-grade econometric models — The selected features feed into logistic regression (PD) and OLS/Tobit regression (LGD), satisfying interpretability and auditability requirements for model risk management (e.g., SR 11-7).

The workflow is designed so that ML methods inform—but do not replace—econometric models, striking the balance between predictive power and regulatory acceptability.

Data

Source: Freddie Mac Multifamily Loan Performance Database (MLPD), as of 2025Q2.

File	Description	Size
`MLPD_datamart\mlpdy25q2_txt.txt`	Full panel database (bar-delimited), 1994–2025Q2	~211 MB
`MLPD_datamart\mlpd_y1994q1_y2008q4.csv.xlsx`	Snapshot data, 1994–2008	~19 MB
`MLPD_datamart\mlpd_y2009q1_y2025q2.csv.xlsx`	Snapshot data, 2009–2025Q2	~124 MB
`MLPD_data_dictionary.pdf`	Freddie Mac data dictionary	~0.2 MB
`MLPD_Loss_Summary.pdf`	Freddie Mac loss summary	~0.8 MB
`mlpd_terms_conditions.pdf`	Freddie Mac terms and conditions	~0.07 MB

The panel file contains one observation per loan per quarter (~61,000+ unique loans). The snapshot files contain one observation per loan at its last reported quarter.

Data is not tracked in this repository. Store raw and processed data outside the repo under:

C:\Users\Han Wang\projects\data\MLPD\raw
C:\Users\Han Wang\projects\data\MLPD\processed

See Data Setup.

Key Variables

Field	Name	Description
`lnno`	Loan ID	Unique loan identifier
`quarter`	Quarter	Observation period (e.g., `y2020q1`)
`mrtg_status`	Loan Status	100=Current, 200=60+DPD, 250=Mod w/ loss, 300=Foreclosure, 450=REO, 500=Closed
`amt_upb_endg`	Ending Balance	UPB at end of quarter
`amt_upb_pch`	Original Balance	UPB at Freddie Mac purchase
`rate_dcr`	Original DCR	Debt Service Coverage Ratio at origination
`rate_ltv`	Original LTV	Loan-to-Value ratio at origination
`rate_int`	Note Rate	Current interest rate
`code_int`	Rate Type	FIX or VAR
`cnt_mrtg_term`	Mortgage Term	Loan term in months
`cnt_io_per`	IO Period	Interest-only period in months
`code_st`	State	Property state
`geographical_region`	Metro	Geographic region
`Credit_loss`	Credit Loss	Realized credit loss (defaulted loans only)
`Sales_Price`	Sales Price	REO disposition price

Methodology

Target Variables

PD (Probability of Default)

Binary indicator: loan enters mrtg_status ∈ {200, 250, 300, 450} (60+ days delinquent or worse) at any point during its life.
Modeled at the loan-level using origination and early-life characteristics, avoiding look-ahead bias.

LGD (Loss Given Default)

Severity: Credit_loss / amt_upb_pch for disposed defaulted properties.
From the loss summary: 172 disposed properties, total credit loss $376.9M, average severity ~32% (34.4% post-2008, 11.3% pre-2008).

Pipeline

Raw Panel Data
    │
    ▼
1. Data Processing & EDA
   ├── Parse bar-delimited panel file
   ├── Construct loan-level features from origination snapshot
   ├── Label defaults and compute LGD for disposed properties
   └── Train/validation/test split (temporal: pre-2020 / 2020-2022 / 2023+)
    │
    ▼
2. Feature Engineering
   ├── Origination characteristics (LTV, DCR, rate, term, IO period)
   ├── Property characteristics (state, metro, unit count)
   ├── Loan structure (fixed vs. variable, balloon term, amortization)
   ├── Interaction terms and transformations
   └── Macro/vintage controls (origination year, rate environment)
    │
    ▼
3. Explainable ML — Feature Selection
   ├── LASSO Logistic Regression (L1 regularization path)
   ├── Random Forest (Gini impurity importance + permutation importance)
   │   └── [Local] Activated-path modal cluster per loan (explain_loan_rf)
   ├── XGBoost / LightGBM (gain importance + SHAP values)
   │   └── [Local] Boosting-round decomposition per loan (explain_loan_xgb_sequential)
   └── Consensus feature ranking across methods
    │
    ▼
4. Econometric Models (Regulatory Use)
   ├── PD: Logistic Regression on SHAP-selected features
   │   ├── Coefficient sign/magnitude validation
   │   ├── Out-of-time validation (AUC, KS statistic, Gini coefficient)
   │   ├── Calibration (Hosmer-Lemeshow test, reliability diagrams)
   │   └── ⚠ Class imbalance checks (event rate ~0.29%; intercept correction, cut-off audit)
   └── LGD: OLS / Tobit / Beta Regression on SHAP-selected features
       ├── Fractional response model (LGD ∈ [0,1])
       ├── Out-of-time validation (RMSE, MAE)
       └── Residual analysis and heteroskedasticity checks
    │
    ▼
5. Model Validation & Reporting
   ├── SHAP summary plots (beeswarm, bar, dependence)
   ├── Partial dependence plots (PDPs)
   ├── Segment analysis (by vintage, LTV bucket, state, rate type)
   ├── LLM-assisted narrative generation (constrained Translator; SHAP input required)
   └── Regulatory documentation
    │
    ▼
6. [Challenger] PD Term Structure via Survival Analysis
   ├── Discrete-time hazard (DtH) model — quarterly intervals
   ├── Cox Proportional Hazard + Extended Cox PH (time-varying coefficients)
   ├── Random Boosting Forest (RBF) survival ensemble
   ├── Kaplan-Meier non-parametric benchmark
   └── PD term structure: PD_marginal(t) = Lifetime PD(t+1) − Lifetime PD(t)

Explainability Strategy

The key modeling constraint is regulatory interpretability. ML models are used as feature selectors and economic validators, not as the final scoring model:

ML Method	Role	Regulatory Benefit
LASSO	Automatic variable selection via L1 penalty; produces sparse, interpretable coefficient paths	Directly maps to logistic regression with fewer features
Random Forest	Non-parametric importance ranking; captures non-linearities	Validates LASSO selection; identifies interaction effects
XGBoost	Gradient boosting with SHAP decomposition	SHAP values provide additive feature attribution (local + global)
SHAP	Unified feature attribution across all models	Provides directional consistency checks against economic theory

The final econometric models must satisfy:

Monotonicity: Higher LTV → higher PD; lower DCR → higher PD
Economic interpretability: All retained variables have theoretically grounded signs
Stability: Coefficients stable across sub-samples and time periods

Activated-Path Explainability for Random Forest (Loan-Level)

Global Gini/permutation importance summarizes variable relevance across the full training set but cannot explain why the model flagged any specific loan. For loan-level explanations — required when model risk reviewers or auditors query individual predictions — the activated-path + modal cluster method is implemented in src/models/selection.py as a complement to global importance.

For observation $x$, the decision path through tree $T_k$ is the sequence of nodes traversed from root to leaf: $\text{path}_k(x) = (n_0 \to n_1 \to \cdots \to \ell)$, extractable via scikit-learn's decision_path(X) in $O(D)$ per tree. The procedure:

Extract the activated path from every majority-vote tree.
Represent each path as the set of (feature, split direction) tuples it traverses.
Compute pairwise Jaccard similarity across trees: $J(k,k') = \frac{|\text{features}k \cap \text{features}{k'}|}{|\text{features}k \cup \text{features}{k'}|}$.
Identify the modal cluster — the plurality of trees sharing the same core feature sequence.
Report a frequency-weighted rule set, e.g.: LTV > 0.78 (73% of trees) → DCR < 1.15 (68%) → 7yr term (61%) → P(Default) = 0.82.

This is more defensible than cherry-picking the single most confident tree, which may follow an idiosyncratic path due to bootstrap sampling and random feature subsetting. The modal cluster represents the consensus reasoning of the forest for that specific loan.

Implementation target: src/models/selection.py → explain_loan_rf(loan_id, model, X).

Boosting-Round Decomposition for XGBoost (Loan-Level)

In XGBoost, each tree $T_k$ contributes a residual correction $f_k(x)$ rather than a standalone prediction. Early trees (low $k$) capture dominant large-residual signals; late trees apply subtle corrections. Aggregated SHAP obscures this sequential structure. The boosting-round decomposition partitions the $K$ trees into tertiles (primary / secondary / refinement) and computes feature frequency in activated paths, weighted by each tree's leaf contribution magnitude:

$$\phi_j^{(\text{stage})}(x) = \sum_{k \in \text{stage}} |f_k(x)| \cdot \mathbf{1}[j \in \text{path}_k(x)]$$

A representative output:

Primary signal   (trees   1– 50):  LTV > 0.85      (weighted freq: 89%)
Secondary signal (trees  51–150):  DCR < 1.10      (weighted freq: 64%)
Refinement       (trees 151–300):  Loan Age < 24mo (weighted freq: 41%)

This hierarchical summary maps naturally onto the economic narrative expected in SR 11-7 documentation and provides richer input to the LLM narration step than a flat SHAP bar chart. Note that full additive attribution across all trees converges to TreeSHAP (Lundberg et al., 2020), which remains the rigorous baseline; the boosting-round decomposition is a supplementary diagnostic, not a replacement.

Implementation target: src/models/selection.py → explain_loan_xgb_sequential(loan_id, model, X).

LLM-Assisted Narrative Generation (Constrained Translator)

SHAP values and rule clusters are technically precise but require translation for non-quantitative stakeholders (model risk management, internal audit, credit officers). A constrained LLM Translator pipeline generates plain-language explanations for inclusion in reports/model_documentation/.

The critical operating constraint, supported by Geng et al. (2026): the LLM must receive structured attribution output as input and must not be asked to determine feature importance autonomously from raw loan characteristics. LLMs operating without structured attribution invoke domain priors that may be inconsistent with the model's empirical behavior — a failure mode that is particularly dangerous in low-default portfolios where feature importance signals are subtle.

Recommended workflow per loan explanation:

Compute TreeSHAP (XGBoost) or modal path cluster (Random Forest) for the target observation.
Format as a structured feature-attribution table: feature, direction, magnitude/frequency.
Pass to LLM with a prompt constraining narration strictly to the provided attribution.
Human reviewer verifies the narrative against the attribution table before inclusion in documentation.

Class Imbalance and PD Calibration

The MLPD portfolio exhibits a severe class imbalance: 178 defaults among ~61,200 loans yields an empirical event rate of approximately 0.29% — well below the 1% threshold at which Schutte et al. (2026) document material calibration bias in logistic regression. Specific implications for this project:

Primary performance metric: Gini coefficient (= 2·AUC − 1) is robust to class imbalance at large sample sizes and is the appropriate headline metric. Accuracy-based metrics are misleading at this event rate.
Optimal cut-off: The standard 0.5 threshold is inappropriate. The optimal cut-off migrates downward as the event rate falls. Cut-off selection should be based on a cost-sensitive criterion (e.g., minimizing weighted misclassification cost) evaluated on out-of-time data.
Calibration: Raw logistic regression intercepts are biased downward at very low event rates. After model estimation, verify that the mean predicted PD aligns with the empirical default rate. If not, apply an intercept correction: $\hat{\beta}_0^* = \hat{\beta}_0 + \log!\left(\frac{\bar{y}}{1-\bar{y}}\right) - \log!\left(\frac{\pi_s}{1-\pi_s}\right)$, where $\bar{y}$ is the sample event rate and $\pi_s$ is the population rate.
Re-sampling: SMOTE or undersampling may be applied during ML feature selection steps to improve gradient signal, but the final logistic regression should be estimated on the original (uncorrected) sample with intercept correction applied post-estimation.

Implementation target: src/utils/metrics.py — add calibration_check(y_true, y_pred, target_rate) and optimal_cutoff(y_true, y_pred, cost_ratio).

PD Term Structure: Survival Analysis Challenger

The binary PD formulation (did the loan ever default?) requires an additional mapping step to produce the lifetime PD term structure ${PD(t)}_{t=1}^{T}$ required for CECL expected loss calculations. The MLPD panel structure — one observation per loan per quarter from origination to disposition — directly supports a survival analysis formulation that generates this term structure natively.

Four models are implemented as challengers to the logistic regression baseline:

Model	Assumption	Strength	Limitation
Kaplan-Meier	Non-parametric	No assumptions; benchmark	No covariate conditioning
Cox PH	Proportional hazards	Parsimonious; interpretable coefficients	Proportionality may fail over long horizons
Extended Cox PH	Time-varying coefficients	Relaxes proportionality	More parameters; less stable on sparse data
Random Boosting Forest (RBF)	Non-parametric ensemble	Captures non-linearities and interactions	Less interpretable; higher C-index but higher AIC

The marginal PD at each horizon $t$ is recovered from the cumulative survival function $S(t)$:

$$PD_{\text{marginal}}(t) = S(t-1) - S(t) = \text{Lifetime PD}(t) - \text{Lifetime PD}(t-1)$$

Based on Moremoholo et al. (2026) — who apply the same model comparison on Freddie Mac single-family data — Cox PH is expected to achieve the best AIC/BIC (parsimony) while Extended Cox PH and RBF achieve higher C-index. For regulatory documentation under SR 11-7, Cox PH is the preferred production model; Extended Cox PH and RBF serve as benchmark challengers in model validation.

Implementation targets:

notebooks/07_PD_survival.ipynb — survival model estimation and comparison
src/models/pd_model.py — SurvivalPDModel class wrapping lifelines / scikit-survival
src/utils/metrics.py — c_index(), brier_score(), marginal_pd_curve()

Repository Structure

MLPD/
├── README.md
├── .gitignore
├── requirements.txt
│
├── notebooks/
│   ├── 01_EDA.ipynb                    # Exploratory data analysis
│   ├── 02_feature_engineering.ipynb
│   ├── 03_PD_feature_selection.ipynb   # LASSO / RF / XGB + SHAP
│   ├── 04_PD_econometric.ipynb         # Logistic regression (final PD model)
│   ├── 05_LGD_feature_selection.ipynb
│   ├── 06_LGD_econometric.ipynb        # OLS / Tobit (final LGD model)
│   └── 07_PD_survival.ipynb            # [Challenger] Survival analysis PD term structure
│
├── src/
│   ├── __init__.py
│   ├── data/
│   │   ├── loader.py            # Parse bar-delimited panel file
│   │   └── preprocess.py        # Feature construction, train/test split
│   ├── features/
│   │   └── engineer.py          # Feature engineering functions
│   ├── models/
│   │   ├── pd_model.py          # PD model classes (ML + econometric + SurvivalPDModel)
│   │   ├── lgd_model.py         # LGD model classes
│   │   └── selection.py         # SHAP-based feature selection + explain_loan_rf + explain_loan_xgb_sequential
│   └── utils/
│       ├── metrics.py           # AUC, KS, Gini, RMSE, calibration_check, optimal_cutoff, c_index, brier_score
│       └── plots.py             # SHAP plots, PDPs, calibration curves, survival curves
│
└── reports/
    ├── figures/                 # Model output charts
    └── model_documentation/     # Regulatory model documentation

Data Setup

Download the MLPD from https://mf.freddiemac.com/investors/data.
Place raw files and documentation under:

C:\Users\Han Wang\projects\data\MLPD\raw\
├── MLPD_datamart\
│   ├── mlpdy25q2_txt.txt
│   ├── mlpd_y1994q1_y2008q4.csv.xlsx
│   └── mlpd_y2009q1_y2025q2.csv.xlsx
├── MLPD_data_dictionary.pdf
├── MLPD_Loss_Summary.pdf
└── mlpd_terms_conditions.pdf

Store derived outputs under:

C:\Users\Han Wang\projects\data\MLPD\processed\

Keep code in the repository folder only:

C:\Users\Han Wang\projects\repos\MLPD

If scripts expect direct file paths, point them to the nested datamart folder:

C:\Users\Han Wang\projects\data\MLPD\raw\MLPD_datamart

Convert the raw files into easier-to-load parquet outputs with:

python scripts/convert_raw_to_parquet.py

This writes:

C:\Users\Han Wang\projects\data\MLPD\processed\panel\mlpdy25q2_part_###.parquet
C:\Users\Han Wang\projects\data\MLPD\processed\panel\manifest.csv
C:\Users\Han Wang\projects\data\MLPD\processed\snapshots\*.parquet

Setup

# Clone the repo
git clone https://github.com/hanisworking0987/MLPD.git
cd MLPD

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Requirements

Core dependencies (see requirements.txt):

Package	Purpose
`pandas`, `numpy`	Data manipulation
`scikit-learn`	LASSO, Random Forest, cross-validation, metrics, `decision_path`
`xgboost`, `lightgbm`	Gradient boosting models
`shap`	SHAP-based feature attribution (TreeSHAP for local explanations)
`statsmodels`	Logistic regression, OLS, Tobit, diagnostic tests
`lifelines`	Kaplan-Meier, Cox PH, Extended Cox PH survival models
`scikit-survival`	Random Boosting Forest (RBF) survival ensemble
`matplotlib`, `seaborn`	Visualization
`jupyter`	Notebooks

Portfolio Summary (as of 2025Q2)

From the MLPD Loss Summary:

Metric	Value
Total unique loans (database)	~61,200
Loans defaulted with credit loss	178
Total defaulted UPB	$1,160 M
Total credit loss	$376.9 M
Average loss severity	~32%
Post-2008 severity	34.4%
Pre-2008 severity	11.3%

Notable risk concentrations in defaulted loans: GA (13%), TX (11%), FL (10%); high LTV buckets (75–82%) account for 45% of defaulted UPB; 7–10 year terms are disproportionately represented.

License & Disclaimer

The MLPD data is provided by Freddie Mac for informational purposes only. This project is for research and educational use. Model outputs should not be used for investment decisions. See mlpd_terms_conditions.pdf for Freddie Mac's terms of use.

References

Freddie Mac Multifamily Loan Performance Database: https://mf.freddiemac.com/investors/data
SR 11-7: Guidance on Model Risk Management (Federal Reserve / OCC)
Lundberg, S. & Lee, S. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS.
Lundberg, S. et al. (2020). From Local Explanations to Global Understanding with Explainable AI for Trees. Nature Machine Intelligence. (TreeSHAP — basis for XGBoost local attribution and boosting-round decomposition)
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS-B.
Geng, Y. et al. (2026). LLMs as Post-Hoc Explainability Tools in Credit Risk. arXiv:2602.18895. (Basis for constrained LLM Translator workflow; caution against autonomous LLM explainers)
Schutte, W. D. et al. (2026). Class Imbalance in Logistic Regression for Low-Default Portfolios. arXiv:2602.19663. (Gini robustness; intercept correction and cut-off audit at ~0.29% event rate)
Moremoholo, T. R. et al. (2026). PD Term Structure Under IFRS 9: Cox PH, Extended Cox PH, and RBF. International Journal of Financial Studies, 14, 62. (Survival model comparison on Freddie Mac data; marginal PD term structure construction)
Botha, M. et al. (2026). Term Structure of Loan Write-Off Risk Under IFRS 9. arXiv:2603.11897. (Two-stage LGD survival framework; DtH vs. CIST vs. GLM on mortgage data)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Freddie Mac MLPD Credit Risk Modeling (PD & LGD)

Overview

Data

Key Variables

Methodology

Target Variables

Pipeline

Explainability Strategy

Activated-Path Explainability for Random Forest (Loan-Level)

Boosting-Round Decomposition for XGBoost (Loan-Level)

LLM-Assisted Narrative Generation (Constrained Translator)

Class Imbalance and PD Calibration

PD Term Structure: Survival Analysis Challenger

Repository Structure

Data Setup

Setup

Requirements

Portfolio Summary (as of 2025Q2)

License & Disclaimer

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Freddie Mac MLPD Credit Risk Modeling (PD & LGD)

Overview

Data

Key Variables

Methodology

Target Variables

Pipeline

Explainability Strategy

Activated-Path Explainability for Random Forest (Loan-Level)

Boosting-Round Decomposition for XGBoost (Loan-Level)

LLM-Assisted Narrative Generation (Constrained Translator)

Class Imbalance and PD Calibration

PD Term Structure: Survival Analysis Challenger

Repository Structure

Data Setup

Setup

Requirements

Portfolio Summary (as of 2025Q2)

License & Disclaimer

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages