# Build Models 

**Goal:**  
Build predictive models for:
- PD (Probability of Default)
- LGD (Loss Given Default)
- EAD (Exposure at Default)

Using the synthetic dataset generated in GeneratePortfolioData.ipynb.


In [1]:
import pandas as pd
from pathlib import Path

# Path to the dataset
project_path = Path("/home/skumar/Desktop/credit-risk-analytics")
csv_file = project_path / "data/input_raw/credit_portfolio.csv"

# Load dataset
data = pd.read_csv(csv_file)
print("✅ Data loaded successfully")
data.head()


✅ Data loaded successfully


Unnamed: 0,BorrowerID,Age,Income,CreditScore,LoanAmount,InterestRate,TermMonths,PD,LGD,EAD
0,1,59,24324,766,136717,0.0837,24,0.0599,0.3606,121789.2
1,2,49,96323,713,167002,0.1396,12,0.0442,0.3472,150060.0
2,3,35,29111,754,266691,0.0972,48,0.0248,0.1261,233692.63
3,4,63,58110,532,86835,0.1512,12,0.1105,0.1895,72685.55
4,5,28,36389,523,127232,0.0639,60,0.1209,0.1278,110827.12


## PD Model

- Target: PD
- Features: Age, Income, CreditScore, LoanAmount, InterestRate, TermMonths
- Model: Logistic Regression (PD scaled to binary: default or no default)


In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Create binary PD target (example: PD > 0.1 -> 1, else 0)
data["PD_binary"] = (data["PD"] > 0.1).astype(int)

# Features
X = data[["Age", "Income", "CreditScore", "LoanAmount", "InterestRate", "TermMonths"]]
y = data["PD_binary"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
pd_model = LogisticRegression(max_iter=1000)
pd_model.fit(X_train, y_train)

# Evaluate
y_pred = pd_model.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, y_pred)
print(f"✅ PD Model AUC: {auc:.3f}")


✅ PD Model AUC: 0.480


## LGD Model

- Target: LGD
- Features: same as PD
- Model: Linear Regression


In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Target and features
y_lgd = data["LGD"]

# Train/test split
X_train_lgd, X_test_lgd, y_train_lgd, y_test_lgd = train_test_split(X, y_lgd, test_size=0.2, random_state=42)

# Fit model
lgd_model = LinearRegression()
lgd_model.fit(X_train_lgd, y_train_lgd)

# Evaluate
y_pred_lgd = lgd_model.predict(X_test_lgd)
mse = mean_squared_error(y_test_lgd, y_pred_lgd)
print(f"✅ LGD Model MSE: {mse:.4f}")


✅ LGD Model MSE: 0.0255


## EAD Model

- Target: EAD
- Features: same as PD/LGD
- Model: Linear Regression


In [4]:
# Target and features
y_ead = data["EAD"]

# Train/test split
X_train_ead, X_test_ead, y_train_ead, y_test_ead = train_test_split(X, y_ead, test_size=0.2, random_state=42)

# Fit model
ead_model = LinearRegression()
ead_model.fit(X_train_ead, y_train_ead)

# Evaluate
y_pred_ead = ead_model.predict(X_test_ead)
mse_ead = mean_squared_error(y_test_ead, y_pred_ead)
print(f"✅ EAD Model MSE: {mse_ead:.2f}")


✅ EAD Model MSE: 281692031.21


### Summary

- Built **PD, LGD, EAD models** using synthetic dataset
- PD: Logistic Regression, evaluated via AUC
- LGD: Linear Regression, evaluated via MSE
- EAD: Linear Regression, evaluated via MSE
- Dataset ready for Monte Carlo simulation
