# Generate Synthetic Credit Portfolio Data

**Goal:**  
Create a mock credit portfolio dataset with borrowers, loans, and risk metrics (PD, LGD, EAD).  
This dataset will be used in later steps for exploratory analysis, modeling, and Monte Carlo simulations.

**Dataset Includes:**
- Borrower info: BorrowerID, Age, Income, CreditScore
- Loan info: LoanAmount, InterestRate, TermMonths
- Risk metrics: PD (Probability of Default), LGD (Loss Given Default), EAD (Exposure at Default)

The dataset will be saved in `data/input_raw/credit_portfolio.csv`.


In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from pathlib import Path

# Set random seed for reproducibility
np.random.seed(42)

# Base path for saving files
base_path = Path("../data/input_raw")
base_path.mkdir(parents=True, exist_ok=True)  # ensure folder exists

print("✅ Libraries imported and folders ready")


✅ Libraries imported and folders ready


## Define Dataset Parameters

We will simulate:
- `n_borrowers = 1000`
- Age: 21–70
- Income: $20,000–$150,000
- CreditScore: 300–850
- LoanAmount: $5,000–$500,000
- InterestRate: 3%–20%
- TermMonths: [12, 24, 36, 48, 60]

Risk metrics (synthetic):
- PD: Beta distribution (0–1)
- LGD: Beta distribution (0–1)
- EAD: LoanAmount × random factor (0.8–1.0)


In [2]:
# Number of borrowers
n_borrowers = 1000

# Generate synthetic borrower and loan data
data = pd.DataFrame({
    "BorrowerID": range(1, n_borrowers + 1),
    "Age": np.random.randint(21, 70, size=n_borrowers),
    "Income": np.random.randint(20000, 150000, size=n_borrowers),
    "CreditScore": np.random.randint(300, 850, size=n_borrowers),
    "LoanAmount": np.random.randint(5000, 500000, size=n_borrowers),
    "InterestRate": np.round(np.random.uniform(0.03, 0.2, size=n_borrowers), 4),
    "TermMonths": np.random.choice([12, 24, 36, 48, 60], size=n_borrowers)
})

print("✅ Borrower and loan data generated")
data.head()


✅ Borrower and loan data generated


Unnamed: 0,BorrowerID,Age,Income,CreditScore,LoanAmount,InterestRate,TermMonths
0,1,59,24324,766,136717,0.0837,24
1,2,49,96323,713,167002,0.1396,12
2,3,35,29111,754,266691,0.0972,48
3,4,63,58110,532,86835,0.1512,12
4,5,28,36389,523,127232,0.0639,60


## Generate Synthetic Risk Metrics

We create:
- PD (Probability of Default) → Beta distribution
- LGD (Loss Given Default) → Beta distribution
- EAD (Exposure at Default) → LoanAmount × random factor

<details>
<summary>Why Beta distribution?</summary>
Beta distribution is useful for probabilities (values between 0 and 1). It allows flexibility in shaping distributions to mimic real-world default and loss rates.
</details>


In [3]:
# PD: Probability of Default
data["PD"] = np.round(np.random.beta(a=2, b=20, size=n_borrowers), 4)

# LGD: Loss Given Default
data["LGD"] = np.round(np.random.beta(a=2, b=5, size=n_borrowers), 4)

# EAD: Exposure at Default
data["EAD"] = np.round(data["LoanAmount"] * np.random.uniform(0.8, 1.0, size=n_borrowers), 2)

print("✅ Risk metrics (PD, LGD, EAD) generated")
data.head()


✅ Risk metrics (PD, LGD, EAD) generated


Unnamed: 0,BorrowerID,Age,Income,CreditScore,LoanAmount,InterestRate,TermMonths,PD,LGD,EAD
0,1,59,24324,766,136717,0.0837,24,0.0599,0.3606,121789.2
1,2,49,96323,713,167002,0.1396,12,0.0442,0.3472,150060.0
2,3,35,29111,754,266691,0.0972,48,0.0248,0.1261,233692.63
3,4,63,58110,532,86835,0.1512,12,0.1105,0.1895,72685.55
4,5,28,36389,523,127232,0.0639,60,0.1209,0.1278,110827.12


## Save the Synthetic Dataset

The dataset will be saved to : data/input_raw/credit_portfolio.csv



In [6]:
# Save dataset to CSV
output_file = base_path / "credit_portfolio.csv"
data.to_csv(output_file, index=False)

print(f"✅ Dataset saved to {output_file}")


✅ Dataset saved to ../data/input_raw/credit_portfolio.csv


### Summary

- Synthetic portfolio with 1000 borrowers created
- Includes borrower info, loan info, and risk metrics
- Saved in `data/input_raw/credit_portfolio.csv`
- Ready for next step: Exploratory Data Analysis (EDA)


In [8]:
from pathlib import Path

# Absolute path to project folder
project_path = Path("/home/skumar/Desktop/credit-risk-analytics")
output_file = project_path / "data/input_raw/credit_portfolio.csv"

data.to_csv(output_file, index=False)
print("✅ Dataset saved to:", output_file)


✅ Dataset saved to: /home/skumar/Desktop/credit-risk-analytics/data/input_raw/credit_portfolio.csv
