<div style="position:relative; width:100%; height:200px;">
  <img src="https://raw.githubusercontent.com/stefanlessmann/VHB_ProDoc_ML/master/banner-nb.png" style="width:100%; object-fit:cover;" alt="ProDok-MachineLearning-Banner">
  <div style="
      position:absolute;
      left:4%;
      top:50%;
      transform:translateY(-50%);
      font-size:3.2vw;
      font-weight:750;
      color:#1f2a44;">
    ProDok â€“ Machine Learning
  </div>
</div>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/stefanlessmann/VHB_ProDoc_ML/blob/master/P.I.1.data_exploration.ipynb)


# P.I.1. Data Exploration and Preparation Using Python
The practice sessions complement the lectures and provide hands-on experience with the concepts covered in the course. This session focuses on explanatory data analysis (EDA), data preparation, and the subtle but crucial differences between explanatory and predictive modeling. 

The available time does not facilitate practicing Python programming. We will use generative AI (i.e., an LLM) to generate relevant codes and focus on designing effective prompts and discussing data science outputs emerging from the generated Python codes.

<p class="alert alert-warning"><strong>Disclaimer:</strong><br> It is crucial to carefully inspect generated codes to ensure their correctness and suitability for a task at hand. Never use generated codes without ample testing and verification in research or practice sessions. We will try to devote as much attention to this pivotal aspect as is possible in the available time.   

In [None]:
# Preliminaries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Credit Risk Analytics Data 
We will work with a synthetic dataset that simulates credit risk analytics data. Specifically, the dataset represents credit applicants at the time of a loan application. It provides information about the applicants, their financial situation, loan characteristics, and so on, as detailed in the following [data dictionary](## Data Dictionary). 

The dataset includes two target variables to facilitate classifcation and regression modeling. A binary target indicates whether the applicant defaulted on the loan. In case of default, a numerical variable `LGD` gives the share of the outstanding amount that was lost (i.e., loss-given-default).

All data are synthetic and created for educational purposes.


In [None]:
data_url = "https://raw.githubusercontent.com/stefanlessmann/VHB_ProDoc_ML/master/credit_data_100k.csv"

# Run this line to load the data from GitHub
#df = pd.read_csv(data_url)

# If you stored the data locally, use this line instead (faster!)
df = pd.read_csv("credit_data_100k.csv")
df.info()

## Data Dictionary
---

### Outcome Variables

| Variable | Type | Meaning |
|----------|------|--------|
| `default_12m` | binary (0/1) | Indicates whether the loan defaulted within 12 months after origination (1 = default, 0 = no default). |
| `LGD` | numeric | Loss Given Default (fraction of exposure lost if a default occurs). Defined only for loans that defaulted; otherwise missing. |

---

### Applicant Information

| Variable | Type | Meaning |
|----------|------|--------|
| `age` | integer | Applicant age in years at application time. |
| `income` | numeric (â‚¬) | Applicant annual income. |
| `employment_type` | categorical | Employment category (`salaried`, `self_employed`, `public_sector`, `student`, `retired`). |
| `housing_status` | categorical | Housing situation (`rent`, `own`, `mortgage`, `family`). |
| `years_at_job` | numeric | Years in current job. |
| `years_at_address` | numeric | Years at current residence. |
| `region` | categorical | Geographic region (`north`, `south`, `east`, `west`, `metro`). |

---

### Relationship with the Bank

| Variable | Type | Meaning |
|----------|------|--------|
| `has_bank_account` | binary (0/1) | Indicates whether the applicant already has an account with the bank (1 = yes, 0 = no). |

---

### Loan Characteristics

| Variable | Type | Meaning |
|----------|------|--------|
| `loan_purpose` | categorical | Stated purpose of the loan (`debt_consolidation`, `car`, `home_improvement`, `education`, `other`). |
| `loan_amount` | numeric (â‚¬) | Requested loan amount. |
| `secured` | binary (0/1) | Indicates whether the loan is secured by collateral (1 = secured, 0 = unsecured). |
| `dti` | numeric | Debt-to-income ratio (total monthly debt payments divided by monthly income). |

---

### Credit-History-Related Indicators

| Variable | Type | Meaning |
|----------|------|--------|
| `num_inquiries_6m` | integer | Number of credit inquiries in the past 6 months. |
| `num_delinquencies_24m` | integer | Number of delinquencies recorded in the past 24 months. |
| `num_collections` | integer | Number of accounts currently in collection. |
| `card_utilization` | numeric | Utilization ratio on the bankâ€™s credit card (balance divided by credit limit). May be missing if not available. |

---

### Notes

- Missing values are represented as `NaN`.
- All variables are synthetic and do not correspond to real individuals or institutions.

# Explanatory Data Analysis (EDA)
Explanatory data analysis (EDA) aims at gaining first-level insight into a dataset, and is the first step in empirical research/when beginning to work with a novel dataset.

The goal of this exercise is to use Python for EDA. We will *LLM-generate* the corresponding codes. Below we provide a possible prompt template, which we will discuss and complete in class. 

## EDA Prompt Engineering

## LLM-Generated Result


In [None]:
# copy your generated codes here


# Data Preparation (DPP)
The goal of the data preparation is to address potential data quality issue (e.g., identified during EDA) and to facilitate statistical / machine learning modeling. For example, many statistical methods cannot handle missing values. Linear regression is a popular example. Thus, addressing missing values is a standard dpp task.

We proceed as before: providing a template of a prompt to generate dpp code, which we will discuss and complete in class. 
 
>Remark: We assume you execute the dpp prompt in the same chat session as the previous EDA prompt. Doing so ensures that the LLM has stored information on the dataset and the results of the EDA. 

## DPP Prompt Engineering

## LLM-Generated result

In [None]:
# copy your generated dpp codes here

# Linear Regression Model
Linear regression is perhaps the most widely used statistical method in empirical research. A key learning goal of this exercise is to stress the subtle differences between an *explanatory* and a *predictive* model. Linear regression supports both, explanatory and predictive modeling. Traditionally, using linear regression for explanatory modeling is more common.   

Below, we provide a working example of running linear regression in Python using the `statsmodels` library. Unlike the famous `sklearn` library, the *go to library* for machine learning in Python, the `statsmodels` library focuses on explanatory modeling. 

The code assume a prepared *ready-to-use* version of the dataset is stored in a variable `df_model`. As dependent variable, we use `lgd`, which is defined only for defaulted loans. Therefore, the code includes a filtering step to select only the relevant subset of the data for modeling.

In [None]:
import statsmodels.api as sm

# ---------------------------
# 1) DEFINE TARGET (y)
# ---------------------------
# LGD is defined only for defaulted loans â†’ restrict to observed LGD
if "lgd" not in df_model.columns:
    raise ValueError("LGD not found in df_model.")

df_lgd = df_model[df_model["lgd"].notna()].copy()
print("Rows with observed LGD:", df_lgd.shape[0])

y = df_lgd["lgd"].astype(np.float64)  # ensure numeric target

# ---------------------------
# 2) DEFINE PREDICTORS (X)
# ---------------------------
# Exclude:
# - LGD (target)
# - default_12m (post-event indicator / not a valid predictor for LGD)
# - any known leakage/reference variables (e.g., default_12m)
exclude_cols = ["lgd", "default_12m"]
X_cols = [c for c in df_lgd.columns if c not in exclude_cols]
X = df_lgd[X_cols].copy()
X = X.astype(np.float64)  # needed for statsmodels: ensure pandas / numpy data type compatibility 
print("Number of predictors before checks:", X.shape[1])

# Ensure all predictors are numeric
non_numeric = X.select_dtypes(exclude=[np.number]).columns.tolist()
if non_numeric:
    print("Dropping non-numeric predictors:", non_numeric)
    X = X.drop(columns=non_numeric)

# ---------------------------
# 3) ADD INTERCEPT
# ---------------------------
X = sm.add_constant(X, has_constant="add")

print("Design matrix shape (with intercept):", X.shape)

# ---------------------------
# 4) FIT LINEAR REGRESSION
# ---------------------------
model = sm.OLS(y, X)
results = model.fit()

# ---------------------------
# 5) OUTPUT RESULTS
# ---------------------------
print("\n=== OLS SUMMARY ===")
print(results.summary())



## Residual Analysis

### Residuals vs. Fitted Values
Plotting residuals against fitted values is a common diagnostic tool to assess the assumptions of linear regression, such as homoscedasticity (constant variance of residuals) and linearity. The plot should ideally show a random scatter of points around the horizontal line at zero, without any discernible pattern. A funnel shape (i.e., increasing or decreasing spread of residuals) may indicate heteroscedasticity, while a systematic pattern (e.g., a curve) may suggest nonlinearity.

It is common practice to standardize residuals before plotting. This helps to identify outliers and to better assess the distribution of residuals. 

Other popular visualizations for residual analysis include the *Q-Q Plot*: A quantile-quantile plot compares the distribution of residuals to a theoretical distribution (e.g., normal distribution) to assess normality. If the residuals are normally distributed, the points in the Q-Q plot will approximately lie on a straight line. Deviations from this line may indicate non-normality, which can affect inference in linear regression.

Below we provide code to create these visualizations using the `seaborn` library for the residuals vs. fitted values plot and the `statsmodels` library for the Q-Q plot. 

We will briefly discuss the results in class. In general, the plots confirm what the $R^2$ value already suggested: The linear regression model does not fit the data well and is an inappropriate choice for this dataset. This was to be expected in an *LGD model* ðŸ˜‰.


In [None]:
# Residuals and fitted values
yhat = results.fittedvalues
e = results.resid

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1.1 Residuals vs fitted (seaborn preferred over matplotlib)
sns.scatterplot(x=yhat, y=e, ax=axes[0, 0], alpha=0.4)
axes[0, 0].axhline(0)
axes[0, 0].set_title("Residuals vs Fitted")
axes[0, 0].set_xlabel("Fitted values")
axes[0, 0].set_ylabel("Residuals")

# 1.2 Standardized residuals vs fitted (statsmodels influence)
# compute standardized residuals from e
e_std = (e - np.mean(e)) / np.std(e)

sns.scatterplot(x=yhat, y=e_std, ax=axes[0, 1], alpha=0.4)
axes[0, 1].axhline(0)
axes[0, 1].axhline(2, linestyle="--")
axes[0, 1].axhline(-2, linestyle="--")
axes[0, 1].set_title("Standardized Residuals vs Fitted")
axes[0, 1].set_xlabel("Fitted values")
axes[0, 1].set_ylabel("Standardized residuals")

# 2.1 Histogram (seaborn)
sns.histplot(e, bins=40, ax=axes[1, 0])
axes[1, 0].set_title("Histogram of residuals")
axes[1, 0].set_xlabel("Residual")
axes[1, 0].set_ylabel("Count")

# 2.2 Q-Q plot (statsmodels preferred)
sm.qqplot(e, line="45", ax=axes[1, 1])
axes[1, 1].set_title("Q-Q plot of residuals")

plt.tight_layout()
plt.show()

# Conclusions and Outlook
The regression statistics suggest that the independent variable explain variation in LGD only to a very limited degree (e.g., $ R^2 < 0.1$). Therefore, we would not expect to a *predictive* model, which uses the same variables to forecast unknown LGDs accurately. You may wonder whether this is a *fault* of linear regression and whether advanced ML models would perform better. This points warrants a discussion in class. 

In the next coding session, we will consider such advanced ML approaches, although in a different use case: binary classification for credit default prediction. After completing this session, you will be equipped with several coding examples illustrating standard ML practices. You are welcome to then come back to this notebook and examine, empirically, if advanced ML beats linear regression in this LGD modeling use case. 