<div style="position:relative; width:100%; height:200px;">
  <img src="https://raw.githubusercontent.com/stefanlessmann/VHB_ProDoc_ML/master/banner-nb.png" style="width:100%; object-fit:cover;" alt="ProDok-MachineLearning-Banner">
  <div style="
      position:absolute;
      left:4%;
      top:50%;
      transform:translateY(-50%);
      font-size:3.2vw;
      font-weight:750;
      color:#1f2a44;">
    ProDok â€“ Machine Learning
  </div>
</div>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/stefanlessmann/VHB_ProDoc_ML/blob/master/P.I.2.cs_benchmark.ipynb)

# P.I.2 Machine Learning for Credit Risk Modeling: Model Selection, Evaluation, and Interpretation
The practice sessions complement the lectures and provide hands-on experience with the concepts covered in the course.
This session focuses on classic machine learning algorithms and practices. We will continue working with the credit risk analytics case, and benchmark alternative supervised learning algorithms to default prediction. In this scope, we will revisit data organization principles, classification model evaluation, and hyperparameter tuning. The notebook concludes with an outlook on the XAI (explainable AI lecture) by illustrating some standard XAI outputs to shed light on the patterns inferred by our learning algorithms. 

As in the previous practice session, the available time does not permit manual coding or extensive code reviews. We will provide coding demos for selected parts and otherwise rely on LLMs to generate the codes we need. The focus of the practice session is on prompt engineering and discussing ML outputs.  


# Data Preparation Reloaded

**Context:** 
We received a new sample of credit risk data. The sample comprises the same features and was taken 6 month after the first 100k batch was gathered. We developed Python codes to ready the data for analysis. However, the code was AI generated and not designed for reusability. Our first task is to *refactor* the data preparation code to ensure can use for the first and the second batch of data, as well as future batches yet to come. 

In [None]:
# Importing standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# We assume you stored the data locally. If not, use the URLs to load it from the web.
#url_100k = "https://raw.githubusercontent.com/stefanlessmann/VHB_ProDoc_ML/master/credit_data_100k.csv"
#url_25k = "https://raw.githubusercontent.com/stefanlessmann/VHB_ProDoc_ML/master/credit_data_25k.csv"

df_100 = pd.read_csv("credit_data_100k.csv")
df_25 = pd.read_csv("credit_data_25k.csv")

## Task: 
Use the code from `P.I.1.data_exploaration.ipynb` to create a reusable data preparation function. Apply your function to create *ready-for-modeling* versions of both datasets. Store the results in variables `dfDevelopment` and `dfHoldout`, respectively. 

In general, LLMs are good at refactoring code. Put our generated data preparation code into a prompt and instructing an LLM to refactor is should work just fine. One the other hand, all we need to do is wrapping up our previous code in a function. Decide freely if you want to approach the task by prompting an LLM or by coding the data preparation function yourself.

>Note that this exercise focuses on classification. Therefore, make sure to remove the `LGD` column from the data frames to prevent data leakage.




In [2]:
# Place for your data preparation function



In [3]:
# Place to call your function and creating modeling-ready datasets
dfDevelopment = None # Put your solution here
dfHoldout = None # Put your solution here

# Overfitting demonstrated
The lecture introduced you to the *fundamental problem of overfitting*, claiming that decision trees are particularly vulnerable to overfitting to make a case for tree pruning. 

<p align="center">
  <img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/tree_pruning.png" alt="Overfitting in decision trees" width="480">
</p>

## Task:

Verify the illustrated relationship between tree depth and overfitting. To that end, your task is to:
1. Partition the data of `dfDevelopment` into a training and validation set using ratios of 80%/20%.
2. Train a decision tree classifier on the training data with varying tree depth (e.g., from 1 to 10).
3. Evaluate the performance of each model on both the training and validation sets using an appropriate metric.
4. Plot the training and validation performance against tree depth to visualize the overfitting phenomenon.

The coding tasks asks for a basic ML pipeline involving functions/classes `train_test_split`, `DecisionTreeClassifier`, amongst others. It is a good exercise to code the solution yourself, but you can also use AI; a lazy prompt should do, no discussion of a prompting strategy warranted.

In [7]:
# Place for your solution

# Benchmarking
We finally hit the main part of the practice session: benchmarking different machine learning algorithms for credit risk modeling. This exercises addresses a common questions in machine learning applications: *which algorithm is best suited for my data?*
Research papers that introduce novel methodologies also include benchmarking exercises to demonstrate the superiority of their method. Lastly, a learning goal of this exercise is to familiarize you with standard machine learning workflows and relevant Python functions/classes. 

We frame the benchmarking task such that it is realistic and perhaps a bit challenging. We discuss details in class, including our prompting strategy for the coding part. 

## Task: 
Compare several established machine learning algorithms using the development sample (i.e., `dfDevelopment`). Consider logistic regression (LR), neural networks (NN), decision trees (DT), random forests (RF), and extreme gradient boosting (XGB). For algorithms that exhibit hyperparameters, make sure these are properly tuned. Assess the performance of the different algorithms in terms of i) the area under the ROC curve, ii) the area under the prediction-recall curve, iii) the F-score, and iv) the Brier score (i.e., the mean squared error of the predicted probabilities compared to a zero-one coded target). Produce an estimate of model performance for each algorithm on a suitably selected subset of the development data. Also assess models on the holdout data (i.e., `dfHoldout`). Plot the ROC-curve, the PR-curve, and the confusion matrix for logistic regression and the overall best-performing model (if different) on the holdout sample.  

In [None]:
# Place for your benchmarking code