# Applied Machine Learning: In-class Exercise 01-1

## Goal
Our goal for this exercise sheet is to learn the basics of mlr3 for supervised learning by training a first simple model on training data and by evaluating its performance on hold-out/test data.

## German Credit Dataset

The German credit dataset was donated by Prof. Dr. Hans Hoffman of the University of Hamburg in 1994 and contains 1000 datapoints reflecting bank customers. The goal is to classify people as a good or bad credit risk based on 20 personal, demographic and financial features. The dataset is available at the UCI repository as [Statlog (German Credit Data) Data Set](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).

## Motivation of Risk Prediction
Customers who do not repay the distributed loan on time represent an enormous risk for a bank: First, because they create an unintended gap in the bank’s planning, and second, because the collection of the repayment amount additionally causes additional time and cost for the bank.

On the other hand, (interest rates for) loans are an important revenue stream for banks. If a person’s loan is rejected, even though they would have met the repayment deadlines, revenue is lost, as well as potential upselling opportunities.

Banks are therefore highly interested in a risk prediction model that accurately predicts the risk of future customers. This is where supervised learning models come into play.

## Data overview

n = 1,000 observations of bank customers

- `credit_risk`: is the customer a good or bad credit risk?
- `age`: age in years
- `amount`: 	amount asked by applicant
- `credit_history`: past credit history of applicant at this bank
- `duration`: duration of the credit in months
- `employment_duration`: present employment since
- `foreign_worker`: is applicant foreign worker?
- `housing`: type of apartment rented, owned, for free / no payment
- `installment_rate`: installment rate in percentage of disposable income
- `job`: current job information
- `number_credits`: number of existing credits at this bank
- `other_debtors`: other debtors/guarantors present?
- `other_installment_plans`: other installment plans the applicant is paying
- `people_liable`: number of people being liable to provide maintenance
- `personal_status_sex`: combination of sex and personal status of applicant
- `present_residence`: present residence since
- `property`: properties that applicant has
- `purpose`: reason customer is applying for a loan
- `savings`: savings accounts/bonds at this bank
- `status`: status/balance of checking account at this bank
- `telephone`: 	is there any telephone registered for this customer?

## Preprocessing

Load the German Credit dataset from OpenML, and print an overview. We will learn more about OpenML in the next sessions.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
# openml sometimes throw future warnings, in these exercises we can ignore them
import warnings
warnings.filterwarnings("ignore")

german_data = fetch_openml(name="credit-g", version=1, as_frame=True)
X = german_data.data

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   checking_status         1000 non-null   category
 1   duration                1000 non-null   int64   
 2   credit_history          1000 non-null   category
 3   purpose                 1000 non-null   category
 4   credit_amount           1000 non-null   int64   
 5   savings_status          1000 non-null   category
 6   employment              1000 non-null   category
 7   installment_commitment  1000 non-null   int64   
 8   personal_status         1000 non-null   category
 9   other_parties           1000 non-null   category
 10  residence_since         1000 non-null   int64   
 11  property_magnitude      1000 non-null   category
 12  age                     1000 non-null   int64   
 13  other_payment_plans     1000 non-null   category
 14  housing                 1

In [2]:
X.describe(include="all")

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker
count,1000,1000.0,1000,1000,1000.0,1000,1000,1000.0,1000,1000,1000.0,1000,1000.0,1000,1000,1000.0,1000,1000.0,1000,1000
unique,4,,5,10,,5,5,,4,3,,4,,3,3,,4,,2,2
top,no checking,,existing paid,radio/tv,,<100,1<=X<4,,male single,none,,car,,none,own,,skilled,,none,yes
freq,394,,530,280,,603,339,,548,907,,332,,814,713,,630,,596,963
mean,,20.903,,,3271.258,,,2.973,,,2.845,,35.546,,,1.407,,1.155,,
std,,12.058814,,,2822.736876,,,1.118715,,,1.103718,,11.375469,,,0.577654,,0.362086,,
min,,4.0,,,250.0,,,1.0,,,1.0,,19.0,,,1.0,,1.0,,
25%,,12.0,,,1365.5,,,2.0,,,2.0,,27.0,,,1.0,,1.0,,
50%,,18.0,,,2319.5,,,3.0,,,3.0,,33.0,,,1.0,,1.0,,
75%,,24.0,,,3972.25,,,4.0,,,4.0,,42.0,,,2.0,,1.0,,


In [3]:
y = german_data.target
y

0      good
1       bad
2      good
3      good
4       bad
       ... 
995    good
996    good
997    good
998     bad
999    good
Name: class, Length: 1000, dtype: category
Categories (2, object): ['bad', 'good']

In [4]:
# One-hot encode categorical features for both training and test sets
X = pd.get_dummies(X)

# Exercises:
Now, we can start building a model. To do so, we need to address the following questions:

- What is the problem we are trying to solve?
- What is an appropriate learning algorithm?
- How do we evaluate "good" performance?

## Split Data in Training and Test Data

Your task is to split the `german` dataset into 70 \% training data and 30 \%
test data by randomly sampling rows.
Later, we will use the training data to learn an ML model and use the test data
to assess its performance.

<details>
  <summary>Recap: Why do we need train and test data?</summary>

We use part of the available data (the training data) to train our model.
The remaining/hold-out data (test data) is used to evaluate the trained model.
This is exactly how we anticipate using the model in practice:
We want to fit the model to existing data and then make predictions on
new, unseen data points for which we do not know the outcome/target values.

Note: Hold-out splitting requires a dataset that is sufficiently
large such that both the training and test dataset are suitable representations
of the target population. What "sufficiently large" means depends on the
dataset at hand and the complexity of the problem.

The ratio of training to test data is also context dependent.
In practice, a 70\% to 30\% (~ 2:1) ratio is a good starting point.

</details>

<details>
<summary>Hint 1:</summary>

Use `np.random.default_rng` to get a random generator with seed fixed,
and then use `rng.choice` to sample 70\% of the indices randomly.
Then, use the remaining indices for the test set.
Based on the ids, set up two datasets, one for training and one for testing/evaluating.

Set a seed (e.g, `100`) to make your results reproducible.

</details>

<details>
<summary>Hint 2:</summary>
Use `X_train = X.loc[...]` and `y_train = y.loc[...]` to retrieve the training instances.
</details>

In [5]:
#===SOLUTION===

# Split the dataset into training (70%) and test (30%) sets
rng = np.random.default_rng(seed=100)

indices = X.index.tolist()
train_size = int(0.7 * X.shape[0])
train_ids = rng.choice(indices, size=train_size, replace=False)
test_ids = list(set(indices) - set(train_ids))

X_train = X.loc[train_ids]
y_train = y.loc[train_ids]
X_test = X.loc[test_ids]
y_test = y.loc[test_ids]

## Create a Classification Task

## Train a Model on the Training Dataset

Next, we train a logistic regression model on the training data.

<details><summary>Hint:</summary>
    Check the documentation of [`sklearn.linear_model.LogisitcRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
</details>


In [6]:
#===SOLUTION===

from sklearn.linear_model import LogisticRegression

# Train a logistic regression model using sklearn
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

## Inspect the Model
We can now inspect the model by looking at the coefficients of the features. Name at least two features that have a significant effect on the outcome.

<details><summary>Hint 1: </summary>
 `logreg.coef_[0]` contains the coefficients of the model.

</details>


In [7]:
#===SOLUTION===

# Comment: statsmodels (link below) is an alternative to sklearn for fitting linear models
# and it is closer to the linear models in R.
# https://www.statsmodels.org/stable/user-guide.html#regression-and-linear-models

# Print the model coefficients along with their corresponding feature names
coef = logreg.coef_[0]
features = X_train.columns
coefficients_df = pd.DataFrame({"Feature": features, "Coefficient": coef})
coefficients_df.sort_values(by="Coefficient", ascending=False)

Unnamed: 0,Feature,Coefficient
10,checking_status_no checking,1.123781
12,credit_history_critical/other existing credit,0.955700
18,purpose_used car,0.940480
41,other_parties_guarantor,0.869472
59,foreign_worker_no,0.674127
...,...,...
11,credit_history_all paid,-0.420558
20,purpose_education,-0.456654
8,checking_status_<0,-0.470530
17,purpose_new car,-0.708395


### Check the positive class of the model.

Before inspecting the model's weights, we need to check the positive class. This is because the positive class determines which log odds are estimated, misidentifying it can lead to incorrect coefficient interpretation.

<details><summary> Hint 1:</summary>
    `logreg.class_` stores the negative and postive classes.
</details>

In [8]:
#===SOLUTION===

positive_class = logreg.classes_[1]
print(f"Positive class: {positive_class}")

Positive class: good


### Discuss the results
Which coefficients have siginifcant influences on the outcome?


===SOLUTION===

According to the summary, e.g., `credit_history` and `check_status` significantly influence the creditworthiness and the bank's risk assessment.
By looking on `logreg.classes_[1]`, we see that the class `good` (creditworthy client) is the positive class.
This means that a positive sign of the estimated coefficient of a feature means that the feature has a positive influence on being a creditworthy client (while a negative sign will have a negative influence).


## Predict on the Test Dataset
Use the trained model to predict on the hold-out/test dataset.

In [9]:
#===SOLUTION===

# Predict labels on the test dataset and compute classification error
pred_labels = logreg.predict(X_test)

## Evaluation
What is the classification error on the test data (200 observations)?

<details>
<summary>Hint 1:</summary>
Use the `score` method of the logistic regression model to calculate the accuracy.
</details>

<details>
<summary>Hint 2:</summary>
The classification error is defined as $1 - \text{accuracy}$.
</details>

In [10]:
#===SOLUTION===

accuracy = logreg.score(X_test, y_test)
classification_error = 1 - accuracy
print(f"Classification error: {classification_error:.4f}")

Classification error: 0.2433


## Predicting probabilities instead of labels

Similarly, we can assess the performance of our model using the AUC. However, this requires predicted probabilities instead of predicted labels. Evaluate the model using the AUC. To do so, we need to use the `predict_proba` method of the model.


<details>
<summary>Hint 1:</summary>
Map the true labels to binary: 1 if "good", else 0.
</details>

<details>
<summary>Hint 2:</summary>
Use the `roc_auc_score` function from `sklearn.metrics` to calculate the AUC.
</details>

In [11]:
#===SOLUTION===

from sklearn.metrics import roc_auc_score

# Predict probabilities for the positive class ("good")
positive_index = list(logreg.classes_).index(positive_class)
pred_proba = logreg.predict_proba(X_test)[:, positive_index]

# Map true labels to binary: 1 if "good", else 0
y_test_bin = y_test.map(lambda x: 1 if x == "good" else 0)

# Evaluate performance using AUC
auc = roc_auc_score(y_test_bin, pred_proba)
print(f"AUC: {auc:.4f}")

AUC: 0.7830


## Summary
In this exercise sheet we learned how to fit a logistic regression model using `sklearn.linear_models.LogisticRegression` on a training task and how to assess its performance on unseen test data. We showed how to split data manually into training and test data, but in most scenarios it is a call to resample or benchmark. We will learn more on this in the next sections.
