# Applied Machine Learning: In-class Exercise 01-2

## Goal

You will learn how to estimate model performance with `scikit-learn` using resampling techniques such as 5-fold cross-validation. Additionally, you will compare a k‑NN model against a logistic regression model.

## German Credit Data
We work with the German credit data. You can load the dataset from OpenML and preprocess it for modeling.

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import fetch_openml

german_data = fetch_openml(name="credit-g", version=1, as_frame=True)

# One-hot encode the features
X = pd.get_dummies(german_data.data, drop_first=True)
y = german_data.target

## Exercise: Fairly evaluate the performance of two learners

We first create two models using scikit-learn: a logistic regression model and a k‑nearest neighbors (k‑NN) model with $k=5$. We then compare their performance using cross-validation.

<details>
<summary>Hint 1:</summary>
You can use `LogisticRegression` and `KNeighborsClassifier` from `sklearn.linear_model` and `sklearn.neighbors`, respectively.
</details>

In [2]:
#===SOLUTION===

# Create the learners
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier


log_reg = LogisticRegression(max_iter=1000)
knn = KNeighborsClassifier(n_neighbors=5)

## Set up a resampling instance

Use `scikit-learn` to set up a resampling strategy for 5‑fold cross-validation. For example, you can use the `KFold` class to configure the number of folds, shuffling, and a random state for reproducibility.

<details>
<summary>Hint 1:</summary>
You can use `KFold` from `sklearn.model_selection`.
</details>

In [3]:
#===SOLUTION===

# Set up 5-fold cross-validation with shuffling for reproducibility
from sklearn.model_selection import KFold


cv = KFold(n_splits=5, shuffle=True, random_state=100)

## Run the resampling

After having created a resampling instance, use it to apply the chosen resampling technique to both previously created learners.

<details>
<summary>Hint 1:</summary>
You can use `cross_val_score` from `sklearn.model_selection`.
</details>


In [4]:
#===SOLUTION===

from sklearn.model_selection import cross_val_score

scores_logreg = cross_val_score(log_reg, X, y, cv=cv, scoring='accuracy')
scores_knn = cross_val_score(knn, X, y, cv=cv, scoring='accuracy')

## Evaluation

Compute the cross-validated classification accuracy of both models. Which learner performed better?

<details>
<summary>Hint 1:</summary>
`cross_val_score` returns an array of scores. You can compute the mean of this array to get the average accuracy.
</details>



In [6]:
#===SOLUTION===

print(f"k-NN CV accuracy: {scores_knn.mean():.4f}")
print(f"Logistic Regression CV accuracy: {scores_logreg.mean():.4f}")

k-NN CV accuracy: 0.6450
Logistic Regression CV accuracy: 0.7480


## Summary

We can now apply different resampling methods to estimate the performance of different learners and fairly compare them. We now have learnt how to obtain a better (in terms of variance) estimate of our model performance instead of doing a simple train and test split. This enables us to fairly compare different learners.