In [7]:
library(mlbench)
library(mlr3)
library(mlr3learners)

## Solution 2:  Resampling strategies

### a)

The two main advantages of resampling are:

• We are able to use larger training sets (at the expense of test set size) because the high variance this incurs
for the resulting estimator is smoothed out by averaging across repetitions.

• Repeated sampling reduces the risk of getting lucky (or not so lucky) with a particular data split, which
is especially relevant with few observations.

### b)

You can find the [german_credit_for_py.csv](https://github.com/slds-lmu/lecture_i2ml/blob/master/exercises/data/german_credit_for_py.csv) in our GitHub. The feature columns have already been preprocessed with *OneHotEncoder* for categorial features and *OrdinalEncoder* for ordianal features(installment_rate, present_residence, number_credits).

In [8]:
#| label: 2-b-1

# create task and learner
(task <- tsk("german_credit"))
learner <- lrn("classif.log_reg")

# train, predict and compute train error
learner$train(task)
preds <- learner$predict(task)
preds$score()

<TaskClassif:german_credit> (1000 x 21): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (20):
  - fct (14): credit_history, employment_duration, foreign_worker,
    housing, job, other_debtors, other_installment_plans,
    people_liable, personal_status_sex, property, purpose, savings,
    status, telephone
  - int (3): age, amount, duration
  - ord (3): installment_rate, number_credits, present_residence

### c)

In [None]:
#| label: 2-c-1
#| output: false

# create different resampling strategies
set.seed(123)
resampling_3x10_cv <- rsmp("repeated_cv", folds = 10, repeats = 3)
resampling_10x3_cv <- rsmp("repeated_cv", folds = 3, repeats = 10)
resampling_ho <- rsmp("holdout", ratio = 0.9)

# evaluate without stratification
result_3x10_cv <- resample(task, learner, resampling_3x10_cv, store_models = TRUE)
result_10x3_cv <- resample(task, learner, resampling_10x3_cv, store_models = TRUE)
result_ho <- resample(task, learner, resampling_ho, store_models = TRUE)

# evaluate with stratification
task_stratified <- task$clone()
task_stratified$set_col_roles("foreign_worker", roles = "stratum")
result_stratified <- resample(
  task_stratified, learner, resampling_3x10_cv, store_models = TRUE)

INFO  [11:29:35.847] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 1/30)
INFO  [11:29:35.927] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 2/30)
INFO  [11:29:35.973] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 3/30)
INFO  [11:29:36.014] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 4/30)
INFO  [11:29:36.057] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 5/30)
INFO  [11:29:36.107] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 6/30)
INFO  [11:29:36.149] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 7/30)
INFO  [11:29:36.190] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 8/30)
INFO  [11:29:36.231] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 9/30)
INFO  [11:29:36.353] [mlr3] Applying learner 'classif.log_reg' on task 'german_credit' (iter 10/30)
INFO  [11

In [None]:
#| label: 2-c-2

# aggregate results over splits (mce is default)
print(sapply(
  list(result_3x10_cv, result_10x3_cv, result_stratified, result_ho), 
  function(i) i$aggregate()))

classif.ce classif.ce classif.ce classif.ce 
 0.2486667  0.2557977  0.2525512  0.1800000 


### d)

Generalization error estimates are pretty stable across the different resampling strategies because we have a
fairly large number (1000) of observations. Still, the pessimistic bias of small training sets is visible: 10x3-CV,
using roughly 67% of data for training in each split, estimates a higher generalization error than 3x10-CV with
roughly 90% training data. Stratification by foreign worker does not seem to have much effect on the estimate.
However, we see a glaring difference when we use a single 90%-10% split, where the estimated GE is roughly 8.5
percentage points higher than with 3x10-CV, meaning we got a higher error just because of an unlucky split.

Comparing the results (except for the unreliable one produced by a single split) with the training error from b)
indicates no serious overfitting.

### e)

LOO is not a very good idea here – with 1000 observations this would take a very long time. Also, LOO has
high variance by nature. Repeated CV with a sufficient number of folds should give us a pretty good idea about
the expected GE of our learner.