***
**Introduction to Machine Learning** <br>
__[https://slds-lmu.github.io/i2ml/](https://slds-lmu.github.io/i2ml/)__
***

# Exercise sheet: 12 Nested Resampling

In [None]:
#| label: import
# Consider the following libraries for this exercise sheet:

library(mlr3verse)
library(mlr3tuning)

## Exercise 2: AutoML


In this exercise, we build a simple automated machine learning (AutoML) system that will make data-driven
choices on which learner/estimator to use and also conduct the necessary tuning.

`mlr3pipelines` make this endeavor easy, modular and guarded against many common modeling errors. <br>
We work on the `pima` data to classify patients as diabetic and design a system that is able to choose between $k$-NN
and a random forest, both with tuned hyperparameters. <br>
To this end, we will use a graph learner, a ”single unit of data operation” that can be trained, resampled, evaluated,
. . . as a whole – in other words, treated as any other learner.

> a) Create a task object in `mlr3` (the problem is pre-specified under the ID ”pima”).

In [None]:
# Enter your code here:

> b) Specify the above learners, where you need to give each learner a name as input to the `id` argument. Convert
each learner to a pipe operator by wrapping them in the sugar function `po()`, and store them in a list.

In [None]:
# Enter your code here:

> c) Before starting the actual learning pipeline, take care of pre-processing. While this step is highly customizable,
you can use an existing sequence to impute missing values, encode categorical features, and remove variables
with constant value across all observations. For this, specify a pipeline (`ppl()`) of type "`robustify`" (setting
`factors_to_numeric` to `TRUE`).

In [None]:
# Enter your code here:

> d) Create another `ppl`, of type "`branch`" this time, to enable selection between your learners.

In [None]:
# Enter your code here:

> e) Chain both pipelines using the double pipe and plot the resulting graph. Next, convert it into a graph learner
with `as_learner()`.

In [None]:
# Enter your code here:

> f) Now you have a learner object just like any other. Take a look at its tunable hyperparameters. You will optimize
the learner selection, the number of neighbors in $k$-NN (between $3$ and $10$), and the number of split candidates
to try in the random forest (between $1$ and $5$). Define the search range for each like so:

<learner>$param_set$values$<hyperparameter> <- to_tune(p_int(lower, upper))

> `p_int` marks an integer hyperparameter with lower and upper bounds as defined; similar objects exist for other
data types. With `to_tune()`, you signal that the hyperparameter shall be optimized in the given range.

<div class="alert alert-block alert-info">
    <b>Hint:</b>  You need to define dependencies, since the tuning process is defined by which learner is selected in the
first place (no need to tune $k$ in a random forest).<br>
</div>

In [None]:
# Enter your code here:

> g) Conveniently, there is a sugar function, `tune_nested()`, that takes care of nested resampling in one step. Use
it to evaluate your tuned graph learner with
> - mean classification error as inner loss,
> - random search as tuning algorithm (allowing for $3$ evaluations), and
> - $3$-CV in both inner and outer loop.

In [None]:
# Enter your code here:

> h) Lastly, extract performance estimates per outer fold (`score()`) and overall (`aggregate()`). If you want to risk
a look under the hood, try `extract_inner_tuning_archives()`.


In [None]:
# Enter your code here:

Congrats, you just designed a turn-key AutoML system that does (nearly) all the work with a few lines of code!

## Exercise 3: Kaggle Challeng

Make yourself familiar with the Titanic Kaggle challenge [https://www.kaggle.com/c/titanic](https://www.kaggle.com/c/titanic). <br>
Based on everything you have learned in this course, do your best to achieve a good performance in the survival
challenge.
- Try out different classifiers you have encountered during the course (or maybe even something new?)
- Improve the prediction by creating new features (feature engineering).
- Tune your parameters (see: https://mlr3book.mlr-org.com/tuning.html or https://scikit-learn.org/stable/modules/grid_search.html).
- How do you fare compared to the public leaderboard?


<div class="alert alert-block alert-info">
    <b> <code>mlr3</code> Hint:</b> Use the <code>titanic</code> package to directly access the data. Use <code>titanic::titanic_train</code> for training and <code>titanic::titanic_test</code> for your final prediction. <br>
</div>

In [None]:
# Enter your code here: