***
**Introduction to Machine Learning** <br>
__[https://slds-lmu.github.io/i2ml/](https://slds-lmu.github.io/i2ml/)__
***

# Exercise sheet: 12 Nested Resampling

In [None]:
#| label: import
# Consider the following libraries for this exercise sheet:

# general
import numpy as np
import pandas as pd
# sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.feature_selection import VarianceThreshold

## Exercise 2: AutoML


In this exercise, we build a simple automated machine learning (AutoML) system that will make data-driven
choices on which learner/estimator to use and also conduct the necessary tuning.

`sklearn.pipeline.Pipeline` makes this endeavor easy, modular and guarded against many common modeling
errors. <br>
We work on the [`pima`](https://github.com/slds-lmu/lecture_i2ml/blob/master/exercises/data/pima.csv) data to classify patients as diabetic and design a system that is able to choose between $k$-NN
and a random forest, both with tuned hyperparameters. <br>
The purpose of the pipeline is to assemble several steps of transformation and a final estimator that can be crossvalidated together while setting different parameters. So to speak, the pipeline estimator can be treated as any
other estimator.

> a) Load the data set [`pima`](https://github.com/slds-lmu/lecture_i2ml/blob/master/exercises/data/pima.csv), encode the target "`diabetes`" as $0$-$1$-vector and perform a stratified `train_test_split`.

In [None]:
# Enter your code here:

> b) As part of our modeling process, we want to perform certain preprocessing steps. While this step is highly
customizable, we want to include at least One-Hot-Encoding of categorical features, and imputing of missing
values. <br>
Instance a `ColumnTransformer` object and include these two steps for a dynamic choice of columns.

<div class="alert alert-block alert-info">
    <b>Hint:</b>   Strings are considered as <code>dtype = object</code> <br>
</div>

In [None]:
# Enter your code here:

> c) Next, both pipelines for the $k$-NN and random forest are created. Like this you can create estimators with highly
individual preprocessing steps. Include the previously created `ColumnTransformer`, a `VarianceThreshold` to
remove constant columns and the corresponding estimator as a final step. Additionally, scale the columns for
the $k$-NN estimator.

In [None]:
# Enter your code here:

> d) A very common ensembling technique is to predict according to the decisions of numerous estimators. This is
refered to as `VotingClassifier` and enables you to predict the class label based on the argmax of the sums
of the predicted probabilities. Instanciate a `VotingClassifier` with the two classifier pipelines for $k$-NN and
random forest.

<div class="alert alert-block alert-info">
    <b>Hint:</b> set the parameters `<code>voting = "soft"</code> and <code>n_jobs = -1</code> for parallel computation. <br>
</div>

In [None]:
# Enter your code here:

> e) Now you have an estimator object just like any other. Take a look at its tunable hyperparameters. You will
optimize the number of neighbors in $k$-NN (between $3$ and $10$), and the number of split candidates to try in the
random forest (between $1$ and $5$). Define the search range for each like so:

In [None]:
param_grid_voting = [{"<voting_estimator1>__<pipelie1_estimator>__<hyperparameter>":
                        list(<parameter_range>)},
                    {"<voting_estimator2>__<pipelie2_estimator>__<hyperparameter>":
                        list(<parameter_range>)}]

> Please note, that the estimator names should be on par with the labels given in the `VotingClassifier`, the
`Pipeline` and, of course, the hyperparameter of the used estimator in the pipeline. Each level of hyperparameters of our created ensemble estimator is accessable through the seperation ”__” (double underscore).

In [None]:
# Enter your code here:

> f) Nested Resampling is a method to avoid the so called *optimization bias* by tuning parameters and evaluation
performance on different subsets of your training data. Use
> - Stratified $3$-CV in both inner and outer loop.
> - accuracy as inner performance measure,
> - grid search as tuning algorithm. <br>

> You may use the following, incomplete code to compute the nested resampling:

In [None]:
NUM_OUTER_FOLDS = <...>
nested_scores_voting = np.zeros(NUM_OUTER_FOLDS) # initalize scores with 0
# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.
inner_cv = <...>(n_splits=<...>, shuffle=True, random_state=42)
outer_cv = <...>(n_splits=<...>, shuffle=True, random_state=42)

for i, (train_index, val_index) in enumerate(outer_cv.split(X_train, y_train)):
    # Nested CV with parameter optimization for ensemble pipeline
    clf_gs_voting = <...>(
        estimator=<...>,
        param_grid=<...>,
        cv=<...>,
        n_jobs=-1
    )
    clf_gs_voting.fit(X_train.iloc[<...>], y_train[<...>])
    nested_scores_voting[i] = clf_gs_voting.score(X_train.iloc[<...>], y_train[<...>])


In [None]:
# Enter your code here:

> g) Extract performance estimates per outer fold and overall (as mean). According to your results, determin the
best classifier object.


In [None]:
# Enter your code here:

> h) Lastly, evaluate the performance on the test set. Think about the imbalance of your data set and how this is
affecting the performance measurement accuracy. Try to find a better metric and compare these two.

In [None]:
# Enter your code here:

Congrats, you just designed a turn-key AutoML system that does (nearly) all the work with a few lines of code!

## Exercise 3: Kaggle Challenge

Make yourself familiar with the Titanic Kaggle challenge [https://www.kaggle.com/c/titanic](https://www.kaggle.com/c/titanic). <br>
Based on everything you have learned in this course, do your best to achieve a good performance in the survival
challenge.
- Try out different classifiers you have encountered during the course (or maybe even something new?)
- Improve the prediction by creating new features (feature engineering).
- Tune your parameters (see: https://scikit-learn.org/stable/modules/grid_search.html).
- How do you fare compared to the public leaderboard?


In [None]:
# Enter your code here: