<img src='images/gdd-logo.png' width='300px' align='right' style="padding: 15px">


# Parameter tuning

So far, you have focused on increasing model performance by making the best out of your data (with categorical feature encodings and imputation of missing values). In this notebook, you will rather focus on the model and its hyperparameters and explore best practices for hyperparameter tuning with Scikit-Learn.

**Program**
- [Baseline model](#review)
    - [<mark>Exercise</mark>](#ex1)
- [Addressing Overfitting](#overfitting)
- [Reducing the complexity](#reduce)
- [Hyperparameter Tuning](#hyper)
    - [<mark>Bonus: Build the best model</mark>](#bonus)
- [Conclusion and next steps](#conc)

Let's first import all the libraries you will need for this notebook:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import RocCurveDisplay

We will again work with the `stroke` data from before:

In [None]:
stroke = pd.read_csv('data/stroke.csv').rename(columns=str.lower)
stroke.head()

## Baseline Model

The Decision Tree Classifier (using categorical features) from the previous notebook will act as the baseline model going forward. 

Below, you can see the pipeline we created in the first notebook:

In [None]:
# Variable definitions
categorical_cols = ['work_type', 'smoking_status', 'who', 'gender', 'residence_type']
missing_cols = ['age','bmi']
drop_cols = ['id','address']

target = 'stroke'

def create_Xy(df, drop_cols, target_col):
    df = df.drop(columns=drop_cols)
    return (
        df.drop(columns=target_col),
        df[target_col]
    )

# Create X and y
X, y = stroke.pipe(create_Xy, 
                   drop_cols=drop_cols, 
                   target_col=target,
                   )

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 42,
                                                    stratify = y,
                                                    )

In [None]:
# Step 1: import model
from sklearn.tree import DecisionTreeClassifier

# Step 2: Instantiate model and set parameters
model = DecisionTreeClassifier(max_depth=3, 
                               class_weight='balanced', 
                               random_state=123)

ct = ColumnTransformer([
    ('onehot', OneHotEncoder(drop="if_binary"), categorical_cols)
], remainder='passthrough')

preprocessing = Pipeline(steps=[
    ('onehot', ct),
    ('impute', SimpleImputer(strategy='mean')),
])

pipeline = Pipeline(steps=[
    ('preprocessing', preprocessing),
    ('model', model)
])

# Step 3: Train model
pipeline.fit(X_train,y_train)

In [None]:
# Step 4: Evaluate model
fig, ax = plt.subplots()
RocCurveDisplay.from_estimator(pipeline, X_train, y_train, ax=ax, name='Train')
RocCurveDisplay.from_estimator(pipeline, X_test, y_test, ax=ax, name='Test')

The training and test AUC are quite good already, but we can surely get them higher, right?

#### <mark>Exercise:</mark> Changing model parameters

Change the hyperparameters of the `DecisionTreeClassifier` to improve its performance.

1. What is the default parameter value for `max_depth`? What does it do?
2. Change the `max_depth` parameter to improve model performance. What happens to the training and test AUC, if you leave it at the default?
3. Find out what other hyperparameters you can tune (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)). Which ones would be worth changing as well?
4. The model is overfitting with the default parameters. What does ***overfitting*** mean?

<details>
  <summary><span style="color:blue">Show solution</span></summary>

1. Default is `None`, meaning there is no restriction on the maximum depth.    
2. Setting `max_depth` to 20 will result in worse test accuracy and almost 100% training accuracy = the model is overfitting the training data.
3. `min_sample_leaf` and `min_sample_split` are also parameters that may be worth tuning. With the defaults `1` and `2`, the model is allowed to create splits that could separate individual samples, which is part of why it is overfitting so drastically.
4. Overfitting means the model fitted too well (or even perfectly) on the training data and did not learn a meaningful pattern. It is thus too specific and cannot generalize well to new, unseen data.


</details>

In [None]:
# Change the parameters!
new_model = DecisionTreeClassifier(
    max_depth=3,
    # add other parameters here
    class_weight='balanced', 
    random_state=123)

new_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessing),
    ('model', new_model)
])

new_pipeline.fit(X_train,y_train)

fig, ax = plt.subplots()
RocCurveDisplay.from_estimator(new_pipeline, X_train, y_train, ax=ax, name='Train')
RocCurveDisplay.from_estimator(new_pipeline, X_test, y_test, ax=ax, name='Test')

In [None]:
# %load answers/03-changing-hyperparameters.py

<mark>**Bonus:**</mark> 

**Precision**, **Recall** and their harmonic mean - the **F1-Score** - are other common metrics for classifiers. 
- Look up what they mean and why they are useful.
- Calculate Precision & Recall and the F1-Score for the model (hint: Use the function `classification_report()` from `sklearn.metrics`)
- Would you focus on maximizing precision or recall in this use case?

<details>
  <summary><span style="color:blue">Show solution</span></summary>

The **precision** for a class is the number of correctly classified positives (e.g., stroke classifications that actually had stroke) divided by the total number of all classified positives, i.e., the sum of true positives and false positives (total positive stroke classifications).

**Recall** is defined as the number of true positives divided by the total number of true positives in the dataset (e.g., how many actual stroke victims did the model correctly identify).

<img src="images/precision_recall.png" width="500" align="center">

**F1 score** combines both precision and recall by calculating their harmonic mean:

$${\displaystyle F_{1}={\frac {2}{\mathrm {recall} ^{-1}+\mathrm {precision} ^{-1}}}=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}={\frac {\mathrm {tp} }{\mathrm {tp} +{\frac {1}{2}}(\mathrm {fp} +\mathrm {fn} )}}}$$


</details>

In [None]:
from sklearn.metrics import classification_report

# your code here

In [None]:
# %load answers/03-metrics-classification-report.py

<a id=overfitting></a>

## Addressing Overfitting

The results above show us that the algorithm can severely overfit on the training data.

There are many ways to address overfitting, including:

- **Train with more data**: It won't work every time, but training with more data can help algorithms detect the signal better. This can involve data augmentation if needed.
- **Remove features**: If you have a large number of features, you should only select the most important features for training so that the model doesn’t learn from features that don't generalise well. 
- ★ **Reduce the complexity of the model**: An over-complex model is more likely to overfit. You can directly reduce the model’s complexity by looking at the type of model, or the specific model parameters.

Let's focus on the easiest to begin with - reducing the complexity of the model.

<a id=reduce></a>
### Reduce the complexity of the model

Leaving the `max_depth` parameter unrestricted increases model complexity a lot.

We can see this by plotting the default tree:

In [None]:
from sklearn.tree import plot_tree

fig,ax = plt.subplots(figsize=(20,20))

plot_tree(new_pipeline.named_steps['model'], ax=ax);

<mark>**Question:**</mark> What is the maximum number of end leaf nodes a tree could get to with a max depth of 50?

<details>
  <summary><span style="color:blue">Show solution</span></summary>

   Each split generates two new leaves. So 50 splits generate 2^50 (1.1 trillion) end nodes.

</details>

<a id=hyper></a>

### Hyperparameter Tuning

It would of course be a good idea to do some hyperparameter tuning to find the best value for `max_depth` (and/or other parameters).

In the exercise before, you have tried to tune these model parameters by hand. Of course, this is not the best idea since it is time-consuming and inefficient.
 
For this problem, sklearn already contains a **Grid Search** algorithm called `GridSearchCV` which will allow you to test different paramter combinations on the dataset using ***Cross Validation***.

<details>
  <summary><span style="color:blue">Refresher Cross-Validation</span></summary>

Cross Validation (CV), specifically k-fold cross validation allows you to train and test your algorithm on multiple, mutually exclusive subsets of your data, giving you a better estimate of the true performance. 

Hereby, the data is split into k equally sized subsets, whereby one is used as a validation set and the remaining k-1 for training (repeating the process for k iterations). After wards, the performance metrics of the k train-validation iterations are averaged.

Usually, a small proportion of the data is held out completely as a separate test set for final evaluation.

<img src="images/crossvalidation.png" style="display: block;margin-left: auto;margin-right: auto;height: 300px"/>

</details>



In [None]:
from sklearn.model_selection import GridSearchCV

The `GridSearchCV` object requires a parameter grid (as a `{key: value}` dictionary) with the different options you want to explore.

In [None]:
params = {
    'preprocessing__impute__strategy': ['mean', 'median'],
    'model__max_depth': range(1, 21),
    'model__criterion': ['gini','entropy',],
}

You can then perform a parameter grid sesarch to optimise your chosen metric.

In [None]:
grid = GridSearchCV(pipeline, params, scoring='roc_auc', cv = 3)

In [None]:
grid.fit(X_train, y_train)

In [None]:
cv_results = pd.DataFrame(grid.cv_results_).sort_values('rank_test_score')
cv_results.head()

In [None]:
(
    cv_results
    .groupby('param_model__max_depth')
    ['mean_test_score']
    .mean()
    .plot(title = 'Test AUC by max_depth', 
          xticks = range(0,21,2)
          )
);

We can then select the best model:

In [None]:
model_best = grid.best_estimator_
model_best

In [None]:
model_best.score(X_train, y_train), model_best.score(X_test, y_test)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16,6))

RocCurveDisplay.from_estimator(pipeline, X_train, y_train, ax=ax[0], name='Train')
RocCurveDisplay.from_estimator(pipeline, X_test, y_test, ax=ax[0], name='Test')
ax[0].set(title='Base model')

RocCurveDisplay.from_estimator(model_best, X_train, y_train, ax=ax[1], name='Train')
RocCurveDisplay.from_estimator(model_best, X_test, y_test, ax=ax[1], name='Test')
ax[1].set(title='Tuned model');

<a id=bonus></a>

### <mark>Bonus:</mark> Build your best model

It is now your turn to put it all together and see if you can build a better model.

1. Build a model pipeline that **maximizes test performance (AUC)**
2. Select a machine learning model of your choice (see here for [sklearn selection](https://scikit-learn.org/stable/supervised_learning.html))
3. Look up what hyperparameters it has and perform a grid search to tune its hyperparameters.

*Hint: In real use cases, more complex models such as Random Forests, Support Vector Machines, or Gradient Boosting models are more often used (as they are generally more performant).*

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# # your code here

In [None]:
# %load answers/03-better-model.py

<details>
    
  <summary><span style="color:blue">Read about how to add different models to your hyperparameter search here. </span></summary>
  
## More control over the hyperparameter search

Let's now consider a scenario where you want to compare two different classifiers, a support vector machine and a random forest one.

Naively, you could create a dictionary like the following:

```python
all_parameters = {'classifier': [SVC(), RandomForestClassifier()],
                  'classifier__C': [.5, 1, 1.5], # SVC hyperparam
                  'classifier__kernel': ["linear", "poly", "rbf"], # SVC hyperparam
                  'classifier__n_estimators': [40, 60, 90], # RFC hyperparam
                  'classifier__max_depth': [2, 3, 5, 10], # RFC hyperparam
                  'classifier__min_samples_leaf': [1,5,8] # RFC hyperparam
                   }
```
However, if you use this dictionary with `GridSearchCV`, the search will explore all possible combinations of hyperparameters. For example, it will fit pipelines with a `RandomForestClassifier()` for every combination of the `C` and `kernel` hyperparameters. This would be a waste of resources since those hyperparameters are not related to the `RandomForestClassifier()`. 

To avoid this issue, `GridSearchCV` also accepts a _list_ of dictionaries as an input to give you more control over what combinations of hyperparameters are tested. It will only compare all possible combination of hyperparameters *within* each dictionary.


```python
svc_parameters = {'classifier': [SVC()],
                  'classifier__C': [.5, 1, 1.5],
                  'classifier__kernel': ["linear", "poly", "rbf"],
                   }
```

```python
rf_parameters = {'classifier': [RandomForestClassifier()],
                 'classifier__n_estimators': [40, 60, 90],
                 'classifier__max_depth': [2, 3, 5, 10],
                 'classifier__min_samples_leaf': [1,5,8]
                   }
```

```python
grid_all = GridSearchCV(pipeline, 
                        [svc_parameters, rf_parameters], 
                        scoring='accuracy')
```


</details>


---

<a id=conc></a>
# Conclusion and Next Steps
<img src='images/gdd-logo.png' width='300px' align='right' style="padding: 15px">

This notebook has covered an overview of ensemble algorithms, with a particular focus on the random forest. Also covered was overfitting and using hyperparameter tuning to correct issues of overfitting. 

In the next notebook you will explore another ensemble algorithm - Gradient Boosting.