# Parameter Tuning

In this notebook, we learn about tuning parameters using tradiational methods such as random and grid search and newer methods such as Successive Halving.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/03-parameter-tuning.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intermediate-v2/main/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.2"), "Please install scikit-learn 1.2"

## Digits dataset

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X, y = digits.data, digits.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, stratify=y
)

In [None]:
X_train[0]

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(4, 4)
for i, ax in zip(range(16), axes.ravel()):
    ax.imshow(X[i].reshape(8, 8), cmap="gray_r")
    ax.set(xticks=(), yticks=(), title=y[i])
plt.tight_layout()

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [None]:
params = {
    'max_depth': [2, 4, 8, 12, 16],
    'max_features': [4, 8, 16, 32]
}

In [None]:
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42), param_grid=params,
    verbose=1,
    n_jobs=2, # Update to the number of physical cpu cores
)  

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_score_

In [None]:
grid_search.best_params_

In [None]:
grid_search.score(X_test, y_test)

### Viewing results as a dataframe

In [None]:
import pandas as pd
cv_df = pd.DataFrame(grid_search.cv_results_)

In [None]:
cv_df.head()

In [None]:
param_results = (cv_df
    .astype({"param_max_depth": int, "param_max_features": int})
    .pivot(
        index="param_max_depth",
        columns="param_max_features",
        values="mean_test_score"
    )
    .rename_axis(index='max_depth', columns='max_features')
)

In [None]:
param_results

In [None]:
import seaborn as sns

In [None]:
_ = sns.heatmap(param_results, cmap='viridis')

## Exercise 1

1. Use a `RandomSearchCV` with the following parameter distrubution for `RandomForestClassifier`:

```python
from scipy.stats import randint

param_dist = {
    "max_features": randint(1, 11),
    "min_samples_split": randint(2, 11)
}
```

Set `random_state=0` to have reproducable results and `n_iter=20`.

2. What were the best hyper-parameters found by the random search?
3. Evaluate the model on the test set.
4. Use `HalvingRandomSearchCV` with the same `param_dist`. What is the best hyper-parameters found by this search? Evaluate on the test set.
    - **Hint**: `n_iter` is not required and set `verbose=1`

In [None]:
from scipy.stats import randint

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

param_dist = {
    "max_features": randint(1, 11),
    "min_samples_split": randint(2, 11)
}

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/solutions/03-ex01-solutions.py). 

In [None]:
# %load solutions/03-ex01-solutions.py

## Searching Pipelines and ColumnTransformer

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

penguins = fetch_openml(data_id=42585, as_frame=True, parser="pandas")
X, y = penguins.data, penguins.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=0
)

In [None]:
numerical_features = [
    'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']
categorical_features = ['island', 'sex']

## Preprocessing

In [None]:
from sklearn.preprocessing import SplineTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import sklearn

In [None]:
sklearn.set_config(transform_output="pandas")

### Numerical features

In [None]:
num_prep = Pipeline([
    ("imputer", SimpleImputer()),
    ("scalar", StandardScaler()),
    ("spline", SplineTransformer())
])

### Categorical features

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_prep = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

### ColumnTransformer

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
ct = ColumnTransformer([
    ("numerical", num_prep, numerical_features),
    ("categorical", cat_prep, categorical_features),
], verbose_feature_names_out=False)

## Pipeline

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = Pipeline([
    ("prep", ct),
    ("log_reg", LogisticRegression(solver="liblinear"))
])
log_reg

## Searching

In [None]:
log_reg.get_params()

In [None]:
params = {
    "prep__numerical__spline__degree": [3, 4, 5],
    "prep__numerical__imputer__strategy": ["mean", "median"],
    "prep__numerical__imputer__add_indicator": [True, False],
    "log_reg__penalty": ["l1", "l2"],
}

In [None]:
from sklearn.model_selection import HalvingGridSearchCV

In [None]:
grid_search = HalvingGridSearchCV(
    log_reg, params, verbose=1, n_jobs=2,
)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
grid_search.score(X_test, y_test)

In [None]:
grid_search.best_params_

## Exercise 2

1. A `column_transformer` is provided to preprocess the penguin dataset. Call `fit_transform` on `X_train` and store the output as `X_train_transformed`.
1. Are there missing values in the transformed dataset?
1. Construct a Pipeline with the `column_transformer` and a `HistGradientBoostingClassifier`.
    - **Hint:** Set the `random_state=0` for the gradient booster.
1. Create a `HalvingGridSearchCV` with that searches through the following params in the gradient booster:
     - `l2_regularization`: `[0.01, 0.1, 1, 10]`
     - `max_bins`: `[32, 64, 128, 255]`
     - **Hint**: Use `get_params` to get the parameter name to search through.
     - **Hint**: Set `verbose=1`
1. What is the best hyper-parameters found by this search?
1. Evaluate on the test set.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import HistGradientBoostingClassifier
import numpy as np

cat_prep = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)

column_transformer = ColumnTransformer([
    ("categorical", cat_prep, categorical_features),
    ("numerical", "passthrough", numerical_features),
])

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/solutions/03-ex02-solutions.py). 

In [None]:
# %load solutions/03-ex02-solutions.py