<i>Modified from the file written by Ahsan Khan on behalf of Alberta Machine Intelligence Institute for the Al Pathways Partnership supported by Prairies Economic Development Canada</i>

---

**Important Note:**

Please do not alter any part of this notebook outside the designated text cells that are clearly marked with "*Start student input* ↓" and "*End student input ↑*". Changes made outside these specified areas could lead to incorrect evaluations of your work, potentially affecting your lab scores.

Ensure you complete all activities within these sections, which are indicated by labels like **[A1]**, **[A2]**, **[A3]**, ... Each activity is crucial for the successful completion of this lab. Additionally, please name your variables exactly as specified in the instructions (if specified) to ensure that your answers are correctly assessed.

---


# Lab 3: Hyperparameter Tuning

In this lab you will be going through techniques to improve the model performance by updating its hyperparameters.

In [1]:
# Crucial data processing and analysis libraries
import numpy as np
import pandas as pd

# Loading the modules required to build and evaluate a random forest regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# Loading the randomized search function
from sklearn.model_selection import RandomizedSearchCV

# Loading the California housing dataset from sklearn
from sklearn.datasets import fetch_california_housing

##### Data Preparation

In [2]:
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()

# The dependent variable. This is a big dataset and to save time we will just use the first 1000 examples in this dataset.
# This is the Median value of owner-occupied homes in $100,000s
y = california.target[:1000]

# Features
X = california.data[:1000]

## Discussion

You will be tuning the parameters in a random forest regressor model using the randomized grid search function. This function allows you to test several different combinations of parameters in order to determine the best one for the best possible results. More information can be found on the scikit-learn documentation page [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html).

You have not yet learned about random forest regressors but this does not matter here. For short rnadom forests are a collection of decision trees and in case of a random forest regressor, it is a collection of decision tree regressors, which you have not learnt about yet neither. You *DO KNOW* **how scikit-learn ML models work** and you know the **concept of hyperparameters and hyperparameter tuning** and this lab is about tuning hyperparameters so knowledge of what random forest regressors are and how they work is not crucial for this lab (especially since we are not asking you to discusss the results and their impact on building random forest regressors).

##### First, we will instantiate a Random Forest Regressor and look at the parameters currenty used by the default model

In [3]:
#Instantiating
rfr = RandomForestRegressor()

# Default parameters
rfr.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

##### The above list shows all of the possible parameters we can update to tune our model. Now we will be creating creating multiple lists (Python lists) for possible input varibles as parameters to our random forest regressor. Below are the parameters we will be tuning for our random forest regressor.

- **`n_estimators`** - Number of trees in random forest
- **`max_depth`** - Maximum number of levels in tree
- **`min_samples_split`** - Minimum number of samples required to split a node
- **`min_samples_leaf`** - Minimum number of samples required at each leaf node
- **`cross_validation`**: how many folds of our data we are taking for cross validation (again, you don't know this yet but just use a value of 3 or 5)
- **`number_of_iterations`**: number of iterations to try random combinations in

In [4]:
#Parameters
n_estimators = [2, 3, 10, 20, 30, 100, 200, 300, 1000]

max_depth = [10, 20, 40, 60, 80, 100, None]

min_samples_split = [2, 5, 10]

min_samples_leaf = [1, 2, 3, 4]

cross_validation = 3

num_of_iterations = 20

# Create the random grid
# The grid search function will try random combinations of these inputs
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

##### Now we will be fitting our random forest regressor with the parameters we have chosen to update as defined above. Running the cell block below will take some time.

In [5]:
# Instantiating
rfr = RandomForestRegressor()

# Random search of parameters
randsearch = RandomizedSearchCV(estimator=rfr,
                                param_distributions=random_grid,
                                n_iter=num_of_iterations,
                                cv=cross_validation,
                                verbose=2,
                                n_jobs=-1)

# Fit our random search model
randsearch.fit(X, y)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


##### The randomized search function tunes each parameter and determines which is the best possible combination by evaluating the performance of the cross validated model using the estimators scoring method. In our case it was a random forest algorithm. We can view these found parameters below.

In [6]:
randsearch.best_params_

{'n_estimators': 200,
 'min_samples_split': 2,
 'min_samples_leaf': 3,
 'max_depth': 100}

Now, we can create a random forest regressor that has the best hyperparameters:

In [7]:
best_params = randsearch.best_params_
rf2 = RandomForestRegressor(**best_params)

## Lab Activity: comparing the base model versus a tuned model

##### **[A1]**  
Split your data into training and validation. Use ``X_train``, ``y_train``,``X_val`` and ``y_val`` as the assigned variables respectively.

*Start student input* ↓

In [8]:
# Put your code here.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

*End student input ↑*

##### **[A2]**
Instantiate a [random forest regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) called `rfr`. with the parameters `n_estimators=5`, `max_depth=5` (this is your base model). Evaluate the model using the [`.score()` method](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.score) of a `RandomForestRegressor` object.

*Start student input* ↓

In [10]:
# Put your code here.
rfe= RandomForestRegressor(n_estimators=5, max_depth=5)
rfe.fit(X_train, y_train)
rfe.score(X_val, y_val)

print("Score: ", rfe.score(X_val, y_val))

Score:  0.6924704699439685


*End student input ↑*

##### **[A3]**
Build a new random grid for the parameters you want to tune.  You can use the same type of parameters and/or others to tune but don't copy the parameter values in the grid made in the discussion above. Apply your own values to the parameter you want to test based on your own intuition. Do try different values for `n_estimators`, `max_features` and `bootstrap` alongside different values for at least 3 more hyperparameters.

*Start student input* ↓

In [11]:
# new rando, grid

new_random_grid = {
    'n_estimators': [10,50,100,200],
     'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2',0.6, 0.7],
    'bootstrap': [True, False]
}


*End student input ↑*

##### **[A4]**
Instantiate a new random forest regressor, apply the [randomized search with cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) and fit the same `X_train` and `y_train` data used in [A2]

*Start student input* ↓

In [12]:
# Instantiate a new random forest regressor
new_rfr = RandomForestRegressor()

# Apply randomized search with cross-validation
random_search = RandomizedSearchCV(estimator=new_rfr,
                                   param_distributions=new_random_grid,
                                   n_iter=20,
                                   cv=3,
                                   verbose=2,
                                   n_jobs=-1)

# fit the model to training data
random_search.fit(X_train, y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


*End student input ↑*

##### **[A5]**
Evaluate your tuned model using the `.score()` method.

*Start student input* ↓

In [13]:
#evaluate tuned model
tuned_score = random_search.score(X_val, y_val)
print("Tuned Model Score:", tuned_score)

Tuned Model Score: 0.7747108673091567


*End student input ↑*

##### Note: Sometimes simple models and datasets do not require aggressive tuning of **all** hyperparameters and you may do fairly well by tuning only the most impactful hyperparameters of the chosen algorithm. Hyperparameter tuning becomes more important as the complexity of the problem and data increases. Some examples include the XGBoost algorithm, Neural Networks and with challenging datasets.