<center><img src="images/logo.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>

# ML through Application
## Module 2, Lab 5: Using Hyperparameter Tuning

Hyperparameter tuning is an important process to select optimal sets of parameters and settings for ML models. In this notebook, you will gain experience with two main types of hyperparameter tuning: grid search and randomized search. In this notebook, you will use a decision tree model, but hyperparameter tuning can be applied to any model type.

You will learn the following:

- How features are used with a decision tree model
- What grid search is and how to use it
- What randomized search is and how to use it

----

__Austin Animal Center Dataset__

In this lab, you will work with historical pet adoption data in the [Austin Animal Center Shelter Intakes and Outcomes dataset](https://www.kaggle.com/datasets/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes?resource=download). The target field of the dataset (**Outcome Type**) is the outcome of adoption: 1 for adopted and 0 for not adopted. Multiple features are used in the dataset.

Dataset schema:
- __Pet ID:__ Unique ID of the pet
- __Outcome Type:__ State of pet at the time of recording the outcome (0 = not placed, 1 = placed). This is the field to predict.
- __Sex upon Outcome:__ Sex of pet at outcome
- __Name:__ Name of pet 
- __Found Location:__ Found location of pet before it entered the shelter
- __Intake Type:__ Circumstances that brought the pet to the shelter
- __Intake Condition:__ Health condition of the pet when it entered the shelter
- __Pet Type:__ Type of pet
- __Sex upon Intake:__ Sex of pet when it entered the shelter
- __Breed:__ Breed of pet 
- __Color:__ Color of pet 
- __Age upon Intake Days:__ Age (days) of pet when it entered the shelter
- __Age upon Outcome Days:__ Age (days) of pet at outcome

----

You will be presented with two kinds of exercises throughout the notebook: activities and challenges. <br/>

| <img style="float: center;" src="images/activity.png" alt="Activity" width="125"/>| <img style="float: center;" src="images/challenge.png" alt="Challenge" width="125"/>|
| --- | --- |
|<p style="text-align:center;">No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p> |<p style="text-align:center;">Challenges are where you can practice your coding skills.</p>

## Index

- [Features and the decision tree model](#Features-and-the-decision-tree-model)
- [Grid search](#Grid-search)
- [Randomized search](#Randomized-search)

---
## Features and the decision tree model

In this section, you will process the categorical and text features of the dataset, and use them to fit a simple decision tree.

In [None]:
%%capture
# Install libraries
!pip install -U -q -r requirements.txt

First, load the required libraries, read the dataset into a DataFrame, and look at that dataset.

In [None]:
import re, string
import matplotlib.pyplot as plt

%matplotlib inline
import nltk
from nltk.stem import SnowballStemmer
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint

df = pd.read_csv("data/review_dataset.csv")

print("The shape of the dataset is:", df.shape)

In [None]:
df.head()

The model will use the numerical, categorical, and text features. To use them, you need to create lists that contain the feature names:

- __Numerical features:__ Age upon Intake Days, Age upon Outcome Days
- __Categorical features:__ Sex upon Outcome, Intake Type, Intake Condition, Pet Type, Sex upon Intake, Breed, Color
- __Text features:__ Found Location
- __Target:__ Outcome Type

In [None]:
# Numerical features
numerical_features = ["Age upon Intake Days", "Age upon Outcome Days"]

# Drop the ID features: RescuerID and PetID
categorical_features = [
    "Sex upon Outcome",
    "Intake Type",
    "Intake Condition",
    "Pet Type",
    "Sex upon Intake",
    "Breed",
    "Color",
]

# Based on exploratory data analysis (EDA), select the text features
text_features = ["Found Location"]

model_features = numerical_features + categorical_features + text_features
model_target = "Outcome Type"

In [None]:
model_features

__Note:__
* Some categories might be boolean types, False and True. Booleans will raise errors when you try to encode the categoricals with sklearn encoders because none of them accept boolean types. If you use the Pandas `get_dummies` function to one-hot encode the categoricals, you don't need to convert the Booleans. However, [get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) is more difficult to use with sklearn's [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) functions.

* One way to handle Booleans is to convert them to strings by using a mask and a map changing only the Booleans. Another way is to convert them to strings by changing the type of all categoricals to `str`. This will also affect the nans—basically performing imputation of the nans with a `nans` string placeholder value.

* Applying the type conversion to both categoricals and text features handles the nans in the text fields as well.

In [None]:
df[categorical_features + text_features] = df[
    categorical_features + text_features
].astype("str")

Now, check for missing values for the categorical features and text features.

In [None]:
print(df[categorical_features + text_features].isna().sum())

Convertion from categoricals into useful numerical features should be done **after splitting the dataset for training and testing**.

If you encode the categoricals on the whole dataset before you split it into train/validation/test sets you will introduce bias into the training. The introduction of bias happens because the encoded categories will now contain information about the samples that will be in your validation and/or test sets.
This is commonly called `data leakage` and it is a problem because the purpose of your validation and test sets is to apply your trained model to data that it has not seen before. 

### Cleaning the text fields

__Note:__ The cleaning stage can take a few minutes, depending on how much text needs to be processed.

In [None]:
# Prepare cleaning functions

stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

stemmer = SnowballStemmer("english")


def preProcessText(text):
    # Lowercase text, and strip leading and trailing white space
    text = text.lower().strip()

    # Remove HTML tags
    text = re.compile("<.*?>").sub("", text)

    # Remove punctuation
    text = re.compile("[%s]" % re.escape(string.punctuation)).sub(" ", text)

    # Remove extra white space
    text = re.sub("\s+", " ", text)

    return text


def lexiconProcess(text, stop_words, stemmer):
    filtered_sentence = []
    words = text.split(" ")
    for w in words:
        if w not in stop_words:
            filtered_sentence.append(stemmer.stem(w))
    text = " ".join(filtered_sentence)

    return text


def cleanSentence(text, stop_words, stemmer):
    return lexiconProcess(preProcessText(text), stop_words, stemmer)


# Clean the text features
for c in text_features:
    print("Text cleaning: ", c)
    df[c] = [cleanSentence(item, stop_words, stemmer) for item in df[c].values]

The cleaned text feature is ready to be vectorized after the dataset split. 

__Note:__ More exploratory data analysis (EDA) might reveal other important hidden attributes or relationships of the model features that are being considered. 

### Create training and test datasets

As part of data preparation, the dataset is split into training and test subsets by using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

For this notebook, you will use 90 percent of the data for the training set and 10 percent for the test set. Determine the best split based on the size of your dataset.

In [None]:
train_data, test_data = train_test_split(
    df, test_size=0.1, shuffle=True, random_state=23
)

### Process the data with a pipeline and ColumnTransformer

In this section, you will build separate pipelines to handle the numerical, categorical, and text features. Then, you will combine them into a composite pipeline along with an estimator. To do this, you will use a [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

* __Numerical processor:__ A MinMaxScaler can be used for numerical features; however, you don't need to scale features when using decision trees. You will use one here to see how to use more data transforms. If different processing is desired for different numerical features, you should build different pipelines as described for the text features pipeline. See the `numerical_processor` in the following code cell.
   
* __Categorical processor:__ Impute with a placeholder value (this won't have an effect because you already encoded the 'nan' values), and encode with sklearn's OneHotEncoder. If computing memory is an issue, it is a good idea to check the number of unique values for the categoricals to get an estimate of how many dummy features one-hot encoding will create. Note the `handle_unknown` parameter, which tells the encoder to ignore (rather than throw an error for) any unique value that might show in the validation or test set that was not present in the initial training set. See the `categorical_processor` in the following code cell.
  
* __Text processor:__ With memory usage in mind, build two more pipelines, one for each of the text features, and try different vocabulary sizes.

Finally, the selective preparations of the dataset features are then put together into a collective ColumnTransformer, which is used in a pipeline along with an estimator. This ensures that the transforms are performed automatically in all situations. This includes on the raw data when fitting the model, when making predictions, when evaluating the model on a validation dataset through cross-validation, or when making predictions on a test dataset in the future.

In [None]:
### COLUMN_TRANSFORMER ###
##########################

# Preprocess the numerical features
numerical_processor = Pipeline(
    [
        (
            "num_scaler",
            MinMaxScaler(),
        )  # Shown in case it is needed. Not a must with decision trees.
    ]
)

# Preprocess the categorical features
# handle_unknown tells it to ignore (rather than throw an error for) any value
# that was not present in the initial training set.
categorical_processor = Pipeline(
    [("cat_encoder", OneHotEncoder(handle_unknown="ignore"))]
)

# Preprocess the text feature
# This text processor uses max_features=150
text_processor_0 = Pipeline(
    [("text_vect_0", CountVectorizer(binary=True, max_features=150))]
)

# Combine all data preprocessors (add more if you choose to define more)
# For each processor/step, specify: a name, the actual process, and the features to be processed
data_preprocessor = ColumnTransformer(
    [
        ("numerical_pre", numerical_processor, numerical_features),
        ("categorical_pre", categorical_processor, categorical_features),
        ("text_pre_0", text_processor_0, text_features[0]),
    ]
)

### PIPELINE ###
################

# Pipeline with all desired data transformers, along with an estimator
# Later, you can set/reach the parameters by using the names issued - for hyperparameter tuning, for example
pipeline = Pipeline(
    [
        ("data_preprocessing", data_preprocessor),
        ("dt", DecisionTreeClassifier(max_depth=5)),
    ]
)  # The initial value is chosen as max_depth=5

# Visualize the pipeline
# This will be helpful especially when building more complex pipelines,
# stringing together multiple preprocessing steps
from sklearn import set_config

set_config(display="diagram")
pipeline

Now that your pipeline is ready, you can train and print the training results.

In [None]:
# Get training data to train the pipeline
X_train = train_data[model_features]
y_train = train_data[model_target]

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Use the fitted pipeline to make predictions on the training dataset
train_predictions = pipeline.predict(X_train)
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print("Accuracy (training):", accuracy_score(y_train, train_predictions))

# Get testing data to test the pipeline
X_test = test_data[model_features]
y_test = test_data[model_target]

# Use the fitted pipeline to make predictions on the testing dataset
test_predictions = pipeline.predict(X_test)
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print("Accuracy (test):", accuracy_score(y_test, test_predictions))

---
## Grid search

When you created the pipelines, you used a few hyperparameters, such `max_depth=5` for the decision tree and `max_features=150` for vectorizers, without exploring other alternatives. Now that you have seen the results of the fixed values, you will use grid search to automatically look for good combinations of multiple hyperparameters.

You will use sklearn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to look for hyperparameter combinations to improve the accuracy on the testing set (and reduce the generalization gap). `GridSearchCV` does the cross-validation train-validation split internally. The data transformers inside the pipeline context will force the correct behavior of learning data transformations on the training set. The same way, it will apply the transformations to the validation set when cross-validating as well as on the test set later when running test predictions.

__Note:__ Setting pipeline step names gives easy access to hyperparameters for hyperparameter tuning while cross-validating. Parameters of the estimators in the pipeline can be accessed using the `estimator__parameter` syntax. Make sure you use double underscores to connect estimator and parameter.

The next code block might take some time to complete because there are 9 candidate (3x3) hyperparameters. With 5-fold cross-validation, this will result in a total of 45 fits.

In [None]:
### PIPELINE GRID_SEARCH ###
############################

# Parameter grid for GridSearch
param_grid = {
    "dt__max_depth": [100, 200, 300],
    "dt__min_samples_leaf": [5, 10, 15],
}

grid_search = GridSearchCV(
    pipeline,  # Base model
    param_grid,  # Parameters to try
    cv=5,  # Apply 5-fold cross validation
    verbose=1,  # Print summary
    n_jobs=-1,  # Use all available processors
)

# Fit the GridSearch to the training data
grid_search.fit(X_train, y_train)

Now that the grid search has completed, you can look at what it found to be the best hyperparameters and the corresponding highest score.

In [None]:
print(grid_search.best_params_)
print(grid_search.best_score_)

Next, you can pick the model with the best hyperparameters.

In [None]:
# Get the best model out of GridSearchCV
classifier = grid_search.best_estimator_

# Fit the best model to the training data once more
classifier.fit(X_train, y_train)

Finally, you can look at the test score.

In [None]:
# Get testing data to test the classifier
X_test = test_data[model_features]
y_test = test_data[model_target]

# Use the fitted model to make predictions on the testing dataset
# Testing data going through the pipeline is first imputed
# (with means from the training set), scaled (with the min/max from the training data),
# and finally used to make predictions.
test_predictions = classifier.predict(X_test)

print("Model performance on the test set:")
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print("Test accuracy:", accuracy_score(y_test, test_predictions))

From the results, you can see that this model has done better when compared to the testing set. If you continue to explore different hyperparameters, you might be able to do event better.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>It's time to check your knowledge!</i></h3>
    <br>
    <p style=" text-align: center; margin: auto;">To load the question, run the following cell.</p>
    <br>
</div>

In [None]:
# Run this cell for a knowledge check question
from MLUMLA_EN_M2_Lab5_quiz_questions import *

question_1

---
## Randomized search

An alternative way to find the optimal hyperparameters is through randomized search. This method chooses a fixed number (given by parameter `n_iter`) of random combinations of hyperparameter values and only tries each. This method can also sample from distributions (sampling with replacement is used), if at least one parameter is given as a distribution.

In the following cell, the `max_depth` and `min_samples_leaf` hyperparameters are searched. 

Sklearn's randomized search method has a default 10 number of iterations, which means 10 combinations to try. Totaling 50 fits with 5-fold cross-validation.

In [None]:
### PIPELINE RANDOMIZED_SEARCH ###
############################
from scipy.stats import randint
# Parameter grid for GridSearch
param_grid = {
    "dt__max_depth": [100, 200, 300],
    'dt__min_samples_leaf' :randint(15, 35)
    #"dt__min_samples_leaf": [5, 10, 15]
}

randomized_search = RandomizedSearchCV(
    pipeline,  # Base model
    param_grid,  # Parameters to try
    cv=5,  # Apply 5-fold cross validation
    verbose=1,  # Print summary
    n_jobs=-1,  # Use all available processors
)

# Fit the RandomizedSearch to the training data
randomized_search.fit(X_train, y_train)

When the randomized search completes, look at what it found to be the best hyperparameters and the corresponding highest score.

In [None]:
print(randomized_search.best_params_)
print(randomized_search.best_score_)

Next, you can pick the model with the best hyperparameters.

In [None]:
# Get the best model out of GridSearchCV
classifier = randomized_search.best_estimator_

# Fit the best model to the training data once more
classifier.fit(X_train, y_train)

Finally, you can look at the test score.

In [None]:
# Get testing data to test the classifier
X_test = test_data[model_features]
y_test = test_data[model_target]

# Use the fitted model to make predictions on the testing dataset
# Testing data going through the pipeline is first imputed
# (with means from the training set), scaled (with the min/max from the training data),
# and finally used to make predictions
test_predictions = classifier.predict(X_test)

print("Model performance on the test set:")
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print("Test accuracy:", accuracy_score(y_test, test_predictions))

The randomized search might not give better results (because it is random, the outcome will vary). You can adjust the ranges to try and improve the results.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it yourself!</i></h3>
    <br>
    <p style="text-align:center;margin:auto;"><img src="images/challenge.png" alt="Challenge" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Add another parameter to the search.</p><br>
    <p style=" text-align: center; margin: auto;">In the following code cell, add <code>'dt__min_samples_split': randint(2, 50)</code> to the grid, and run the randomized search again.</p><br>
    <p style=" text-align: center; margin: auto;">After the training completes, check the test score as you did previously.</p>
    <br>
</div>

In [None]:
# Parameter grid for randomized search

############### CODE HERE ###############

param_grid={'dt__max_depth': [100, 200, 300],#, 50, 75, 100, 125, 150, 200, 250],
            'dt__min_samples_leaf' :randint(15, 35),  # Picks an integer randomly between 15 and 35
            'dt__min_samples_split' :randint(2, 50)
           }

randomized_search = RandomizedSearchCV(pipeline, # Base model
                                       param_grid, # Parameters to try
                                       cv = 5, # Apply 5-fold cross validation
                                       verbose = 1, # Print summary
                                       n_jobs = -1 # Use all available processors
                                      )
############## END OF CODE ##############

# Fit the RandomizedSearch to the training data
randomized_search.fit(X_train, y_train)

In [None]:
print(randomized_search.best_params_)
print(randomized_search.best_score_)

# Get the best model out of GridSearchCV
classifier = randomized_search.best_estimator_

# Fit the best model to the training data once more
classifier.fit(X_train, y_train)

In [None]:
# Get testing data to test the classifier
X_test = test_data[model_features]
y_test = test_data[model_target]

# Use the fitted model to make predictions on the testing dataset
# Testing data going through the pipeline is first imputed
# (with means from the training set), scaled (with the min/max from the training data),
# and finally used to make predictions
test_predictions = classifier.predict(X_test)

print("Model performance on the test set:")
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print("Test accuracy:", accuracy_score(y_test, test_predictions))

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>It's time to check your knowledge!</i></h3>
    <br>
    <p style=" text-align: center; margin: auto;">To load the question, run the following cell.</p>
    <br>
</div>

In [None]:
# Run this cell for a knowledge check question

question_2

----
## Conclusion

This notebook showed you how to tune hyperparameters to improve your model.

## Next lab

In the next lab, you will gain experience with ensemble methods to create a strong model by combining the predictions of multiple weak models (also known as weak learners or base estimators) that are built with a given dataset and a given learning algorithm.