2.1 Demo – Hyperparameter Tuning with Optuna

# Hyperparameter Tuning with Optuna

In this hands-on demo, you will learn how to leverage **Optuna**, a powerful optimization library, for efficient model tuning. We’ll guide you through the process of performing **hyperparameter optimization**, demonstrating how to define the search space, objective function, and algorithm selection. Throughout the demo, you will utilize *MLflow* to seamlessly track the model tuning process, capturing essential information such as hyperparameters, metrics, and intermediate results. By the end of the session, you will not only grasp the principles of hyperparameter optimization but also be proficient in finding the best-tuned model using various methods such as the **MLflow API** and **MLflow UI**.

By integrating Optuna and MLflow, you can efficiently optimize hyperparameters and maintain comprehensive records of your machine learning experiments, facilitating reproducibility and collaborative research.

## Learning Objectives:

*By the end of this demo, you will be able to:*

- Perform hyperparameter optimization using Optuna.  
- Track the model tuning process with MLflow.  
- Query previous runs from an experiment using the `MLflowClient`.  
- Review an MLflow Experiment for visualizing results and selecting the best run.  
- Read in the best model, make a prediction, and register the model to Unity Catalog.

---

## Requirements

Please review the following requirements before starting the lesson:

- To run this notebook, you need to use one of the following Databricks runtime(s): **16.0.x-cpu-ml-scala2.12**


## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.  
Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.
2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - In the drop-down, select **More**.
   - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.  
2. Find the triangle icon to the right of your compute cluster name and click it.  
3. Wait a few minutes for the cluster to start.  
4. Once the cluster is running, complete the steps above to select your cluster.

---

## Classroom Setup

Before starting the demo, run the provided classroom setup script. This script will define configuration variables necessary for the demo. Execute the following cell:


In [0]:
%pip install -U -qq optuna
dbutils.library.restartPython()


In [0]:
%run ../Includes/Classroom-Setup-2.1


### Other Conventions:

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:


In [0]:
print(f"Username:       {DA.username}")
print(f"Catalog Name:   {DA.catalog_name}")
print(f"Schema Name:    {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")


### Prepare Dataset

Before we start fitting a model, we need to prepare dataset. First, we will load dataset, then we will split it to train and test sets.


### Load Dataset

In this demo we will be using the CDC Diabetes dataset from the Databricks Marketplace. This dataset has been read in and written to a feature table called `diabetes` in our working schema.


In [0]:
# load data from the feature table
table_name = f"{DA.catalog_name}.{DA.schema_name}.diabetes"
diabetes_dataset = spark.read.table(table_name)
diabetes_pd = diabetes_dataset.drop('unique_id').toPandas()

# review dataset and schema
display(diabetes_pd)
print(diabetes_pd.info())


### Train/Test Split
Next, we will divide the dataset to training and testing sets.

In [0]:
from sklearn.model_selection import train_test_split

print(f"We have {diabetes_pd.shape[0]} records in our source dataset")

# split target variable into its own dataset
target_col = "Diabetes_binary"
X_all = diabetes_pd.drop(labels=target_col, axis=1)
y_all = diabetes_pd[target_col]

# test / train split
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size=0.95, random_state=42)

y_train = y_train.astype(float)
y_test = y_test.astype(float)

print(f"We have {X_train.shape[0]} records in our training dataset")
print(f"We have {X_test.shape[0]} records in our test dataset")


## Hyperparameter Tuning

### Define the Objective Function

An objective function in Optuna is a Python function that defines the optimization target. It takes a single argument, typically named `trial`, which is an instance of the `optuna.Trial` class. This function is responsible for:

1. Defining the hyperparameter search space  
2. Training the model with the suggested hyperparameters  
3. Evaluating the model's performance  
4. Returning a scalar value that Optuna will try to optimize (minimize or maximize)

In our case, we are working with scikit-learn's `DecisionTreeClassifier`. Start by defining the search space for the model. Our hyperparameters are:

- `criterion`: chooses between `gini` and `entropy`. Defining the criterion parameter allows the algorithm to try both options during tuning and can assist in identifying which criterion works best. `TPE` is the default, though there are [other sampling methods](https://optuna.readthedocs.io/en/stable/reference/samplers.html) like `GPSampler` and `BruteForceSampler`.
- `max_depth`: an integer between 5 and 50  
- `min_samples_split`: an integer between 2 and 40  
- `min_samples_leaf`: an integer between 1 and 20  

The objective function will also have nested MLflow runs for logging. Each trial starts a new MLflow run using `with mlflow.start_run()`.  
We will also manually log metrics and the scikit-learn model within the objective function.  
Note that the training process is using cross-validation (5-fold CV in fact) and returns the negative mean of the fold results.

- Impureza de Gini mide con qué frecuencia una muestra seleccionada al azar sería clasificada incorrectamente si se etiquetara aleatoriamente según la distribución actual de clases. Cuantifica la probabilidad de una clasificación errónea.

- Entropía mide la cantidad de incertidumbre o desorden en el conjunto de datos. Cuantifica cuán "impuro" es un nodo en términos de distribución de clases, siendo mayor la entropía cuanto más desorden o incertidumbre hay en la clasificación.


In [0]:
import optuna
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
from mlflow.models.signature import infer_signature

# Define the objective function
def optuna_objective_function(trial):
    # Define hyperparameter search space
    params = {
        'criterion': trial.suggest_categorical('criterion', ['gini', 'entropy']),
        'max_depth': trial.suggest_int('max_depth', 5, 50),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 40),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20)
    }

    # Start an MLflow run for logging
    with mlflow.start_run(nested=True, run_name=f"Model Tuning with Optuna – Trial {trial.number}"):
        
        # Log parameters with MLflow
        mlflow.log_params(params)

        dtc = DecisionTreeClassifier(**params)
        scoring_metrics = ['accuracy', 'precision', 'recall', 'f1']
        cv_results = cross_validate(dtc, X_train, y_train, cv=5,
                                    scoring=scoring_metrics, return_estimator=True)

        # Log cross-validation metrics to MLflow
        for metric in scoring_metrics:
            mlflow.log_metric(f"cv_{metric}", cv_results[f'test_{metric}'].mean())

        # Train the model on the full training set
        final_model = DecisionTreeClassifier(**params)
        final_model.fit(X_train, y_train)

        # Create input signature using the first row of X_train
        input_example = X_train.iloc[0:1]
        signature = infer_signature(input_example, final_model.predict(input_example))

        # Registrar el modelo con la firma de entrada
        mlflow.sklearn.log_model(final_model, "decision_tree_model", signature=signature, input_example=input_example)

        # Calcular el promedio de la validación cruzada
        f1_score_mean = -cv_results['test_f1'].mean()

        # Métrica a minimizar
        return -f1_score_mean



# Optimize the Scikit-Learn Model on Single-Machine Optuna and Log Results with MLflow

Before running the optimization, we need to perform two key steps:

1. **Initialize an Optuna Study** using `optuna.create_study()`.
   - A *study* represents an optimization process consisting of multiple trials.
   - A *trial* is a single execution of the *objective function* with a specific set of hyperparameters.

2. **Run the Optimization** using `study.optimize()`.
   - This tells Optuna how many trials to perform and allows it to explore the search space.

Each trial will be logged to MLflow, including the hyperparameters tested and their corresponding cross-validation results. Optuna will handle the optimization while training continues.

---

## Steps:

- **Set up an Optuna study** with `optuna.create_study()`.
- **Start an MLflow run** with `mlflow.start_run()` to log experiments.
- **Optimize hyperparameters** using `study.optimize()` within the MLflow context.

---

## Note on `n_jobs` in `study.optimize()`:

The `n_jobs` argument controls the **number of trials running in parallel** using multi-threading **on a single machine**.

- If `n_jobs=-1`, Optuna will use **all available CPU cores** (e.g., on a 4-core machine, it will likely use all 4 cores).
- If `n_jobs` is **undefined (default)**, trials run **sequentially (single-threaded)**.
- **Important**: `n_jobs` **does not** distribute trials across multiple nodes in a Spark cluster.  
  To parallelize across nodes, use `SparkTrials()` instead.

---

## Why We Don't Use `MLflowCallback`:

Optuna provides an `MLflowCallback` for automatic logging. However, in this demo, we are demonstrating how to integrate the MLflow API with Optuna separate from `MLflowCallback`.


First, we will delete all previous runs to keep our workspace and experiment tidy.

In [0]:
# Set the MLflow experiment name and get the id
experiment_name = f"/Users/{DA.username}/Demo_Optuna_Experiment_{DA.schema_name}"
print(f"Experiment Name: {experiment_name}")

mlflow.set_experiment(experiment_name)
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id
print(f"Experiment ID: {experiment_id}")

print("Clearing out old runs (If you want to add more runs, change the n_trial parameter in the next cell) ...")

# Get all runs
runs = mlflow.search_runs(experiment_ids=[experiment_id], output_format="pandas")

if runs.empty:
    print("No runs found in the experiment.")
else:
    # Iterate and delete each run
    for run_id in runs["run_id"]:
        mlflow.delete_run(run_id)
        print(f"Deleted run: {run_id}")

print("All runs have been deleted.")


In [0]:
study = optuna.create_study(
    study_name="optuna_hpo",
    direction="minimize"
)

with mlflow.start_run(run_name='demo_optuna_hpo') as parent_run:
    # Run optimization
    study.optimize(
        optuna_objective_function,
        n_trials=10
    )


## Review Tuning Results

We can use the MLflow API to review the trial results.


In [0]:
import mlflow
import pandas as pd

# Define your experiment name or ID
experiment_id = parent_run.info.experiment_id  # Replace with your actual experiment ID

# Fetch all runs from the experiment
df_runs = mlflow.search_runs(
    experiment_ids=[experiment_id]
)

# Filter out the parent run
df_runs = df_runs[df_runs['tags.mlflow.runName'] != 'demo_optuna_hpo']

# Display the results
display(df_runs)


We can use the Optuna study to get the best parameters and F1-score. Validate this agrees with the table results from the previous cell’s output.


In [0]:
# Display the best hyperparameters and metric
print(f"Best hyperparameters: {study.best_params}")
print(f"Best negative–F1 score: {study.best_value}")


## Find the Best Run Based on F1-Score

In this section, we will search for registered models. There are a couple ways for achieving this. We will show how to search runs using MLflow API and the UI.

**The output links for using Optuna gave the best runs. Why can’t we just use that?**

You totally can! But this is the same as using the UI to navigate to the trial that was the best (which is shown below).

### Option 1: Find the Best Run – MLflow API

Using the MLflow API, you can search runs in an experiment, which returns results into a Pandas DataFrame.


In [0]:
experiment_id = parent_run.info.experiment_id
print(f"Experiment ID: {experiment_id}")

In [0]:
from mlflow.entities import ViewType

search_runs_pd = mlflow.search_runs(
    experiment_ids=[experiment_id],
    order_by=["metrics.cv_f1 DESC"],
    max_results=1
)

display(search_runs_pd)


## Option 2 - Find the Best Run - MLflow UI

The simplest way of seeing the tuning result is to use MLflow UI.

1. Click on **Experiments** from left menu.  
2. Select experiment which has the same name as this notebook's title (_2.1 - Hyperparameter Tuning with Optuna_).  
3. Click on the graph icon at the top left under **Run**.  
4. Click on the parent run or manually select all 10 runs to compare. The graphs on the right of the screen will appear for inspection.


## Visualize the Hyperparameters

By now, we have determined which trial had the best run according to the f1-score.  
Now, let's visually inspect our other search space elements with respect to this metric.


In [0]:
import matplotlib.pyplot as plt

# Ensure the necessary parameters exist in the DataFrame before plotting
required_params = [
    "params.min_samples_leaf",
    "params.max_depth",
    "params.min_samples_split",
    "metrics.cv_f1",
    "tags.mlflow.runName"
]
df_filtered = df_runs.dropna(subset=required_params, how="any")

# Convert parameters to appropriate types
df_filtered["params.min_samples_split"] = df_filtered["params.min_samples_split"].astype(float)
df_filtered["params.max_depth"] = df_filtered["params.max_depth"].astype(float)
df_filtered["metrics.cv_f1"] = df_filtered["metrics.cv_f1"].astype(float)

# Identify the best run index (assuming higher f1 is better)
best_run_index = df_filtered["metrics.cv_f1"].idxmax()
best_run_name = df_filtered.loc[best_run_index, "tags.mlflow.runName"]

# Extract run names for x-axis labels
run_names = df_filtered["tags.mlflow.runName"]

# Create a figure and axis for bar chart
fig, ax1 = plt.subplots(figsize=(12, 6))

# Bar chart for min_samples_split and max_depth
df_filtered[["params.min_samples_split", "params.max_depth"]].plot(
    kind="bar", ax=ax1, edgecolor="black"
)

ax1.set_xlabel("Run Name")
ax1.set_ylabel("Parameter Values")
ax1.set_title("Hyperparameters & cv_f1 Score per Run")
ax1.legend(["Max Features", "Max Depth"])
ax1.set_xticks(range(len(df_filtered)))
ax1.set_xticklabels(run_names, rotation=45, ha="right")  # Rotate for readability

plt.show()

# Create a second y-axis for the cv_f1 score line chart
ax2 = ax1.twinx()
ax2.plot(
    range(len(df_filtered)),  # X-axis indices
    df_filtered["metrics.cv_f1"],
    marker="o",
    linestyle="-",
    color="blue",
    label="cv_f1 Score"
)

# Highlight the best run with a bold marker
ax2.plot(
    df_filtered.index.get_loc(best_run_index),  # Get positional index
    df_filtered.loc[best_run_index, "metrics.cv_f1"],
    marker="o",
    markersize=10,
    color="red",
    label="Best Run"
)
# Add a vertical dashed line to indicate the best run
ax2.axvline(df_filtered.index.get_loc(best_run_index), color="red", linestyle="--")

ax2.set_ylabel("cv_f1 Score")

# Add legend
fig.legend(loc="upper left", bbox_to_anchor=(0.1, 0.9))
plt.show()

# Pie chart for criterion
plt.figure(figsize=(8, 8))
df_filtered["params.criterion"].value_counts().plot(kind="pie", autopct="%1.1f%%", startangle=90)
plt.title("Criterion Distribution")
plt.ylabel("")  # Hide y-label for better visualization
plt.show()


## Load the Best Model and Parameters

To load the model and make a prediction, let's use the information from Option 2 shown above. Run the next cell to get the value.

### Copy and Paste Option

Alternatively, you can set the variables shown below manually. Using either the output from Option 1 or the UI from Option 2, locate the `run_id` and the `experiment_id`. With Option 1 or 2, this is simply the value in the first two columns. In the UI, this is presented to you in the Details table when clicking on the specific run.


In [0]:
# Convert search_runs_pd to pyspark dataframe
search_runs_sd = spark.createDataFrame(search_runs_pd)

# Get the string value from run_id and experiment_id from PySpark DataFrame hpo_runs_df
run_id = search_runs_sd.select("run_id").collect()[0][0]
experiment_id = search_runs_sd.select("experiment_id").collect()[0][0]

print(f"Run ID: {run_id}")
print(f"Experiment ID: {experiment_id}")


In [0]:
import mlflow
import json
from mlflow.models import Model

# Grab an input example from the test set
input_example = X_test.iloc[0]

model_path = f"dbfs:/databricks/mlflow-tracking/{experiment_id}/{run_id}/artifacts/decision_tree_model"

# Load the model using the run ID
loaded_model = mlflow.pyfunc.load_model(model_path)

# Retrieve model parameters
client = mlflow.tracking.MlflowClient()
params = client.get_run(run_id).data.params

# Display model parameters
print("Best Model Parameters:")
print(json.dumps(params, indent=4))


## Make prediction

In [0]:
# Make a prediction
test_prediction = loaded_model.predict(input_example)

# X_test is a pandas dataframe – let's add the test_prediction output as a new column
input_example['prediction'] = test_prediction

display(input_example)


## Register the Model to Unity Catalog

After running the following cell, navigate to our working catalog and schema (see course setup above) and validate the model has been registered.


In [0]:
mlflow.set_registry_uri("databricks-uc")

model_uri = f"runs:/{run_id}/decision_tree_model"

mlflow.register_model(
    model_uri=model_uri,
    name=f"{DA.catalog_name}.{DA.schema_name}.demo_optuna_model"
)


## Conclusión
En esta demostración, exploramos cómo mejorar el rendimiento de tu modelo utilizando Optuna para la optimización de hiperparámetros y MLflow para hacer seguimiento del proceso de ajuste.
Mediante el uso de los eficientes algoritmos de búsqueda de Optuna, aprendiste a ajustar eficazmente los parámetros del modelo.
Simultáneamente, MLflow facilitó un monitoreo fluido y el registro de cada prueba, capturando información esencial como hiperparámetros, métricas y resultados intermedios.
Además, aprendiste cómo registrar el mejor modelo en Unity Catalog.
De cara al futuro, integrar estas herramientas en tu flujo de trabajo será clave para mejorar el rendimiento del modelo y simplificar el proceso de ajuste fino.