<a href="https://colab.research.google.com/github/wengti/Reinforcement-Learning-Tutorial-/blob/main/notebooks/unit3/Hyperparameter_Optimization_with_Optuna.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

* The following notebook focuses on applying Optuna in Reinforcement Learning.
* Source / Reference of this notebook: https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb#scrollTo=4UU17YpjymPr
* To study the application of Optuna in Deep Learning, refer to: https://www.geeksforgeeks.org/hyperparameter-tuning-with-optuna-in-pytorch/


## Quick guides on the step needed.

1. Define a config.
2. Define a search space or a function that returns the parameters that define the models.
3. Define an objective score function that will return the objective score function for the sampled set of hyperparameter.
4. Create a optimization loop
  - For each trial:
    - a) sample a hyperparameter
    - b) Use the hyperparameter to train. At intervals, evaluate the performance of the models and decided if to prune this trial.

## Step 0: Library Installation and Import

In [1]:
# Install optuna library
!pip install optuna

Collecting optuna
  Downloading optuna-4.3.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.16.1-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.3.0-py3-none-any.whl (386 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.6/386.6 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.16.1-py3-none-any.whl (242 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.5/242.5 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, alembic, optuna
Successfully installed alembic-1.16.1 colorlog-6.9.0 optuna-4.3.0


In [2]:
# Install Stable Baseline 3
!pip install stable-baselines3==2.0.0a5

Collecting stable-baselines3==2.0.0a5
  Downloading stable_baselines3-2.0.0a5-py3-none-any.whl.metadata (5.3 kB)
Collecting gymnasium==0.28.1 (from stable-baselines3==2.0.0a5)
  Downloading gymnasium-0.28.1-py3-none-any.whl.metadata (9.2 kB)
Collecting jax-jumpy>=1.0.0 (from gymnasium==0.28.1->stable-baselines3==2.0.0a5)
  Downloading jax_jumpy-1.0.0-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11->stable-baselines3==2.0.0a5)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11->stable-baselines3==2.0.0a5)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11->stable-baselines3==2.0.0a5)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (fr

In [None]:
# Install gymnasium

!pip install swig
!pip install gymnasium[box2d]

Collecting swig
  Using cached swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (3.5 kB)
Using cached swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.9 MB)
Installing collected packages: swig
Successfully installed swig-4.3.1
Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Using cached box2d-py-2.3.5.tar.gz (374 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pygame==2.1.3 (from gymnasium[box2d])
  Using cached pygame-2.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Using cached pygame-2.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.7 MB)
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp311-cp311-linux_x86_64.whl size=2379371 sha256=c83611c5b3ce9ef831a8a0d95b62d4f87411120a87123b329864aa0fff59be42
  Stored in directory: /root/.cache/pip/wheels/

In [None]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances


from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.a2c import A2C
from stable_baselines3 import PPO

import gymnasium as gym

import torch as th
import torch.nn as nn

# DEMO: Tune a A2C agent that plays CartPole-v1

## Step 1: Create config / key parameters

* Terminology:
1. `TRIALS`  - Each `TRIAL` is initiated with different sampled set of hyperparameter. If needed, multiple `JOBS` can be initiated in parallel for each trial. Each `TRIAL` will involve training the agent / model for `N_TIMESTEPS`.

2. `EVAL_EPISODES` - During the training that last for `N_TIMESTEPS`, at an interval of `EVAL_FREQ`, evaluation will be performed. For each evaluation, `N_EVAL_EPISODES` of evaluation episodes will be sampled and reviewed. This may help the scheduler to decide whether to prune early.


In [None]:
# Config
# Hyperparameter Optimization Loop
N_TRIALS = 100 # Maximum number of trials during Hyperparameter Optimization Loop
N_JOBS = 1 # Number of parallel jobs to run during each trials in Hyperparameter Optimization Loop
N_STARTUP_TRIALS = 5 # Number of trials to perform random sampling (without relying on sampler) during the Hypeparameter Optimization Loop (To create the initial database)
TIMEOUT = int(60*15) # Maximum number of times (in seconds) the entire loop is allowed up to.

# Evaluation Parameter for each set of hyperparameter
N_TIMESTEPS = int(2e4) # Training budget - Number of time steps in one FULL TRIAL for each set of hyperparameter.
N_EVALUATIONS = 2 # Number of intermediate evaluations performed in one FULL TRIAL for each set of hyperparameter.
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS) # Step interval for each intermediate evaluations during one FULL TRIAL.
N_EVAL_EPISODES = 10 # Number of episodes to be sampled for each evaluation.


# Environment Parameter
N_EVAL_ENVS = 5 # Number of environments used in parallel during evaluation.

ENV_ID = "CartPole-v1" # ID of the environments, to be initiated with gym.make()

# Algorithm Parameter
ALGO_NAME = "A2C"

## Step 2: Define the search space

In [None]:
def sample_a2c_params(trial):

  """
  Sample a set of hyperparameter to be trial'd.

  Args:
    trial (optuna.Trial) : An Optuna trial object.

  Returns:
    params (dict): The hyperparameters to be trial'd. Its key matches the keywords used in defining models.
  """

  # Refer to this link: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#example
  # To study the hyperparameter to be updated.

  ###################
  # Discount factor #
  ###################
  # suggest.float -> sample from a continuos space (float)
  # "gamma" - name (to be showcased in the final plot)
  # log - means sample from log space
  gamma = 1 - trial.suggest_float("one_minus_gamma", 0.0001, 0.1, log = True)

  # Create another attribute to store the actual gamma value
  trial.set_user_attr("gamma", gamma)

  #######################################
  # Maximum value for gradient clipping #
  #######################################
  max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 0.5, log=True)

  ##########################################################
  # Number of steps to run for each environment per update #
  ##########################################################
  n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)

  # Create another attribute to store the actual n_steps
  trial.set_user_attr("n_steps", n_steps)

  #################
  # Learning_rate #
  #################
  learning_rate = trial.suggest_float("learning_rate", 1e-5, 1, log=True)

  ########################
  # Network architecture #
  ########################
  # https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#a2c-policies

  net_arch = trial.suggest_categorical("net_arch", ["tiny", "small"])

  # The net_arch expects a list, that is why it is wrapped in a list
  net_arch = [{"pi": [64], "vf": [64]} if net_arch == 'tiny' \
              else {"pi" : [64, 64], "vf" : [64, 64]}]

  #######################
  # Activation Function #
  #######################
  activation_fn = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

  activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]



  # Note: The key used in this dictionary match the key used in defining the models
  # Therefore, the naming convention must be followed.
  params = {"n_steps": n_steps,
          "gamma": gamma,
          "learning_rate": learning_rate,
          "max_grad_norm": max_grad_norm,
          "policy_kwargs": {"net_arch": net_arch,
                            "activation_fn": activation_fn}}

  return params


## Step 3: Define objective

* A custom callback function is defined to report the results of periodic evaluations.

In [None]:
class TrialEvalCallback(EvalCallback):

  """
  Callback used for evaluating and reporting a trial.

  Args:
    eval_env (gym.env): An evaluation environment.
    trial (Optuna.trial): An Optuna trial object.
    n_eval_episodes (int): Number of evaluation episodes for each evaluation.
    eval_freq (int): Step interval for an intermediate evaluation in each trial.
    deterministic (boolean): Whether the evaluation should use stochastic or deterministic policy.
    verbose (int):

  Returns:
    out (boolean):
  """

  def __init__(self, eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose = 0):

    super().__init__(eval_env = eval_env, n_eval_episodes = n_eval_episodes,
                    eval_freq = eval_freq, deterministic = deterministic,
                    verbose = verbose)
    self.trial = trial
    self.eval_idx = 0
    self.is_pruned = False

  def _on_step(self):
    if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
      super()._on_step()
      self.eval_idx += 1

      # Send report to optuna
      self.trial.report(self.last_mean_reward, self.eval_idx)

      # Prune trial if needed
      if self.trial.should_prune():
        self.is_pruned = True
        return False
    return True

* The true objective function.

In [None]:
def objective(trial):

  """
  A function that returns the objective score that decides the quality of a set of hyperparameter.

  Args:
    trial (optuna.Trial): An Optuna trial object.

  Returns:
    objective_score (float): The score that represents the quality of this set of hyperparameter.
  """

  # Creat the default keyword arguments (those that wasnt defined in the hyperparameter sampling function)
  kwargs = {"policy": "MlpPolicy",
            "env": ENV_ID}

  # Update with the inclusion of the sampled hyperparameter
  kwargs.update(sample_a2c_params(trial))

  # Create a model using the sampled hyperparameter
  model = A2C(**kwargs)

  # Create the environments
  eval_envs = make_vec_env(env_id = ENV_ID,
                           n_envs = N_EVAL_ENVS)

  # Create the call back for reporting evaluation results
  eval_callback = TrialEvalCallback(eval_env = eval_envs,
                                    trial = trial,
                                    n_eval_episodes = N_EVAL_EPISODES,
                                    eval_freq = EVAL_FREQ,
                                    deterministic = True)

  nan_encountered = False
  try:
    model.learn(N_TIMESTEPS, callback = eval_callback)
  except AssertionError as e:
    # Sometimes, randomly sampled error can lead to NaN
    print(e)
    nan_encountered = True
  finally:
    # At the end of training or if error is encountered
    # Free Memory
    model.env.close()
    eval_envs.close()

  # Inform the optimizer that a non-valid hyperparameter is sampled
  if nan_encountered:
    return float('nan')

  if eval_callback.is_pruned:
    raise optuna.exceptions.TrialPruned()

  return eval_callback.last_mean_reward




## Step 4: Define Hyperparameter Optimization Loop

In [None]:
# Set PyTorch num threads to 1 for faster training
# Parallel environement will demand heavy use of CPU.
# Therefore, this line limits to the usage of cpu for PyTorch to be only 1 line.
th.set_num_threads(1)

# Select a sampler
# https://optuna.readthedocs.io/en/stable/reference/samplers/generated/optuna.samplers.TPESampler.html
# n_startup_trials -> Number of trials at the beginning that sample a set of hyperparameter randomly instead of using the algorithm
# This allows the creation of initial database.
sampler = TPESampler(n_startup_trials = N_STARTUP_TRIALS)

# Select a scheduler / pruner
# https://optuna.readthedocs.io/en/stable/reference/generated/optuna.pruners.MedianPruner.html
# n_startup_trials -> Pruning is disabaled at the beginning for this many trials for initial database creation.
# n_warmup_steps -> Number of steps in each trial that disable the pruning.
pruner = MedianPruner(n_startup_trials = N_STARTUP_TRIALS,
                      n_warmup_steps = N_TIMESTEPS // 3)

# Create a study for Hyperparameter Optimization
# https://optuna.readthedocs.io/en/stable/reference/generated/optuna.create_study.html
study = optuna.create_study(sampler = sampler,
                            pruner = pruner,
                            direction = "maximize")

try:
  # https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html#optuna.study.Study.optimize
  study.optimize(objective,
                 n_trials = N_TRIALS,
                 timeout = TIMEOUT,
                 n_jobs = N_JOBS)
except KeyboardInterrupt:
  pass

# Print the meta info for the hyperparameter optimization process
print(f"Number of finished trials: {len(study.trials)}")
trial = study.best_trial
print(f"Best trial: {trial.value}")

print("Params: ")
for key, value in trial.params.items():
  print(f"  {key}: {value}")

print("User Attributes: ")
for key, value in trial.user_attrs.items():
  print(f"  {key}: {value}")


# Write report
study.trials_dataframe().to_csv(f"study_result_{ALGO_NAME}_{ENV_ID}.csv")

# Show plot
fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()


[I 2025-06-12 09:35:47,739] A new study created in memory with name: no-name-813ee568-ff50-443a-8067-43acc1197b30

As shared layers in the mlp_extractor are removed since SB3 v1.8.0, you should now pass directly a dictionary and not a list (net_arch=dict(pi=..., vf=...) instead of net_arch=[dict(pi=..., vf=...)])

[I 2025-06-12 09:36:19,525] Trial 0 finished with value: 9.2 and parameters: {'one_minus_gamma': 0.012924707618275728, 'max_grad_norm': 0.34796692788707256, 'exponent_n_steps': 5, 'learning_rate': 0.02729744524579956, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 0 with value: 9.2.
[I 2025-06-12 09:36:49,605] Trial 1 finished with value: 9.2 and parameters: {'one_minus_gamma': 0.038723746654246494, 'max_grad_norm': 0.31201984520850595, 'exponent_n_steps': 5, 'learning_rate': 0.9681078693342838, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 9.2.
[I 2025-06-12 09:37:18,442] Trial 2 finished with value: 9.3 and parameters: {'one_minus_g

Number of finished trials: 29
Best trial: 500.0
Params: 
  one_minus_gamma: 0.008262752774391171
  max_grad_norm: 0.4988993250029
  exponent_n_steps: 5
  learning_rate: 0.0009036800602866176
  net_arch: small
  activation_fn: tanh
User Attributes: 
  gamma: 0.9917372472256089
  n_steps: 32


# Practice: Tune a PPO agent that plays LunarLander-v2

## Step 1: Create a config

In [None]:

                                          ###########################
                                          # Step 1: Create a config #
                                          ###########################

# Hyperparameter Optimization Loop
N_TRIALS = 100 # Maximum number of trials during Hyperparameter Optimization Loop
N_JOBS = 1 # Number of parallel jobs to run during each trials in Hyperparameter Optimization Loop
N_STARTUP_TRIALS = 5 # Number of trials to perform random sampling (without relying on sampler) during the Hypeparameter Optimization Loop (To create the initial database)
TIMEOUT = int(60*15) # Maximum number of times (in seconds) the entire loop is allowed up to.

# Evaluation Parameter for each set of hyperparameter
N_TIMESTEPS = int(1e4) # Training budget - Number of time steps in one FULL TRIAL for each set of hyperparameter.
N_EVALUATIONS = 2 # Number of intermediate evaluations performed in one FULL TRIAL for each set of hyperparameter.
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS) # Step interval for each intermediate evaluations during one FULL TRIAL.
N_EVAL_EPISODES = 10 # Number of episodes to be sampled for each evaluation.


# Environment Parameter
N_EVAL_ENVS = 16 # Number of environments used in parallel during evaluation.

ENV_ID = "LunarLander-v2" # ID of the environments, to be initiated with gym.make()

# Algorithm Parameter
ALGO_NAME = "PPO"

## Step 2: Define the search space

In [None]:

                            #############################################################
                            # Step 2: Define a function that samples the hyperparameter #
                            #############################################################

def sample_ppo_params(trial):

  """
  Sample a set of hyperparameter to be trial'd.

  Args:
    trial (optuna.Trial) : An Optuna trial object.

  Returns:
    params (dict): The hyperparameters to be trial'd. Its key matches the keywords used in defining models.
  """

  # To study the hyperparameter to be updated.

  ###################
  # Discount factor #
  ###################
  # suggest.float -> sample from a continuos space (float)
  # "gamma" - name (to be showcased in the final plot)
  # log - means sample from log space
  gamma = 1 - trial.suggest_float("one_minus_gamma", 0.0001, 0.1, log = True)

  # Create another attribute to store the actual gamma value
  trial.set_user_attr("gamma", gamma)

  #######################################
  # Maximum value for gradient clipping #
  #######################################
  max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 0.5, log=True)

  ##########################################################
  # Number of steps to run for each environment per update #
  ##########################################################
  n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 11)

  # Create another attribute to store the actual n_steps
  trial.set_user_attr("n_steps", n_steps)

  #################
  # Learning_rate #
  #################
  learning_rate = trial.suggest_float("learning_rate", 1e-5, 1, log=True)

  ########################
  # Network architecture #
  ########################
  net_arch = trial.suggest_categorical("net_arch", ["tiny", "small"])

  # The net_arch expects a list, that is why it is wrapped in a list
  net_arch = [{"pi": [64], "vf": [64]} if net_arch == 'tiny' \
              else {"pi" : [64, 64], "vf" : [64, 64]}]

  #######################
  # Activation Function #
  #######################
  activation_fn = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

  activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]



  # Note: The key used in this dictionary match the key used in defining the models
  # Therefore, the naming convention must be followed.
  params = {"n_steps": n_steps,
          "gamma": gamma,
          "learning_rate": learning_rate,
          "max_grad_norm": max_grad_norm,
          "policy_kwargs": {"net_arch": net_arch,
                            "activation_fn": activation_fn}}

  return params

## Step 3: Define objective functions

In [None]:


                                ##############################
                                # Step 3: Objective function #
                                ##############################
                                # 3.1: Callback
class TrialEvalCallback(EvalCallback):

  """
  Callback used for evaluating and reporting a trial.

  Args:
    eval_env (gym.env): An evaluation environment.
    trial (Optuna.trial): An Optuna trial object.
    n_eval_episodes (int): Number of evaluation episodes for each evaluation.
    eval_freq (int): Step interval for an intermediate evaluation in each trial.
    deterministic (boolean): Whether the evaluation should use stochastic or deterministic policy.
    verbose (int):

  Returns:
    out (boolean):
  """

  def __init__(self, eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose = 0):

    super().__init__(eval_env = eval_env, n_eval_episodes = n_eval_episodes,
                    eval_freq = eval_freq, deterministic = deterministic,
                    verbose = verbose)
    self.trial = trial
    self.eval_idx = 0
    self.is_pruned = False

  def _on_step(self):
    if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
      super()._on_step()
      self.eval_idx += 1

      # Send report to optuna
      self.trial.report(self.last_mean_reward, self.eval_idx)

      # Prune trial if needed
      if self.trial.should_prune():
        self.is_pruned = True
        return False
    return True

                            # 3.2: The definition of the objective score.
def objective(trial):

  """
  A function that returns the objective score that decides the quality of a set of hyperparameter.

  Args:
    trial (optuna.Trial): An Optuna trial object.

  Returns:
    objective_score (float): The score that represents the quality of this set of hyperparameter.
  """

  # Creat the default keyword arguments (those that wasnt defined in the hyperparameter sampling function)
  kwargs = {"policy": "MlpPolicy",
            "env": ENV_ID}

  # Update with the inclusion of the sampled hyperparameter
  kwargs.update(sample_ppo_params(trial))

  # Create a model using the sampled hyperparameter
  model = PPO(**kwargs)

  # Create the environments
  eval_envs = make_vec_env(env_id = ENV_ID,
                           n_envs = N_EVAL_ENVS)

  # Create the call back for reporting evaluation results
  eval_callback = TrialEvalCallback(eval_env = eval_envs,
                                    trial = trial,
                                    n_eval_episodes = N_EVAL_EPISODES,
                                    eval_freq = EVAL_FREQ,
                                    deterministic = True)

  nan_encountered = False
  try:
    model.learn(N_TIMESTEPS, callback = eval_callback)
  except AssertionError as e:
    # Sometimes, randomly sampled error can lead to NaN
    print(e)
    nan_encountered = True
  finally:
    # At the end of training or if error is encountered
    # Free Memory
    model.env.close()
    eval_envs.close()

  # Inform the optimizer that a non-valid hyperparameter is sampled
  if nan_encountered:
    return float('nan')

  if eval_callback.is_pruned:
    raise optuna.exceptions.TrialPruned()

  return eval_callback.last_mean_reward

## Step 4: Hyperparameter Optimization Loop

In [None]:

                            #############################
                            # Step 4: Optimization Loop #
                            #############################

# Set PyTorch num threads to 1 for faster training
# Parallel environement will demand heavy use of CPU.
# Therefore, this line limits to the usage of cpu for PyTorch to be only 1 line.
th.set_num_threads(1)

# Select a sampler
# https://optuna.readthedocs.io/en/stable/reference/samplers/generated/optuna.samplers.TPESampler.html
# n_startup_trials -> Number of trials at the beginning that sample a set of hyperparameter randomly instead of using the algorithm
# This allows the creation of initial database.
sampler = TPESampler(n_startup_trials = N_STARTUP_TRIALS)

# Select a scheduler / pruner
# https://optuna.readthedocs.io/en/stable/reference/generated/optuna.pruners.MedianPruner.html
# n_startup_trials -> Pruning is disabaled at the beginning for this many trials for initial database creation.
# n_warmup_steps -> Number of steps in each trial that disable the pruning.
pruner = MedianPruner(n_startup_trials = N_STARTUP_TRIALS,
                      n_warmup_steps = N_TIMESTEPS // 3)

# Create a study for Hyperparameter Optimization
# https://optuna.readthedocs.io/en/stable/reference/generated/optuna.create_study.html
study = optuna.create_study(sampler = sampler,
                            pruner = pruner,
                            direction = "maximize")

try:
  # https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html#optuna.study.Study.optimize
  study.optimize(objective,
                 n_trials = N_TRIALS,
                 timeout = TIMEOUT,
                 n_jobs = N_JOBS)
except KeyboardInterrupt:
  pass

# Print the meta info for the hyperparameter optimization process
print(f"Number of finished trials: {len(study.trials)}")
trial = study.best_trial
print(f"Best trial: {trial.value}")

print("Params: ")
for key, value in trial.params.items():
  print(f"  {key}: {value}")

print("User Attributes: ")
for key, value in trial.user_attrs.items():
  print(f"  {key}: {value}")


# Write report
study.trials_dataframe().to_csv(f"study_result_{ALGO_NAME}_{ENV_ID}.csv")

# Show plot
fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()




[I 2025-06-12 10:33:00,511] A new study created in memory with name: no-name-fd0bfab8-64ec-40c1-9241-687385342959

As shared layers in the mlp_extractor are removed since SB3 v1.8.0, you should now pass directly a dictionary and not a list (net_arch=dict(pi=..., vf=...) instead of net_arch=[dict(pi=..., vf=...)])

[I 2025-06-12 10:33:32,344] Trial 0 finished with value: -161.2994264 and parameters: {'one_minus_gamma': 0.0013394327939557347, 'max_grad_norm': 0.37610299115911333, 'exponent_n_steps': 7, 'learning_rate': 0.0015267266493195163, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 0 with value: -161.2994264.

You have specified a mini-batch size of 64, but because the `RolloutBuffer` is of size `n_steps * n_envs = 32`, after every 0 untruncated mini-batches, there will be a truncated mini-batch of size 32
We recommend using a `batch_size` that is a factor of `n_steps * n_envs`.
Info: (n_steps=32 and n_envs=1)

[I 2025-06-12 10:34:04,336] Trial 1 finished with value: 

Number of finished trials: 33
Best trial: -88.5928761
Params: 
  one_minus_gamma: 0.0003189053701602661
  max_grad_norm: 0.3794603696820127
  exponent_n_steps: 9
  learning_rate: 0.013170246935540999
  net_arch: tiny
  activation_fn: tanh
User Attributes: 
  gamma: 0.9996810946298398
  n_steps: 512


# Practice: Tune a DQN agent that plays Atari Games - Space Invader

## Install Libraries

In [3]:
!pip install git+https://github.com/DLR-RM/rl-baselines3-zoo

Collecting git+https://github.com/DLR-RM/rl-baselines3-zoo
  Cloning https://github.com/DLR-RM/rl-baselines3-zoo to /tmp/pip-req-build-cf5dfh_t
  Running command git clone --filter=blob:none --quiet https://github.com/DLR-RM/rl-baselines3-zoo /tmp/pip-req-build-cf5dfh_t
  Resolved https://github.com/DLR-RM/rl-baselines3-zoo to commit 577616cb9f13341579953cb0f6111e007acc0a1d
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sb3_contrib<3.0,>=2.6.1a1 (from rl_zoo3==2.6.1a1)
  Downloading sb3_contrib-2.6.1a1-py3-none-any.whl.metadata (4.1 kB)
Collecting gymnasium<1.2.0,>=0.29.1 (from rl_zoo3==2.6.1a1)
  Downloading gymnasium-1.1.1-py3-none-any.whl.metadata (9.4 kB)
Collecting huggingface_sb3<4.0,>=3.0 (from rl_zoo3==2.6.1a1)
  Downloading huggingface_sb3-3.0-py3-none-any.whl.metadata (6.3 kB)
Collec

In [4]:
!pip install gymnasium[atari]
!pip install gymnasium[accept-rom_license]



## Create a config file

* Save the config file as `dqn.yml`

* Note: The config file is only moslty used as a placeholder. During Hyperparameter optimization, a set of hyperparameter will be randomly sampled, thus replacing these values.

* The type and range of hyperparameter values to be sampled may be referred to: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/rl_zoo3/hyperparams_opt.py#L222

In [None]:
# The default config file
SpaceInvadersNoFrameskip-v4:
  env_wrapper:
    - stable_baselines3.common.atari_wrappers.AtariWrapper
  frame_stack: 4 #Every 4 frame as 1 input to allow the model to learn the trajectories of the object.
  policy: 'CnnPolicy'
  n_timesteps: !!float 1e2 # 1e6 (Recommended, but shortened in this notebook as its only for demo)
  buffer_size: 100000
  learning_rate: !!float 1e-4
  batch_size: 32
  learning_starts: 100000
  target_update_interval: 1000
  train_freq: 4
  gradient_steps: 1
  exploration_fraction: 0.1
  exploration_final_eps: 0.01
  # If True, you need to deactivate handle_timeout_termination
  # in the replay_buffer_kwargs
  optimize_memory_usage: False

## Access the API for hyperparameter optimization (Built-in)

* Refer to the raw code to find out the usage: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/rl_zoo3/train.py

* The full command is as following:
`!python -m rl_zoo3.train --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/ -c dqn.yml -optimize --optimization-log-path logs/optimization --eval-episodes 10 --n-eval-envs 1 --max-total-trials 100  --n-jobs 1 --sampler "tpe" --pruner "median" --n-startup-trials 10 --n-evaluations 2`

In [7]:
!python -m rl_zoo3.train --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/ -c dqn.yml -optimize --optimization-log-path logs/optimization --eval-episodes 10 --n-eval-envs 1 --max-total-trials 100  --n-jobs 1 --sampler "tpe" --pruner "median" --n-startup-trials 10 --n-evaluations 2

2025-06-13 02:13:47.398080: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749780827.659864    2022 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749780827.730596    2022 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-13 02:13:48.294532: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Seed: 3490938613
Loading hyperparameters from: dqn.yml
Default hyperparameters for environment (ones being tuned will be over

# Post Notes: How to apply it onto a PyTorch Model?

* A good reference link: https://www.geeksforgeeks.org/hyperparameter-tuning-with-optuna-in-pytorch/

* The main difference will be on the definition of the objective function and how to manually report back the intermediate and/or final evaluation to optune.

* The following codes are generated by chatGPT on how to accomplish both. (Note: It has not yet to be tested)

In [None]:
def objective(trial):
    # Sample hyperparameters
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    n_units = trial.suggest_int("n_units", 16, 128)

    # Model
    model = MyModel(n_units)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = torch.nn.CrossEntropyLoss()

    # Training loop
    for epoch in range(num_epochs):
        model.train()
        for x_batch, y_batch in train_loader:
            optimizer.zero_grad()
            output = model(x_batch)
            loss = criterion(output, y_batch)
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for x_val, y_val in val_loader:
                output = model(x_val)
                val_loss += criterion(output, y_val).item()

        val_loss /= len(val_loader)

        # Report intermediate result to Optuna
        trial.report(val_loss, epoch)

        # Check whether to prune
        if trial.should_prune():
            raise optuna.TrialPruned()

    return val_loss  # Final score
