# Hyperparameter Optimization

To optimize the hyperparameters of the inference and summary networks, we use [Optuna](https://www.optuna.org), a framework-agnostic hyperparameter optimization library that efficiently minimizes custom objective functions to identify optimal parameter settings. Our optimization process involves four three steps:

1. **Data Simulation** of a large *training data set* that will be used throughout all subsequent training phases as well as a smaller *validation data set* that will be used to assess the outcome metric.

2. **Define the Objective Function and Search Space**, which takes a set of hyperparameters and evaluates the outcome of interest. In our case, the function includes a whole offline training phase on the *training data set* and computes a weighted sum of metrics capturing parameter recovery, simulation-based calibration, and posterior contraction with respect to the *validation data set*. Within the objective function, the search space has to be defined by specifying a range of possible values for each hyperparameter.


3. **Run a predefined number of trials**, during which Optuna iteratively proposes and evaluates hyperparameter sets to minimize the objective function. Each trial involves a whole offline training procedure on the same pre-simulated *training data set*. At the end of each trials, the performance of the given set of hyperparameters is assessed by evaluating the objective function. Based on this score, Optuna updates the parameter proposals for the next trial, which ideally leads to a stepwise minimization of the objective function.

After running all trials, we can identify which parameter configurations yielded the lowest objective value and visualize the optimization results. 

Before starting with these steps, we have to load all libraries:

In [None]:
import sys
sys.path.append("../../BayesFlow")
sys.path.append("../")

import os
import torch 

# optional if you have a GPU:
#print("CUDA available:", torch.cuda.is_available(), flush=True)
#print(torch.cuda.device_count(), flush=True)
#print("Using device:", torch.cuda.get_device_name(0))

if "KERAS_BACKEND" not in os.environ:
    # set this to "torch", "tensorflow", or "jax"
    os.environ["KERAS_BACKEND"] = "torch"

import numpy as np
import pickle

import keras

import optuna

import bayesflow as bf



  from .autonotebook import tqdm as notebook_tqdm
INFO:bayesflow:Using backend 'torch'
When using torch backend, we need to disable autograd by default to avoid excessive memory usage. Use

with torch.enable_grad():
    ...

in contexts where you need gradients (e.g. custom training loops).


## Data Simulation

As described in [DMC Introduction](./dmc_introduction.ipynb), we set up the simulator function to simulate RT and accuracy data using the Diffusion Model for Conflict Tasks. Make sure that the simulator has a fixed number of trials by defining `fixed_num_obs` otherwise the data simulation will contain a random number of trials that is constant across all data sets.

In [None]:
from dmc import DMC, dmc_helpers

simulator = DMC(prior_means=np.array([70.8, 114.71, 0.71, 332.34, 98.36, 43.36]),
                prior_sds=np.array([19.42, 40.08, 0.14, 52.07, 30.05, 9.19]),
                sdr_fixed=None,
                tmax=1500, 
                contamination_probability=None,
                fixed_num_obs=500,
                param_names=("A", "tau", "mu_c", "mu_r", "b", "sd_r"))

We are now able to simulate multiple data sets efficiently by simply running:

In [6]:
test_data = simulator.sample(10)

For the optimization algorithm, we simulated 50,000 data sets for the training data set and 1000 data sets for the validation data set. However, for demonstration purposes we only simulate 100 training data sets and 100 validation data sets:

In [24]:
train_data = simulator.sample(100)

val_data = simulator.sample(100)

## Defining the Objective Function and Search Space

The core element of the optuna algorithm is the objective function. That is a discrepancy function the automated algorithm is going to minimize during multiple training iterations. As outlined in Schaefer et al. (2025), we used a weighted sum of the parameter recovery, simulation-based calibration and posterior contraction:

\begin{equation}
L = w_{1} \overline{\text{NRMSE}} + w_2 \overline{\text{\text{ECE}}} + w_3 \overline{\text{CF}} 
\end{equation}

Before defining the objective function, we can define the adapter since it will stay the same throughout all trials:

In [25]:
adapter = (
    bf.adapters.Adapter()
    .convert_dtype("float64", "float32")
    .sqrt("num_obs")
    .concatenate(["A", "tau", "mu_c", "mu_r", "b", 'sd_r'], into="inference_variables")
    .concatenate(["rt", "accuracy", "conditions"], into="summary_variables")
    .standardize(include="inference_variables")
    .rename("num_obs", "inference_conditions")
    )

In earlier training phases, we observed that the loss reaches a relatively low level after about 50 epochs, with only marginal improvements beyond that point. Taken into account that the training phase will be repeated in each Optuna-trial, we decided to choose this number.

In [29]:
def objective(trial, epochs=10):

    # Define Search Space
    dropout = trial.suggest_float("dropout", 0.01, 0.3)
    initial_learning_rate = trial.suggest_float("lr", 1e-4, 1e-3) 
    num_seeds = trial.suggest_int("num_seeds", 1, 8)
    batch_size = trial.suggest_categorical("batch_size", [16, 32, 64, 128])
    embed_dim = trial.suggest_categorical("embed_dim", [64, 128])
    summary_dim = trial.suggest_int("summary_dim", 4, 32)

    # Define inference net 
    inference_net = bf.networks.FlowMatching(coupling_kwargs=dict(subnet_kwargs=dict(dropout=dropout)))

    # Define summary net
    summary_net = bf.networks.SetTransformer(summary_dim=summary_dim, num_seeds=num_seeds, dropout=dropout, embed_dim=(embed_dim, embed_dim))
    
    workflow = bf.BasicWorkflow(
        simulator=simulator,
        adapter=adapter,
        initial_learning_rate=initial_learning_rate,
        inference_network=inference_net,
        summary_network=summary_net,
        checkpoint_filepath= '../data_complete/optuna_checkpoints',
        checkpoint_name= f'network_{round(dropout, 2)}_{round(initial_learning_rate, 2)}_{num_seeds}_{batch_size}_{embed_dim}_{summary_dim}',
        inference_variables=["A", "tau", "mu_c", "mu_r", "b", 'sd_r'])
    
    # Start Offline Training Procedure
    history = workflow.fit_offline(train_data, epochs=epochs, batch_size=batch_size, validation_data=val_data, verbose=0)
    
    # Compute Metrics Recovery, SBC, PC
    metrics_table=workflow.compute_default_diagnostics(test_data=val_data)

    # Computed weighted sum L
    L = dmc_helpers.weighted_metric_sum(metrics_table, weight_recovery=1, weight_pc=1, weight_sbc=1)
    
    return L


We now create a study object:

In [30]:

study = optuna.create_study(direction="minimize")


[I 2025-07-18 10:30:04,645] A new study created in memory with name: no-name-71fe456a-2ee9-4b1f-be3c-32238f2048c2


Now, we can run the study over multiple trials. Note, that this can be computationally heavy and runtime can vary between hardware configurations. For the optimization in Schaefer et al. (2025), we ran 50 trials on 50000 training data sets and 1000 validation data sets. The coomputation took about 5 days on a NVIDIA A100 SXM4 80 GB GPU. For demonstration puposes, we run only 10 trials:

In [31]:

study.optimize(objective, n_trials=10)

trial = study.best_trial
print("Outcome Metric: {}".format(trial.value))
print("Best hyperparameters: {}".format(trial.params))



INFO:bayesflow:Fitting on dataset instance of OfflineDataset.
INFO:bayesflow:Building on a test batch.
INFO:bayesflow:Training is now finished.
            You can find the trained approximator at '../data_complete/optuna_checkpoints/network_0.17_0.0_8_32_64_13.network_0.17_0.0_8_32_64_13.keras'.
            To load it, use approximator = keras.saving.load_model(...).
[I 2025-07-18 10:31:46,174] Trial 0 finished with value: 1.5426481802641618 and parameters: {'dropout': 0.17131354078376448, 'lr': 0.00045175605420312196, 'num_seeds': 8, 'batch_size': 32, 'embed_dim': 64, 'summary_dim': 13}. Best is trial 0 with value: 1.5426481802641618.
INFO:bayesflow:Fitting on dataset instance of OfflineDataset.
INFO:bayesflow:Building on a test batch.
INFO:bayesflow:Training is now finished.
            You can find the trained approximator at '../data_complete/optuna_checkpoints/network_0.24_0.0_8_32_64_12.network_0.24_0.0_8_32_64_12.keras'.
            To load it, use approximator = keras.saving.l

Outcome Metric: 1.4344895329407834
Best hyperparameters: {'dropout': 0.26865304748694374, 'lr': 0.0005948312875696228, 'num_seeds': 4, 'batch_size': 16, 'embed_dim': 128, 'summary_dim': 23}


After running the study, we can inspect the search space by plotting the objective function as a function of parameters:

In [32]:
from plotly.io import show

fig = optuna.visualization.plot_slice(study, params=["dropout", "lr", 'num_seeds', 'batch_size', 'embed_dim', 'summary_dim'])
show(fig)