In [None]:
import torch
import pandas as pd
from minerva.bayesopt import BayesianOptimisation 

## Notebook 1: Running a Bayesian optimisation benchmark on a virtual benchmark dataset

In [None]:
# set device, use gpu if available
tkwargs = {
        "dtype": torch.double,
        "device": torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    }

**This tutorial shows how to run a Bayesian optimisation benchmark on the emulated virtual benchmark datasets in the manuscript. The datasets are constructed as tables of concatenated descriptor representations and objective values**

First, we show an example of what an input benchmark dataset looks like
- The benchmark dataset is a table of rows containing the featurised representation of the reaction conditions and their corresponding objective values, in this case yield and turnover
- The input features and target objectives are **not assumed to have undergone any scalarisation**
- No other columns besides the input features and objective columns are assumed to be present
- **Maximisation is assumed**, so minimisation objectives will have to be adjusted to their negative values
- In this case, the reaction conditions consist of choice of ligand, which is one-hot encoded, with continuous variables residence time, reaction temperature, and catalyst loading

In [None]:
suzuki_i_benchmark = pd.read_csv('../benchmark_datasets/olympus_suzuki/suzuki_i.csv', index_col=0)
suzuki_i_benchmark

We then define the settings that we wish to run the Bayesian Optimisation benchmark on. 

| Arguments | Explanation |
| --- | --- |
| `seed` (int) | Random seed set using pytorch lightning |
| `objective_columns` (List[str]) | List of strings denoting objective columns in the benchmark dataset |
| `benchmark_path` (str) | File path to read the benchmark dataset file, example shown above <br> The virtual benchmark datasets used in the manuscript are included in this repository under minerva/benchmark_datasets |
| `init_samples` (int) | Number of quasi-random Sobol samples to initialise <br> the optimisation campaign as initial training data |
| `batch_size` (int) | Number of experiments to suggest in parallel per iteration |
| `n_iterations` (int) | Number of iterations to run Bayesian optimisation for, <br> excluding the quasi-random initialisation |
| `search_strategy` (str) | Acquisition function to use.  <br> Available choices are `qNEHVI`, `qNParEgo`, and `TS-HVI` |
| `kernel` (str) | Kernel hyperparameters to use for Gaussian Process with a Matern Kernel. <br> Available choices are `default` and `edboplus` |
| `noise_std` (float) | Level of noise standard deviation for Gaussian noise used to perturb the first objective value. <br> Since for our benchmark datasets this defaults to yield, they are clamped at `[0, 100]` in this implementation |
| `noise_std_2` (float) | Level of noise standard deviation for Gaussian noise used to perturb the second objective value. <br> For our benchmark datasets, this is turnover, and is clamped at `0` and the max value for this work |

In [None]:
seed = 1
objective_columns = ['yield', 'turnover'] # defining objective columns in the dataframe to be read and optimised in the benchmark dataframe
benchmark_path = '../benchmark_datasets/olympus_suzuki/suzuki_i.csv' # in the same dataframe format as shown above

init_samples = 24 
batch_size = 24 
n_iterations = 4 # iterations of BO in addition to initialisation we would like to run, total 24 + 24*4 = 120 in this case
search_strategy = 'qNEHVI' 
kernel = 'edboplus' # choose kernel hyperparameters used in edboplus
noise_std = 0
noise_std_2 = 0

In [None]:
# initialisation of BO object
Benchmark = BayesianOptimisation(device=tkwargs,seed=seed)
Benchmark.load_and_preprocess_benchmark(data_df_path=benchmark_path, objective_columns=objective_columns)

**Run the Bayesian optimisation**
- Output metrics are displayed as hypervolume per iteration of optimisation from iteration 0 (sobol) to iteration n of the Bayesian optimisation. 
- Reference max hypervolumes of the existing dataset are also displayed
- IGD+ and IGD metrics are also calculated 

In [None]:
results = Benchmark.run_baseline_benchmark(
    init_samples=init_samples,
    batch_size=batch_size,
    n_iterations=n_iterations,
    search_strategy=search_strategy,
    kernel='edboplus',
    noise_std=noise_std,
    noise_std_2=noise_std_2,
)

In [None]:
print(results)