In [None]:
import torch
import pandas as pd
from minerva.bayesopt import BayesianOptimisation 

## Notebook 2: Running a batch constrained bayesian optimisation benchmark on a virtual benchmark dataset

In [None]:
# set device, use gpu if available
tkwargs = {
        "dtype": torch.double,
        "device": torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    }

**This tutorial shows how to run a constrained Bayesian optimisation benchmark on the emulated virtual benchmark datasets in the manuscript. The datasets are constructed as tables of concatenated descriptor representations and objective values. This follows similar instructions as `benchmark_baseline.ipynb`**
- Constrained benchmarks restrict the number of unique values of a specific feature for each batch, e.g. only 2 unique temperatures per batch, etc.



First, as in `1_benchmark_baseline.ipynb`, we show an example of a benchmark dataset.
- The benchmark dataset is a table of rows containing the featurised representation of the reaction conditions and their corresponding objective values, in this case yield and turnover
- The input features and target objectives are **not assumed to have undergone any scalarisation**
- No other columns besides the input features and objective columns are assumed to be present
- **Maximisation is assumed**, so minimisation objectives will have to be adjusted to their negative values
- In this case, the reaction conditions consist of choice of ligand, which is one-hot encoded, with continuous variables residence time, reaction temperature, and catalyst loading

In [None]:
suzuki_i_benchmark = pd.read_csv('../benchmark_datasets/olympus_suzuki/suzuki_i.csv', index_col=0)
suzuki_i_benchmark

# define index of temperature column in the dataframe
temperature_index = 8

We then define the settings that we wish to run the Bayesian Optimisation benchmark on.

| Arguments | Explanation |
| --- | --- |
| `seed` (int) | Random seed set using pytorch lightning |
| `objective_columns` (List[str]) | List of strings denoting objective columns in the benchmark dataset |
| `benchmark_path` (str) | File path to read the benchmark dataset file, example shown above <br> The virtual benchmark datasets used in the manuscript are included in this repository under minerva/benchmark_datasets |
| `init_samples` (int) | Number of quasi-random Sobol samples to initialise the optimisation campaign as initial training data |
| `batch_size` (int) | Number of experiments to suggest in parallel per iteration |
| `n_iterations` (int) | Number of iterations to run Bayesian optimisation for, excluding the quasi-random initialisation |
| `search_strategy` (str) | Acquisition function to use. Available choices are `qNEHVI`, `qNParEgo`, and `TS-HVI` |
| `constrain_strategy` (int) | Strategy to use for constraining (e.g.) the number of temperatures per batch. <br> Choices implemented are `nested` and `naive`. For a full explanation we refer the user to our manuscript. |
| `n_unique_features` (int) | The number of unique features (temperature in this case) values allowed per batch. |
| `feature_index` (int) | Index of the constrained feature, temperature, in the training input features **tensor**. <br> This depends on your benchmark dataset and must be changed. We have added the feature indexes for our existing datasets |
| `kernel` (str) | Kernel hyperparamters for Matern Kernel Gaussian Process. <br> Available choices are `default` and `edboplus` |


In [None]:
seed = 1 
objective_columns = ['yield', 'turnover'] # defining objective columns in the dataframe to be read and optimised in the benchmark dataframe
benchmark_path = '../benchmark_datasets/olympus_suzuki/suzuki_i.csv' 

init_samples = 24 
batch_size = 24 
n_iterations = 4 # iterations of BO in addition to initialisation we would like to run, total 24 + 24*4 = 120 in this case
search_strategy = 'qNEHVI' 
constrain_strategy = 'nested' # strategy to use for constraining features, can be chosen from ['nested', 'naive']
n_unique_features = 2 # number of unique features e.g. temperature allowed per batch 
kernel = 'edboplus' # choose kernel hyperparameters used in edboplus

# get the feature index at which the constrained temperature is in the training data, this is dependent on dataset
benchmark = 'olympus_suzuki'
if benchmark == 'olympus_suzuki':
    feature_index = 8 # this is equal to temperature_index
elif benchmark == 'edbo_arylation':
    feature_index = 1
# you will need to set your own feature index for your custom datasets

In [None]:
# initialisation of BO object
Benchmark = BayesianOptimisation(device=tkwargs,seed=seed)
Benchmark.load_and_preprocess_benchmark(data_df_path=benchmark_path, objective_columns=objective_columns)

**Run the Bayesian optimisation**
- Output metrics are displayed as hypervolume per iteration of optimisation from iteration 0 (sobol) to iteration n of the Bayesian optimisation. 
- Reference max hypervolumes of the existing dataset are also displayed
- IGD+ and IGD metrics are also calculated 

In [None]:
results = Benchmark.run_constrained_benchmark(
    init_samples=init_samples,
    batch_size=batch_size,
    n_iterations=n_iterations,
    search_strategy=search_strategy,
    constrain_strategy=constrain_strategy,
    n_unique_features=n_unique_features,
    feature_index=feature_index,
    kernel='edboplus',
)

In [None]:
print(results)