Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Black Box Optimization Challenge

This repo contains the starter kit for the black box optimization challenge at NeurIPS 2020. Upload submissions here.

The submission site is open July 10, 2020 - October 15, 2020. We will be open early for practice submissions in the beta-testing phase, but the leaderboard will be reset on July 10.

The benchmark site is powered by Valohai and runs the Bayesmark package, which evaluates black-box optimization algorithms on real-world objective functions. It includes tuning (validation set) performance of standard machine learning models on real data sets. Currently, all of the problems are based on ML hyper-parameter tuning tasks. There are example problems in the starter kit that can be run locally, but the leaderboard problems are secret.

The leaderboard is determined using the optimization performance on held-out (hidden) objective functions, where the optimizer must run without human intervention.

Look in example_submissions to see examples of submissions. The examples currently contain the sub-directories:


Bayesmark output


Bayesian optimization is a popular sample-efficient approach for derivative-free optimization of objective functions that take several minutes or hours to evaluate. Bayesian optimization builds a surrogate model (often a Gaussian process) for the objective function that provides a measure of uncertainty. Using this surrogate model, an acquisition function is used to determine the most promising point to evaluate next.

Bayesian optimization has many applications, with hyperparameter tuning of machine learning models (e.g., deep neural networks) being one of the most popular applications. However, the choice of surrogate model and acquisition function are both problem-dependent and the goal of this challenge is to compare different approaches over a large number of different problems. This challenge focuses on the application of Bayesian optimization to tuning the hyper-parameters of machine learning models.

Instructions for local experiments

Local experimentation/benchmarking on publicly available problems can be done using Bayesmark, which is extensively documented. Note, these are not the same problems as on the leaderboard when submitting on the website, but may be useful for local iteration.

For convenience the script run_local can do local benchmarking in a single command:

> ./ ./example_submissions/pysot 3
Final score `100 x (1-loss)` for leaderboard:
pysot_0.2.3_8cae841    81.11426

It produces a lot of log output as it runs, that is normal. The first argument gives the folder of the optimizer to run, while the second argument gives the number of repeated trials for each problem. Set the repeated trials as large as possible within your computational budget. For finer grain control over which experiments to run, use the Bayesmark commands directly. See the documentation for details.

More explanation on how the mean score formula works can be found here.

Instructions for submissions on the website

The quick-start instructions work as follows:

  • Place all the necessary Python files to execute the optimizer in a folder, for example, example_submissions/pysot. The site will use as the entry-point.
  • The Python environment will be Python 3.6.10 and contain all dependencies in environment.txt. All other dependencies must be placed in a requirements.txt in the same folder as

The submission can be prepared using the prepare_upload script:

> ./ ./example_submissions/pysot
For scoring, upload at address:

This will produce a zip file (e.g., that can be uploaded at the submission site. See, ./ ./example_submissions/opentuner for an example with a requirements file.

The submissions will be evaluated inside a docker container that has no internet connection. Therefore, places all dependencies from requirements.txt as wheel files inside the zip file.

The zip file can also manually be constructed by zipping the Python files (including and all necessary wheel/tar balls for installing extra dependencies. Note: the Python file should be at the top level of the zip file (and not inside a parent folder).

Also note, our optimization problems have been randomly split into a set that will be used to determine the leaderboard, and another set that will determine the final winner once submissions are closed (October 15, 2020).

Controlling install order

The docker startup script does not necessarily install a dependency and its second order dependency in the desired order, which can lead to errors on the submission website. To control install order, bundle a file called install_order.txt at the top of level of the zip file. In this file, list of all of the wheels/tar-balls, one-per-line, in the order you want them installed. For example, a install_order.txt with:


will install BoTorch first and then Ax.

Debugging execution issues locally

It is possible to mock the startup install script run on the server locally using our docker image. For example, to debug the startup for locally, one can run:

> docker pull valohai/bbochallenge:20200821-57e60f9
20200821-57e60f9: Pulling from valohai/bbochallenge
d6ff36c9ec48: Pull complete
c958d65b3090: Pull complete
> docker run -it --network none valohai/bbochallenge:20200821-57e60f9 /bin/bash
root@90d3e94c5156:/# mkdir -p /valohai/inputs/optimizer
> CONTAINER=90d3e94c5156
> docker cp ./ $CONTAINER:/valohai/inputs/optimizer/
root@90d3e94c5156:/# ls /valohai/inputs/optimizer
root@90d3e94c5156:/# python /blackbox/ -i /valohai/inputs/optimizer -o /blackbox/optimizer
Bayesmark.version 01090c64b92bd5f7e138d9f80b37d4b171462ce4
Processing /blackbox/optimizer/
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.6/site-packages (from turbo==0.0.1) (1.18.5)
Requirement already satisfied: torch>=1.3.0 in /usr/local/lib/python3.6/site-packages (from turbo==0.0.1) (1.5.0)
Requirement already satisfied: gpytorch>=0.3.6 in /usr/local/lib/python3.6/site-packages (from turbo==0.0.1) (1.1.1)
Requirement already satisfied: future in /usr/local/lib/python3.6/site-packages (from torch>=1.3.0->turbo==0.0.1) (0.18.2)
Building wheels for collected packages: turbo
  Building wheel for turbo ( ... done
  Created wheel for turbo: filename=turbo-0.0.1-py3-none-any.whl size=12382 sha256=d9c90fbfe571a426ff0c82f164fd6c5a95462bc2ca219a816f1c62e30199b4fe
  Stored in directory: /root/.cache/pip/wheels/59/83/8f/7ce3dd69f1e81e4aa8adbb82a7f1b8f89473c82df67d83d675
Successfully built turbo
Installing collected packages: turbo
Successfully installed turbo-0.0.1

Commands starting with > are run on the local terminal, while those starting with root@ should be run in the docker. Any issues involved in installing dependencies should appear in the docker when calling

Note, the --network none flag is used to disable network access, which makes the docker behave like the docker used in evaluation.

Time limits

The optimizer has a total of 640 seconds compute time for making suggestions on each problem (16 iterations with batch size of 8); or 40 seconds per iteration. Optimizers exceeding the time limits will be cut off from making further suggestions and the best optima found before being killed will be used. The optimizer evaluation terminates after 16 iterations (batch size 8) or 640 seconds, whichever happens earlier. There is no way to get more iterations by being faster.

Participant teams will be limited to one submission per day (we are temporarily allowing five per day).

Execution environment

The docker environment for scoring on the website has two CPU cores and no GPUs. It runs in Debian GNU/Linux 10 (buster) with Python 3.6.10 and the pre-installed packages in environment.txt. Participants have until July 31 to suggest new packages be added to environment.txt in the docker. The environment in the docker can be produced locally by creating a new Python 3.6.10 virtual environment and running:

pip install -r environment.txt

It consumes ~1.6GB of disk space to create the virtual environment.

Non-PyPI dependencies

The prepare_upload script will only fetch the wheels for packages available on PyPI. If a package is not available on PyPI, you can use a public git repo instead; see the turbo example.

If a package is only available in a private git repo, you can include the wheel in the optimizer folder manually. Wheels can be built using the command:

python3 bdist_wheel

as documented here.

Optimizer API

All optimization problems are minimization. Lower values of the black-box function are better.

Optimizer submissions should follow this template, for a suggest-observe interface, in your

from bayesmark.abstract_optimizer import AbstractOptimizer
from bayesmark.experiment import experiment_main

class NewOptimizerName(AbstractOptimizer):
    primary_import = None  # Optional, used for tracking the version of optimizer used

    def __init__(self, api_config):
        """Build wrapper class to use optimizer in benchmark.

        api_config : dict-like of dict-like
            Configuration of the optimization variables. See API description.
        AbstractOptimizer.__init__(self, api_config)
        # Do whatever other setup is needed
        # ...

    def suggest(self, n_suggestions=1):
        """Get suggestions from the optimizer.

        n_suggestions : int
            Desired number of parallel suggestions in the output

        next_guess : list of dict
            List of `n_suggestions` suggestions to evaluate the objective
            function. Each suggestion is a dictionary where each key
            corresponds to a parameter being optimized.
        # Do whatever is needed to get the parallel guesses
        # ...
        return next_guess

    def observe(self, X, y):
        """Feed an observation back.

        X : list of dict-like
            Places where the objective function has already been evaluated.
            Each suggestion is a dictionary where each key corresponds to a
            parameter being optimized.
        y : array-like, shape (n,)
            Corresponding values where objective has been evaluated
        # Update the model with new objective function observations
        # ...
        # No return statement needed

if __name__ == "__main__":
    # This is the entry point for experiments, so pass the class to experiment_main to use this optimizer.
    # This statement must be included in the wrapper class file:

You can replace NewOptimizerName with the name of your optimizer.

More details on the API can be found here. Note: do not specify kwargs in a config.json for the challenge because the online evaluation will not pass any kwargs or use the config.json.

Configuration space

The search space is defined in the api_config dictionary in the constructor to the optimizer (see template above). For example, if we are optimizing the hyper-parameters for the scikit-learn neural network with ADAM sklearn.neural_network.MLPClassifier then we could use the following configuration for api_config:

api_config = \
    {'hidden_layer_sizes': {'type': 'int', 'space': 'linear', 'range': (50, 200)},
     'alpha': {'type': 'real', 'space': 'log', 'range': (1e-5, 1e1)},
     'batch_size': {'type': 'int', 'space': 'linear', 'range': (10, 250)},
     'learning_rate_init': {'type': 'real', 'space': 'log', 'range': (1e-5, 1e-1)},
     'tol': {'type': 'real', 'space': 'log', 'range': (1e-5, 1e-1)},
     'validation_fraction': {'type': 'real', 'space': 'logit', 'range': (0.1, 0.9)},
     'beta_1': {'type': 'real', 'space': 'logit', 'range': (0.5, 0.99)},
     'beta_2': {'type': 'real', 'space': 'logit', 'range': (0.9, 1.0 - 1e-6)},
     'epsilon': {'type': 'real', 'space': 'log', 'range': (1e-9, 1e-6)}}

Each key in api_config is a variable to optimize and its description is itself a dictionary with the following entries:

  • type: {'real', 'int', 'cat', 'bool'}
  • space: {'linear', 'log', 'logit', 'bilog'}
  • values: array-like of same data type as type to specify a whitelist of guesses
  • range: (lower, upper) tuple to specify a range of allowed guesses

One can also make the following assumption on the configurations:

  • space will only be specified for types int and real
  • range will only be specified for types int and real
  • We will not specify both range and values
  • bool does not take anything extra (space, range, or values)
  • The values for cat will be strings

For observe, X is a (length n) list of dictionaries with places where the objective function has already been evaluated. Each suggestion is a dictionary where each key corresponds to a parameter being optimized. Likewise, y is a length n list of floats of corresponding objective values. The observations y can take on inf values if the objective function crashes, however, it should never be nan.

For suggest, n_suggestions is simply the desired number of parallel suggestions for each guess. Also, next_guess will be a length n_suggestions array of dictionaries of guesses, in the same format as X.


  • Submission site open to public: July 10
  • Deadline for final submissions: October 15
  • Announcement of winners: November 15
  • Presentations, short papers, and code release from winners due: December 1
  • Competition track (virtual) award ceremony: December 2020

Terms and conditions

The terms and conditions for the challenge: coming soon.


Any questions can be sent to


Starter kit for the black box optimization challenge at Neurips 2020







No packages published