Skip to content

add hyper-parameter optimization api #2532

@schlichtanders

Description

@schlichtanders

Dear DVC folk,

Motivation

you mention it yourself on your documentation: Fully versioned Hyperparameter Optimization comes to mind when using DVC.

Little Research

I just made a quick research and it gets apparent very soon that this needs a specific implementation for dvc.

All the existing hyperparameter optimizers like python's hyperopt

  • assume their own hyper-parameter API how the hyperoptimization orchestration process communicates with the single algorithm
  • and distribute the computation using their individual distribution machinery

Suggestions how to integrate to DVC

It seems to me the following is needed for hyperparameter optimization to be a natural addition to DVC:

  1. each triggered hyperoptimization orchestration should have its own git branch subfolder

  2. each single hyperoptimization run should have its own subbranch under that subfolder

  3. a file-based hyper-parameter API, probably based on json

    • i.e. the hyper-parameter configurations should be stored in a file format
      • e.g. spearmint uses a custom json format, SMAC a completely custom file-format
    • and in addition also the choosen parameters for a concrete run
      • everything I found by now either passes the hyperparemters python-internally as arguments to a function, or on the commandline as arguments to a script... so no convention to copy, but anyway it is just a dictionary with values.
    • using a common json format this would enable easy tracking/comparing of different parameters accross hyperoptimization git branches similar to how dvc metrics already work.
    • and also the final run could easily be written as a .dvc routine itself by calling dvc repro
  4. it would unbelievably awesome to not reinvent the wheel entirely, but provide wrappers around existing hyperoptimization* packages like hyperopt or smac or others

    the principle idea is simple: instead of running a concrete algorithm with the specific framework, you run a wrapper which

    1. checksout a new hyperoptimization branch
    2. grabs the hyperparameters from the framework-specific API (e.g. as commandline args) and writes them into the new json file format
    3. runs dvc repro myalgorithm.dvc on a previously specified routine myalgorithm.dvc
    4. commits everything on the branch
    5. somehow find out the winner of the hyper-optimization, create a specific branch for this, and commit everything nicely.

    wrapping existing optimization frameworks has several advantages

    • less code to maintain and also only against stable APIs
    • monitoring webui and else for evaluating or live-inspecting the hyperoptimization may be already available
    • the community could be new wrappers

Of course more details will pop up while actually implementing this, e.g. how to integrate hyperoptimization with .dvc pipeline files as neatly as possible (for instance we may want to commit both the single run.dvc as well as a hyperopt.dvc to the same repository -- these need to interact seamlessly together)

What do you think about this suggested approach?

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionrequires active participation to reach a conclusionfeature requestRequesting a new feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions