-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Dear DVC folk,
Motivation
you mention it yourself on your documentation: Fully versioned Hyperparameter Optimization comes to mind when using DVC.
Little Research
I just made a quick research and it gets apparent very soon that this needs a specific implementation for dvc.
All the existing hyperparameter optimizers like python's hyperopt
- assume their own hyper-parameter API how the hyperoptimization orchestration process communicates with the single algorithm
- and distribute the computation using their individual distribution machinery
Suggestions how to integrate to DVC
It seems to me the following is needed for hyperparameter optimization to be a natural addition to DVC:
-
each triggered hyperoptimization orchestration should have its own git branch subfolder
-
each single hyperoptimization run should have its own subbranch under that subfolder
-
a file-based hyper-parameter API, probably based on json
- i.e. the hyper-parameter configurations should be stored in a file format
- and in addition also the choosen parameters for a concrete run
- everything I found by now either passes the hyperparemters python-internally as arguments to a function, or on the commandline as arguments to a script... so no convention to copy, but anyway it is just a dictionary with values.
- using a common json format this would enable easy tracking/comparing of different parameters accross hyperoptimization git branches similar to how
dvc metricsalready work. - and also the final run could easily be written as a .dvc routine itself by calling
dvc repro
-
it would unbelievably awesome to not reinvent the wheel entirely, but provide wrappers around existing hyperoptimization* packages like hyperopt or smac or others
the principle idea is simple: instead of running a concrete algorithm with the specific framework, you run a wrapper which
- checksout a new hyperoptimization branch
- grabs the hyperparameters from the framework-specific API (e.g. as commandline args) and writes them into the new json file format
- runs
dvc repro myalgorithm.dvcon a previously specified routinemyalgorithm.dvc - commits everything on the branch
- somehow find out the winner of the hyper-optimization, create a specific branch for this, and commit everything nicely.
wrapping existing optimization frameworks has several advantages
- less code to maintain and also only against stable APIs
- monitoring webui and else for evaluating or live-inspecting the hyperoptimization may be already available
- the community could be new wrappers
Of course more details will pop up while actually implementing this, e.g. how to integrate hyperoptimization with .dvc pipeline files as neatly as possible (for instance we may want to commit both the single run.dvc as well as a hyperopt.dvc to the same repository -- these need to interact seamlessly together)
What do you think about this suggested approach?