[MRG+1] Scikit - Optimize based GridSearchCV plug-in replacement #405
Conversation
Codecov Report
@@ Coverage Diff @@
## master #405 +/- ##
==========================================
+ Coverage 85.76% 86.37% +0.61%
==========================================
Files 21 22 +1
Lines 1440 1556 +116
==========================================
+ Hits 1235 1344 +109
- Misses 205 212 +7
Continue to review full report at Codecov.
|
Also let me know if you do not like the location of the class (cuttently |
As of right now PR contains working Implementation of |
Thanks for this! Code, example, and tests all in one!
I'm not mad keen on the name. I'd make it available at the top level like |
Looks like we need to be smarter about our tests so that they can run in less time. |
skopt/wrappers/search_cv.py
Outdated
import sklearn.model_selection._search as skms | ||
|
||
import numpy as np | ||
from collections import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no stars please :)
skopt/tests/test_searchcv.py
Outdated
|
||
# Extract available surrogates, so that new ones are used automatically | ||
available_surrogates = [ | ||
getattr(sol, name) for name in sol.__all__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather be explicit and write down the full list of surrogates, instead of relying on some black magic that could easily break down as we add/update things.
skopt/tests/test_searchcv.py
Outdated
""" | ||
Tests whether the cross validation search wrapper around sklearn | ||
models runs properly with available surrogates and with single | ||
or multiple workers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation issue
skopt/wrappers/__init__.py
Outdated
@@ -0,0 +1 @@ | |||
from .search_cv import SkoptSearchCV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing line break
skopt/wrappers/search_cv.py
Outdated
dimensions = [params_space[k] for k in sorted(params_space.keys())] | ||
|
||
if self.surrogate == "auto": | ||
surrogate = GaussianProcessRegressor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a good kernel?
I think it would be nice to solve #338 to reuse a sensible default here.
skopt/wrappers/search_cv.py
Outdated
|
||
from sklearn.base import clone | ||
from sklearn.externals.joblib import Parallel, delayed, cpu_count | ||
import sklearn.model_selection._search as skms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we find a better name than skms
?
skopt/tests/test_searchcv.py
Outdated
# Extract available surrogates, so that new ones are used automatically | ||
available_surrogates = [ | ||
getattr(sol, name) for name in sol.__all__ | ||
if "GradientBoostingQuantileRegressor" not in name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use this model?
skopt/wrappers/search_cv.py
Outdated
number of evaluations set to self.n_iter. Alternatively, if | ||
a list of (dict, int > 0) is given, the search is done for | ||
every search space for number of iterations given as a second | ||
element of tuple. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means if I give a list of three dictionaries there will be a total of 3*n_iter
iterations? What is scikit-learn's behaviour wrt this?
I thought sklearn treats the situation of having a list of dict
s as having an extra implied categorical dimension.
We should try and be as much plug'n'play as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for GridSearchCV
says that if the list of dicts is provided, then grid spanned by every dict is explored sequentially. Hence I do the same in here.
skopt/wrappers/search_cv.py
Outdated
|
||
# this dict is used in order to keep track of skopt Optimizer | ||
# instances for different search spaces (str(space) is used as key) | ||
self.optimizer = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> _optimizer
? It is 'private' to the internal process no?
skopt/wrappers/search_cv.py
Outdated
dimensions = [params_space[k] for k in sorted(params_space.keys())] | ||
|
||
if self.surrogate == "auto": | ||
surrogate = GaussianProcessRegressor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should solve #400 (comment) or we need to add the same kernel setup and space conversion as is in gp_minimize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup I use now strings for surrogate for Optimizer which cooks me estimators like a chef 😜
skopt/wrappers/search_cv.py
Outdated
|
||
# if tuple: (dict: search space, int: n_iter) | ||
if isinstance(elem, tuple): | ||
psp, n_iter = elem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
psp
-> search_space
(I think) in general we should use proper words for the variable names instead of abbreviations
skopt/wrappers/search_cv.py
Outdated
psp, n_iter = elem, self.n_iter | ||
else: | ||
raise ValueError("Unsupported type of parameter space. " | ||
"Expected tuple or dict, got " + str(elem)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to "got %s." % (elem)
skopt/wrappers/search_cv.py
Outdated
"Expected tuple or dict, got " + str(elem)) | ||
|
||
# do the optimization for particular search space | ||
while n_iter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool(-3) == True
so this won't stop if n_iter
isn't dividable by n_jobs
. I think we should check if there are enough iterations left to sample n_jobs
more points, if not reduce it to the number left and then stop the loop. (and add a unit test to check this works)
skopt/wrappers/search_cv.py
Outdated
} | ||
return params_dict | ||
|
||
def step(self, X, y, param_space, groups=None, n_jobs=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to read the code for RandomSearchCV
but doesn't it have a step function we can re-use/hook into so we don't have to duplicate all the code for recording the results?
Addressed comments by @betatim . Please let me know if there is something else. |
skopt/__init__.py
Outdated
@@ -61,7 +61,7 @@ def f(x): | |||
from .utils import load | |||
from .utils import dump | |||
from .utils import expected_minimum | |||
|
|||
from .searchcv import BayesSearchCV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this in the alphabetical order in the list above?
skopt/searchcv.py
Outdated
import sklearn.model_selection._search as sk_model_sel | ||
|
||
from skopt import Optimizer | ||
from skopt.utils import point_asdict, dimensions_aslist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those should be local imports
`pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 * | ||
n_jobs`. | ||
|
||
See Also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a short Example
section in the docstring? (showing that things are in fact simple, despite the huge list of arguments)
skopt/searchcv.py
Outdated
key = str(param_space) | ||
if key not in self.optimizer_: | ||
self.optimizer_[key] = self._make_optimizer(param_space) | ||
optimizer = self.optimizer_[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure what is done inside self._make_optimizer
, but we should ensure there is no border effects... Your suggestion is not equivalent to @iaroslav-ai's code, in case self._make_optimizer
changes the state of self
internally.
skopt/searchcv.py
Outdated
Either estimator needs to provide a ``score`` function, | ||
or ``scoring`` must be passed. | ||
|
||
search_spaces : list of dict or tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is nice, the 90% use case is to search over a single space. I think we should support directly providing a dict. This would also better match with the API of GridSearchCV
.
add example, add test for the multiple search spaces
skopt/searchcv.py
Outdated
"Parameter of space should be an instance of" | ||
"skopt.space.Dimension (Real, Integer, ...)," | ||
"but in subspace %s the dimension %s is %s" % | ||
(subspace, k, v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead use skopt.space.check_dimension
? In this way, we would have a consistent API with Optimizer/*_minimize
in terms of how dimensions are specified. (Also, this would allow much less boilerplate code for the users.)
add some exception tests
Change the order of arguments to the function to name , space to make it a bit more readable
This should address all the comments. Will make a second pass over the code later to double check. |
In the notebook, could you update the minimal example in order to use the simplest API? (directly feeding the dict, with dimensions specified as pairs) |
Yup will do so |
Updated! :) |
Thanks! This looks very nice :) One more thing though... in the last part of the notebook, it seems almost nothing is learned. Within the very first iterations, the optimizer reaches a good value and does not improve from there. Have you tried with more iterations to see if things eventually improve? |
I did not try, but my suspicion would be that it will not improve much, as the dataset is not really that complex and is used mainly as a simple example. Maybe I could see what other datasets could be used. |
skopt/tests/test_searchcv.py
Outdated
def constructor(x): BayesSearchCV(*x) | ||
assert_raises( | ||
ValueError, constructor, (SVC(), {'C':'1 ... 100.0'}) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.raises(ValueError):
BayesSearchCV(args_here)
https://docs.pytest.org/en/latest/assert.html#assertions-about-expected-exceptions for soem more examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahaaa so that is how you do it thx :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to spread the pytest way of life :)
skopt/searchcv.py
Outdated
"Search space should be provided as a dict or list of dict," | ||
"got %s" % search_space) | ||
|
||
def add_spaces(self, spaces, names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it needs to be public, but if it is it needs docs. Which might motivate making it private :)
point_as_list = [ | ||
point_as_dict[k] for k in sorted(search_space.keys()) | ||
] | ||
return point_as_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in test_utils.py
we can add a test that just tries to round trip a dict to a list and back
Feel free to merge this, I'll try to have a look next week and then make some comments if any. Not sure I will have much to say in any case. |
@betatim Was this ok with you? I think this PR is good enough to be merged in :) We can polish things further later. |
🍻 Great work @iaroslav-ai ! Thanks a lot for this, I am pretty sure this will be super useful to many :) |
🍻 and 🍰 ! This is a big new feature! Thanks for the work and patience with the many little comments spread out over days :) I think we can do a bit more work on improving the doc strings etc but let's do that in a new PR (or several). |
Awesome! Did someone do any benchmarks with how this compares to GridSearchCV in particular cases? |
Don't think so. The best we have is https://github.com/scikit-optimize/scikit-optimize/tree/master/benchmarks#ml-classification Would be great if someone could do perform some benchmarks on "realistic" data sets. I once saw some benchmarks in https://github.com/rhiever/tpot maybe we can start from there instead of having to do everything ourselves. Not related to tuning ML algos: https://github.com/sigopt/evalset but maybe faster to execute. |
A class as discussed in #78 under the tentative name of
SkoptSearchCV
(discussion on what might be a better name is welcome :)Minimalist usage example as of right now:
If something is missing in todos let me know.
skopt.wrappers
tosetup.py
RandomizedSearchCV
usingBaseSearchCV
GridSearchCV
sklearn-wrapper
cv_results_