[MRG+1] Scikit - Optimize based GridSearchCV plug-in replacement #405

iaroslav-ai · 2017-06-17T12:50:52Z

A class as discussed in #78 under the tentative name of SkoptSearchCV (discussion on what might be a better name is welcome :)

Minimalist usage example as of right now:

from skopt.wrappers import SkoptSearchCV
from skopt.space import Real, Categorical, Integer

opt = SkoptSearchCV(
    SVC(),
    [{
        'C': Real(1e-6, 1e+6, prior='log-uniform'),
        'gamma': Real(1e-6, 1e+1, prior='log-uniform'),
        'degree': Integer(1, 8),
        'kernel': Categorical(['linear', 'poly', 'rbf']),
    }],
    n_jobs=1, n_iter=32,
)

opt.fit(X_train, y_train)
print(opt.score(X_test, y_test))

If something is missing in todos let me know.

add skopt.wrappers to setup.py
implement basic wrapper similar to RandomizedSearchCV using BaseSearchCV
support multiple search spaces as list of dicts, similar to GridSearchCV
support parallel search
add draft of example usage in sklearn-wrapper
add all necessary data to cv_results_
add tests
fix the docstrings and add comments
review and improve example usage

codecov-io · 2017-06-17T13:24:39Z

Codecov Report

Merging #405 into master will increase coverage by 0.61%.
The diff coverage is 94.01%.

@@            Coverage Diff             @@
##           master     #405      +/-   ##
==========================================
+ Coverage   85.76%   86.37%   +0.61%     
==========================================
  Files          21       22       +1     
  Lines        1440     1556     +116     
==========================================
+ Hits         1235     1344     +109     
- Misses        205      212       +7

Impacted Files	Coverage Δ
skopt/utils.py	`98.05% <100%> (+0.18%)`	⬆️
skopt/__init__.py	`100% <100%> (ø)`	⬆️
skopt/searchcv.py	`93.39% <93.39%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e95b043...a57c21d. Read the comment docs.

iaroslav-ai · 2017-06-17T15:01:11Z

Also let me know if you do not like the location of the class (cuttently skopt.wrappers)

iaroslav-ai · 2017-06-18T18:41:46Z

As of right now PR contains working Implementation of *SearchCV according to #78 and scikit-learn/scikit-learn#5491 .
Could someone review? I am not sure who would be the best to ask to do this, as all of you guys seemed to be pretty active about this in referenced issue and PR :)

betatim · 2017-06-19T05:18:37Z

Thanks for this! Code, example, and tests all in one!

Also let me know if you do not like the location of the class (cuttently skopt.wrappers)

I'm not mad keen on the name. I'd make it available at the top level like *_minimize. But then skopt.SkoptSearchCV is a bit doppelt gemoppelt. SMBOSearchCV is cryptic, ... will think a bit and see what others have used. I think once we know a good name we can use it for the class and replace wrappers with it as well.

betatim · 2017-06-19T05:20:14Z

Looks like we need to be smarter about our tests so that they can run in less time.

betatim · 2017-06-19T05:58:37Z

skopt/wrappers/search_cv.py

+import sklearn.model_selection._search as skms
+
+import numpy as np
+from collections import *


no stars please :)

glouppe · 2017-06-19T06:11:20Z

skopt/tests/test_searchcv.py

+
+# Extract available surrogates, so that new ones are used automatically
+available_surrogates = [
+    getattr(sol, name) for name in sol.__all__


I would rather be explicit and write down the full list of surrogates, instead of relying on some black magic that could easily break down as we add/update things.

glouppe · 2017-06-19T06:11:39Z

skopt/tests/test_searchcv.py

+    """
+    Tests whether the cross validation search wrapper around sklearn
+     models runs properly with available surrogates and with single
+     or multiple workers.


indentation issue

glouppe · 2017-06-19T06:11:57Z

skopt/wrappers/__init__.py

@@ -0,0 +1 @@
+from .search_cv import SkoptSearchCV


missing line break

glouppe · 2017-06-19T06:14:16Z

skopt/wrappers/search_cv.py

+        dimensions = [params_space[k] for k in sorted(params_space.keys())]
+
+        if self.surrogate == "auto":
+            surrogate = GaussianProcessRegressor()


Is that a good kernel?
I think it would be nice to solve #338 to reuse a sensible default here.

glouppe · 2017-06-19T06:16:18Z

skopt/wrappers/search_cv.py

+
+from sklearn.base import clone
+from sklearn.externals.joblib import Parallel, delayed, cpu_count
+import sklearn.model_selection._search as skms


can we find a better name than skms?

betatim · 2017-06-19T06:22:24Z

skopt/tests/test_searchcv.py

+# Extract available surrogates, so that new ones are used automatically
+available_surrogates = [
+    getattr(sol, name) for name in sol.__all__
+    if "GradientBoostingQuantileRegressor" not in name


Why can't we use this model?

betatim · 2017-06-19T06:29:07Z

skopt/wrappers/search_cv.py

+        number of evaluations set to self.n_iter. Alternatively, if
+        a list of (dict, int > 0) is given, the search is done for
+        every search space for number of iterations given as a second
+        element of tuple.


This means if I give a list of three dictionaries there will be a total of 3*n_iter iterations? What is scikit-learn's behaviour wrt this?

I thought sklearn treats the situation of having a list of dicts as having an extra implied categorical dimension.

We should try and be as much plug'n'play as possible.

The documentation for GridSearchCV says that if the list of dicts is provided, then grid spanned by every dict is explored sequentially. Hence I do the same in here.

betatim · 2017-06-19T06:31:42Z

skopt/wrappers/search_cv.py

+
+        # this dict is used in order to keep track of skopt Optimizer
+        # instances for different search spaces (str(space) is used as key)
+        self.optimizer = {}


-> _optimizer? It is 'private' to the internal process no?

betatim · 2017-06-19T06:34:52Z

skopt/wrappers/search_cv.py

+        dimensions = [params_space[k] for k in sorted(params_space.keys())]
+
+        if self.surrogate == "auto":
+            surrogate = GaussianProcessRegressor()


We should solve #400 (comment) or we need to add the same kernel setup and space conversion as is in gp_minimize

Yup I use now strings for surrogate for Optimizer which cooks me estimators like a chef 😜

betatim · 2017-06-19T06:36:13Z

skopt/wrappers/search_cv.py

+
+            # if tuple: (dict: search space, int: n_iter)
+            if isinstance(elem, tuple):
+                psp, n_iter = elem


psp -> search_space (I think) in general we should use proper words for the variable names instead of abbreviations

betatim · 2017-06-19T06:37:48Z

skopt/wrappers/search_cv.py

+                psp, n_iter = elem, self.n_iter
+            else:
+                raise ValueError("Unsupported type of parameter space. "
+                                 "Expected tuple or dict, got " + str(elem))


switch to "got %s." % (elem)

betatim · 2017-06-19T06:40:33Z

skopt/wrappers/search_cv.py

+                                 "Expected tuple or dict, got " + str(elem))
+
+            # do the optimization for particular search space
+            while n_iter:


bool(-3) == True so this won't stop if n_iter isn't dividable by n_jobs. I think we should check if there are enough iterations left to sample n_jobs more points, if not reduce it to the number left and then stop the loop. (and add a unit test to check this works)

betatim · 2017-06-19T06:42:08Z

skopt/wrappers/search_cv.py

+        }
+        return params_dict
+
+    def step(self, X, y, param_space, groups=None, n_jobs=1):


I'd have to read the code for RandomSearchCV but doesn't it have a step function we can re-use/hook into so we don't have to duplicate all the code for recording the results?

iaroslav-ai · 2017-06-30T21:11:24Z

Addressed comments by @betatim . Please let me know if there is something else.

glouppe · 2017-07-02T08:06:37Z

skopt/__init__.py

@@ -61,7 +61,7 @@ def f(x):
 from .utils import load
 from .utils import dump
 from .utils import expected_minimum
-
+from .searchcv import BayesSearchCV


Can you put this in the alphabetical order in the list above?

glouppe · 2017-07-02T08:07:11Z

skopt/searchcv.py

+import sklearn.model_selection._search as sk_model_sel
+
+from skopt import Optimizer
+from skopt.utils import point_asdict, dimensions_aslist


those should be local imports

glouppe · 2017-07-02T08:10:12Z

skopt/searchcv.py

+    `pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 *
+    n_jobs`.
+
+    See Also


Could we add a short Example section in the docstring? (showing that things are in fact simple, despite the huge list of arguments)

glouppe · 2017-07-02T08:13:29Z

skopt/searchcv.py

+        key = str(param_space)
+        if key not in self.optimizer_:
+            self.optimizer_[key] = self._make_optimizer(param_space)
+        optimizer = self.optimizer_[key]


Hmm, not sure what is done inside self._make_optimizer, but we should ensure there is no border effects... Your suggestion is not equivalent to @iaroslav-ai's code, in case self._make_optimizer changes the state of self internally.

glouppe · 2017-07-02T08:15:22Z

skopt/searchcv.py

+        Either estimator needs to provide a ``score`` function,
+        or ``scoring`` must be passed.
+
+    search_spaces : list of dict or tuple


While this is nice, the 90% use case is to search over a single space. I think we should support directly providing a dict. This would also better match with the API of GridSearchCV.

add example, add test for the multiple search spaces

glouppe · 2017-07-02T12:18:43Z

skopt/searchcv.py

+                            "Parameter of space should be an instance of"
+                            "skopt.space.Dimension (Real, Integer, ...),"
+                            "but in subspace %s the dimension %s is %s" %
+                            (subspace, k, v)


Could we instead use skopt.space.check_dimension? In this way, we would have a consistent API with Optimizer/*_minimize in terms of how dimensions are specified. (Also, this would allow much less boilerplate code for the users.)

add some exception tests

Change the order of arguments to the function to name , space to make it a bit more readable

iaroslav-ai · 2017-07-03T08:59:38Z

This should address all the comments. Will make a second pass over the code later to double check.

glouppe · 2017-07-03T09:24:03Z

In the notebook, could you update the minimal example in order to use the simplest API? (directly feeding the dict, with dimensions specified as pairs)

iaroslav-ai · 2017-07-03T09:31:42Z

Yup will do so

iaroslav-ai · 2017-07-03T09:47:17Z

Updated! :)

glouppe · 2017-07-03T10:16:01Z

Thanks! This looks very nice :)

One more thing though... in the last part of the notebook, it seems almost nothing is learned. Within the very first iterations, the optimizer reaches a good value and does not improve from there. Have you tried with more iterations to see if things eventually improve?

iaroslav-ai · 2017-07-03T10:24:37Z

I did not try, but my suspicion would be that it will not improve much, as the dataset is not really that complex and is used mainly as a simple example. Maybe I could see what other datasets could be used.

betatim · 2017-07-03T10:33:16Z

skopt/tests/test_searchcv.py

+    def constructor(x): BayesSearchCV(*x)
+    assert_raises(
+        ValueError, constructor, (SVC(), {'C':'1 ... 100.0'})
+    )


with pytest.raises(ValueError): BayesSearchCV(args_here)

https://docs.pytest.org/en/latest/assert.html#assertions-about-expected-exceptions for soem more examples

ahaaa so that is how you do it thx :)

Trying to spread the pytest way of life :)

betatim · 2017-07-03T10:37:30Z

skopt/searchcv.py

+                "Search space should be provided as a dict or list of dict,"
+                "got %s" % search_space)
+
+    def add_spaces(self, spaces, names):


Not sure it needs to be public, but if it is it needs docs. Which might motivate making it private :)

betatim · 2017-07-03T10:38:25Z

skopt/utils.py

+    point_as_list = [
+        point_as_dict[k] for k in sorted(search_space.keys())
+    ]
+    return point_as_list


Needs a unit test

Maybe in test_utils.py we can add a test that just tries to round trip a dict to a list and back

and point_aslist

MechCoder · 2017-07-04T03:27:22Z

Feel free to merge this, I'll try to have a look next week and then make some comments if any. Not sure I will have much to say in any case.

glouppe · 2017-07-06T05:45:09Z

@betatim Was this ok with you? I think this PR is good enough to be merged in :) We can polish things further later.

glouppe · 2017-07-06T05:52:25Z

🍻 Great work @iaroslav-ai ! Thanks a lot for this, I am pretty sure this will be super useful to many :)

betatim · 2017-07-06T05:52:26Z

🍻 and 🍰 !

This is a big new feature! Thanks for the work and patience with the many little comments spread out over days :)

I think we can do a bit more work on improving the doc strings etc but let's do that in a new PR (or several).

iaroslav-ai · 2017-07-06T10:16:44Z

A huge thank you to @glouppe and @betatim for taking your time to review my code! Indeed quite some PR it is :)
I am also testing this internally for some of my things, seems quite useful so far! :)

amueller · 2017-07-06T14:24:41Z

Awesome! Did someone do any benchmarks with how this compares to GridSearchCV in particular cases?

betatim · 2017-07-06T14:52:00Z

Don't think so. The best we have is https://github.com/scikit-optimize/scikit-optimize/tree/master/benchmarks#ml-classification

Would be great if someone could do perform some benchmarks on "realistic" data sets.

I once saw some benchmarks in https://github.com/rhiever/tpot maybe we can start from there instead of having to do everything ourselves. Not related to tuning ML algos: https://github.com/sigopt/evalset but maybe faster to execute.

iaroslav-ai added the New feature label Jun 17, 2017

iaroslav-ai added this to the 0.4 milestone Jun 17, 2017

iaroslav-ai changed the title ~~[WIP] Scikit - Optimize based GridSearchCV plug-in replacement~~ [MRG] Scikit - Optimize based GridSearchCV plug-in replacement Jun 18, 2017

betatim reviewed Jun 19, 2017

View reviewed changes

glouppe reviewed Jun 19, 2017

View reviewed changes

skopt/wrappers/__init__.py Outdated

@@ -0,0 +1 @@

from .search_cv import SkoptSearchCV

Copy link

Member

glouppe Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing line break

glouppe reviewed Jun 19, 2017

View reviewed changes

betatim reviewed Jun 19, 2017

View reviewed changes

iaroslav-ai added 8 commits June 29, 2017 00:47

Draft implementation of SkoptSearchCV

3976fc9

Preparation for multiple parameter spaces

a81db6d

Implement n_jobs, add parameter search class to setup.py, add example

45ffcc4

Remove redundant example and code in search cv wrapper

db5b8cd

Add all parameters of cv_results_

767efc3

Add tests, fix docstrings and add comments, finished version of example

dc05012

Second pass through the example and docstrings

889d1fc

Pep8 all the things, make names consistent, fix typos in ...SearchCV

f6b4fce

Update descriptions to the functions, simplify test, update example

fd72fe0

Remove remnants of SkoptSearchCV

15508bd

glouppe reviewed Jul 2, 2017

View reviewed changes

Remove caching with str(space), update docs, enhance error checking,

dbc0f24

add example, add test for the multiple search spaces

glouppe reviewed Jul 2, 2017

View reviewed changes

iaroslav-ai added 2 commits July 2, 2017 19:02

Update examples to comply with updated api, use check_dimension,

6f78bfc

add some exception tests

Update and shorten example of usage of add_space function

dcd72bf

Change the order of arguments to the function to name , space to make it a bit more readable

iaroslav-ai added 2 commits July 3, 2017 11:39

Update the searchcv minimal example to use dict directly as search space

2f7bc37

Shorten further the example by using tuples and lists as ranges

a8b0446

betatim reviewed Jul 3, 2017

View reviewed changes

Use with.pytest, add docs to add_spaces, add unit test for point_asdict

67adba6

and point_aslist

Use digits dataset

a57c21d

betatim merged commit 10fde17 into scikit-optimize:master Jul 6, 2017

mfeurer mentioned this pull request Aug 1, 2017

SklearnWrapper for SMAC automl/SMAC3#291

Closed

[MRG+1] Scikit - Optimize based GridSearchCV plug-in replacement #405

[MRG+1] Scikit - Optimize based GridSearchCV plug-in replacement #405

Conversation

iaroslav-ai commented Jun 17, 2017 • edited

codecov-io commented Jun 17, 2017 • edited

Codecov Report

iaroslav-ai commented Jun 17, 2017

iaroslav-ai commented Jun 18, 2017

betatim commented Jun 19, 2017

betatim commented Jun 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iaroslav-ai commented Jun 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iaroslav-ai commented Jul 3, 2017

glouppe commented Jul 3, 2017

iaroslav-ai commented Jul 3, 2017

iaroslav-ai commented Jul 3, 2017

glouppe commented Jul 3, 2017

iaroslav-ai commented Jul 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MechCoder commented Jul 4, 2017

glouppe commented Jul 6, 2017

glouppe commented Jul 6, 2017

betatim commented Jul 6, 2017

iaroslav-ai commented Jul 6, 2017

amueller commented Jul 6, 2017

betatim commented Jul 6, 2017

iaroslav-ai commented Jun 17, 2017 •

edited

codecov-io commented Jun 17, 2017 •

edited