Skip to content
This repository has been archived by the owner on Feb 28, 2024. It is now read-only.

[MRG+1] Scikit - Optimize based GridSearchCV plug-in replacement #405

Merged
merged 22 commits into from Jul 6, 2017

Conversation

iaroslav-ai
Copy link
Member

@iaroslav-ai iaroslav-ai commented Jun 17, 2017

A class as discussed in #78 under the tentative name of SkoptSearchCV (discussion on what might be a better name is welcome :)

Minimalist usage example as of right now:

from skopt.wrappers import SkoptSearchCV
from skopt.space import Real, Categorical, Integer

opt = SkoptSearchCV(
    SVC(),
    [{
        'C': Real(1e-6, 1e+6, prior='log-uniform'),
        'gamma': Real(1e-6, 1e+1, prior='log-uniform'),
        'degree': Integer(1, 8),
        'kernel': Categorical(['linear', 'poly', 'rbf']),
    }],
    n_jobs=1, n_iter=32,
)

opt.fit(X_train, y_train)
print(opt.score(X_test, y_test))

If something is missing in todos let me know.

  • add skopt.wrappers to setup.py
  • implement basic wrapper similar to RandomizedSearchCV using BaseSearchCV
  • support multiple search spaces as list of dicts, similar to GridSearchCV
  • support parallel search
  • add draft of example usage in sklearn-wrapper
  • add all necessary data to cv_results_
  • add tests
  • fix the docstrings and add comments
  • review and improve example usage

@iaroslav-ai iaroslav-ai added this to the 0.4 milestone Jun 17, 2017
@codecov-io
Copy link

codecov-io commented Jun 17, 2017

Codecov Report

Merging #405 into master will increase coverage by 0.61%.
The diff coverage is 94.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #405      +/-   ##
==========================================
+ Coverage   85.76%   86.37%   +0.61%     
==========================================
  Files          21       22       +1     
  Lines        1440     1556     +116     
==========================================
+ Hits         1235     1344     +109     
- Misses        205      212       +7
Impacted Files Coverage Δ
skopt/utils.py 98.05% <100%> (+0.18%) ⬆️
skopt/__init__.py 100% <100%> (ø) ⬆️
skopt/searchcv.py 93.39% <93.39%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e95b043...a57c21d. Read the comment docs.

@iaroslav-ai
Copy link
Member Author

Also let me know if you do not like the location of the class (cuttently skopt.wrappers)

@iaroslav-ai iaroslav-ai changed the title [WIP] Scikit - Optimize based GridSearchCV plug-in replacement [MRG] Scikit - Optimize based GridSearchCV plug-in replacement Jun 18, 2017
@iaroslav-ai
Copy link
Member Author

As of right now PR contains working Implementation of *SearchCV according to #78 and scikit-learn/scikit-learn#5491 .
Could someone review? I am not sure who would be the best to ask to do this, as all of you guys seemed to be pretty active about this in referenced issue and PR :)

@betatim
Copy link
Member

betatim commented Jun 19, 2017

Thanks for this! Code, example, and tests all in one!

Also let me know if you do not like the location of the class (cuttently skopt.wrappers)

I'm not mad keen on the name. I'd make it available at the top level like *_minimize. But then skopt.SkoptSearchCV is a bit doppelt gemoppelt. SMBOSearchCV is cryptic, ... will think a bit and see what others have used. I think once we know a good name we can use it for the class and replace wrappers with it as well.

@betatim
Copy link
Member

betatim commented Jun 19, 2017

Looks like we need to be smarter about our tests so that they can run in less time.

import sklearn.model_selection._search as skms

import numpy as np
from collections import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no stars please :)


# Extract available surrogates, so that new ones are used automatically
available_surrogates = [
getattr(sol, name) for name in sol.__all__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather be explicit and write down the full list of surrogates, instead of relying on some black magic that could easily break down as we add/update things.

"""
Tests whether the cross validation search wrapper around sklearn
models runs properly with available surrogates and with single
or multiple workers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation issue

@@ -0,0 +1 @@
from .search_cv import SkoptSearchCV
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing line break

dimensions = [params_space[k] for k in sorted(params_space.keys())]

if self.surrogate == "auto":
surrogate = GaussianProcessRegressor()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a good kernel?
I think it would be nice to solve #338 to reuse a sensible default here.


from sklearn.base import clone
from sklearn.externals.joblib import Parallel, delayed, cpu_count
import sklearn.model_selection._search as skms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we find a better name than skms?

# Extract available surrogates, so that new ones are used automatically
available_surrogates = [
getattr(sol, name) for name in sol.__all__
if "GradientBoostingQuantileRegressor" not in name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use this model?

number of evaluations set to self.n_iter. Alternatively, if
a list of (dict, int > 0) is given, the search is done for
every search space for number of iterations given as a second
element of tuple.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means if I give a list of three dictionaries there will be a total of 3*n_iter iterations? What is scikit-learn's behaviour wrt this?

I thought sklearn treats the situation of having a list of dicts as having an extra implied categorical dimension.

We should try and be as much plug'n'play as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for GridSearchCV says that if the list of dicts is provided, then grid spanned by every dict is explored sequentially. Hence I do the same in here.


# this dict is used in order to keep track of skopt Optimizer
# instances for different search spaces (str(space) is used as key)
self.optimizer = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> _optimizer? It is 'private' to the internal process no?

dimensions = [params_space[k] for k in sorted(params_space.keys())]

if self.surrogate == "auto":
surrogate = GaussianProcessRegressor()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should solve #400 (comment) or we need to add the same kernel setup and space conversion as is in gp_minimize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I use now strings for surrogate for Optimizer which cooks me estimators like a chef 😜


# if tuple: (dict: search space, int: n_iter)
if isinstance(elem, tuple):
psp, n_iter = elem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

psp -> search_space (I think) in general we should use proper words for the variable names instead of abbreviations

psp, n_iter = elem, self.n_iter
else:
raise ValueError("Unsupported type of parameter space. "
"Expected tuple or dict, got " + str(elem))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to "got %s." % (elem)

"Expected tuple or dict, got " + str(elem))

# do the optimization for particular search space
while n_iter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool(-3) == True so this won't stop if n_iter isn't dividable by n_jobs. I think we should check if there are enough iterations left to sample n_jobs more points, if not reduce it to the number left and then stop the loop. (and add a unit test to check this works)

}
return params_dict

def step(self, X, y, param_space, groups=None, n_jobs=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to read the code for RandomSearchCV but doesn't it have a step function we can re-use/hook into so we don't have to duplicate all the code for recording the results?

@iaroslav-ai
Copy link
Member Author

Addressed comments by @betatim . Please let me know if there is something else.

@@ -61,7 +61,7 @@ def f(x):
from .utils import load
from .utils import dump
from .utils import expected_minimum

from .searchcv import BayesSearchCV
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in the alphabetical order in the list above?

import sklearn.model_selection._search as sk_model_sel

from skopt import Optimizer
from skopt.utils import point_asdict, dimensions_aslist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those should be local imports

`pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 *
n_jobs`.

See Also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a short Example section in the docstring? (showing that things are in fact simple, despite the huge list of arguments)

key = str(param_space)
if key not in self.optimizer_:
self.optimizer_[key] = self._make_optimizer(param_space)
optimizer = self.optimizer_[key]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure what is done inside self._make_optimizer, but we should ensure there is no border effects... Your suggestion is not equivalent to @iaroslav-ai's code, in case self._make_optimizer changes the state of self internally.

Either estimator needs to provide a ``score`` function,
or ``scoring`` must be passed.

search_spaces : list of dict or tuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is nice, the 90% use case is to search over a single space. I think we should support directly providing a dict. This would also better match with the API of GridSearchCV.

add example, add test for the multiple search spaces
"Parameter of space should be an instance of"
"skopt.space.Dimension (Real, Integer, ...),"
"but in subspace %s the dimension %s is %s" %
(subspace, k, v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead use skopt.space.check_dimension? In this way, we would have a consistent API with Optimizer/*_minimize in terms of how dimensions are specified. (Also, this would allow much less boilerplate code for the users.)

Change the order of arguments to the function to name , space to make it a bit more readable
@iaroslav-ai
Copy link
Member Author

This should address all the comments. Will make a second pass over the code later to double check.

@glouppe
Copy link
Member

glouppe commented Jul 3, 2017

In the notebook, could you update the minimal example in order to use the simplest API? (directly feeding the dict, with dimensions specified as pairs)

@iaroslav-ai
Copy link
Member Author

Yup will do so

@iaroslav-ai
Copy link
Member Author

Updated! :)

@glouppe
Copy link
Member

glouppe commented Jul 3, 2017

Thanks! This looks very nice :)

One more thing though... in the last part of the notebook, it seems almost nothing is learned. Within the very first iterations, the optimizer reaches a good value and does not improve from there. Have you tried with more iterations to see if things eventually improve?

@iaroslav-ai
Copy link
Member Author

I did not try, but my suspicion would be that it will not improve much, as the dataset is not really that complex and is used mainly as a simple example. Maybe I could see what other datasets could be used.

def constructor(x): BayesSearchCV(*x)
assert_raises(
ValueError, constructor, (SVC(), {'C':'1 ... 100.0'})
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with pytest.raises(ValueError):
  BayesSearchCV(args_here)

https://docs.pytest.org/en/latest/assert.html#assertions-about-expected-exceptions for soem more examples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahaaa so that is how you do it thx :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to spread the pytest way of life :)

"Search space should be provided as a dict or list of dict,"
"got %s" % search_space)

def add_spaces(self, spaces, names):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it needs to be public, but if it is it needs docs. Which might motivate making it private :)

point_as_list = [
point_as_dict[k] for k in sorted(search_space.keys())
]
return point_as_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a unit test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in test_utils.py we can add a test that just tries to round trip a dict to a list and back

@MechCoder
Copy link
Member

Feel free to merge this, I'll try to have a look next week and then make some comments if any. Not sure I will have much to say in any case.

@glouppe
Copy link
Member

glouppe commented Jul 6, 2017

@betatim Was this ok with you? I think this PR is good enough to be merged in :) We can polish things further later.

@betatim betatim merged commit 10fde17 into scikit-optimize:master Jul 6, 2017
@glouppe
Copy link
Member

glouppe commented Jul 6, 2017

🍻 Great work @iaroslav-ai ! Thanks a lot for this, I am pretty sure this will be super useful to many :)

@betatim
Copy link
Member

betatim commented Jul 6, 2017

🍻 and 🍰 !

This is a big new feature! Thanks for the work and patience with the many little comments spread out over days :)

I think we can do a bit more work on improving the doc strings etc but let's do that in a new PR (or several).

@iaroslav-ai
Copy link
Member Author

A huge thank you to @glouppe and @betatim for taking your time to review my code! Indeed quite some PR it is :)
I am also testing this internally for some of my things, seems quite useful so far! :)

@amueller
Copy link
Contributor

amueller commented Jul 6, 2017

Awesome! Did someone do any benchmarks with how this compares to GridSearchCV in particular cases?

@betatim
Copy link
Member

betatim commented Jul 6, 2017

Don't think so. The best we have is https://github.com/scikit-optimize/scikit-optimize/tree/master/benchmarks#ml-classification

Would be great if someone could do perform some benchmarks on "realistic" data sets.

I once saw some benchmarks in https://github.com/rhiever/tpot maybe we can start from there instead of having to do everything ourselves. Not related to tuning ML algos: https://github.com/sigopt/evalset but maybe faster to execute.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants