New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] extending BaseSearchCV with a custom search strategy #9599

Merged
merged 32 commits into from Aug 5, 2018

Conversation

@jnothman
Member

jnothman commented Aug 21, 2017

Fixes #9499 through inheritance, or by providing a new AdaptiveSearchCV

The idea is that something like skopt.BayesSearchCV can be implemented as:

class MySearchCV(BaseSearchCV):
    def __init__(....):
        ...

    def _run_search(self, evaluate_candidates):
        results = evaluate_candidates(initial_candidates)
        while ...:
            results = evaluate_candidates(more_candidates)

Note that it should be possible to use scipy.optimize minimizers in this framework.

I've also provided AdatptiveSearchCV, which accepts a search parameter to a public interface, rather than relying on inheritance, but I'm undecided about it.

TODO:

  • Get other core devs to comment on broad API
  • Probably remove AdaptiveSearchCV and leave a private/protected API
  • Decide how we document an API for inheritance, as this is a bit new to Scikit-learn
  • Decide whether this is "experimental" API or something that we want to assure stability on
  • Should _run_search be allowed to change the cv_results_ dict? Should we put (in)formal requirements (or an API) on how cv_results_ can be changed.
@jnothman

This comment has been minimized.

Member

jnothman commented Aug 21, 2017

Ping @betatim

jnothman added some commits Aug 22, 2017

@jnothman jnothman changed the title from [WIP] *SearchCV can now provide candidates through a coroutine to [MRG] *SearchCV can now provide candidates through a coroutine Aug 22, 2017

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 22, 2017

Decided to make this MRG. Why not?

@jnothman jnothman changed the title from [MRG] *SearchCV can now provide candidates through a coroutine to [WIP] *SearchCV can now provide candidates through a coroutine Aug 22, 2017

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 22, 2017

This is messy but, I hope, fixed.

@jnothman jnothman changed the title from [WIP] *SearchCV can now provide candidates through a coroutine to [MRG] *SearchCV can now provide candidates through a coroutine Aug 22, 2017

@betatim

This comment has been minimized.

Contributor

betatim commented Aug 22, 2017

This is exactly what I tried to describe with words :) 👍

To be overridden by implementors.
It can iteratively generate a list of candidate parameter dicts, and is

This comment has been minimized.

@NelleV

NelleV Aug 22, 2017

Member

I think there is a grammatical mistake with this sentence (or I just don't understand it)

This comment has been minimized.

@jnothman

jnothman Aug 22, 2017

Member

I think it parses. But it's not easy :)

How about "As in the following snippet, implementations yield lists of candidates, where each candidate is a dict of parameter settings. The yield expression then returns the results dict corresponding to those candidates."?

@@ -555,6 +559,22 @@ def classes_(self):
self._check_is_fitted("classes_")
return self.best_estimator_.classes_
def _generate_candidates(self):
"""Generates lists of candidates to search as a coroutine

This comment has been minimized.

@betatim

betatim Aug 22, 2017

Contributor

-> "Generate lists of candidate parameters to evaluate."?

Having our conversation in mind I still had to think twice to grok this, so I'll try and make a suggestion for the doc string that would have helped me.

This comment has been minimized.

@jnothman

jnothman Aug 22, 2017

Member

I use "candidate" to mean a full parameter setting, i.e. multiple parameters and their values.

This comment has been minimized.

@betatim

betatim Aug 22, 2017

Contributor

kk, my main issue was the "coroutine". I think that as a reader/user here I don't need to know about coroutines (and the fifteen different ways people define them). IMHO the important bit is "this has to be a generator that can have stuff sent into it". There aren't that many of those out in the wild I'd guess.

"""Base class for hyper parameter search with cross-validation."""
"""Abstract base class for hyper parameter search with cross-validation.
Implementers should provide ``__init__`` and either ``_get_param_iterator``

This comment has been minimized.

@betatim

betatim Aug 22, 2017

Contributor

for my education: why do we need to provide __init__ if it is empty (like in the test)? Doesn't python call the base __init__ automatically?

This comment has been minimized.

@jnothman

jnothman Aug 22, 2017

Member

Only because the current code marks it as an abstractmethod.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 22, 2017

All this result munging makes a mess of the code, admittedly. We could perhaps make it a little cleaner if _generate_candidates was sent a list of dicts rather than a dict of lists as results.

_generate_candidates could also be renamed to _generate_candidate_lists if that's clearer.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 22, 2017

@betatim, do you have a preference as to whether you would rather receive a dict of arrays, as in cv_results_ or a list of dicts?

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 22, 2017

Now "coroutine" free.

@betatim

This comment has been minimized.

Contributor

betatim commented Aug 22, 2017

As it is now (dict of arrays) seems fine.

Why not pass all results back instead of just the last iterations? In skopt we would end up doing that as you need the whole history.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 22, 2017

@betatim

This comment has been minimized.

Contributor

betatim commented Aug 22, 2017

One dict per parameter setting tried? Sounds like a lot of generators will write the for loop code to collect all scores together, all parameter values etc :-/

I like an already merged dict of arrays seems best I think.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 22, 2017

@betatim

This comment has been minimized.

Contributor

betatim commented Aug 24, 2017

Yes. At least for skopt we need the "complete" history and I would guess most other approaches are also interested in the whole history so far to make their stopping decision.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 11, 2017

@betatim, I've updated this to provide cumulative history to the coroutine. As a downside, it repeats some work on aggregating results, but I suppose that could be made more efficient at a later point if it became an issue, and at least it's not a function of the dataset size except where that determines the number of splits.

WDYT?

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 11, 2017

And now I'm wondering whether we should just add a warm_start parameter to GridSearchCV, allowing users to change the parameter grid and also, perhaps, the number of splits, and for this to be included in the same results listing and ranking.

@amueller

This comment has been minimized.

Member

amueller commented Jul 23, 2018

@stsievert can you maybe provide a bit of context on your usage, just to inform our decision making / out of curiosity?

@stsievert

This comment has been minimized.

stsievert commented Jul 23, 2018

@amueller I'm implementing an adaptive hyperparameter search called Hyperband in dask/dask-ml#221. In this setting, Hyperband realizes optimization is iterative and treats computation as a scarce resource, so it relies on partial_fit (though it doesn't have to be that way; it can be applied to black box models too).

This PR can not be used directly unless evaluate can be customized to support partial_fit (see #11266 for related work). I think this PR caught my interest because I study adaptive methods in graduate school.

@jnothman jnothman changed the title from [WIP] extending BaseSearchCV with a custom search strategy to [MRG] extending BaseSearchCV with a custom search strategy Jul 24, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 24, 2018

@amueller

This comment has been minimized.

Member

amueller commented Jul 24, 2018

When I looked at it I felt #11354 wasn't that widely applicable. I think @janvanrji had argued to me it would be useful for successive halfing but I argued that's easy to implement with the current code. Need to revisit the motivation.

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 25, 2018

Another review of this as an experimental private interface? @lesteve? @glemaitre?

@glemaitre

LGTM with tiny nitpicks.

The change are minimum and it simplified thing for contrib so I would be +1.

if attr[0].islower() and attr[-1:] == '_' and \
attr not in {'cv_results_', 'best_estimator_',
'refit_time_'}:
assert_equal(getattr(gscv, attr), getattr(mycv, attr),

This comment has been minimized.

@glemaitre

glemaitre Jul 25, 2018

Contributor

assert

def test_custom_run_search():
def check_results(results, gscv):
exp_results = gscv.cv_results_
assert_equal(sorted(results.keys()), sorted(exp_results))

This comment has been minimized.

@glemaitre

glemaitre Jul 25, 2018

Contributor

plain assert

assert_array_equal(exp_results[k], results[k],
err_msg='Checking ' + k)
else:
assert_almost_equal(exp_results[k], results[k],

This comment has been minimized.

@glemaitre

glemaitre Jul 25, 2018

Contributor

assert_allclose

@@ -577,6 +578,30 @@ def classes_(self):
self._check_is_fitted("classes_")
return self.best_estimator_.classes_
@abstractmethod
def _run_search(self, evaluate_candidates):
"""Repeatedly calls evaluate_candidates to conduct a search

This comment has been minimized.

@glemaitre

glemaitre Jul 25, 2018

Contributor

backsticks around evaluate_candidates

This comment has been minimized.

@glemaitre

glemaitre Jul 25, 2018

Contributor

and a full stop

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 25, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 25, 2018

Now, do I add a what's new along the lines of:

- `BaseSearchCV` now has an experimental private interface to support
  customized parameter search strategies, through its ``_run_search``
  method.  See the implementations in :class:`model_selection.GridSearchCV`
  and :class:`model_selection.RandomizedSearchCV` and please provide feedback
  if you use this. Note that we do not assure the stability of this API
  beyond version 0.20. :issue:`9599` by `Joel Nothman`_

?

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jul 25, 2018

A what's new entry could be nice I think even this is not public changes.

jnothman added some commits Jul 26, 2018

@ogrisel

ogrisel approved these changes Aug 3, 2018

LGTM. Please see the following comments.

::
def _run_search(self):

This comment has been minimized.

@ogrisel

ogrisel Aug 3, 2018

Member

typo:

def _run_search(self, evaluate_candidates):
   ...
@abstractmethod
def _run_search(self, evaluate_candidates):
"""Repeatedly calls `evaluate_candidates` to conduct a search.

This comment has been minimized.

@ogrisel

ogrisel Aug 3, 2018

Member

Please add some motivation for the intent of this abstract method. For instance:

This method, implemented in sub-classes, makes it is possible to customize the
the scheduling of evaluations: GridSearchCV and RandomizedSearchCV schedule
evaluations for their whole parameter search space at once but other more
sequential approaches are also possible: for instance is possible to
iteratively schedule evaluations for new regions of the parameter search
space based on previously collected evaluation results. This makes it
possible to implement Bayesian optimization or more generally sequential
model-based optimization by deriving from the BaseSearchCV abstract base
class.

This comment has been minimized.

@jnothman

jnothman Aug 5, 2018

Member

Very nice text!

@ogrisel

This comment has been minimized.

Member

ogrisel commented Aug 3, 2018

Maybe for another PR, I think it might also be interesting to add:

class BaseSearchCV(...):

    @staticmethod
    def _fit_and_score(base_estimator, X, y, ...):
        """Make it possible to override model evaluation in subclasses"""
        return _fit_and_score(base_estimator, X, y, ...)

This would make it possible to override the default implementation in subclasses. For instance to be able to snapshot the fitted models on disk to later build an ensemble of the top performers for instance. Or to save the model evaluation in a DB to feed a dashboard.

jnothman added some commits Aug 5, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 5, 2018

I'm not sure about allowing users to customise _fit_and_score... It seems too powerful/essential, and something we need to be able to change on a whim. If they want to save estimators or log to a DB, scoring can do that in a hacky way. I've proposed before having another callback for storing diagnostics.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 5, 2018

And: let's merge this when green... Sorry it's taken a while, @betatim!

@jnothman jnothman merged commit 477c921 into scikit-learn:master Aug 5, 2018

7 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.3%)
Details
codecov/project Absolute coverage decreased by -2.21% but relative coverage increased by +4.69% compared to 71ee6d2
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

scikit-learn 0.20 automation moved this from PRs tagged to Done Aug 5, 2018

@saddy001

This comment has been minimized.

saddy001 commented Sep 26, 2018

The method BaseSearchCV._run_search is mandatory now on derived classes, because it's abstract.
Classes that inherit from BaseSearchCV and don't implement the new method, fail with

TypeError: Can't instantiate abstract class <class_name> with abstract methods _run_search

This breaks external libraries, for example skopt.BayesSearchCV. Is this on purpose?

@amueller

This comment has been minimized.

Member

amueller commented Sep 26, 2018

@saddy001 that's unfortunate. Ideally this refactor would make BayesSearchCV easier to implement. Maybe we should not make this abstract but just raise a NotImplementedError, in case the derived class overwrites fit (which I guess is what BayesSearchCV does. If they overwrite fit, what exactly does the class use from BaseSearchCV right now?

@saddy001

This comment has been minimized.

saddy001 commented Sep 26, 2018

It seems to implement the following methods:

dir(BayesSearchCV)
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_is_fitted', '_check_search_space', '_estimator_type', '_fit', '_fit_best_model', '_format_results', '_get_param_names', '_make_optimizer', '_run_search', '_step', 'best_params_', 'best_score_', 'classes_', 'decision_function', 'fit', 'get_params', 'inverse_transform', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params', 'total_iterations', 'transform']

The code is here: https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/searchcv.py
Maybe we should 'ping' the developers?

@amueller

This comment has been minimized.

Member

amueller commented Sep 26, 2018

not sure what the dir is supposed to be telling me. My question was what do they inherit, and I think the answer is score?

@saddy001

This comment has been minimized.

saddy001 commented Sep 26, 2018

Yep sorry, dir is also listing inherited attributes. I think this is what you want:

BaseSearchCV.__dict__.keys() - BayesSearchCV.__dict__.keys()
{'score', '_run_search', '_check_is_fitted', 'predict_log_proba', 'inverse_transform', 'predict', 'classes_', '_format_results', 'predict_proba', 'transform', 'decision_function', '_estimator_type'}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment