Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Successive halving for faster parameter search #13900

Merged
merged 124 commits into from Sep 9, 2020

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented May 17, 2019

Closes #12538

This implements hyper parameter search with successive halving

This builds upon #13145, whose changes are required.

This is a port of what we implemented in dabl with @amueller.

  • main functional tests
  • examples
  • user guide
  • better integration into user guide to avoid redundancy
  • a few more doc here and there
  • some more thorough tests about input checking, etc

Still WIP but very advanced, and would appreciate some feedback before I start tackling the last few bullet points, so I'll mark as MRG.

ping @ogrisel ;)


Benchmarks

EDIT more recent benchmarks here

Please check out dabl benchmarks for source.

  • On 20_news_group datset:
                    training time   test score   best CV score
---------------------------------------------------------------
GridSearchCV             19984.9 s      0.8567          0.9262
GridSuccessiveHalving      598.4 s      0.8514          0.8811
---------------------------------------------------------------
Best Params GridSuccessiveHalving
{'clf__C': 1000.0, 'vect': TfidfVectorizer(), 'vect__ngram_range': (1, 1)}
Best Params GridSearchCV
{'clf__C': 1000.0,
  'vect': TfidfVectorizer(ngram_range=(1, 2)),
  'vect__ngram_range': (1, 2)}
  • on digits dataset:

Training Time Successive Halving 3.1159074306488037
Test Score Successive Halving:  0.9911111111111112
Parameters Successive Halving:  {'C': 100.0, 'gamma': 0.1}

Training Time Grid Search:  39.42753505706787
Test Score Grid Search:  0.9911111111111112
Parameters Grid Search:  {'C': 10.0, 'gamma': 0.1}

@NicolasHug NicolasHug changed the title [MRG] Successive halving [MRG] Successive halving for faster parameter search May 17, 2019
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking forward to this, but it is a big review task!

doc/modules/classes.rst Outdated Show resolved Hide resolved
sklearn/model_selection/_search_successive_halving.py Outdated Show resolved Hide resolved
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
@amueller
Copy link
Member

Most of it is docs, though ;)

One thing that should be noted is that there is no literature on successive halving with a fixed number of configurations as far as I'm aware, or at least not with a fixed number of iterations and a variable budget. So some of the solutions for applying this for grids are solutions Nicolas and I came up with.

expensive and is not strictly required to select the parameters that
yield the best generalization performance.

max_budget : int, optional(default='auto')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually r_max, not the maximum total budget, right? I'm not sure if it makes sense to call this r_max because that's kind of obscure, but the naming is inconsistent with r_min because they refer to the same thing (per-estimator per-iteration budget constraints).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the maximum value that r_i is allowed to take.
I think we agreed to call it max_budget? This makes sense to me, especially when budget_on='n_samples'. But it might not be completely following the hyperband paper

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this syntax of "optional("?

doc/modules/classes.rst Outdated Show resolved Hide resolved
sklearn/model_selection/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

@@ -724,7 +731,8 @@ def evaluate_candidates(candidate_params):

return self

def _format_results(self, candidate_params, scorers, n_splits, out):
def _format_results(self, candidate_params, scorers, n_splits, out,
more_results={}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more_results can be a regular argument (not a keyword).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd have to change GridSearchCV and RandomizedSearchCV then. Would you prefer it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I also prefer the private API to have as few params with default values as possible. I'd change GridSearchCV and RandomizedSearchCV.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer the private API to have as few params with default values as possible

Why? I'm not sold on this

I like to not break backward compatibility when we can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also typically make writing test annoying (see e.g. _find_binning_thresholds in sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still be happier if we make more_results with not default value. How easy or hard writing tests would be shouldn't affect the API. You can also use partial if you want to handle that in tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why you don't like defaults for private utilities? I haven't heard anything on that yet ;)

I believe the API would be weird for GridSearchCV and RandomizedSearchCV which would have to pass an empty dict to this function now.

Another advantage of adding a keyword is that we don't further break the current API, even if it's private. I believe @jnothman cares about it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoid defaults for private utilities, because I have seen a history of bugs caused by parameters not being passed when they should, and that bug not being identified with thanks to a default value. But in this case, I'm fairly ambivalent, because it's the kind of thing that would break the calling context if it was not passed. But actually, this only affects BaseSearchCV, which is its only caller.

A different nitpick: mutable default values are generally to be avoided, i.e. don't have ={}, instead have =None and change it to {} internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoid defaults for private utilities, because I have seen a history of bugs caused by parameters not being passed when they should, and that bug not being identified with thanks to a default value

I'm not sure how this specifically applies to private utilities? It seems to me that this holds for any function, whether it's private or public.

Would we all be happy with this? (adding the kwonly part)

    def _format_results(self, candidate_params, *, scorers, n_splits, out,
                        more_results=None):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It applies to private utilities because there is little usability benefit to providing default values when the utility is private. Yes, I'm fine with that in this case.

classifier=is_classifier(self.estimator))
n_splits = cv.get_n_splits(X, y, groups)

# please see https://gph.is/1KjihQe for a justification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very well justified.

doc/modules/grid_search.rst Outdated Show resolved Hide resolved
doc/modules/grid_search.rst Outdated Show resolved Hide resolved
sklearn/model_selection/_search_successive_halving.py Outdated Show resolved Hide resolved
sklearn/model_selection/_search_successive_halving.py Outdated Show resolved Hide resolved
sklearn/model_selection/_search_successive_halving.py Outdated Show resolved Hide resolved
doc/modules/grid_search.rst Outdated Show resolved Hide resolved
@NicolasHug
Copy link
Member Author

@jnothman do you have any comment/concern regarding the fact that I pulled the changes from #13145 (which you reviewed)?

@jnothman
Copy link
Member

@jnothman do you have any comment/concern regarding the fact that I pulled the changes from #13145 (which you reviewed)?

Are you just asking whether I am happy with those API changes? I appreciate that some kind of protected (in the Java sense) API change is necessary, and that it's hard to design the right change. All in all, I'd like to take a look at this PR, but I've not had a lot of time for review.

@jnothman
Copy link
Member

Should we consider just calling the estimator HalvingSearchCV?

Comment on lines +364 to +365
:class:`model_selection.GridSearchCV`. :pr:`13900` by `Nicolas Hug`_, `Joel
Nothman`_ and `Andreas Müller`_.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman @amueller added you guys here considering the amount of design you both made

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments on docs and examples, I'll try to do the code tomorrow (??), though I think I've looked at most of it before.

doc/modules/grid_search.rst Outdated Show resolved Hide resolved
doc/modules/grid_search.rst Outdated Show resolved Hide resolved

As illustrated in the figure below, only a small subset of candidates
'survive' until the last iteration. These are the candidates that have
consistently ranked among the best candidates across all iterations. Each
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"best" is not true, I think, though I might be nitpicking here. It could have just made the cutoff in each round. Maybe "was in the subset of winners" or was in the top half, though we haven't explained the halving yet here... hm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll replace "best" by "top scoring":

These are the candidates that have
consistently ranked among the top-scoring candidates across all iterations

LMK if that's not OK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

doc/modules/grid_search.rst Outdated Show resolved Hide resolved
search in our implementation. ``ratio`` effectively controls the number of
iterations in :class:`HalvingGridSearchCV` and the number of candidates (if
'auto') and iterations in :class:`HalvingRandomSearchCV`.
``aggressive_elimination=True`` can also be used if the number of available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it makes sense to mention parameters without explaining them. Maybe just say "there are several more parameters that are explained in detail below"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joel suggested having a paragraph like this one... it's hard to satisfy every reviewer when it comes to docs :p

We do end this paragraph a few line below with

Each parameter and their interactions are
described in more details below.

Do you think we should say that earlier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't worry about it.

# We now plot heatmaps for both search estimators.


def make_heatmap(ax, gs, is_sh=False, make_cbar=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we can't easily reuse any of the confusion matrix plot? I've been nagging @thomasjpfan to do a grid-search visualizer ;) But I guess pandas out is nice, too.

max_resources='auto', # max_resources=n_samples
n_candidates='exhaust',
cv=5,
ratio=2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to use ratio=2? I thought ratio=3 was the more common choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. In this case ratio=2 makes the plot look nicer (longer) and makes it easier illustrate the SH process in the narrative docs

sklearn/model_selection/_search.py Outdated Show resolved Hide resolved
Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

sklearn/model_selection/_search.py Outdated Show resolved Hide resolved
@amueller
Copy link
Member

amueller commented Sep 8, 2020

Good to merge from my side. Re naming, @jnothman @NicolasHug how do you feel about factor instead of ratio?

@NicolasHug
Copy link
Member Author

I'm happy with factor

@jnothman
Copy link
Member

jnothman commented Sep 9, 2020 via email

@NicolasHug
Copy link
Member Author

+2 and green... Let's get this in before CI randomly fails again please ^^

@adrinjalali
Copy link
Member

adrinjalali commented Sep 9, 2020

Never got to review most of the code, but went through the docs a few times and I'm very excited to have this :) Thanks a ton @NicolasHug

@adrinjalali adrinjalali merged commit 0a5af0d into scikit-learn:master Sep 9, 2020
5 checks passed
@amueller
Copy link
Member

amueller commented Sep 9, 2020

YAAAY! This is so exciting!

@uhoenig
Copy link

uhoenig commented Oct 11, 2020

amazing work! using the version from dabl until 0.24 gets released, though. Any info on the approximate release date? Thanks again for implementing the feature!

default, this is set to:
default, this is set to the highest possible value
satisfying the constraint `force_exhaust_resources=True` (which is
the default). Otherwise this is set to:

- ``n_splits * 2`` when ``resource='n_samples'`` for a regression
problem
- ``n_classes * n_splits * 2`` when ``resource='n_samples'`` for a
regression problem
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, but should this perhaps say 'classification' instead of regression?

@NicolasHug
Copy link
Member Author

We try to release twice a year. We don't have a definite date but it should happen by the end of the year

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
* More flexible grid search interface

* added info dict parameter

* Put back removed test

* renamed info into more_results

* Passed grroups as well since we need n_to use get_n_splits(X, y, groups)

* port

* pep8

* dabl -> sklearn

* add _required_parameters

* skipping check in rst file if pandas not installed

* Update sklearn/model_selection/_search_successive_halving.py

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

* renamed into GridHalvingSearchCV and RandomHalvingSearchCV

* Addressed thomas' comments

* repr

* removed passing group as a parameter to evaluate_candidates

* Joels comments

* pep8

* reorganized user user guide

* renaming

* update user guide

* remove groups support + pass fit_params

* parameter renaming

* pep8

* r_i -> resource_iter

* fixed r_i issues

* examples + removed use of word budget

* Added inpute checking tests

* added cv_resutlts_ user guide

* minor title change

* fixed doc layout

* Addressed some comments

* properly pass down fit_params

* change default value of force_exhaust_resources and update doc

* should fix doc

* Used check_fit_params

* Update section about min_resources and number of candidates

* Clarified ratio section

* Use ~ to refer to classes

* fixed doc checks

* Apply suggestions from code review

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

* Addressed easy comments from Joel

* missed some

* updated docstring of run_search

* Used f strings instead of format

* remove candidate duplication checks

* fix example

* Addressed easy comments

* rotate ticks labels

* Added discussion in the intro as suggested by Joel

* Split examples into sections

* minor changes

* remove force_exhaust_budget and introduce min_resources=exhaust

* some minor validation

* Added a n_resources_ attribute

* update examples

* Addressed comments

* passing CV instead of X,y

* minor revert for handling fit_params

* updated docs

* fix len

* whatsnew

* Add test for sampling when all_list

* minor change to top-k

* Force CV splits to be consistent across calls

* reorder parameters

* reduced diff

* added tests for top_k

* put back doc for groups

* not sure what went wrong

* put import at its place

* some comment

* Addressed comments

* Added tests for cv_results_ and base estimator inputs

* pep8

* avoid monkeypatching

* rename df

* use Joel's suggestions for testing masks

* Made it experimental

* Should fix docs

* whats new entry

* Apply suggestions from code review

Co-authored-by: Andreas Mueller <t3kcit@gmail.com>

* Addressed comments to docs

* Addressed comments in examples

* minor doc update

* minor renaming in UG

* forgot some

* some sad note about splitter statefulness :'(

* Addressed comments

* ratio -> factor

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Co-authored-by: Andreas Mueller <t3kcit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add successive halving for search?
9 participants