New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Successive halving for faster parameter search #13900
[MRG] Successive halving for faster parameter search #13900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking forward to this, but it is a big review task!
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
…rn into successive_halving
Most of it is docs, though ;) One thing that should be noted is that there is no literature on successive halving with a fixed number of configurations as far as I'm aware, or at least not with a fixed number of iterations and a variable budget. So some of the solutions for applying this for grids are solutions Nicolas and I came up with. |
expensive and is not strictly required to select the parameters that | ||
yield the best generalization performance. | ||
|
||
max_budget : int, optional(default='auto') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually r_max, not the maximum total budget, right? I'm not sure if it makes sense to call this r_max because that's kind of obscure, but the naming is inconsistent with r_min because they refer to the same thing (per-estimator per-iteration budget constraints).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is the maximum value that r_i is allowed to take.
I think we agreed to call it max_budget? This makes sense to me, especially when budget_on='n_samples'. But it might not be completely following the hyperband paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this syntax of "optional("?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
sklearn/model_selection/_search.py
Outdated
@@ -724,7 +731,8 @@ def evaluate_candidates(candidate_params): | |||
|
|||
return self | |||
|
|||
def _format_results(self, candidate_params, scorers, n_splits, out): | |||
def _format_results(self, candidate_params, scorers, n_splits, out, | |||
more_results={}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more_results
can be a regular argument (not a keyword).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd have to change GridSearchCV
and RandomizedSearchCV
then. Would you prefer it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I also prefer the private API to have as few params with default values as possible. I'd change GridSearchCV and RandomizedSearchCV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer the private API to have as few params with default values as possible
Why? I'm not sold on this
I like to not break backward compatibility when we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also typically make writing test annoying (see e.g. _find_binning_thresholds
in sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still be happier if we make more_results
with not default value. How easy or hard writing tests would be shouldn't affect the API. You can also use partial
if you want to handle that in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why you don't like defaults for private utilities? I haven't heard anything on that yet ;)
I believe the API would be weird for GridSearchCV and RandomizedSearchCV which would have to pass an empty dict to this function now.
Another advantage of adding a keyword is that we don't further break the current API, even if it's private. I believe @jnothman cares about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I avoid defaults for private utilities, because I have seen a history of bugs caused by parameters not being passed when they should, and that bug not being identified with thanks to a default value. But in this case, I'm fairly ambivalent, because it's the kind of thing that would break the calling context if it was not passed. But actually, this only affects BaseSearchCV, which is its only caller.
A different nitpick: mutable default values are generally to be avoided, i.e. don't have ={}, instead have =None and change it to {} internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I avoid defaults for private utilities, because I have seen a history of bugs caused by parameters not being passed when they should, and that bug not being identified with thanks to a default value
I'm not sure how this specifically applies to private utilities? It seems to me that this holds for any function, whether it's private or public.
Would we all be happy with this? (adding the kwonly part)
def _format_results(self, candidate_params, *, scorers, n_splits, out,
more_results=None):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It applies to private utilities because there is little usability benefit to providing default values when the utility is private. Yes, I'm fine with that in this case.
classifier=is_classifier(self.estimator)) | ||
n_splits = cv.get_n_splits(X, y, groups) | ||
|
||
# please see https://gph.is/1KjihQe for a justification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very well justified.
Are you just asking whether I am happy with those API changes? I appreciate that some kind of protected (in the Java sense) API change is necessary, and that it's hard to design the right change. All in all, I'd like to take a look at this PR, but I've not had a lot of time for review. |
Should we consider just calling the estimator |
:class:`model_selection.GridSearchCV`. :pr:`13900` by `Nicolas Hug`_, `Joel | ||
Nothman`_ and `Andreas Müller`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments on docs and examples, I'll try to do the code tomorrow (??), though I think I've looked at most of it before.
doc/modules/grid_search.rst
Outdated
|
||
As illustrated in the figure below, only a small subset of candidates | ||
'survive' until the last iteration. These are the candidates that have | ||
consistently ranked among the best candidates across all iterations. Each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"best" is not true, I think, though I might be nitpicking here. It could have just made the cutoff in each round. Maybe "was in the subset of winners" or was in the top half, though we haven't explained the halving yet here... hm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll replace "best" by "top scoring":
These are the candidates that have
consistently ranked among the top-scoring candidates across all iterations
LMK if that's not OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
doc/modules/grid_search.rst
Outdated
search in our implementation. ``ratio`` effectively controls the number of | ||
iterations in :class:`HalvingGridSearchCV` and the number of candidates (if | ||
'auto') and iterations in :class:`HalvingRandomSearchCV`. | ||
``aggressive_elimination=True`` can also be used if the number of available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it makes sense to mention parameters without explaining them. Maybe just say "there are several more parameters that are explained in detail below"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Joel suggested having a paragraph like this one... it's hard to satisfy every reviewer when it comes to docs :p
We do end this paragraph a few line below with
Each parameter and their interactions are
described in more details below.
Do you think we should say that earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't worry about it.
# We now plot heatmaps for both search estimators. | ||
|
||
|
||
def make_heatmap(ax, gs, is_sh=False, make_cbar=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we can't easily reuse any of the confusion matrix plot? I've been nagging @thomasjpfan to do a grid-search visualizer ;) But I guess pandas out is nice, too.
max_resources='auto', # max_resources=n_samples | ||
n_candidates='exhaust', | ||
cv=5, | ||
ratio=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to use ratio=2? I thought ratio=3 was the more common choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed. In this case ratio=2 makes the plot look nicer (longer) and makes it easier illustrate the SH process in the narrative docs
Co-authored-by: Andreas Mueller <t3kcit@gmail.com>
…rn into successive_halving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work!
Good to merge from my side. Re naming, @jnothman @NicolasHug how do you feel about |
I'm happy with |
factor is excellent. I had spent some time failing to come up with the
right name!
Excited to see this hit the road!!
|
+2 and green... Let's get this in before CI randomly fails again please ^^ |
Never got to review most of the code, but went through the docs a few times and I'm very excited to have this :) Thanks a ton @NicolasHug |
YAAAY! This is so exciting! |
amazing work! using the version from dabl until 0.24 gets released, though. Any info on the approximate release date? Thanks again for implementing the feature! |
default, this is set to: | ||
default, this is set to the highest possible value | ||
satisfying the constraint `force_exhaust_resources=True` (which is | ||
the default). Otherwise this is set to: | ||
|
||
- ``n_splits * 2`` when ``resource='n_samples'`` for a regression | ||
problem | ||
- ``n_classes * n_splits * 2`` when ``resource='n_samples'`` for a | ||
regression problem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure, but should this perhaps say 'classification' instead of regression?
We try to release twice a year. We don't have a definite date but it should happen by the end of the year |
* More flexible grid search interface * added info dict parameter * Put back removed test * renamed info into more_results * Passed grroups as well since we need n_to use get_n_splits(X, y, groups) * port * pep8 * dabl -> sklearn * add _required_parameters * skipping check in rst file if pandas not installed * Update sklearn/model_selection/_search_successive_halving.py Co-Authored-By: Joel Nothman <joel.nothman@gmail.com> * renamed into GridHalvingSearchCV and RandomHalvingSearchCV * Addressed thomas' comments * repr * removed passing group as a parameter to evaluate_candidates * Joels comments * pep8 * reorganized user user guide * renaming * update user guide * remove groups support + pass fit_params * parameter renaming * pep8 * r_i -> resource_iter * fixed r_i issues * examples + removed use of word budget * Added inpute checking tests * added cv_resutlts_ user guide * minor title change * fixed doc layout * Addressed some comments * properly pass down fit_params * change default value of force_exhaust_resources and update doc * should fix doc * Used check_fit_params * Update section about min_resources and number of candidates * Clarified ratio section * Use ~ to refer to classes * fixed doc checks * Apply suggestions from code review Co-authored-by: Joel Nothman <joel.nothman@gmail.com> * Addressed easy comments from Joel * missed some * updated docstring of run_search * Used f strings instead of format * remove candidate duplication checks * fix example * Addressed easy comments * rotate ticks labels * Added discussion in the intro as suggested by Joel * Split examples into sections * minor changes * remove force_exhaust_budget and introduce min_resources=exhaust * some minor validation * Added a n_resources_ attribute * update examples * Addressed comments * passing CV instead of X,y * minor revert for handling fit_params * updated docs * fix len * whatsnew * Add test for sampling when all_list * minor change to top-k * Force CV splits to be consistent across calls * reorder parameters * reduced diff * added tests for top_k * put back doc for groups * not sure what went wrong * put import at its place * some comment * Addressed comments * Added tests for cv_results_ and base estimator inputs * pep8 * avoid monkeypatching * rename df * use Joel's suggestions for testing masks * Made it experimental * Should fix docs * whats new entry * Apply suggestions from code review Co-authored-by: Andreas Mueller <t3kcit@gmail.com> * Addressed comments to docs * Addressed comments in examples * minor doc update * minor renaming in UG * forgot some * some sad note about splitter statefulness :'( * Addressed comments * ratio -> factor Co-authored-by: Joel Nothman <joel.nothman@gmail.com> Co-authored-by: Andreas Mueller <t3kcit@gmail.com>
Closes #12538
This implements hyper parameter search with successive halving
This builds upon #13145, whose changes are required.This is a port of what we implemented in dabl with @amueller.
Still WIP but very advanced, and would appreciate some feedback before I start tackling the last few bullet points, so I'll mark as MRG.ping @ogrisel ;)
Benchmarks
EDIT more recent benchmarks here
Please check out dabl benchmarks for source.