Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC BayesSearchCV in scikit-learn #26170

Open
adrinjalali opened this issue Apr 13, 2023 Discussed in #26141 · 14 comments
Open

RFC BayesSearchCV in scikit-learn #26170

adrinjalali opened this issue Apr 13, 2023 Discussed in #26141 · 14 comments
Labels
module:model_selection Needs Decision - Include Feature Requires decision regarding including feature

Comments

@adrinjalali
Copy link
Member

adrinjalali commented Apr 13, 2023

Discussed in #26141

Originally posted by earlev4 April 11, 2023
Hi! First off, thanks so much to the excellent work done by all the scikit-learn contributors! The project is truly a gift and your work is greatly appreciated!

I still consider myself a novice when it comes to scikit-learn. In my usage, I typically will attempt to use GridSearchCV when searching for the best parameters. However, depending on the search space, GridSearchCV might not always be the best option and can be computationally expensive and time-consuming in some scenarios. RandomizedSearchCV can be an alternative in these situations, but does not always seem to provide the best parameters compared to scikit-optimize BayesSearchCV. In my humble opinion, scikit-optimize BayesSearchCV seems to be a nice compromise between GridSearchCV and RandomizedSearchCV, providing good parameters in a reasonable time.

Unfortunately, scikit-optimize BayesSearchCV seems to be no longer supported. The last commit was in 2021. As of NumPy 1.24, NumPy now results in an error, unless a workaround of np.int = int is used. This is just one example. There are numerous issues that have not been touched since 2021. It would be a shame to lose a project such as scikit-optimize BayesSearchCV and humbly ask the contributors of scikit-learn if a similar version could be implemented in scikit-learn.

Thanks so much for your consideration! Looking forward to the discussion.


I thought scikit-optimize was a maintained project, but it seems it isn't, and I agree it'd be nice for the community to have a maintained BayesSearchCV available.

I'm not sure if we should include it here, or scikit-learn-extra, but would be nice to have it.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Apr 13, 2023
@betatim
Copy link
Member

betatim commented Apr 13, 2023

As a one of the scikit-optimize creators: if someone who has experience maintaining open-source projects wants to take over maintenance of scikit-optimize please get in touch!


Given the additional complexity of using a tool like BayesSearchCV (you trade tuning your estimator for tuning the regression model used by BayesSearchCV) and the cheapness of additional compute: how does RandomSearchCV/HalvingRandomSearchCV with more CPU and RAM compare to bayes search?

The last time we (the scikit-optimize authors) paid attention to the literature/benchmarks related to this topic our conclusion was that renting more CPU+RAM was way less complicated.

@jiawei-zhang-a
Copy link
Contributor

@betatim Thank you so much for your effort in scikit-optimize! If sklearn maintainers have decided to include BayesSearchCV mentioned above, we @Charlie-XIAO @ROMEEZHOU would like to help move that feature to sklearn. We are also willing to help find and solve the issues that have arisen since 2021. Are there specific smaller tasks that you recommend starting with?

@thomasjpfan thomasjpfan added module:model_selection Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Apr 14, 2023
@ogrisel
Copy link
Member

ogrisel commented Apr 14, 2023

+1 for community maintained implementation either in scikit-optimize, scikit-learn-extra or a dedicated, single estimator package under the scikit-learn-contrib organization.

However as @betatim said, I think this is too much complexity to be maintained as part of the main scikit-learn repo.

@adrinjalali
Copy link
Member Author

The last time we (the scikit-optimize authors) paid attention to the literature/benchmarks related to this topic our conclusion was that renting more CPU+RAM was way less complicated.

That basically means putting more money behind it, and I don't think we can expect everybody to be able to do that.

I've heard over and over that people want BayesSearchCV in scikit-learn, so from that perspective, I'd be in favor of having it in the core library. Maybe a draft PR wouldn't hurt to see how complex inclusion of it in scikit-learn would be?

@betatim
Copy link
Member

betatim commented Jul 6, 2023

Together with a PR (or maybe even before) it would be interesting to see a comparison to random search or sequential halving to evaluate how big an improvement you get (in terms of time spent to find a "nearly optimal point").

That way, no matter what the outcome, we can either add it to scikit-learn or write a blog post + docs to point out why it is not needed.

@Charlie-XIAO
Copy link
Contributor

Charlie-XIAO commented Jul 6, 2023

Hi, have maintainers decided to work on this or this is open to public? Me and @jiawei-zhang-a have been interested in this for a while, if possible we would like to help with drafting a PR.

@adrinjalali
Copy link
Member Author

@Charlie-XIAO we would really appreciate it if you could start with what @betatim suggests and a draft PR.

@betatim
Copy link
Member

betatim commented Jul 6, 2023

Depending on your prior about the outcome of a benchmark, it might be wise to start with that instead of a PR.

Something I don't know how to benchmark/measure is the "you trade tuning your estimator for tuning the regression model used by BayesSearchCV" part. As a user of BayesSearchCV I have to make choices about the regression model used. In particular if you use gaussian processes (GPs) as your model my impression is that you quickly end up in a world that is unfamiliar to the average person. It is a fascinating world, if you are into that kind of stuff, but it is by no means "free", like picking points at random is. A key problem to think about and try and solve is how to keep this cost under control or even reduce it so much that it is something the average user doesn't have to think about. I don't know how clever we can be about picking the right kernels for each hyper-parameter, etc. Part of the reason I spent time learning about and using trees as the regressor instead of GPs in scikit-optimize is that it looks like they perform well and are quite easy to reason about. There is a group at Freiburg university in Germany, the professor is called Hutter(?) or similar, who are very active in the autoML world and have used decision tree based models to win a lot of competitions.

Another line of though is to investigate what the popular optimisers/models are in tools like https://optuna.org/. Optuna seems like it is a popular choice. I don't think it contains any "secret sauce" that people couldn't get elsewhere, so understanding why people choose it and which part of it they use could give you an unfair advantage in terms of making a hyper-parameter optimiser that is simple to maintain, simple to use and used a lot.

But this moves the goal from "create a BayesSearchCV" to "create a simple and powerful hyper-parameter optimiser". Whether this is a good thing or not depends on people's priors (I guess).

A (very) old paper comparing methods https://proceedings.neurips.cc/paper_files/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf

@Charlie-XIAO
Copy link
Contributor

Thanks for the information @betatim I'll definitely take a look.

@thomasjpfan
Copy link
Member

The original Hyperband paper that uses successive halving has benchmarks against Bayesian methods. To experiment with, Optuna already has a Search CV to use their strategies. For reference, here is a recent paper comparing Optuna and HyperOpt.

Given the third party options out there, I am not too excited about adding a Bayes Search CV. I rather push people to use successive halving and further improve it. For example, we can take advantage of warm starting: #15125

@amueller may have more insights on this topic

@Charlie-XIAO
Copy link
Contributor

Charlie-XIAO commented Jul 6, 2023

As for benchmarking, optuna developers seem to use their own kurobako and there is also HBOBench.

I rather push people to use successive halving and further improve it.

This does seem reasonable. For instance Hyperband (based on successive halving) seems to perform better than Bayesian optimization with small~medium budget (because Bayesian is like random initially and only stands out with larger budgets). However, there is also BOHB combining Bayesian optimization and Hyperband that again outperforms Hyperband on most tasks (though the BOHB repo is no longer maintained). Here is an interesting post that might be worth reading.

I have also seen DEHB based on differential evolution and Hyperband that claims to beat BOHB. The repo is here.

I'm not really sure what scikit-learn wants to have. An algorithm that has been quite popular, a state-of-the-art algorithm, or something else?

Given the third party options out there

I'm thinking that scikit-learn is still nowadays the first choice when dealing with machine learning problems especially for beginners (and beginners may not even know optuna).

Regardless, I don't think I'm as familiar with this topic as maintainers (just providing some information and personal thoughts here). Maybe I should first wait for you to reach an agreement?

@thomasjpfan
Copy link
Member

Regardless, I don't think I'm as familiar with this topic as maintainers (just providing some information and personal thoughts here). Maybe I should first wait for you to reach an agreement?

Yea, from what I've seen, people have moved away from bayesian approaches to approaches that takes training time into consideration.

I'm not really sure what scikit-learn wants to have. An algorithm that has been quite popular,

From memory, I think successive halving is close enough to Hyperband and Hyperband is better than Bayesian approaches unless you have a lot of time to search the space. At the end of the day, I'll always recommend HalvingSearchCV even if BayesianSearchCV existed. I rather people get a good enough model quickly and move on to evaluation than to spend a long time searching for hyper parameters.

@adrinjalali
Copy link
Member Author

Seems like this is becoming more of a documentation issue then. We should probably make sure we refer people to good examples from all SearchCV classes to make sure they do best practices.

@betatim
Copy link
Member

betatim commented Jul 12, 2023

In the pointers/docs we add can we add "keywords" like "hyperopt" and the like to help people (including me - I always have to look up the paper to remind myself that it is basically the same) realise that the, somewhat obscurely named, HalvingSeachCV is what they are looking for if they want "hyperopt"? Or is it hyperband?

👍 improved docs and examples and references within the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:model_selection Needs Decision - Include Feature Requires decision regarding including feature
Projects
None yet
Development

No branches or pull requests

6 participants