[WIP] Add ensemble selection algorithm #6540

yenchenlin · 2016-03-14T14:13:41Z

This PR is an implementation of ensemble selection which is discussed in #6329 ,
however, the API of this module is not yet determined.

Check List:

Reference:
Ensemble Selection from Libraries of Models R. Caruana et.al. ICML 2004

x3n0cr4735 · 2016-03-14T14:17:54Z

Awesome, thanks. Will take a look today.

yenchenlin · 2016-03-14T14:20:46Z

Hello @x3n0cr4735 ,
you can test the implementation with the following code snippet:
https://gist.github.com/yenchenlin1994/9fe63d5a5f481b6256eb

NOTE: you should download adult dataset and substitute this line with adult data path on your computer.

Also, you may need pandas.

x3n0cr4735 · 2016-03-14T14:33:24Z

Ok, sounds good.

jnothman · 2016-03-14T23:16:17Z

sklearn/ensemble/ensemble_selection.py

+        These estimators must be fitted.
+
+    max_bag_estimators : integer, optional (default=50)
+        The maximum number of estimators at each bag ensemble selection


I think there's a relative pronoun missing here. Rewrite for clarity.

x3n0cr4735 · 2016-03-15T10:36:41Z

Hey @yenchenlin1994, I ran the demo and it worked. Looked at your EnsembleSelectionClassifier class and have a couple comments/questions:

I think the code is very readable and your comments are very useful.
I wonder if it may not be necessary to make another class that feeds the EnsembleSelectionClassifier class with the estimators that it will be fitting and selecting over - something similar to a GridSearch mechanism. In your demo you give the models explicitly, but it might be useful to be able to specify a param grid for each model, or perhaps a param grid for all at once - as in the MajorityVotingClassifier. It seems like a general enough problem that it would be useful to have such a feature, although I'm not sure if you want to include in this feature or have it as a complementary feature. Let me know your thoughts.
Are you OK with only having binomial predictions at first or are you aiming to add multi-label?
It might also be useful to be able to serialize the models, along with a dictionary with their scores. Maybe this getting to specific for the general purpose - I don't know.

x3n0cr4735 · 2016-03-15T10:38:22Z

P.S. It would be nice to have some notes at the start of each method, like what you did for the class. I'm not sure if this is standard for sklearn. If not, ignore my suggestion.

jnothman · 2016-03-15T21:59:24Z

If we're offering estimators that take pre-fitted sub-estimators (I know there's a calibration setting that does this too), then we need to at least document the impossibility of clone-based metaestimators around them, if not programmatically forbid it.

jnothman · 2016-03-15T23:52:27Z

sklearn/ensemble/ensemble_selection.py

+    ----------
+    estimators : list of (string, estimator) tuples
+        The estimators from which the ensemble selection classifier is built.
+        These estimators must be fitted.


Please explain that the strings are arbitrary, unique names. I'm not sure they're well motivated, actually, if we're only dealing with fitted estimators: the model can refer to estimators by index.

I think it's more consistent to VotingClassifiet, though input to VotingClassifiet is unfitted estimators.
Another benefit is that when user want to see the result of ensemble selection, we can output the string corresponding to the estimator and it's weight, which is more clear.

jnothman · 2016-03-15T23:57:08Z

Please work in this order:

make this Python 3-compatible and keep it that way
add tests
make it idiomatic numpy: use arrays of weights, and arrays of indices, not dicts and counters and names.
allow arbitrary scoring parameter
examples
narrative documentation

We'll deal with further API issues along the way.

yenchenlin · 2016-03-15T23:59:08Z

@x3n0cr4735 @jnothman sorry for my poor PR 😢
Some of these are for my debug purpose and I forget to remove it.

And super super super thanks for the review, I will modify all these right away.

jnothman · 2016-03-16T00:30:38Z

sorry for my poor PR 😢

Who said it was poor?

jnothman · 2016-03-16T00:33:23Z

Alternatively: who said making a perfect PR was easy? It takes practice and feedback to make a good PR, which is why we encourage people to start with something small, submit WIPs early, and not take on hundreds of issues at once.

jnothman · 2016-03-16T00:36:59Z

And the whole point of getting core developers, and not some arbitrary person, to review is that we're familiar with what is idiomatic for this project, what is consistent, and what corner cases need to be handled for integration and usability that you may not have thought of by just translating a paper into code. So until you are equally familiar with those things, you should anticipate that we will identify issues that you had not foreseen.

x3n0cr4735 · 2016-03-16T12:44:11Z

Hey @yenchenlin1994, I agree with @jnothman that you should not feel bad about the PR. I haven't contributed much to open source but I'm guessing it's like the peer reviewed publication process, which I'm familiar with. It's rough and sometimes seems arbitrary but will be helpful for producing the best output. @jnothman could you please comment on points 2 and 3 in my previous post?

jnothman · 2016-03-16T12:48:42Z

(2) was somewhat decided against in #6329.
Re (3), the current solution doesn't even handle binary problems yet; let's get some test cases in before we worry about such limitations.

x3n0cr4735 · 2016-03-16T13:57:53Z

@jnothman

I see now that you already had the same idea. What if we build something analogous to GridSearchCV where we treat the ensemble selector as the estimator argument for the 2000 model builder. If GridSearchCV can handle a meta-estimator like a stacking estimator why would it not be able to handle the ensemble selector? I'm sure there may be some caveats (like whether we should make it be able to handle base estimators instead of this specific meta-estimator, or other meta-estimators) but it might be useful to at least make something to build the portfolio of models. Even if it doesn't end up being part of sklearn.

jnothman · 2016-03-16T14:09:51Z

The argument was that given the limitations of scikit-learn parallelism, we
would rather suggest users provide pre-fitted estimators. However, we can
certainly use a ParameterGrid or a ParameterSampler to derive the
estimators in an example.

On 17 March 2016 at 00:57, Jan-Samuel Wagner notifications@github.com
wrote:

@jnothman https://github.com/jnothman

I see now that you already had the same idea. What if we build something
analogous to GridSearchCV where we treat the ensemble selector as the
estimator argument for the 2000 model builder. If GridSearchCV can handle a
meta-estimator like a stacking estimator why would it not be able to handle
the ensemble selector? I'm sure there may be some caveats (like whether we
should make it be able to handle base estimators instead of this specific
meta-estimator, or other meta-estimators) but it might be useful to at
least make something to build the portfolio of models. Even if it doesn't
end up being part of sklearn.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6540 (comment)

x3n0cr4735 · 2016-03-16T14:16:40Z

So you're thinking it would be better to have the first step executed with something like Hadoop where the models are saved and then picked up by the ensemble selector. Would it be beneficial to get the ball rolling with an implementation in scikit-learn and then from there writing a map-reduce solution? Maybe the datastax people can take the scikit-learn version and write a Spark implementation. They seem to do a lot of that.

x3n0cr4735 · 2016-03-16T14:27:32Z

I'm sure this has been discussed among core developers but I'm not educated on the topic. If ipython uses zeromq instead of joblib is it possible to leverage mq in scikit-learn? Has anyone discussed integrating functionality with pika for instance?

jnothman · 2016-03-16T14:28:15Z

I don't really see the relevance of map-reduce over any old job scheduler,
really. GNU Parallel would be sufficient for distributing the fitting of
estimators to then be used in ensemble selection. It's not that we're
thinking any particular thing except that if you give the user an interface
where they can construct the estimators here, the user feels compelled to
use that interface, not DIY... then you need to handle both the sampling
and grid approaches, combination of multiple samplers, parallelism options,
etc. and it becomes a monolith quite beyond the interesting part of this
algorithm. Suddenly the extra usability becomes a burden: better to give an
example of how to load this thing up and let the users' imagination run
wild.

On 17 March 2016 at 01:16, Jan-Samuel Wagner notifications@github.com
wrote:

So you're thinking it would be better to have the first step executed with
something like Hadoop where the models are saved and then picked up by the
ensemble selector. Would it be beneficial to get the ball rolling with an
implementation in scikit-learn and then from there writing a map-reduce
solution? Maybe the datastax people can take the scikit-learn version and
write a Spark implementation. They seem to do a lot of that.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6540 (comment)

jnothman · 2016-03-16T14:31:24Z

That's quite off topic, but I suspect the answer is something like:
scikit-learn has no intention to support sophisticated parallelism
including message passing etc; however, I think an IPython backend for
joblib.parallel is a good way to for us to consider extending parallelism
to multi-machine clusters. It's not something I've discussed with
scikit-learn (or joblib) devs.

On 17 March 2016 at 01:27, Jan-Samuel Wagner notifications@github.com
wrote:

I'm sure this has been discussed among core developers but I'm not
educated on the topic. If ipython uses zeromq instead of joblib is it
possible to leverage mq in scikit-learn? Has anyone discussed integrating
functionality with pika for instance?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6540 (comment)

x3n0cr4735 · 2016-03-16T14:57:45Z

Yeah, it's off-topic, but still an important one. What would be involved in creating a backend for joblib.parallel based on ipython? That sounds like a huge task.

yenchenlin · 2016-03-24T06:36:56Z

Hey @jnothman , may I ask a quick question?
I don't quite get the following paragraph, would you please elaborate on clone-based metaestimators?

If we're offering estimators that take pre-fitted sub-estimators (I know there's a calibration setting that does this too), then we need to at least document the impossibility of clone-based metaestimators around them, if not programmatically forbid it.

jnothman · 2016-03-24T06:46:13Z

sklearn.base.clone is how scikit-learn duplicates estimators so that they can be fit with different data or parameters. It is used by GridSearchCV among others. It reconstructs its argument, retaining estimator constructor parameters but deleting other attributes; however it retains them by recursively cloning them. This means that if an estimator or collection of estimators is a parameter, any fitted model will be removed in the clone operation.

Thus clone(EnsembleSelectionClassifier([('a', my_fitted_estimator)])) will return an EnsembleSelectionClassifier with unfitted estimators in its ensemble, and fitting it (which is the next thing a GridSearchCV would do after cloning and setting parameters) will, I take it, fail.

yenchenlin · 2016-03-24T08:24:34Z

Great thanks for your detailed explanation @jnothman !
I got it.

x3n0cr4735 · 2016-04-05T15:46:36Z

Hey @yenchenlin1994, how are you doing with this? If you need help addressing some of the comments let me know.

yenchenlin · 2016-04-05T15:48:03Z

Thanks! I am currently modifying code.
Will let you know if I need help @x3n0cr4735
Thanks a lot 😄

amueller · 2016-10-11T00:48:44Z

This looks great. Any update?

x3n0cr4735 · 2016-10-11T17:17:03Z

Haven't heard anything back from @yenchenlin

yenchenlin · 2016-10-12T10:00:11Z

Sorry for my late reply, I'm relatively busy right now, will ping back you guys when I have substantial progress.

ajing · 2017-06-26T22:22:43Z

Any progress on this topic?

realfaker · 2017-08-05T15:04:44Z

any updates?

adrinjalali · 2024-04-19T11:00:15Z

Since OP hasn't had time to work on this and the code is in python2, closing as a fresh start might be better.

Implement algorithm

4021aa7

jnothman reviewed Mar 14, 2016
View reviewed changes

jnothman reviewed Mar 15, 2016
View reviewed changes

amueller removed this from PR phase in Andy's pets Jul 21, 2017

amueller added the Stalled label Aug 5, 2019

github-actions bot added the module:ensemble label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

cmarmo mentioned this pull request Dec 10, 2021

Stacking: Add Ensemble Selection from Libraries of Models #6329

Open

adrinjalali closed this Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add ensemble selection algorithm #6540

[WIP] Add ensemble selection algorithm #6540

yenchenlin commented Mar 14, 2016 •

edited

Loading

x3n0cr4735 commented Mar 14, 2016

yenchenlin commented Mar 14, 2016

x3n0cr4735 commented Mar 14, 2016

jnothman Mar 14, 2016

x3n0cr4735 commented Mar 15, 2016

x3n0cr4735 commented Mar 15, 2016

jnothman commented Mar 15, 2016

jnothman Mar 15, 2016

yenchenlin Mar 18, 2016

jnothman commented Mar 15, 2016

yenchenlin commented Mar 15, 2016

jnothman commented Mar 16, 2016

jnothman commented Mar 16, 2016

jnothman commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

jnothman commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

jnothman commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

jnothman commented Mar 16, 2016

jnothman commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

yenchenlin commented Mar 24, 2016

jnothman commented Mar 24, 2016

yenchenlin commented Mar 24, 2016

x3n0cr4735 commented Apr 5, 2016

yenchenlin commented Apr 5, 2016

amueller commented Oct 11, 2016

x3n0cr4735 commented Oct 11, 2016

yenchenlin commented Oct 12, 2016

ajing commented Jun 26, 2017

realfaker commented Aug 5, 2017

adrinjalali commented Apr 19, 2024

[WIP] Add ensemble selection algorithm #6540

[WIP] Add ensemble selection algorithm #6540

Conversation

yenchenlin commented Mar 14, 2016 • edited Loading

x3n0cr4735 commented Mar 14, 2016

yenchenlin commented Mar 14, 2016

x3n0cr4735 commented Mar 14, 2016

jnothman Mar 14, 2016

Choose a reason for hiding this comment

x3n0cr4735 commented Mar 15, 2016

x3n0cr4735 commented Mar 15, 2016

jnothman commented Mar 15, 2016

jnothman Mar 15, 2016

Choose a reason for hiding this comment

yenchenlin Mar 18, 2016

Choose a reason for hiding this comment

jnothman commented Mar 15, 2016

yenchenlin commented Mar 15, 2016

jnothman commented Mar 16, 2016

jnothman commented Mar 16, 2016

jnothman commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

jnothman commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

jnothman commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

jnothman commented Mar 16, 2016

jnothman commented Mar 16, 2016

x3n0cr4735 commented Mar 16, 2016

yenchenlin commented Mar 24, 2016

jnothman commented Mar 24, 2016

yenchenlin commented Mar 24, 2016

x3n0cr4735 commented Apr 5, 2016

yenchenlin commented Apr 5, 2016

amueller commented Oct 11, 2016

x3n0cr4735 commented Oct 11, 2016

yenchenlin commented Oct 12, 2016

ajing commented Jun 26, 2017

realfaker commented Aug 5, 2017

adrinjalali commented Apr 19, 2024

yenchenlin commented Mar 14, 2016 •

edited

Loading