Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add ensemble selection algorithm #6540

Closed

Conversation

yenchenlin
Copy link
Contributor

@yenchenlin yenchenlin commented Mar 14, 2016

This PR is an implementation of ensemble selection which is discussed in #6329 ,
however, the API of this module is not yet determined.

Check List:

  • Implement main algorithm
  • Add example code
  • Determine API
  • Test
  • Detail documentation

Reference:
Ensemble Selection from Libraries of Models R. Caruana et.al. ICML 2004

@x3n0cr4735
Copy link

Awesome, thanks. Will take a look today.

@yenchenlin
Copy link
Contributor Author

Hello @x3n0cr4735 ,
you can test the implementation with the following code snippet:
https://gist.github.com/yenchenlin1994/9fe63d5a5f481b6256eb

NOTE: you should download adult dataset and substitute this line with adult data path on your computer.

Also, you may need pandas.

@x3n0cr4735
Copy link

Ok, sounds good.

These estimators must be fitted.

max_bag_estimators : integer, optional (default=50)
The maximum number of estimators at each bag ensemble selection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a relative pronoun missing here. Rewrite for clarity.

@x3n0cr4735
Copy link

Hey @yenchenlin1994, I ran the demo and it worked. Looked at your EnsembleSelectionClassifier class and have a couple comments/questions:

  1. I think the code is very readable and your comments are very useful.
  2. I wonder if it may not be necessary to make another class that feeds the EnsembleSelectionClassifier class with the estimators that it will be fitting and selecting over - something similar to a GridSearch mechanism. In your demo you give the models explicitly, but it might be useful to be able to specify a param grid for each model, or perhaps a param grid for all at once - as in the MajorityVotingClassifier. It seems like a general enough problem that it would be useful to have such a feature, although I'm not sure if you want to include in this feature or have it as a complementary feature. Let me know your thoughts.
  3. Are you OK with only having binomial predictions at first or are you aiming to add multi-label?
  4. It might also be useful to be able to serialize the models, along with a dictionary with their scores. Maybe this getting to specific for the general purpose - I don't know.

@x3n0cr4735
Copy link

P.S. It would be nice to have some notes at the start of each method, like what you did for the class. I'm not sure if this is standard for sklearn. If not, ignore my suggestion.

@jnothman
Copy link
Member

If we're offering estimators that take pre-fitted sub-estimators (I know there's a calibration setting that does this too), then we need to at least document the impossibility of clone-based metaestimators around them, if not programmatically forbid it.

----------
estimators : list of (string, estimator) tuples
The estimators from which the ensemble selection classifier is built.
These estimators must be fitted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain that the strings are arbitrary, unique names. I'm not sure they're well motivated, actually, if we're only dealing with fitted estimators: the model can refer to estimators by index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more consistent to VotingClassifiet, though input to VotingClassifiet is unfitted estimators.
Another benefit is that when user want to see the result of ensemble selection, we can output the string corresponding to the estimator and it's weight, which is more clear.

@jnothman
Copy link
Member

Please work in this order:

  1. make this Python 3-compatible and keep it that way
  2. add tests
  3. make it idiomatic numpy: use arrays of weights, and arrays of indices, not dicts and counters and names.
  4. allow arbitrary scoring parameter
  5. examples
  6. narrative documentation

We'll deal with further API issues along the way.

@yenchenlin
Copy link
Contributor Author

@x3n0cr4735 @jnothman sorry for my poor PR 😢
Some of these are for my debug purpose and I forget to remove it.

And super super super thanks for the review, I will modify all these right away.

@jnothman
Copy link
Member

sorry for my poor PR 😢

Who said it was poor?

@jnothman
Copy link
Member

Alternatively: who said making a perfect PR was easy? It takes practice and feedback to make a good PR, which is why we encourage people to start with something small, submit WIPs early, and not take on hundreds of issues at once.

@jnothman
Copy link
Member

And the whole point of getting core developers, and not some arbitrary person, to review is that we're familiar with what is idiomatic for this project, what is consistent, and what corner cases need to be handled for integration and usability that you may not have thought of by just translating a paper into code. So until you are equally familiar with those things, you should anticipate that we will identify issues that you had not foreseen.

@x3n0cr4735
Copy link

Hey @yenchenlin1994, I agree with @jnothman that you should not feel bad about the PR. I haven't contributed much to open source but I'm guessing it's like the peer reviewed publication process, which I'm familiar with. It's rough and sometimes seems arbitrary but will be helpful for producing the best output. @jnothman could you please comment on points 2 and 3 in my previous post?

@jnothman
Copy link
Member

(2) was somewhat decided against in #6329.
Re (3), the current solution doesn't even handle binary problems yet; let's get some test cases in before we worry about such limitations.

@x3n0cr4735
Copy link

@jnothman

I see now that you already had the same idea. What if we build something analogous to GridSearchCV where we treat the ensemble selector as the estimator argument for the 2000 model builder. If GridSearchCV can handle a meta-estimator like a stacking estimator why would it not be able to handle the ensemble selector? I'm sure there may be some caveats (like whether we should make it be able to handle base estimators instead of this specific meta-estimator, or other meta-estimators) but it might be useful to at least make something to build the portfolio of models. Even if it doesn't end up being part of sklearn.

@jnothman
Copy link
Member

The argument was that given the limitations of scikit-learn parallelism, we
would rather suggest users provide pre-fitted estimators. However, we can
certainly use a ParameterGrid or a ParameterSampler to derive the
estimators in an example.

On 17 March 2016 at 00:57, Jan-Samuel Wagner notifications@github.com
wrote:

@jnothman https://github.com/jnothman

I see now that you already had the same idea. What if we build something
analogous to GridSearchCV where we treat the ensemble selector as the
estimator argument for the 2000 model builder. If GridSearchCV can handle a
meta-estimator like a stacking estimator why would it not be able to handle
the ensemble selector? I'm sure there may be some caveats (like whether we
should make it be able to handle base estimators instead of this specific
meta-estimator, or other meta-estimators) but it might be useful to at
least make something to build the portfolio of models. Even if it doesn't
end up being part of sklearn.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6540 (comment)

@x3n0cr4735
Copy link

So you're thinking it would be better to have the first step executed with something like Hadoop where the models are saved and then picked up by the ensemble selector. Would it be beneficial to get the ball rolling with an implementation in scikit-learn and then from there writing a map-reduce solution? Maybe the datastax people can take the scikit-learn version and write a Spark implementation. They seem to do a lot of that.

@x3n0cr4735
Copy link

I'm sure this has been discussed among core developers but I'm not educated on the topic. If ipython uses zeromq instead of joblib is it possible to leverage mq in scikit-learn? Has anyone discussed integrating functionality with pika for instance?

@jnothman
Copy link
Member

I don't really see the relevance of map-reduce over any old job scheduler,
really. GNU Parallel would be sufficient for distributing the fitting of
estimators to then be used in ensemble selection. It's not that we're
thinking any particular thing except that if you give the user an interface
where they can construct the estimators here, the user feels compelled to
use that interface, not DIY... then you need to handle both the sampling
and grid approaches, combination of multiple samplers, parallelism options,
etc. and it becomes a monolith quite beyond the interesting part of this
algorithm. Suddenly the extra usability becomes a burden: better to give an
example of how to load this thing up and let the users' imagination run
wild.

On 17 March 2016 at 01:16, Jan-Samuel Wagner notifications@github.com
wrote:

So you're thinking it would be better to have the first step executed with
something like Hadoop where the models are saved and then picked up by the
ensemble selector. Would it be beneficial to get the ball rolling with an
implementation in scikit-learn and then from there writing a map-reduce
solution? Maybe the datastax people can take the scikit-learn version and
write a Spark implementation. They seem to do a lot of that.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6540 (comment)

@jnothman
Copy link
Member

That's quite off topic, but I suspect the answer is something like:
scikit-learn has no intention to support sophisticated parallelism
including message passing etc; however, I think an IPython backend for
joblib.parallel is a good way to for us to consider extending parallelism
to multi-machine clusters. It's not something I've discussed with
scikit-learn (or joblib) devs.

On 17 March 2016 at 01:27, Jan-Samuel Wagner notifications@github.com
wrote:

I'm sure this has been discussed among core developers but I'm not
educated on the topic. If ipython uses zeromq instead of joblib is it
possible to leverage mq in scikit-learn? Has anyone discussed integrating
functionality with pika for instance?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6540 (comment)

@x3n0cr4735
Copy link

Yeah, it's off-topic, but still an important one. What would be involved in creating a backend for joblib.parallel based on ipython? That sounds like a huge task.

@yenchenlin
Copy link
Contributor Author

Hey @jnothman , may I ask a quick question?
I don't quite get the following paragraph, would you please elaborate on clone-based metaestimators?

If we're offering estimators that take pre-fitted sub-estimators (I know there's a calibration setting that does this too), then we need to at least document the impossibility of clone-based metaestimators around them, if not programmatically forbid it.

@jnothman
Copy link
Member

sklearn.base.clone is how scikit-learn duplicates estimators so that they can be fit with different data or parameters. It is used by GridSearchCV among others. It reconstructs its argument, retaining estimator constructor parameters but deleting other attributes; however it retains them by recursively cloning them. This means that if an estimator or collection of estimators is a parameter, any fitted model will be removed in the clone operation.

Thus clone(EnsembleSelectionClassifier([('a', my_fitted_estimator)])) will return an EnsembleSelectionClassifier with unfitted estimators in its ensemble, and fitting it (which is the next thing a GridSearchCV would do after cloning and setting parameters) will, I take it, fail.

@yenchenlin
Copy link
Contributor Author

Great thanks for your detailed explanation @jnothman !
I got it.

@x3n0cr4735
Copy link

Hey @yenchenlin1994, how are you doing with this? If you need help addressing some of the comments let me know.

@yenchenlin
Copy link
Contributor Author

Thanks! I am currently modifying code.
Will let you know if I need help @x3n0cr4735
Thanks a lot 😄

@amueller
Copy link
Member

This looks great. Any update?

@x3n0cr4735
Copy link

Haven't heard anything back from @yenchenlin

@yenchenlin
Copy link
Contributor Author

Sorry for my late reply, I'm relatively busy right now, will ping back you guys when I have substantial progress.

@ajing
Copy link

ajing commented Jun 26, 2017

Any progress on this topic?

@amueller amueller removed this from PR phase in Andy's pets Jul 21, 2017
@realfaker
Copy link

any updates?

@adrinjalali
Copy link
Member

Since OP hasn't had time to work on this and the code is in python2, closing as a fresh start might be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants