-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add ensemble selection algorithm #6540
Conversation
Awesome, thanks. Will take a look today. |
Hello @x3n0cr4735 , NOTE: you should download adult dataset and substitute this line with adult data path on your computer. Also, you may need |
Ok, sounds good. |
These estimators must be fitted. | ||
|
||
max_bag_estimators : integer, optional (default=50) | ||
The maximum number of estimators at each bag ensemble selection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a relative pronoun missing here. Rewrite for clarity.
Hey @yenchenlin1994, I ran the demo and it worked. Looked at your EnsembleSelectionClassifier class and have a couple comments/questions:
|
P.S. It would be nice to have some notes at the start of each method, like what you did for the class. I'm not sure if this is standard for sklearn. If not, ignore my suggestion. |
If we're offering estimators that take pre-fitted sub-estimators (I know there's a calibration setting that does this too), then we need to at least document the impossibility of |
---------- | ||
estimators : list of (string, estimator) tuples | ||
The estimators from which the ensemble selection classifier is built. | ||
These estimators must be fitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain that the strings are arbitrary, unique names. I'm not sure they're well motivated, actually, if we're only dealing with fitted estimators: the model can refer to estimators by index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more consistent to VotingClassifiet
, though input to VotingClassifiet
is unfitted estimators.
Another benefit is that when user want to see the result of ensemble selection, we can output the string corresponding to the estimator and it's weight, which is more clear.
Please work in this order:
We'll deal with further API issues along the way. |
@x3n0cr4735 @jnothman sorry for my poor PR 😢 And super super super thanks for the review, I will modify all these right away. |
Who said it was poor? |
Alternatively: who said making a perfect PR was easy? It takes practice and feedback to make a good PR, which is why we encourage people to start with something small, submit WIPs early, and not take on hundreds of issues at once. |
And the whole point of getting core developers, and not some arbitrary person, to review is that we're familiar with what is idiomatic for this project, what is consistent, and what corner cases need to be handled for integration and usability that you may not have thought of by just translating a paper into code. So until you are equally familiar with those things, you should anticipate that we will identify issues that you had not foreseen. |
Hey @yenchenlin1994, I agree with @jnothman that you should not feel bad about the PR. I haven't contributed much to open source but I'm guessing it's like the peer reviewed publication process, which I'm familiar with. It's rough and sometimes seems arbitrary but will be helpful for producing the best output. @jnothman could you please comment on points 2 and 3 in my previous post? |
(2) was somewhat decided against in #6329. |
I see now that you already had the same idea. What if we build something analogous to GridSearchCV where we treat the ensemble selector as the estimator argument for the 2000 model builder. If GridSearchCV can handle a meta-estimator like a stacking estimator why would it not be able to handle the ensemble selector? I'm sure there may be some caveats (like whether we should make it be able to handle base estimators instead of this specific meta-estimator, or other meta-estimators) but it might be useful to at least make something to build the portfolio of models. Even if it doesn't end up being part of sklearn. |
The argument was that given the limitations of scikit-learn parallelism, we On 17 March 2016 at 00:57, Jan-Samuel Wagner notifications@github.com
|
So you're thinking it would be better to have the first step executed with something like Hadoop where the models are saved and then picked up by the ensemble selector. Would it be beneficial to get the ball rolling with an implementation in scikit-learn and then from there writing a map-reduce solution? Maybe the datastax people can take the scikit-learn version and write a Spark implementation. They seem to do a lot of that. |
I'm sure this has been discussed among core developers but I'm not educated on the topic. If ipython uses zeromq instead of joblib is it possible to leverage mq in scikit-learn? Has anyone discussed integrating functionality with pika for instance? |
I don't really see the relevance of map-reduce over any old job scheduler, On 17 March 2016 at 01:16, Jan-Samuel Wagner notifications@github.com
|
That's quite off topic, but I suspect the answer is something like: On 17 March 2016 at 01:27, Jan-Samuel Wagner notifications@github.com
|
Yeah, it's off-topic, but still an important one. What would be involved in creating a backend for joblib.parallel based on ipython? That sounds like a huge task. |
Hey @jnothman , may I ask a quick question?
|
Thus |
Great thanks for your detailed explanation @jnothman ! |
Hey @yenchenlin1994, how are you doing with this? If you need help addressing some of the comments let me know. |
Thanks! I am currently modifying code. |
This looks great. Any update? |
Haven't heard anything back from @yenchenlin |
Sorry for my late reply, I'm relatively busy right now, will ping back you guys when I have substantial progress. |
Any progress on this topic? |
any updates? |
Since OP hasn't had time to work on this and the code is in python2, closing as a fresh start might be better. |
This PR is an implementation of ensemble selection which is discussed in #6329 ,
however, the API of this module is not yet determined.
Check List:
Reference:
Ensemble Selection from Libraries of Models R. Caruana et.al. ICML 2004