Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] FEA: Stacking estimator for classification and regression #11047

Merged
merged 127 commits into from Sep 18, 2019

Conversation

@glemaitre
Copy link
Contributor

glemaitre commented May 1, 2018

Reference Issues/PRs

closes #4816
closes #8960
closes #7427
closes #6674

requires #14305 to be merged.

What does this implement/fix? Explain your changes.

Implement 2 meta-estimators to perform stacking (classification and regression problems).

Any other comments?

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented May 1, 2018

@jnothman @amueller @GaelVaroquaux It is what I had in mind.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented May 1, 2018

self.estimators = estimators
self.meta_estimator = meta_estimator
self.cv = cv
self.method_estimators = method_estimators

This comment has been minimized.

Copy link
@chkoar

chkoar May 6, 2018

Contributor

I believe that method_estimators is not obvious name. output_confidence, metafeatures or something like that could be more targeted. In case of classification, if output_confidence is True the the level 1 estimator should be trained on the (cross-validated or not) outputs of predict_proba or decision_function of level 0 estimators else the level 1 estimator should be trained on prediction outputs. Another option could be a parameter like metafeatures with several options like:

  • predictions
  • confidences
  • cv_predictions
  • cv_confidences

Of course we could use both metafeatures and the original space to train the level 1 estimator.

As for the confidence outputs an option could be to aggregate the outputs using a descriptive statistic like mean or median. So the dimensionality of the metafeatures will be the same as the number of classes.

This comment has been minimized.

Copy link
@glemaitre

glemaitre May 6, 2018

Author Contributor

What the use case in which we want to avoid the CV when fitting the final estimator.
method_estimators is the closest keywords that I found to be in accordance with the method keyword in the other estimator (in grid-search I think)

This comment has been minimized.

Copy link
@jnothman

jnothman Mar 11, 2019

Member

How about predict_method?

@chkoar

This comment has been minimized.

Copy link
Contributor

chkoar commented May 6, 2018

It is very nice that you started this @glemaitre. If you need any help don't hesitate to ping me.

@glemaitre glemaitre changed the title [WIP] Stacking estimator for classification and regression [MRG] Stacking estimator for classification and regression May 6, 2018
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented May 6, 2018

It is very nice that you started this @glemaitre. If you need any help don't hesitate to ping me.

Actually, it was some good PR which allow us to fix some issues. I would appreciate if you can make a review :)

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented May 6, 2018

@amueller I drafted the documentation as well as an example. I hope that it will motivate you to give a go for the review ;)

NB: we should improve the example once the ColumnTransformer is merged. There is a huge case in which we applied different classifier on different column and use that as estimators. This is more the use case that I am used too actually.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented May 7, 2018

Artifacts for the documentation:

I see that there some glitches to be corrected.

glemaitre and others added 2 commits May 7, 2018
Copy link
Member

TomDLT left a comment

I directly pushed minor improvements in the example.

Your implementation does not allow to pass X unchanged to the final estimator, does it?

sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved
doc/whats_new/v0.20.rst Outdated Show resolved Hide resolved
sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved
sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved
sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented May 23, 2018

Your implementation does not allow to pass X unchanged to the final estimator, does it?

It does not allow it. What do you propose. I see 2 solutions:

  • Create a keyword "passthrough=True/False" and stack X accordingly
  • Create a strategy "input" for the DummyRegressor which will return the input and can be added to the estimators list.

The first one is less complicated to discover as a user I think. WDYT?

@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented May 23, 2018

The first one is less complicated to discover as a user I think.

I agree.

@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented May 23, 2018

Don't forget the .. versionadded:: 0.20 label in both new classes docstrings.

@rth

This comment has been minimized.

Copy link
Member

rth commented May 23, 2018

Your implementation does not allow to pass X unchanged to the final estimator, does it?

I'm not using stacking extensively, but IMHO this PR is complex enough; since passing X to the final estimator may not be the primary reason people would use stacking, maybe this feature can be proposed in a separate PR?

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

I am not convinced that not dropping is a good default.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

I added the drop parameter.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

I am still having doubt regading the dropping. We are discussing which estimator should benefit or not from it. I am still convinced that dropping would be best because apart of the tree which might be using this info (and I am not convinced about it as mentoned #11047 (comment)), it is really likely that user have linear and non-linear models in estimators and therefore dropping will be better than not dropping(this is really wrong)

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 18, 2019

I still prefer not to drop the first column by default, but since we've introduced drop parameter and the default final_estimator is linear model, I think that's acceptable.

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 18, 2019

Actually, apart from tree-based models, I'm not sure whether co-linear features are useful for regularized linear models (they're certainly not useful for unregularized linear models).
But I tend to believe that the influence is small, this is why I approve this PR.

Copy link
Member

jnothman left a comment

I'm really not sure about the importance of this dropping business. Without evidence to the contrary, I'd say YAGNI: dropping one of two columns in binary classification is obviously justified, for any final estimator, but otherwise it's unclear to me that we are providing the user with a useful parameter.

@glemaitre writes that "it is really likely that user have linear and non-linear models in estimators and therefore dropping will be better than not dropping". I don't get why having a particular type of estimator in estimators matters. Isn't the dropping all about what the final estimator does?

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

Isn't the dropping all about what the final estimator does?

Completely true. Wrong reasoning on my side.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 18, 2019

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Sep 18, 2019

+1 for always dropping for binary classifiers and never dropping for multiclass to keep things simple. The other cases are YAGNI in my opinion. No need for an option.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 18, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 18, 2019

+1 for always dropping for binary classifiers and never dropping for multiclass to keep things simple. The other cases are YAGNI in my opinion. No need for an option.

I think this is a better solution. So there's +3, maybe enough?

###############################################################################
# Stack of predictors on a single data set
###############################################################################
# It is sometimes tedious to find the model which will best perform on a given

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Sep 18, 2019

Member

And somehow, I think the logic here is strange. I'm still not sure whether it's good to throw "bad" estimators to a stacker. It's true that a stacker can somehow filter "bad" estimators (e.g., by giving lower weight to them), but I think these "bad" estimators will still influence the performance of a stacker.
I think one should use cross validation to check the performance of all the estimators and throw some "good" estimators to a stacker to produce a "better" estimator.
But I'm not sure and I think this example is acceptable.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Sep 18, 2019

Member

I agree. Stacking is mostly useful to slightly improve the accuracy by combining the strength of good models (hopefully diverse enough to get not too correlated errors).

Throwing in bad models is likely to negatively impact the performance of the ensemble (and it's computationally wasteful).

This comment has been minimized.

Copy link
@glemaitre

glemaitre Sep 18, 2019

Author Contributor

OK, so I replaced AdaBoostRegressor and KNNRegressor by HistGradientBoostingRegressor and RandomForestRegressor to have only strong learner and changed the conclusion slightly. I will add a bit of timing as well to show that training a stack learner is computationally expensive

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

Works for me

@glemaitre glemaitre force-pushed the glemaitre:is/4816 branch from de5316c to 8599450 Sep 18, 2019
glemaitre added 5 commits Sep 18, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 18, 2019

Let's do this... Thanks to @caioaao and @glemaitre both for great conception of these competing APIs and for persistence in finding a solution we all can accept!!

@jnothman jnothman merged commit bab5926 into scikit-learn:master Sep 18, 2019
18 checks passed
18 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python 1 new alert
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 97.85% of diff hit (target 96.72%)
Details
codecov/project 96.72% (+<.01%) compared to e5be3ad
Details
scikit-learn.scikit-learn Build #20190918.46 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@glemaitre glemaitre moved this from WAITING FOR REVIEW to TO BE MERGED in Guillaume's pet Sep 18, 2019
@glemaitre glemaitre moved this from TO BE MERGED to MERGED in Guillaume's pet Sep 18, 2019
@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Oct 3, 2019

@wderose

This comment has been minimized.

Copy link

wderose commented Nov 23, 2019

@caioaao @glemaitre : Thank you for this awesome feature. Should the stack_method="auto" priority match that of CalibratedClassifierCV?

decision_function -> proba -> predict

Instead of

proba -> decision_function -> predict
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Nov 23, 2019

@wderose

This comment has been minimized.

Copy link

wderose commented Nov 24, 2019

Raised the predict-function alignment issue in #15711.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.