Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Stacking classifier with pipelines API #8960

Closed
wants to merge 100 commits into from

Conversation

caioaao
Copy link
Contributor

@caioaao caioaao commented May 31, 2017

See discussion on #7427 for more info.

TODO:

  • Add support for methods other than predict_proba;
  • Improve tests

Improvements over #7427

  • Works with several kinds of estimators (one/multi label classification - input and output -, regressors, etc);
  • Less entropy, as it reuses the pipeline API;
  • Just tested it in a kaggle competition (look for Caio Oliveira);
  • Support for stacking multiple layers;
  • Blending each estimator separately makes it easier to pretrain models, cache/share blending results, etc (see one awesome application for this here);
  • More flexibility on using transformed data.

Possible improvements

Improvements from review:

  • Rewrite the docs (maybe tests to?) to stop mixing classification and regression problems;
  • Document "simple usage";
  • Document "advanced usage";
  • Write tests for StackLayer;
  • Write examples for interesting use cases, including: efficient hyperparam optimization techniques; blending without CV; pre-training base estimators.

Relevant discussions

Why to use StackingTransformer instead of implementing classes

Summary: #8960 (comment)

Other discussions:

Is having more than two layers worth it?

Factory methods vs Class inheritance

@caioaao
Copy link
Contributor Author

caioaao commented May 31, 2017

forgot to mention: huge thanks to @MLWave for the kaggle ensemble guide and the initial implementation

Should fix broken tests
- Make it generic (works with classifiers and regressors);
- Better naming;
Also improved testing by using a faster CV
@caioaao caioaao changed the title Stacking classifier with pipelines API [MRG] Stacking classifier with pipelines API Jun 1, 2017
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite elegant, though -- unaware of the literature in this space -- I feel like an estimator matrix rather than a single stacked layer is probably excessive. I certainly don't see why one should want the matrix to be rectangular. Specifying a single layer short-hand as a flat list should be possible.

You should add narrative documentation to doc/modules/ensemble and an example.

class BlendedEstimator(BaseEstimator, MetaEstimatorMixin, TransformerMixin):
"""Transformer to turn estimators into blended estimators

This is used for stacking models. Blending an estimator prevents data leaks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should define what blending is. It's a simple concept, but shouldn't be presumed knowledge.

self.n_jobs = n_jobs

def fit(self, *args, **kwargs):
self.base_estimator = self.base_estimator.fit(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should raise NotImplementedError if this is not going to behave the same as fit_transform.

def _identity_transformer():
"""Contructs a transformer that does nothing"""
return FunctionTransformer(lambda x: x)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: extra blank line


Parameters
----------
estimators_matrix: 2D matrix with base estimators. Each row will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces before colons are required


Returns
-------
p: Pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just say Pipeline here.

for estimator in base_estimators]
if restacking:
estimators.append(_identity_transformer())
return make_union(*estimators)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good idea. It means that the estimators will be named BlendedEstimator1, BlendedEstimator2, etc. in get_params. You should use FeatureUnion directly to give better prefixes; and you will need a way for the user to specify names themselves, like what FeatureUnion provides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the names are horrrible and should be fixed. about the second part, FeatureUnion already exists if someone wants to specify the names. I don't see a reason to create a new api for that. the only arguable point here is that, if someone want's to create a stack layer by hand using FeatureUnion, the restacking will have to be implemented by hand as well. I'll give this a thought, but I can't see a clean solution to this yet (other than creating a second function that accepts named estimators instead of unnamed ones).

@caioaao
Copy link
Contributor Author

caioaao commented Jun 2, 2017

@jnothman I'm not too familiarized with the literature either, but this shows how some big kaggle players used a stacked model with three layers. I also didn't understand what you meant by "rectangular matrix".

I'll fix the rest of the comments. If you think having the stack factory is too much, I'm not opposed to deleting the fn

EDIT:
I think I understood the "rectangular matrix" issue. It was actually a bad name choice that I made, which was enforced by the stack_estimators example code. The estimators_matrix is actually a 2D array, not a matrix. Hopefully it's clearer now

EDIT2:
I started working in the documentation, but I'd rather wait for us to agree on the final API before going any further.

@caioaao
Copy link
Contributor Author

caioaao commented Jun 2, 2017

I have one doubt: both cross_val_predict and FeatureUnion accept an n_jobs param. which should I choose to pipe the param to?

@jnothman
Copy link
Member

jnothman commented Jun 10, 2018 via email

@caioaao
Copy link
Contributor Author

caioaao commented Jun 11, 2018

@jnothman I'm sorry, I didn't mean modifying the original Pipeline. What I meant was to extract the common code from Pipeline to a new _BasePipeline class and inherit from it to build both Pipeline (with the same behavior) and StackingPipeline. I tried doing it today but I can't make the tests pass on my machine (even on the master branch). I'll open a PR as soon as I fix my env

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 11, 2018 via email

@MLWave
Copy link

MLWave commented Jun 11, 2018

As a data point, I do not think that the nested stacking concerns a lot of users. I am not sure that we should focus our API designs on it.

  • Nested stacking is a pretty fundamental part of the original paper, Stacked Generalization (1992), and perhaps as relevant as providing bootstrap functionality for RandomForestClassifier:

For example, we can consider the case where each level has a single learning set, and all such learning sets feed serially into the set one level above, all according to the exact same rules. (Such a structure is a multi-layer net, where each node is a learning set, there exists one node per layer, and information is fed from one node to the next via the N generalizers.) For such a scenario the successive levels act upon the learning set like successive iterations of an iterated map. Therefore the usual non-linear analysis questions apply: when does one get periodic behavior? when does one get chaotic behavior? what are the dimensions of the attractors? etc. Once answered, such questions would presumably help determine how many levels to stack such a system.

  • All stacking research (and autoML research) eventually encounters multiple layers. Instead of writing your own code for this, a multi-layer Super Learner (van der Laan et al, 2007) would capture all of that. You can create general stacked estimators, much like multi-layer stacked auto-encoders. You can create a multi-layer net of MLP's. You can do Super Boosting, by stacking GBM's 6 layers deep. You can optimize an entire multi-layer net of estimators, to find the best combinations of accuracy, and deployability.

  • Stacking to obtain SoTa (like in Kaggle competitions) has been multi-layer for years now. Though they all make use of Scikit-Learn, the multi-layer code is all manual glue code (and poorly battle-tested).

  • Multi-layer stacking has commercial applications. Having code in Scikit-Learn, all with decent quality and tests, makes commercial application more feasible.

  • Finally, the amount of flexibility of combining the recent column transformer with a stacking transformer, means you can build complete recipes for tackling a problem (next to pipelining the transformations, it pipelines the modeling parts too). This, to me, is very attractive, and would see heavy use.

@caioaao
Copy link
Contributor Author

caioaao commented Jun 11, 2018

@GaelVaroquaux IMO the discussion of choosing between this or #11047 boils down to this: scikit-learn's current API doesn't currently support a flexible stacking implementation, so we have these choices so far:

  1. Break sklearn's contracts (fit().transform() != fit_transform());
  2. Extend sklearn's API (increasing its vocabulary by adding something like blend() and making a custom pipeline for this type of model)
  3. Restrict the stacking implementation to a simple use case that can be handled by sklearn's current API

As you can see, I'm strongly against making the stacking implementation less flexible, thus leaving two choices, one of which you seem to be strongly against too.

Also, I don't see how making the PIpeline/FeatureUnion code more reusable would be a bad thing. Even #11047 would benefit from that as it would remove most of the code there. I may be a little pushy in this subject, but my background/professional experience is in software engineering and code quality is really important to me. Having a class that does a lot of stuff (combining blending and pipeline stuff) is not good in any case, specially if you already have another class that does exactly what you want for half of the problem.

@jnothman
Copy link
Member

jnothman commented Jun 11, 2018

Just to make sure it's clear, there's nothing stopping a StackingClassifier of #11047 have multiple layers. That's just a matter of nesting:

StackingClassifier(
    [('x', XClassifier()), ('y', YClassifier()), ('z', ZClassifier())],
    StackingClassifier(
        [('x', XClassifier()), ('y', YClassifier()), ('z', ZClassifier())],
        DecisionTreeClassifier()
    )
)

compared to:

Pipeline([
    ('layer1', make_stack_layer(XClassifier(), YClassifier(), ZClassifier())),
    ('layer2', make_stack_layer(XClassifier(), YClassifier(), ZClassifier())),
    ('predict', DecisionTreeClassifier())
])

I think the former example is a bit less intuitive as analogous to layers (for people not coming from the Lisp world!). But it does not break any conventions we might want to hold on to. I definitely think that if we go down the StackingClassifier path, an example illustrating usage with multiple layers is necessary.

@jorisvandenbossche, I think get_params/set_params names get lengthly in any case. A tool like searchgrid, which makes search params an attribute of the object, stops that being an issue, though (as long as it handles non-grid search cases, which it doesn't atm).

@glemaitre glemaitre modified the milestones: 0.20, 0.21 Jun 13, 2018
@caioaao
Copy link
Contributor Author

caioaao commented Jun 14, 2018

@jnothman this is the refactor I wanted to do on Pipeline: caioaao@92ec504
It's a bit hacky but I thought it'd be the least intrusive way of making pipeline behavior extendable.

This way we can just override the properties that return functions used for caching and I'll be able to create a stacking-specific pipeline

@jnothman
Copy link
Member

jnothman commented Jun 14, 2018 via email

@caioaao
Copy link
Contributor Author

caioaao commented Jun 25, 2018

@jnothman I think I wasn't clear about my approach. I started a PoC that you can check here:
caioaao/scikit-learn@0b58da2...5b6116b
Basically what I'm doing is to extend Pipeline to use another method for blending. This way this API is closer to the Pipeline API and also reuses lots of code from it. It's a less hacky way of doing stacks with multiple layers

@caioaao
Copy link
Contributor Author

caioaao commented Jun 25, 2018

I just realized I moved a lot of code around and it may not be easy to see where I'm going with this implementation, so here it goes.

The idea is that we should be able to access and manipulate the stack layers. There are some reasons to this:

  • As discussed, we want to be able to create multi-layer models;
  • There are some approaches to blending outputs from each layer and one of them is using k-fold predictions (implemented here with CV). The choice of which type of blending is usually related to trade-offs between training performance, data leakage and data usage on subsequent layers. Marios Michailidis (@kaz-Anova) does a great job on explaining the different strategies in this lecture;
  • Ease of distributed capabilities. Training/tuning a stacked ensemble is an CPU-intensive process and being able to distribute the load is pretty good (even better when you think about the dynamics of teams in competitive data science competitions).

What I'm aiming for in the patch is to have a simple API for the base case (the one implemented in #11047) and hiding the new blend verb from the average client, while also allowing for experienced users to access the flexibility of the framework.

The base case would be like this:

from sklearn.ensemble import make_stacked_ensemble
from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR


layer0 = [LinearRegression(), LinearSVR(random_state=RANDOM_SEED)]
layer1 = [LinearRegression(), LinearSVR(random_state=RANDOM_SEED)]
final_estimator = LinearRegression()

# this will return a StackingPipeline
final_clf = make_stacked_ensemble(layer0, layer1, final_estimator)

final_clf.fit(Xtrain, ytrain)
ypreds = final_clf.predict(Xtest)

And for accessing the internal layers:

from sklearn.ensemble import StackingPipeline, make_stack_layer
from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR


layer0 = make_stack_layer(LinearRegression(), LinearSVR(random_state=RANDOM_SEED))
layer1 = make_stack_layer(LinearRegression(), LinearSVR(random_state=RANDOM_SEED))
final_estimator = LinearRegression()

final_clf = StackingPipeline([layer0, layer1, final_estimator])

final_clf.fit(Xtrain, ytrain)
ypreds = final_clf.predict(Xtest)

Also, this implementation would solve the problem with fit().transform() being different from fit_transform():

from sklearn.ensemble import StackableTransformer, StackingLayer
from sklearn.linear_model import LinearRegression


stackable_t = StackableTransformer(LinearRegression())
# stackable_t.fit().transform() === stackable_t.fit_transform()

layer = StackingLayer([stackable_t])
# layer.fit().transform() === layer.fit_transform()
# the stackable transformer passed to StackingLayer must implement `blend`

stackable_t.blend() #=> CV prediction
layer.blend() #=> CV predictions

This way we can just swap StackableTransformer (which I'll rename to CVStackableTransformer) with another transformer that implements blend to use another method of ensembling (train-test split, for instance).

TLDR:

  • Has a sane and useful simple use case with a simple API;
  • Exposes an API that supports most of the more advanced use cases;
  • Relies on a good chunk of the Pipeline/FeatureUnion code, making the code easier to maintain.

@caioaao
Copy link
Contributor Author

caioaao commented Jul 5, 2018

Since this PR is dragging for a long time and we have some conflicting opinions, I'm thinking about doing it in a separate library. I'd like to merge my pipeline refactor here so I can reuse the code from sklearn, is that ok with you? If not, I can just copy the pipeline code and modify it, but I don't know if it's ok based on sklearn's license.

@glemaitre
Copy link
Member

glemaitre commented Jul 5, 2018 via email

@caioaao
Copy link
Contributor Author

caioaao commented Jul 5, 2018

@glemaitre anyway I think it's best to create another library for this as it'll give me more freedom to do what I want. We can discuss it again and, if doable, integrate the library back into scikit learn in the future.

@jnothman
Copy link
Member

jnothman commented Jul 6, 2018 via email

@caioaao
Copy link
Contributor Author

caioaao commented Jul 10, 2018

great! then all I need atm is for this PR to be merged: #11446
it's quite simple and it'll help a lot on reusing pipeline API

@caioaao
Copy link
Contributor Author

caioaao commented Jul 27, 2018

I just released a beta version of the stacking framework based on this PR. If you guys want to check it, here's the link: http://github.com/caioaao/wolpert

I'm closing this PR for now and will focus on evolving the library instead. thank you all for your time and comments :)

@caioaao caioaao closed this Jul 27, 2018
@jnothman
Copy link
Member

jnothman commented Jul 29, 2018 via email

@ngoix
Copy link
Contributor

ngoix commented Oct 9, 2018

Just one comment about fit_transform != fit.transform:

I believe we can circumvent this problem the same way we did for LocalOutlierFactor, with fit.predict and fit_predict (see #10700). Two different behaviours are supported depending on how the class is initialised (the transductive behaviour disables predict while the non-transductive one disables fit_predict).

In other words, StackableTransformer could override TransformerMixin.fit_transform the same way LocalOutlierFactor override OutlierMixin.fit_predict.

Then, make_stack_layer could just use StackingTransformer initialised in its transductive behaviour.

@jnothman
Copy link
Member

jnothman commented Oct 9, 2018 via email

@jnothman
Copy link
Member

jnothman commented Oct 9, 2018 via email

@caioaao
Copy link
Contributor Author

caioaao commented Oct 10, 2018

I agree with @jnothman here, but I feel using fit_resample would be hacky. The training predictions have a different semantics here and makes sense for it to be a different verb

@jnothman
Copy link
Member

jnothman commented Oct 10, 2018 via email

@caioaao
Copy link
Contributor Author

caioaao commented Oct 10, 2018

Yeah, I agreed with you about it not being the same situation as LocalOutlierFactor. My argument about fit_resample is that it's doing more than just resampling the data as it's also transforming it and potentially not even resampling it, so it'd be misleading to use resample for this as well

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 11, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet