New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Allow transformers on y #4552

Closed
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
8 participants
@amueller
Member

amueller commented Apr 8, 2015

Shot at #4143.

Adds a new function to the interface, with the uninspired name pipe,
with signature

def pipe(self, X=None, y=None):
     ...
    return Xt, yt

example:

from sklearn.preprocessing.label import _LabelTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression


class TargetScaler(StandardScaler, _LabelTransformer):
    pass

X, y = make_regression()
Xt, yt = TargetScaler().fit_pipe(X, y)
print(y.mean())
print(yt.mean())

15.5246170559
-1.11022302463e-18

Another (unnecessary) example:

from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target_names[iris.target]
pipe = make_pipeline(LabelEncoder(), DecisionTreeClassifier()).fit(X, y)
print(pipe.score(X, y))
print(pipe.predict(X))

1.0
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

(the labels got mapped to numbers again)

@amueller

This comment has been minimized.

Member

amueller commented Apr 8, 2015

This doesn't seem very useful to me at the moment. I think I'd like to add the "resampling". I don't care if FeatureUnion breaks, I think that is on the user.
What I do care about is a meaningful distinction between training and prediction behavior.

Are there any other good use-cases for transforming y?

@amueller

This comment has been minimized.

Member

amueller commented Apr 8, 2015

@ogrisel

This comment has been minimized.

Member

ogrisel commented Apr 11, 2015

Why not make transform return (X_transformed, y_transformed) whenever y is not None instead of introducing a new method? I am a bit worried about introducing a new bunch of methods. On the otherhand this approach is more explicit.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Apr 11, 2015

I should have read the description of the linked issue before commenting...

@amueller

This comment has been minimized.

Member

amueller commented Apr 19, 2015

pinging #3855 which discusses changing n_samples in a pipeline for visibility.

@amueller amueller changed the title from [WIP] Allow transformers on y to [RFC] Allow transformers on y May 1, 2015

@amueller

This comment has been minimized.

Member

amueller commented Jun 11, 2015

@GaelVaroquaux after thinking about it a bit more, I think we shouldn't deprecate transform. In 90% of use-cases, you don't want to return y. We had a hard time to come up with any use cases apart from data loading and resampling.

I can add the resampling as a use-case here, but otherwise I think this is actually good to go.
I have no strong feelings about the name. How about transform_Xy?

@amueller amueller changed the title from [RFC] Allow transformers on y to [MRG] Allow transformers on y Jun 11, 2015

Parameters
----------
X : iterable
Data to inverse transform. Must fulfill output requirements of the

This comment has been minimized.

@jnothman

jnothman Jun 12, 2015

Member

"Data to inverse transform"?

This comment has been minimized.

@amueller

amueller Jun 12, 2015

Member

Damn, the docstrings are not finished yet, it seems. I'll fix them now. Also add some narrative.

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 12, 2015

  • pipe needs to have capacity for altering other args (e.g. sample_weight).
  • What case is the method pipe for, wherein the target variable is observed, but we're not fitting the model, in a pipeline context?
  • this doesn't help the case where the number of samples should change (although sample_weight support would allow a subset of such changes)
  • it needs to be clear that the pipeline will be scored in terms of the initial y space, although I'm not certain that this is always what's desired.

I think we need a clear set of use-cases (described in terms of what happens at train time, what happens at test), preferably supported by research papers where these techniques are used, to design a careful solution to generalised transform operations. With only fairly trivial use-cases to motivate this PR, I don't think this is it, sadly.

@amueller

This comment has been minimized.

Member

amueller commented Jun 12, 2015

  1. I think adding other args would be simplest in the context of sample_props. It could for the moment return X, y, fit_params.
  2. You mean when do you need to do pipe(X, y)? For scoring, I think. Otherwise not sure.
  3. why not? [at least without sample weights]
  4. I was operating under the assumption that scoring is done on the transformed y, with the use-case of preprocessing, loading or generating y in mind. What would be a case where you want the original space?

Why is subsampling a trivial use-case? It is a somewhat trivial operation that we can't currently do.

@amueller

This comment has been minimized.

Member

amueller commented Jun 12, 2015

I agree though the examples I gave at the top are not very inspiring. I should have added the subsampling from the beginning, and maybe a loading example.

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 13, 2015

Perhaps I misunderstood and this is more widely applicable than I thought. But unless I'm much mistaken, it's not finished.

How do I use a custom scorer on a Pipeline containing one of these? Currently the scorers call predict(X) or decision_function(X), etc. They would need to call the not-yet-extant predict_pipe(X, y) etc. (Maybe the fact that this is only applicable to Pipelines is a reason that custom scorers should be going through an estimator's score function, not be external to it.)

What is correct behaviour when a Piper appears in a FeatureUnion?

@@ -272,6 +278,22 @@ def inverse_transform(self, X):
return Xt
@if_delegate_has_method(delegate='_final_estimator')

This comment has been minimized.

@jnothman

jnothman Jun 13, 2015

Member

No, it's applicable if any of the estimators supports pipe and the last supports transform or pipe.

@AlexisMignon

This comment has been minimized.

Contributor

AlexisMignon commented Jun 15, 2015

Has it been considered adding a specific mixin "TargetTransformer", it would require to have "inverse_transform" so that pipelines can apply the transformation back when "predict" is called. And it would be not so hard to modify the pipelines to detect this Mixin and the specific transforms needed. I see a few reasons to add this new mixin:

  1. Within a pipeline, targets should be treated differently from 'X' data since when "predict" is called on the pipeline the inverse transformations need to be applied on the prediction. Having a specific Mixin makes it easy to detect them and apply ad hoc treatments.

  2. Everything else is kept unchanged. Only the pipeline has to be changed and specific transformers can be easily derived from standard ones by derivation and adding the TargetTransformer mixin.

Another possibility would be to add extra information when building the pipeline:

p = Pipline([("scaleX", StandardScaler()), ("scaleY", StandardScaler(), "y"),
                   ("the_model", MyModel())])

or even

p = Pipline([("scaleXY", StandardScaler(), "Xy"),("the_model", MyModel())])

In this case no need to change the API only the pipeline needs to be changed. Again "inverse_transform" needs to be implemented on transformers applied on targets.

What do you think ? Or is the design already chosen ?

@amueller

This comment has been minimized.

Member

amueller commented Jun 15, 2015

@AlexisMignon The design is definitely not chosen yet. I think your use-case might be better handled with meta-estimators, though.
This PR is mostly to support subsampling and creating labels on the fly.

@amueller

This comment has been minimized.

Member

amueller commented Jun 15, 2015

@jnothman thank you for your comments, I'll work on them. I have to think about what happens with a custom scorer.

@amueller

This comment has been minimized.

Member

amueller commented Jun 22, 2015

Meditating on this a bit longer, I'm not sure if subsampling is not better done with a meta estimator.

@amueller

This comment has been minimized.

Member

amueller commented Oct 22, 2015

@ogrisel @GaelVaroquaux @agramfort @arjoly opening again in light of our discussion.

@amueller amueller reopened this Oct 22, 2015

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 22, 2015

That sounds ominous.

On 22 October 2015 at 21:21, Andreas Mueller notifications@github.com
wrote:

@ogrisel https://github.com/ogrisel @GaelVaroquaux
https://github.com/GaelVaroquaux @agramfort
https://github.com/agramfort @arjoly https://github.com/arjoly
opening again in light of our discussion.


Reply to this email directly or view it on GitHub
#4552 (comment)
.

@amueller

This comment has been minimized.

Member

amueller commented Oct 22, 2015

@jnothman it's too bad you are not here. We are having a very animated discussion (mostly me and Gael). To give you a very short summary of what we discussed:

  • This is the way forward (there are many use cases, some of which I agree with ;).
  • It is really important for pipelines.
  • The way it will work is adding a new type of object, that has fit_pipe and transform_pipe (name up to discussion) and no other methods. This will make sure that something else can happen during fit and during transform in a consistent way.
  • There will probably not be a fit method, as, say for undersampling, only fit_pipe will do undersampling. So calling fit is pretty much useless.
  • We could add these methods to the existing transformers, but we will not for the moment (because of explosion of methods).

Also, there will now probably be "scikit-learn advancement proposals" (slap) in the spirit of PEPs that have user stories for API changes. @GaelVaroquaux will open a new repo which will store RST files in a minute.

@amueller

This comment has been minimized.

Member

amueller commented Oct 22, 2015

This is actually a very minimal surgery, as it only has some changes to the pipeline that are relatively easy to understand.

Why do Scorers need to know about the new method? They don't know about transformers, right?

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 22, 2015

Oy will I give you a slap.

;)

I'm a bit full up this week, and not able to think about the proposal in
detail but hope to at some later point.

On 22 October 2015 at 21:38, Andreas Mueller notifications@github.com
wrote:

This is actually a very minimal surgery, as it only has some changes to
the pipeline that are relatively easy to understand.

Why do Scorers need to know about the new method? They don't know about
transformers, right?


Reply to this email directly or view it on GitHub
#4552 (comment)
.

@amueller

This comment has been minimized.

Member

amueller commented Oct 22, 2015

cool, that would be great :)

@arjoly arjoly changed the title from [MRG] Allow transformers on y to [RFC] Allow transformers on y Oct 22, 2015

@versatran01

This comment has been minimized.

versatran01 commented Feb 1, 2016

Guys, any updates on this issue?

I was trying to use Pipeline for an object detection application, my X is an color image and y is a bunch of binary images that labels which pixel is which. Not being able to transform y is very inconvenient for me, I end up having to subclass Pipeline and return (Xt, yt) from each transform functions.

@EelcoHoogendoorn

This comment has been minimized.

EelcoHoogendoorn commented Feb 14, 2016

Id like to second @versatran01. I am also working on object detection. The logical approach is to convert images and their annotations to feature vectors of sliding windows and labels. Of course such processing can be done outside of sklearn, but there are many hyperparameters involved in these transformations, which impact the y-vector as well; and since we would like to perform a gridsearch on them, we would want these transforms to be part of the pipeline.

Subclassing Pipeline to pass through y seems to be the preferred solution; but is there a good reason not to include this behavior in the general pipeline class? That is, allow transforms to return an X, y tuple, and if a modified y is returned, propagate that through the pipeline?

@amueller

This comment has been minimized.

Member

amueller commented Oct 8, 2016

@EelcoHoogendoorn yeah, because it breaks the current API and therefore possible user code. There is no transformer in scikit-learn that changes y, so that addition is very unnatural and totally breaks the API contract.

The object detection case is certainly valid, but I think our pipeline is not very well suited for that. Feel free to do a PR to scikit-learn contrib with your transformers that change y and a pipeline that can work with that. There's actually already one in imblearn: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py

@EelcoHoogendoorn @versatran01 do your transformers change y during fit_transform, during transform or both?

@versatran01

This comment has been minimized.

versatran01 commented Oct 8, 2016

@amueller Thanks for the pointer to imbalanced-learn.

I have a class ImagePipeline that subclasses Pipeline and I modified both fit_transform and transform so that as long as y is not None, it will apply transformers to y (whether that transformer changes y depends on the implementation).

I agree with you that the current API is not well suited for this particular case and should not be changed for compatibility reasons. And I'm happy with my current solution. I'd like to see scikit-learn have better support for image-related learning tasks.

@sv3ndk

This comment has been minimized.

sv3ndk commented Oct 18, 2016

HI all,

Based on your comments above, I coded two very small Pipeline sub-classes that let a transformer behave as a predictor and vice-versa. This is probably very similar to the ImagePipeline that @versatran01 mentioned.

This seems to do the trick to include post-processor as part of a cross-validated pipeline, and might even allow to code stacked-ensemble as scikit-learn pipelines (not tested, yet).

As far as i can tell many people are interested in this subject, so even if my solution is not groudbreaking nor even new, I blogged about it, I assume it's ok to document the link here ? Here is it:

Using model post-processor within scikit-learn pipelines

@amueller

This comment has been minimized.

Member

amueller commented Oct 18, 2016

@svendx4f thanks! So one thing with your example is that you "only" process after the predictions. Often, for example if you want to do a log-linear model, you want to transform before the training and then back after the prediction. That would be possible with your pipeline, but there would be no way to ensure that the transformation and inverse transformation match.

Also this might make your code more readable: #7608

@amueller

This comment has been minimized.

Member

amueller commented Oct 18, 2016

@svendx4f Actually, I kinda don't like your solution because you change the meaning of X and y within the nested pipelines, and that's easy to get wrong. It's pretty subtle to see which steps are applied to X and which to y. But I acknowledge that we don't have a nice solution right now and any work-around is better than no solution ;) And you managed to not add a new method to the API, to that's a benefit.

@sv3ndk

This comment has been minimized.

sv3ndk commented Oct 19, 2016

Hi @amueller , thanks for the feed-back.

I acknowledge my solution probably works only in some specific cases, it's working for me atm at least (so far so good).

Maybe I fail to grasp some subtlety between X and y? To me the output of any predictor/transformer is "its Y", which becomes the X of the next one. What is specific to, say linear regression, so its transformation of X is called Y, whereas the output of PCA or standard scaler is still called X? In both cases they need to be trained beforehand, and in both cases the dimension of the output is potentially different from that of the input.

I understand the difference from a ML point of view of course, but why is it relevant from a pipeline "plumbing" point of view? Can't the pipeline ignore the semantics of its components and see them all as thingies that need to be trained and can then transform inputs into outputs?

I like your operators, they remove tons of parentheses :)

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 19, 2016

I think Y is characterised by being Y unseen at prediction time but
necessary for model evaluation.

Yes, some predictions can be used as features, and some transformed feature
spaces as predictions.

On 19 October 2016 at 16:35, Svend Vanderveken notifications@github.com
wrote:

Hi @amueller https://github.com/amueller , thanks for the feed-back.

I acknowledge my solution probably works only in some specific cases, it's
working for me atm at least (so far so good).

Maybe I fail to grasp some subtlety between X and y? To me the output of
any predictor/transformer is "its Y", which becomes the X of the next one.
What is specific to, say linear regression, so its transformation of X is
called Y, whereas the output of PCA or standard scaler is still called X?
In both cases they need to be trained beforehand, and in both cases the
dimension of the output is potentially different from that of the input.

I understand the difference from a ML point of view of course, but why is
it relevant from a pipeline "plumbing" point of view? Can't the pipeline
ignore the semantics of its components and see them all as thingies that
need to be trained and can then transform inputs into outputs?

I like your operators, they remove tons of parentheses :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#4552 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6ybWWIAj-XNh8FcefwlT_Rzu9gNIks5q1aw1gaJpZM4D8v9c
.

@amueller

This comment has been minimized.

Member

amueller commented Oct 19, 2016

@svendx4f I found it hard to follow the code because it was implicit whether a transformation was applied to X or y. It's not wrong from a programming point of view, but I find the API confusing and I think it makes it easy to write nonsensical code ;)

@gminorcoles

This comment has been minimized.

gminorcoles commented Mar 28, 2017

Is there an implementation of this anywhere that is considered OK? I do not find it convincing that the reason X,y transforms are bad is because all the existing code only knows about X transforms. I need to support arbitrary X,y transformations, and I feel that this is a superset of the features required by X-only transforms. I would prefer not to maintain my own XY pipeline, which I am trying to do now.

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment