Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP005: Resampler API #15

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open

SLEP005: Resampler API #15

wants to merge 27 commits into from

Conversation

@glemaitre
Copy link

@glemaitre glemaitre commented Mar 1, 2019

Here is the SLEP regarding the Outlier Rejection API

index.rst Outdated Show resolved Hide resolved
@amueller
Copy link
Member

@amueller amueller commented Mar 1, 2019

I was expecting "the fit_resample slep" to be about imbalanced data. I think subsampling is a much more common issue then trying to automatically remove outliers.

Do you have a real-world application where this would help? ;)

slep005/proposal.rst Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Mar 1, 2019

Indeed, as I was telling Guillaume IRL, I think that this SLEP should be about fit_resample, rather than outlier rejection, and outlier rejection should be listed as an application. I am not convinced that proposing a new mixing calls for a SLEP in itself. It's the addition of the method fit_resample that calls for the SLEP.

slep005/proposal.rst Outdated Show resolved Hide resolved
slep005/proposal.rst Show resolved Hide resolved
@glemaitre
Copy link
Author

@glemaitre glemaitre commented Mar 2, 2019

OK, so I will modify the SLEP to make it about resamplers. Basically, I should keep the part about the implementation of the pipeline with the limitations. @orausch made a good suggestion.

@glemaitre
Copy link
Author

@glemaitre glemaitre commented Mar 2, 2019

Do you have a real-world application where this would help? ;)

@amueller This is indeed a really good point which should be demonstrated in the outlier PR.

Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>
@chkoar
Copy link

@chkoar chkoar commented Mar 2, 2019

Will resamplers applied in transform? I got confused.

@glemaitre
Copy link
Author

@glemaitre glemaitre commented Mar 2, 2019

Will resamplers applied in transform? I got confused.

Nop resampler will be implementing fit_resample and will not implement fit_transform or transform.

@chkoar
Copy link

@chkoar chkoar commented Mar 2, 2019

@orausch
Copy link

@orausch orausch commented Mar 2, 2019

I wonder because of this 4ecc51b#diff-fa5683f6b9de871ce2af02905affbdaaR80

Sorry, that is a mistake on my part. They should not be applied on transform.

EDIT: and accordingly, during fit_transform, resamplers are only applied during the fit, and not the transform.

@chkoar
Copy link

@chkoar chkoar commented Mar 2, 2019

EDIT: and accordingly, during fit_transform, resamplers are only applied during the fit, and not the transform.

Yes, this make sense.

@orausch
Copy link

@orausch orausch commented Mar 2, 2019

Currently, my PR applies resamplers on transform (and so does imblearn). It seems that it is not trivial to change the behavior so that, for fit_transform, resamplers are only applied during the fit, and not during the transform. Currently, the pipeline is efficient in the sense that each transformer is only called once during a fit_transform call.

To get the behavior described above, a naive implementation would have to call each transformer after the first resampler twice: once for the fit path, where we apply resamplers, and once for the transform path, where we don't. It seems to me that, in order to do it efficiently, we would need some knowledge of what samples were added/removed by the resamplers in the pipeline.

If we want to make transform not apply resamplers, we would have to address this problem in the pipeline.

This brings me to the next thought: does it even make sense to have resamplers in a transformer only pipeline? Is there a good usecase? One choice would be to simply disallow this behavior (similarly to how we disallow resamplers for fit_predict).

@glemaitre
Copy link
Author

@glemaitre glemaitre commented Mar 2, 2019

So I update the SLEP toward a more fit_resample discussion.

I realised (even if it is obvious) that outlier rejection is unsupervised while resampling for balancing is supervised (and for binary/multiclass classification) AFAIK. In the latter case, resampler will require to validate the targets and define an API for driving the resampling (i.e. sampling_strategy in imblearn)

Is this API choice should be discussed within the SLEP as well or this is more specific to a type of resampler and it will be handled later on.

@glemaitre
Copy link
Author

@glemaitre glemaitre commented Mar 2, 2019

Currently, my PR applies resamplers on transform (and so does imblearn).

https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py#L487

We are skipping the resampler during transform.

@glemaitre
Copy link
Author

@glemaitre glemaitre commented Mar 2, 2019

@orausch I am a bit confused with your comment.

When calling fit_transform on a pipeline, we will call all fit_transform or fit_resample of all estimators but the last. The last will then call fit.transform or fit_transform.

For transform on a pipeline, we will call transform of all estimators but the last, skipping the ones with fit_resample. Finally, we call transform of the last estimator.

Could you explain where do you think we are applying the resampling exactly, maybe I am missing something?

@orausch
Copy link

@orausch orausch commented Mar 2, 2019

We are skipping the resampler during transform.

Sorry, I missed that. I think there is a problem regardless.

When calling fit_transform on a pipeline, we will call all fit_transform or fit_resample of all estimators but the last. The last will then call fit.transform or fit_transform.

If we do this, we have inconsistent behavior between pipe.fit_transform(X, y) and pipe.fit(X, y).transform(X). For example consider:

X = [A, B, C] # A, B, C are feature vectors 
y = [a, b, c]
pipe = make_pipeline(removeB, mult2) # removeB is a resampler that will remove B, mult2 is a transformer

Then pipe.fit_transform(X, y) gives [2A,2C] but pipe.fit(X, y).transform(X) gives [2A, 2B, 2C]

@chkoar
Copy link

@chkoar chkoar commented Mar 2, 2019

At least in the imbalanced setting use case usually you will have at the last step either a classifier or a resampler, not a transformer. I suppose that the same happens in the outlier rejection. In your example it make sense to resample also in pipeline's transform, right? But what if you transform again with a X2? Does it make sense to resample on X2 when you only need the transformation? It seems like you need to resample in transform when you will pass the same fitted data again, in order to not break the contract. Thoughts?

@glemaitre glemaitre changed the title SLEP005: Outlier Rejection API SLEP005: Resampler API Mar 3, 2019
@orausch
Copy link

@orausch orausch commented Mar 4, 2019

Some options to address this:

  • Forbid calling fit_transform on a pipeline containing resamplers. (like we do with fit_predict).
  • Define the API so that transform applies resamplers (then the approach from imblearn works).

Or we can implement the behavior that for fit_transform, we apply resamplers on fit, but not on transform. Here I can also see two options

  • Implement the behavior, but do it inefficiently (simply call fit and then transform).
  • Do it efficiently. Like I mentioned earlier, it seems to me that to do this, we need some knowledge of what samplers were added/removed by resamplers. Perhaps something like return_idx is an optional kwarg to fit_resample. If the resampler supports returning idx, we do it efficiently, if not, we call fit, then transform.

Let me know if I missed something.

@glemaitre
Copy link
Author

@glemaitre glemaitre commented Mar 4, 2019

slep005/proposal.rst Outdated Show resolved Hide resolved
Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>
@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Mar 5, 2019

I don't think there's one solution which works for all usecases, here are two real world examples (and I hope they're convincing):

  1. In the context of FAT-ML, assume the following pipeline:
  • resample to tackle class imbalance
  • mutate the data for the purpose of a more "fair" data (which may or may not touch y)
  • usual transformers
  • an estimator

In the above pipeline, during fit, I'd like the first two steps to be on, and during predict, I'd like them to be off, which we can do if the first two steps are resamplers and they're automatically out during the predict.

  1. I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them. The pipeline would look like:
  • exclude outliers
  • usual transformers
  • predict faulty behavior

In contrast to the previous usecase, now I'd like the first step to be always on, cause I need to ignore those data points from my analysis.

Also, we should consider the fact that once we have this third type of model, we'd have at least three types of pipelines, i.e. estimator, transformer, and resampler pipelines. I feel like this fact, plus the above two usecases, would justify a parameter like resample_on_transform for the pipeline, to tune the behavior of the pipeline regarding these resamplers as its steps. I'm not sure if it completely solves our issues, but it may.

For instance, if the user wants to mix these behaviors, they can put different resamplers into different pipelines, set the resample_on_transform of each pipeline appropriately, and include those pipelines in their main pipeline.

I haven't managed to completely think these different cases through, but I hope I could convey the idea.

@amueller
Copy link
Member

@amueller amueller commented Mar 5, 2019

@judahrand
Copy link

@judahrand judahrand commented Mar 29, 2020

@jnothman I have and may end up going down that route. But I thought it was worth exploring how vanilla Sklearn wanted it implemented as it seems to be a feature which has been in discussions for quite some time!

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 4, 2020

@amueller
Copy link
Member

@amueller amueller commented Apr 7, 2020

I think @GaelVaroquaux and @glemaitre would have to weigh in what their thoughts are on slep1 vs slep5. Right now I'm leaning towards Slep1. Option A is basically saying "we don't need to address this issue", saying the current API is sufficient. I think this is unlikely to be the solution of choice. It has been the default solution because we couldn't agree on any other so far.

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 19, 2020

Right now I'm leaning towards Slep1

Does that mean, @amueller, that you prefer its text/discussion, but perhaps not its proposals?? I'm confused by your comment.

I think the core of the current proposal(s) is that there is a new kind of estimator (resampler) that can be plugged into some kind of meta-estimator, which is either a Pipeline extended to handle such things, or a specialised meta-estimator class.

@amueller
Copy link
Member

@amueller amueller commented Apr 20, 2020

As I said in #15 (comment), I think there's two core issues: distinguishing training and test phase and returning more complex objects. Neither of these seems closely related to sampling, so I think using the framing of this SLEP is confusing. It doesn't even mention these two core changes explicitly.
So I guess I like the discussion in SLEP 1 more. But I also think that adding a new method and a specialized meta-estimator seems like the worst of both worlds. I think adding a new method that does both (is only called during fitting and returns a tuple) and supporting it wherever necessary would be best (i.e. modify pipeline and other meta-estimators).

If we don't call it fit_resample but instead call it fit_modify then we could simplify our stacking meta-estimator which would be sweet (that requires only the distinction of fitting and prediction phase, not returning more complex objects, same for impact/target encoding which is basically a variant of stacking).

While there might be a use-case for trans_modify as proposed in SLEP 1, I think that's a bit confusing as it's only use would be to have a more complex return value, but it would otherwise behave as transform.

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 21, 2020

@amueller
Copy link
Member

@amueller amueller commented May 2, 2020

I think the semantics of having something in a pipeline that does something different during training and prediction is not that complicated, but the current scikit-learn API makes it really hard to reason about it because it doesn't distinguish training and test time transformations.

I wasn't suggesting to change the semantics, but I don't think you need anything more general for stacking. The current proposal wants things not to be a resampler and a transformer at the same time, but I don't see why this would be necessary.

I can send along a paper I recently worked on that is comparing several pipelining languages, and as I said in other places, for MLR this is no problem at all, and I think the semantics are perfectly clear. Mostly because the differentiate between training time and test time transformations, and because they don't have fit_transform which has some pretty weird semantics tbh.

If we were not mixing the two issues of returning targets and distinguishing training and test transformation I would suggest we remove fit_transform and only have fit_modify but it would probably be too annoying to always return y and sample weights etc.

So re minimal proposal, I think a pretty minimal proposal would be to add fit_modify that is called in a pipeline during fit if it exists and returns X, y, ... and we'd prohibit fit_transform or fit_predict on the pipeline. That would allow the implementation of stacking with a transformer (with the fit_modify API) and it seems more minimal than this SLEP.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 14, 2020

I can send along a paper I recently worked on that is comparing several pipelining languages

I'd be interested. I agree that these issues come from the design of fit_transform, fit returning self, transform returning only Xt, etc.

So re minimal proposal, I think a pretty minimal proposal would be to add fit_modify that is called in a pipeline during fit if it exists and returns X, y, ... and we'd prohibit fit_transform or fit_predict on the pipeline. That would allow the implementation of stacking with a transformer (with the fit_modify API) and it seems more minimal than this SLEP.

And at test-time any estimators with fit_modify are treated as the identity transform (and hence fit_modify must return X with same columns as the input, which is one reason why fit_resample was a decent name)? I don't see this as being very different to the current proposal. As well as needing a mechanism to disable fit_transform, you still need to deal with the interpretation of props returned by fit_modify and the routing of props to be used or modified by this Modifier. But I'd be happy to see an implementation of this change and see if we can get this one rolling.

@amueller
Copy link
Member

@amueller amueller commented Jun 15, 2020

And at test-time any estimators with fit_modify are treated as the identity transform

No, transform (or modify) can still do arbitrary things. For stacking it would not be the identity, it would be predict_proba of the base-estimator.

I don't see this as being very different to the current proposal. As well as needing a mechanism to disable fit_transform, you still need to deal with the interpretation of props returned by fit_modify and the routing of props to be used or modified by this Modifier.

Yes, I agree, you still need to do all of these things. The thing is really awkward because the question is whether you'd also have a modify, which would be the same as transform but without the y or sample props. And you only need to disable fit_transform on meta-estimators that contain fit_modify, which doesn't seem so hard?

In an alternate universe, we could have transform just return an xarray dataset, we'd break the fit().transform() == fit_transform() convention and everything would be fine ;)

Not sure if I'll get around to prototype this, I still have a book to write this week, or something.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 16, 2020

@orausch's implementation of Pipeline handling fit_resample, close to your proposal, is https://github.com/scikit-learn/scikit-learn/pull/13269/files/61fce479352f82b7f3d1136aac24d5598748ec73#diff-dd32b47cafa79252b520b030724ddda9. I think some things there in the implementation could be simplified. There are some things there not yet handled such as resamplers returning kwargs/props.

A remaining question in your design is whether "modifiers" (please could we stick to "resamplers" for now and later propose the name change to modify?) must implement transform to be used at test time. E.g. whether the ResamplerMixin should implement an identity transform as a default?

@amueller
Copy link
Member

@amueller amueller commented Jun 16, 2020

Yes I would say they should. I don't see why they wouldn't. Sure, we could special-case it and have the pipeline and other meta-estimators handle it, but I don't see why it would be better to put the logic there instead of putting it in the estimator.

One thing that is a bit strange is that right now, is the relation of a "resampler" with estimators that only have fit_transform but no transform. That means it doesn't apply to the test set, so there is no point in distinguishing training and test set, even though conceptually what they do is more close to transforming the training set, so fit_modify might be more appropriate. I guess in theory you could come up with a case where you want to return X, y, and sampleprops for something that only applies to the training set, and then fit_modify would make sense to have without having a transform.

So "resampler" to you is something that returns X and y and sample props and has different behavior during training and fitting? I'm fine using that terminology as long as we agree what it means.

@amueller
Copy link
Member

@amueller amueller commented Jun 16, 2020

The downside of that proposal is that you can't use a TargetEncoder or stacking and then use TSNE, because we don't allow calling fit_transform because it wouldn't be the same as fit().transform().

If TSNE implemented fit_resample it would actually work (though with the resample name it would be weird). For using DBSCAN I don't see how to fix it unless we also add a new verb for predicting on the training set.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 16, 2020

The downside of that proposal is that you can't use a TargetEncoder or stacking and then use TSNE, because we don't allow calling fit_transform because it wouldn't be the same as fit().transform().

You can call fit and rely on the stored embeddings_. But I don't really understand why you'd use stacking as a transformer before TSNE..?

I think it is appropriate not to use a Pipeline when you want predictions over something other than the input data. For TSNE and DBSCAN after resampling, you'd likely want to either: keep a copy of the data in the resampled space (i.e. break the pipeline before TSNE/DBSCAN), or have a way to impute the TSNE/DBSCAN embeddings back onto the original full dataset, via something like inverse_transform, or both.

@amueller
Copy link
Member

@amueller amueller commented Jun 17, 2020

You can call fit and rely on the stored embeddings_. But I don't really understand why you'd use stacking as a transformer before TSNE..?

Using the attribute works, though having to change the code if you add something with fit_resample (or whatever it's called) in the pipeline seemed just a bit strange. It wasn't really a realistic example ;) I was trying to come up with an example of a pipeline with something that transforms differently on the training data and then something that can only produce on the training data.

I'm not sure if it would actually ever be an issue in practice, but conceptually it seems a bit ugly, and I find the mental model hard to describe and communicate.
In hindsight, I think it was a bad idea to use fit_transform and fit_predict for things that only apply on the training set if we want to keep the fit_transform == fit().transform().

I think something like @orausch's implementation probably solves most issues in some way. I guess I was hoping for something that results in a simple and easy to understand overall API but that would probably require substantial backward incompatible changes and so maybe that's more appropriate for sklearn 2.0 ;)

What part do you think could be simplified? The changes to pipeline are pretty small, right?
Oh, interesting question is: what do the other meta-estimators do? Say Column transformer? the default case for TargetEncoding is within a column transformer, but for resampling it doesn't make sense. I guess anything that changes y doesn't make sense in a target transformer, but things that are different between training and testing do make sense.

@amueller
Copy link
Member

@amueller amueller commented Jun 29, 2020

Notes from dev meeting: @jnothman felt like my concern was unrelated to the resampling and only relates to fit_transform != fit.transform, as it doesn't change the sample alignment between input and output.

I felt like resampling is "returning y" + "doing something else on the training set". Can you maybe say how the alignment comes in?

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 29, 2020

I felt like resampling is "returning y" + "doing something else on the training set". Can you maybe say how the alignment comes in?

A core assumption in [fit_]transform, [fit_]predict, predict_proba, etc. is that the return value has samples that are mapped directly to the input X. This allows the predictions to then be scored against some held out y. This is true for the transformation in stacking and target encoding, but not in resampling, sample reduction, sample rejection or dataset augmentation, which are the focus of this PR.

One reason I like the verb "resample" here is because we are literally changing the set of samples that downstream estimators receive features of. (I'm intentionally distinguishing a "sample" from its features, to focus on the identity of the sample rather than the observation of it.) The fact that we need to be able to modify y, and the fact that we cannot perform this operation at test time are corollaries of changing the set of samples we are operating on.

I hope this makes sense.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 29, 2020

Also: As discussed at today's meeting, I proposed the meta-estimator, ResampledTrainer, because while it seems natural that a resampler should occur in a Pipeline, the fact that it changes the sample set makes its semantics a little hard to reason about (at least in theory). It also complicates the implementation and testing of myriad Pipeline setups.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 17, 2020

An open question here is: is there any value in requiring resamplers to have fit? They're only useful as resamplers if fit_resample is called.

@amueller
Copy link
Member

@amueller amueller commented Aug 17, 2020

I like your distinction above and requiring resamplers not to change the meaning of features and basically operate on sample identities. That makes sense as a distinct logical concept to me.

However, in terms of API, I'm not sure if we're working on too narrow a solution here, since it will still not let us modify y in other areas and have distinct training and test transformations, two things that I think we are relatively certain we want.

On the one hand it makes sense to me to tackle one problem after the others and have incremental solutions. However, these issues seem connected to me; as you said

The fact that we need to be able to modify y, and the fact that we cannot perform this operation at test time are corollaries of changing the set of samples we are operating on.

So if we solve this particular case, but not the more general case, do we keep tacking on things to cover the other cases, and might we come up against issues that can not be solved (like the ones I tried to construct above)?

As you said, it might be fine not to use a pipeline in some cases, I guess.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 17, 2020

However, in terms of API, I'm not sure if we're working on too narrow a solution here, since it will still not let us modify y in other areas and have distinct training and test transformations, two things that I think we are relatively certain we want.

I'm still not very aware of categories of use case for modifying y where it's not about:

  • changing the sample identity, which doesn't make sense on inductive models at test time, and so should be doable with fit_resample.
    • In some cases, such as retrieving both X and y from some external data source, this involves modifying the feature space as well as the sample space. I don't mind us extending the current proposal to allow for changing feature identity, which requires transform at test time, but does not involve transforming y at test time.
  • transforming the target, such that the prediction is inverse-transformed (a generalisation of TransformedTargetRegressor)

If there are other cases, but they are not a cohesive category, I see no reason not to tackle them with custom meta-estimators.

I like that this solution circumscribes a group of tools that do similar things, i.e. changing the set of samples used at training time; and it protects the user from doing something inappropriate with test samples.

If we forbid resamplers having transform methods, then it is extensible to later allowing feature spaces to be modified.

And I think this is separate from the need for non-equivalence between fit_transform and transform.

@amueller
Copy link
Member

@amueller amueller commented Aug 17, 2020

Yeah, I have tried to come up with other examples of modifying y that doesn't fit in these categories but couldn't really think of any. @GaelVaroquaux might have had some use-cases in mind?

And I agree, it's probably separate from having non-equivalence between fit_transform and transform.

@drorata
Copy link

@drorata drorata commented Aug 18, 2020

I posted a SO question today and @adrinjalali suggested that my concern is related to this SLEP. After a brief chat with him, I was convinced that I can contribute a use case where the Resampler could be helpful.

Assume that we start with a dataset like so, denoted by X:

feat_1 feat_2 feat_3
f_1_1 f_1_2 f_1_3
f_2_1 f_2_2 f_2_3
f_3_1 f_3_2 f_3_3
f_4_1 f_4_2 f_4_3

etc..

I would like to construct a pipeline which performs the following stages (each stage may consist of multiple steps):

  1. Feature engineering: create derived features from those available in the original set.
  2. Extract targets from the dataset. When saying "extract", I mean applying some logic on the data and creating a new feature that will be the target of the predictions. For example, think of time series data and the target of someday d_i will be the number of transactions 10 days later. The extracted target should be y and the rest of the pipeline should respect it.
  3. The last phase is no longer special. It gets from the previous stages (probably modified) X and targets y.

I hope this is indeed on-topic for the SLEP. I'll be happy to elaborate and discuss further as you need.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 30, 2020

@glemaitre did you see my proposed amendments in glemaitre#4?

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 20, 2020

@glemaitre, should I just be pushing the changes from glemaitre#4 to this branch?

I see the proposal here as reasonable to implement, without touching much else in the library.

The contrast between the Pipeline and the ResampledTrainer choice is still a bit irksome. Pipeline seems more natural (i.e. what users expect), but the scope of the resampling can be confusing, so I prefer the explicit ResampledTrainer.

But once you've got a specialised metaestimator like ResampledTrainer, why not just allow each resampler to be its own metaestimator, e.g. Oversampled(predictor, n_samples) or NanRemoved(predictor) or OutlierRemover(predictor, outlier_detector), all inheriting from a common base class? This would get rid of the need for a new fit_resample verb, but would consequently make it hard to use imblearn's resamplers directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet