Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP005: Resampler API #15

Open
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
7 participants
@glemaitre
Copy link

commented Mar 1, 2019

Here is the SLEP regarding the Outlier Rejection API

@@ -26,6 +26,7 @@
slep002/proposal
slep003/proposal
slep004/proposal
slep005/proposal

This comment has been minimized.

Copy link
@amueller

amueller Mar 1, 2019

Member

shouldn't be under delayed review, should be under "under review" I think.

@amueller

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

I was expecting "the fit_resample slep" to be about imbalanced data. I think subsampling is a much more common issue then trying to automatically remove outliers.

Do you have a real-world application where this would help? ;)


The new mixin is implemented as::
class OutlierRejectionMixin:

This comment has been minimized.

Copy link
@GaelVaroquaux

GaelVaroquaux Mar 1, 2019

Member

I don't think that a SLEP should give that level of implementation detail (I find myself wanting to comment on the implementation details, which is not the point), and rather focus on the API, and it's consequences (which may be in the implementation).

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

Indeed, as I was telling Guillaume IRL, I think that this SLEP should be about fit_resample, rather than outlier rejection, and outlier rejection should be listed as an application. I am not convinced that proposing a new mixing calls for a SLEP in itself. It's the addition of the method fit_resample that calls for the SLEP.

To handle outlier rejector in ``Pipeline``, we enforce the following:

* an estimator cannot implement both ``fit_resample(X, y)`` and
``fit_transform(X)`` / ``transform(X)``.

This comment has been minimized.

Copy link
@GaelVaroquaux

GaelVaroquaux Mar 1, 2019

Member

That's the important point (the crutial one). The SLEP should mention the rational.

@glemaitre

This comment has been minimized.

Copy link
Author

commented Mar 2, 2019

OK, so I will modify the SLEP to make it about resamplers. Basically, I should keep the part about the implementation of the pipeline with the limitations. @orausch made a good suggestion.

@glemaitre

This comment has been minimized.

Copy link
Author

commented Mar 2, 2019

Do you have a real-world application where this would help? ;)

@amueller This is indeed a really good point which should be demonstrated in the outlier PR.

Update slep005/proposal.rst
Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>
@chkoar

This comment has been minimized.

Copy link

commented Mar 2, 2019

Will resamplers applied in transform? I got confused.

@glemaitre

This comment has been minimized.

Copy link
Author

commented Mar 2, 2019

Will resamplers applied in transform? I got confused.

Nop resampler will be implementing fit_resample and will not implement fit_transform or transform.

@chkoar

This comment has been minimized.

Copy link

commented Mar 2, 2019

@orausch

This comment has been minimized.

Copy link

commented Mar 2, 2019

I wonder because of this 4ecc51b#diff-fa5683f6b9de871ce2af02905affbdaaR80

Sorry, that is a mistake on my part. They should not be applied on transform.

EDIT: and accordingly, during fit_transform, resamplers are only applied during the fit, and not the transform.

@chkoar

This comment has been minimized.

Copy link

commented Mar 2, 2019

EDIT: and accordingly, during fit_transform, resamplers are only applied during the fit, and not the transform.

Yes, this make sense.

@orausch

This comment has been minimized.

Copy link

commented Mar 2, 2019

Currently, my PR applies resamplers on transform (and so does imblearn). It seems that it is not trivial to change the behavior so that, for fit_transform, resamplers are only applied during the fit, and not during the transform. Currently, the pipeline is efficient in the sense that each transformer is only called once during a fit_transform call.

To get the behavior described above, a naive implementation would have to call each transformer after the first resampler twice: once for the fit path, where we apply resamplers, and once for the transform path, where we don't. It seems to me that, in order to do it efficiently, we would need some knowledge of what samples were added/removed by the resamplers in the pipeline.

If we want to make transform not apply resamplers, we would have to address this problem in the pipeline.

This brings me to the next thought: does it even make sense to have resamplers in a transformer only pipeline? Is there a good usecase? One choice would be to simply disallow this behavior (similarly to how we disallow resamplers for fit_predict).

@glemaitre

This comment has been minimized.

Copy link
Author

commented Mar 2, 2019

So I update the SLEP toward a more fit_resample discussion.

I realised (even if it is obvious) that outlier rejection is unsupervised while resampling for balancing is supervised (and for binary/multiclass classification) AFAIK. In the latter case, resampler will require to validate the targets and define an API for driving the resampling (i.e. sampling_strategy in imblearn)

Is this API choice should be discussed within the SLEP as well or this is more specific to a type of resampler and it will be handled later on.

@glemaitre

This comment has been minimized.

Copy link
Author

commented Mar 2, 2019

Currently, my PR applies resamplers on transform (and so does imblearn).

https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py#L487

We are skipping the resampler during transform.

@glemaitre

This comment has been minimized.

Copy link
Author

commented Mar 2, 2019

@orausch I am a bit confused with your comment.

When calling fit_transform on a pipeline, we will call all fit_transform or fit_resample of all estimators but the last. The last will then call fit.transform or fit_transform.

For transform on a pipeline, we will call transform of all estimators but the last, skipping the ones with fit_resample. Finally, we call transform of the last estimator.

Could you explain where do you think we are applying the resampling exactly, maybe I am missing something?

@orausch

This comment has been minimized.

Copy link

commented Mar 2, 2019

We are skipping the resampler during transform.

Sorry, I missed that. I think there is a problem regardless.

When calling fit_transform on a pipeline, we will call all fit_transform or fit_resample of all estimators but the last. The last will then call fit.transform or fit_transform.

If we do this, we have inconsistent behavior between pipe.fit_transform(X, y) and pipe.fit(X, y).transform(X). For example consider:

X = [A, B, C] # A, B, C are feature vectors 
y = [a, b, c]
pipe = make_pipeline(removeB, mult2) # removeB is a resampler that will remove B, mult2 is a transformer

Then pipe.fit_transform(X, y) gives [2A,2C] but pipe.fit(X, y).transform(X) gives [2A, 2B, 2C]

@chkoar

This comment has been minimized.

Copy link

commented Mar 2, 2019

At least in the imbalanced setting use case usually you will have at the last step either a classifier or a resampler, not a transformer. I suppose that the same happens in the outlier rejection. In your example it make sense to resample also in pipeline's transform, right? But what if you transform again with a X2? Does it make sense to resample on X2 when you only need the transformation? It seems like you need to resample in transform when you will pass the same fitted data again, in order to not break the contract. Thoughts?

@glemaitre glemaitre changed the title SLEP005: Outlier Rejection API SLEP005: Resampler API Mar 3, 2019

@orausch

This comment has been minimized.

Copy link

commented Mar 4, 2019

Some options to address this:

  • Forbid calling fit_transform on a pipeline containing resamplers. (like we do with fit_predict).
  • Define the API so that transform applies resamplers (then the approach from imblearn works).

Or we can implement the behavior that for fit_transform, we apply resamplers on fit, but not on transform. Here I can also see two options

  • Implement the behavior, but do it inefficiently (simply call fit and then transform).
  • Do it efficiently. Like I mentioned earlier, it seems to me that to do this, we need some knowledge of what samplers were added/removed by resamplers. Perhaps something like return_idx is an optional kwarg to fit_resample. If the resampler supports returning idx, we do it efficiently, if not, we call fit, then transform.

Let me know if I missed something.

@glemaitre

This comment has been minimized.

Copy link
Author

commented Mar 4, 2019

@orausch orausch referenced this pull request Mar 5, 2019

Open

[WIP] Outlier Rejection #13269

14 of 21 tasks complete
Show resolved Hide resolved slep005/proposal.rst Outdated
Update slep005/proposal.rst
Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>
@adrinjalali

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

I don't think there's one solution which works for all usecases, here are two real world examples (and I hope they're convincing):

  1. In the context of FAT-ML, assume the following pipeline:
  • resample to tackle class imbalance
  • mutate the data for the purpose of a more "fair" data (which may or may not touch y)
  • usual transformers
  • an estimator

In the above pipeline, during fit, I'd like the first two steps to be on, and during predict, I'd like them to be off, which we can do if the first two steps are resamplers and they're automatically out during the predict.

  1. I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them. The pipeline would look like:
  • exclude outliers
  • usual transformers
  • predict faulty behavior

In contrast to the previous usecase, now I'd like the first step to be always on, cause I need to ignore those data points from my analysis.

Also, we should consider the fact that once we have this third type of model, we'd have at least three types of pipelines, i.e. estimator, transformer, and resampler pipelines. I feel like this fact, plus the above two usecases, would justify a parameter like resample_on_transform for the pipeline, to tune the behavior of the pipeline regarding these resamplers as its steps. I'm not sure if it completely solves our issues, but it may.

For instance, if the user wants to mix these behaviors, they can put different resamplers into different pipelines, set the resample_on_transform of each pipeline appropriately, and include those pipelines in their main pipeline.

I haven't managed to completely think these different cases through, but I hope I could convey the idea.

@amueller

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

@amueller

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

@orausch

This comment has been minimized.

Copy link

commented Mar 5, 2019

I'm not sure if we want to support this use-case - in the end you could always just add another class to the main problem that is "outlier".

Although sklearn already includes Novetly detection models, it would be impossible to support this under the current API either way. This is because fit_resample is stateless (as in it fits and resamples on the same input data). Supporting this correctly would require a separate resample verb.

@amueller

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

@orausch sorry I don't follow. The simple solution to the situation I described is just a multi-class problem and there is no resampling necessary.

@orausch

This comment has been minimized.

Copy link

commented Mar 5, 2019

Sorry, I was referring to implementing novelty rejection under the resampler API (usecase 2 of @adrinjalali's comment)

@orausch

This comment has been minimized.

Copy link

commented Mar 5, 2019

Let me rephrase to clarify:

2. I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them.

This is not possible under the proposed resampler API. fit_resample is currently stateless, in that it fits and resamples on the same data, and returns the resampled result. If i understand this usecase correctly, this "novelty rejection" would require you to be able to separately fit the resampler on the train set, and call resample whenever you predict. Since we only have a fit_resample method, this is not possible.

@jnothman

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

I think the predictive maintenance example is one for hierarchical classification with outlier detectors, not merely for outlier removal during training.

I think the SLEP should go through the series of methods on a Pipeline and note the proposed behaviour and the implications for Resampler/Pipeline design.

* ``sample_weight`` can be applied at both fit and predict time;
* ``sample_weight`` need to be passed and modified within a
``Pipeline``.

This comment has been minimized.

Copy link
@jnothman

jnothman Mar 7, 2019

Member

Like everything, a meta-estimator could be implemented providing this functionality. However this is difficult to use, unintuitive, and would have to implement all methods and attributes appropriate to the contexts where this may be used (regression, decomposition, etc)


* sample rebalancing to correct bias toward class with large cardinality;
* outlier rejection to fit a clean dataset.

This comment has been minimized.

Copy link
@jnothman

jnothman Mar 7, 2019

Member

You should mention sample reduction e.g. representation of a dataset by its kmeans centroids.

I also note that the former two cases can be handled merely by the ability to modify sample_weight in a Pipeline.

perform resampling. However, the current limitations are:

* ``sample_weight`` is not available for all estimators;
* ``sample_weight`` will implement only sample reductions;

This comment has been minimized.

Copy link
@jnothman

jnothman Mar 7, 2019

Member

sample reductions -> simple resampling


* ``sample_weight`` is not available for all estimators;
* ``sample_weight`` will implement only sample reductions;
* ``sample_weight`` can be applied at both fit and predict time;

This comment has been minimized.

Copy link
@jnothman

jnothman Mar 7, 2019

Member

not sure about this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.