Feature Request: Pipelining Outlier Removal #9630

datajanko · 2017-08-26T14:08:06Z

I wonder if we could make outlier removal available in pipelines.

I tried implementing it for example using the IsolationForest but so far I couldn't solve it and I know why.

The problem boils down to fit_transform only returning a transformed X this suffices in the vast majority of cases, since we typically only throw away columns (think of a PCA). However, using outlier removal in a pipeline, we need to throw away rows of X and y during training and do nothing during testing. This is not supported so far. Essentially, we would need to turn the predict function into some kind of transform function during training.

Investigating the pipeline implementation shows, that fit_transformis called if present during the fitting part of the pipeline, rather than fit(X, y).transform(X). Particularly, in a cross validation fit_transform is only called during training. This would be perfect for outlier removal. However, it remains to do nothing in the test step. But to this end we can simply implement a "do-nothing" transform-function.

The most direct way to implement this, would be an API-change of the TransformerMixin-class, unfortunately.

So my questions are:

Would it be interesting to contain feature removal in pipelines?
Are there other more suitable ideas of implementing this feature in a pipeline?

If the content of this question is somehow inapropriate (e.g. since I'm only an active user, not an active developer of the project) or at the wrong place, feel free to remove the thread.

The text was updated successfully, but these errors were encountered:

jnothman · 2017-08-26T14:33:39Z

This is the right place. Sorry I've not fished out issue numbers, but certainly things like this have been raised before, particularly as regards resampling in pipeline. A few points:

It's not hard to implement something like this as your own metaestimator (and maybe it makes sense to have one specialised for outlier removal)
We'd rather avoid allowing transform to resize the data, but introducing a new method (I think imblearn uses resample) is a something I would like to see happen.

datajanko · 2017-08-26T14:53:28Z

Hey, I skimmed through some issues but didn't find a related topic. Sorry for that.

I fully understand that an API change is suboptimal. Resample is of course the right method name for imblearn, reducemight be good in this case, but it's an completey overloaded name ;) What about condense?
But would this really be the right place to be put directly into sklearn or would a "contrib" package be preferred?

Concerning the meta estimator: how would such an estimator remove rows? I just don't see it at the moment and a small(!) hint is appreciated.

datajanko · 2017-08-26T15:51:19Z

I think I found one of the related issues

And I think I understood how to use a meta estimator. Compared to bagging, the random part in bagging is replaced by the outlier free set and the many estimators are replaced by the estimator we are interested in.

However, this means, that we can only remove outliers in the final step (or build further meta estimators out of transformer (which seems awkward)). I think in that case, a new method might be a better choice

jnothman · 2017-08-27T14:39:57Z

Yes, #3855, #4143, #4552 and scikit-learn/enhancement_proposals#2 all relate.

I don't see why resample is entirely inappropriate here.

Meta-estimator (almost; untested):

class WithoutOutliersClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, outlier_detector, classifier):
        self.outlier_detector = outlier_detector
        self.classifier = classifier

    def fit(self, X, y):
        self.outlier_detector_ = clone(self.outlier_detector)
        mask = self.outlier_detector_.fit_predict(X, y) == 1
        self.classifier_ = clone(self.classifier).fit(X[mask], y[mask])
        return self

    def predict(self, X):
        return self.classifier_.predict(X)

datajanko · 2017-08-28T10:22:46Z

For me, (re)sampling always contains a strong random component. This is not directly fulfilled when outliers shall get removed.

Thanks for the metaestimator hint. I thought in a similiar direction. Additionally, the classifier can be a pipeline, so my previous comment on removing outliers only in the last step is wrong in some sense.

jnothman · 2017-08-28T11:55:20Z

yes, this is more-or-less sufficient. The questions are whether it is elegant; and whether we should have some such thing in the library.

amueller · 2017-08-28T20:31:23Z

I think we should close this as duplicate, given all the other open issues and the enhancement proposal.

Resampling doesn't necessarily mean random. In signal processing at least, it is not ;)

the meta-estimator is a good solution imho. Interesting question: it looks like you don't want to remove outliers at test time. Why is that? That seems counter-intuitive to me. I would refuse to make a prediction on points that I would have removed from the training set.

jnothman · 2017-08-28T23:43:46Z

refusing to classify is interesting. The estimator could have an option to add an additional unlabelled class for outliers in prediction, but the key idea is to reduce noise in training. Also, resampling pipelines alone would not fix this, so unless we have other issues about outlier removal I think we should keep this open. We either need a wrapper for outlier detectors, or need to add a fit_resample method on each of them. Or we can just add a metaestimator like mine above to make outlier removal in training practical.

…

On 29 Aug 2017 6:31 am, "Andreas Mueller" ***@***.***> wrote: I think we should close this as duplicate, given all the other open issues and the enhancement proposal. Resampling doesn't necessarily mean random. In signal processing at least, it is not ;) the meta-estimator is a good solution imho. Interesting question: it looks like you don't want to remove outliers at test time. Why is that? That seems counter-intuitive to me. I would refuse to make a prediction on points that I would have removed from the training set. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9630 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69cj85oGjbk-UKn__ZPE2EdFnJiiks5scyOdgaJpZM4PDfu1> .

glemaitre · 2017-08-30T10:33:17Z

Just passing by this issue. Starting reading I thought that imblearn could implement a FunctionSampler in which an outlier detection transformer could be used to remove samples consider as outilers.

Then, I am sharing the feeling of @amueller regarding "not removing the outliers at test time". Using imblearn pipeline with such a FunctionSampler, you will get exactly this behaviour but I am really not sure this is the right thing to do.

@jnothman @amueller if somehow it would make sense to have such feature, you could delegate it to imblearn.

NB:
If at some point scikit-learn is interested to get the most used samplers (random + SMOTE) in imblearn, we are willing to pass those. I think that we start to have an API which is more less stable. We are still missing how to handle regression and categorical data for SMOTE.

datajanko · 2017-08-30T10:50:07Z

@amueller, thanks for the remark on signal processing and "sampling".

I support the view of @jnothman concerning the outlier removal only in training. Particularly, I see the test set as "real" data that can of course be noisy. In that scenario, I typically want to have a prediction for any datapoint. (contrary to @mueller). However, during training I want to reduce overfitting a bit by removing outliers. This can/could improve model quality.

Maybe marking points as outliers and providing predictions could be a compromise.

jdwittenauer · 2017-11-09T19:57:32Z

Just ran into this exact scenario and came looking to see if anything was in-progress, so I'll echo my support for the feature (for whatever that's worth =)). Regarding removing outliers from the test set, one could argue both ways and it's likely too dependent on the specifics of the problem to have a right answer, but the scenario @datajanko described seems to be more common (i.e. remove from training to improve generalization).

jnothman · 2017-11-09T21:38:04Z

perhaps it's time we just accept a PR to support imblearn-style pipeline with resampling and outlier removal examples, and we tweak from there... Ping @glemaite

glemaitre · 2017-11-10T00:37:02Z

It would be great ... I think that @chkoar would agree with that.

By the way, we started to make a PR for a FunctionSampler as previously stated. We still have to document it but it should allow to warp some outlier detection algorithm inside it to later on make some sampling.

Regarding the Pipeline itself, I wanted to go back to #8350 and it could be the occasion of introducing the SamplerMixin support at the same time. @jnothman WDYT?

jnothman · 2017-11-10T03:11:46Z

I'd be tempted to just add a selection method to outlier detectors directly rather than create FunctionSampler, but maybe I've not understood its benefits.

jnothman · 2017-11-10T03:12:00Z

I think this is quite unrelated to #8350. I also don't think it would be arduous to reimplement an imblearn-like interface without getting explicit permission from its creators :P

glemaitre · 2017-11-28T01:28:24Z

I'd be tempted to just add a selection method to outlier detectors directly rather than create FunctionSampler, but maybe I've not understood its benefits.

In the meantime, we made a quick (stupid) example:
https://840-36019880-gh.circle-artifacts.com/0/home/ubuntu/imbalanced-learn/doc/_build/html/auto_examples/plot_outlier_rejections.html#sphx-glr-auto-examples-plot-outlier-rejections-py

We are still working on it so any feedback is welcomed.

john-gillam · 2018-01-24T02:45:43Z

Hi,

I have been working on a similar pipeline that bypasses specific data. In my case I don't want it to resample the data but rather refuse to classify (i.e. either set nans as the classification outcome or use a masked array as output) those subjects that conform to a specific condition. I'm not sure if this is helpful, or similar to the FunctionSampler in imblearn, or might be appropriate for #3855.

In any case, if anyone is interested, I have made an available branch with the pipeline included.

jnothman · 2018-01-24T04:01:55Z

Link to your branch here? But I don't really think this should be a branch modifying Pipeline so much as a separate estimator: it sounds like hierarchical (i.e. coarse then fine) classification which I think is hard to develop a generic tool for.

john-gillam · 2018-01-24T04:13:40Z

Sorry, I'm a bit new to this. The branch is at:
https://github.com/john-gillam/scikit-learn/tree/BypassPipeline
I'm still working on/testing it though.

averri · 2018-11-06T17:25:40Z

If the act of detecting and removing outliers are part of the ML steps, why have they been left out the pipeline processing? I came here because I want to do the exact same thing as the author of this issue.

The Scikit-learn is a fantastic framework. The GridSearchCV is very powerful and useful, and I would like to integrate different strategies for detecting and removing outliers in the pipeline and let the GridSearchCV figure out which one is the best.

It's clear that changing rows in the X during the pipeline breaks the relationship between X and y, because the y is not seeing the same changes. I don't know if this is a design flaw or something that has been considered by the designers since the beginning. But the outlier detection/removal seem to be a rational use case to be used in the pipeline, according to the reason I explained in the paragraph above.

jnothman · 2018-11-06T22:55:20Z

There is a plan towards a solution in #3855, and you are welcome to help implement it.

amueller · 2018-11-06T23:33:05Z

@averri it's very easy to implement this as a meta-estimator btw, which would allow using all the grid-search and cross-validation tools.

pre_outlier_pipeline = make_pipeline(...)
post_outlier_pipeline = make_pipeline(..., SomeClassifier())
outlier_detector = ...
full_model = make_pipeline(pre_outlier_pipeline,
                           OutlierMeta(post_outlier_pipeline, outlier_detector))

Where OutlierMeta implements the outlier detection (I'm not gonna give an implementation of that here, but it should be less than 20 lines).

That does break the nice linear flow of the pipeline, though.

Also, it would be interesting if you could share actual real-world examples of data where using an outlier detection algorithm is helpful.

jnothman · 2018-11-07T01:51:28Z

I have a meta-estimator coded above: #9630 (comment)

amueller · 2018-11-07T18:26:35Z

Great so our two snippets together nearly make one complete untested solution ;) I guess I should have re-read the rest of the thread.

datajanko · 2018-11-08T13:33:42Z

Would it make sense to put this somewhere in the documentation?

jnothman · 2018-11-08T23:10:30Z

An example illustrating how to do this would be welcome in the gallery.

amueller · 2019-06-27T15:09:55Z

Can anyone provide an example where this was done in practice? Or any paper evaluating using automatic outlier removal for supervised learning?
My intuition is that it would be a bad idea, and I have never heard of anyone doing it in practice.
So any references to papers or applications would be great.

adrinjalali · 2019-06-28T09:23:39Z

I was in a project where they were using a random forest to automatically detect the outliers, before fitting a model on the rest of them. In practice it was working pretty well.
I have seen this in places where the high dimensionality of the data doesn't allow for simple usual outlier removals. For instance, use a one class SVM, remove the outliers, then continue the job.
Not doing it in a pipeline sounds like a bad idea. Always remove my outliers after I split the train/test. In a cross validation/grid search cv scenario, this means I always do that part of it manually, cause I can't have it in the pipeline; I never want to calculate any statistics about the data when the test data is included in it.

ajcallegari · 2021-09-05T16:40:55Z

This change must be implemented because the current version encourages data hygiene violations. When using IsolationForest outside of a pipeline, users will typically define outliers based on data in both the training and test set, which causes leakage.

Based on this comment scikit-learn/scikit-learn#9630 (comment)

jnothman mentioned this issue Jan 16, 2018

Resampler estimators that change the sample size in fitting #3855

Open

orausch linked a pull request Feb 26, 2019 that will close this issue

[WIP] Resamplers #13269

Open

8 tasks

jnothman mentioned this issue Jun 20, 2019

transform should not accept y #8174

Closed

chkoar mentioned this issue Jun 27, 2019

SLEP005: Resampler API scikit-learn/enhancement_proposals#15

Open

mfeurer mentioned this issue Jun 25, 2021

Dealing with highly imbalanced data in Autosklearn automl/auto-sklearn#1164

Closed

Vaggelis-Malandrakis added a commit to lduruo10/sml-group-10-mini-project that referenced this issue Nov 25, 2021

Outlier removal python

fe41852

Based on this comment scikit-learn/scikit-learn#9630 (comment)

cmarmo added module:compose module:preprocessing labels Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Pipelining Outlier Removal #9630

Feature Request: Pipelining Outlier Removal #9630

datajanko commented Aug 26, 2017 •

edited

Loading

jnothman commented Aug 26, 2017

datajanko commented Aug 26, 2017

datajanko commented Aug 26, 2017 •

edited

Loading

jnothman commented Aug 27, 2017

datajanko commented Aug 28, 2017

jnothman commented Aug 28, 2017 via email •

edited by amueller

Loading

amueller commented Aug 28, 2017

jnothman commented Aug 28, 2017 via email

glemaitre commented Aug 30, 2017

datajanko commented Aug 30, 2017

jdwittenauer commented Nov 9, 2017

jnothman commented Nov 9, 2017 via email

glemaitre commented Nov 10, 2017

jnothman commented Nov 10, 2017

jnothman commented Nov 10, 2017 via email

glemaitre commented Nov 28, 2017

john-gillam commented Jan 24, 2018

jnothman commented Jan 24, 2018

john-gillam commented Jan 24, 2018 •

edited

Loading

averri commented Nov 6, 2018

jnothman commented Nov 6, 2018 via email

amueller commented Nov 6, 2018 •

edited

Loading

jnothman commented Nov 7, 2018

amueller commented Nov 7, 2018

datajanko commented Nov 8, 2018

jnothman commented Nov 8, 2018 via email

amueller commented Jun 27, 2019

adrinjalali commented Jun 28, 2019

ajcallegari commented Sep 5, 2021

Feature Request: Pipelining Outlier Removal #9630

Feature Request: Pipelining Outlier Removal #9630

Comments

datajanko commented Aug 26, 2017 • edited Loading

jnothman commented Aug 26, 2017

datajanko commented Aug 26, 2017

datajanko commented Aug 26, 2017 • edited Loading

jnothman commented Aug 27, 2017

datajanko commented Aug 28, 2017

jnothman commented Aug 28, 2017 via email • edited by amueller Loading

amueller commented Aug 28, 2017

jnothman commented Aug 28, 2017 via email

glemaitre commented Aug 30, 2017

datajanko commented Aug 30, 2017

jdwittenauer commented Nov 9, 2017

jnothman commented Nov 9, 2017 via email

glemaitre commented Nov 10, 2017

jnothman commented Nov 10, 2017

jnothman commented Nov 10, 2017 via email

glemaitre commented Nov 28, 2017

john-gillam commented Jan 24, 2018

jnothman commented Jan 24, 2018

john-gillam commented Jan 24, 2018 • edited Loading

averri commented Nov 6, 2018

jnothman commented Nov 6, 2018 via email

amueller commented Nov 6, 2018 • edited Loading

jnothman commented Nov 7, 2018

amueller commented Nov 7, 2018

datajanko commented Nov 8, 2018

jnothman commented Nov 8, 2018 via email

amueller commented Jun 27, 2019

adrinjalali commented Jun 28, 2019

ajcallegari commented Sep 5, 2021

datajanko commented Aug 26, 2017 •

edited

Loading

datajanko commented Aug 26, 2017 •

edited

Loading

jnothman commented Aug 28, 2017 via email •

edited by amueller

Loading

john-gillam commented Jan 24, 2018 •

edited

Loading

amueller commented Nov 6, 2018 •

edited

Loading