Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Pipelining Outlier Removal #9630

Open
datajanko opened this issue Aug 26, 2017 · 29 comments · May be fixed by #13269
Open

Feature Request: Pipelining Outlier Removal #9630

datajanko opened this issue Aug 26, 2017 · 29 comments · May be fixed by #13269

Comments

@datajanko
Copy link
Contributor

datajanko commented Aug 26, 2017

I wonder if we could make outlier removal available in pipelines.

I tried implementing it for example using the IsolationForest but so far I couldn't solve it and I know why.

The problem boils down to fit_transform only returning a transformed X this suffices in the vast majority of cases, since we typically only throw away columns (think of a PCA). However, using outlier removal in a pipeline, we need to throw away rows of X and y during training and do nothing during testing. This is not supported so far. Essentially, we would need to turn the predict function into some kind of transform function during training.

Investigating the pipeline implementation shows, that fit_transformis called if present during the fitting part of the pipeline, rather than fit(X, y).transform(X). Particularly, in a cross validation fit_transform is only called during training. This would be perfect for outlier removal. However, it remains to do nothing in the test step. But to this end we can simply implement a "do-nothing" transform-function.

The most direct way to implement this, would be an API-change of the TransformerMixin-class, unfortunately.

So my questions are:

Would it be interesting to contain feature removal in pipelines?
Are there other more suitable ideas of implementing this feature in a pipeline?

If the content of this question is somehow inapropriate (e.g. since I'm only an active user, not an active developer of the project) or at the wrong place, feel free to remove the thread.

@jnothman
Copy link
Member

This is the right place. Sorry I've not fished out issue numbers, but certainly things like this have been raised before, particularly as regards resampling in pipeline. A few points:

  1. It's not hard to implement something like this as your own metaestimator (and maybe it makes sense to have one specialised for outlier removal)
  2. We'd rather avoid allowing transform to resize the data, but introducing a new method (I think imblearn uses resample) is a something I would like to see happen.

@datajanko
Copy link
Contributor Author

Hey, I skimmed through some issues but didn't find a related topic. Sorry for that.

I fully understand that an API change is suboptimal. Resample is of course the right method name for imblearn, reducemight be good in this case, but it's an completey overloaded name ;) What about condense?
But would this really be the right place to be put directly into sklearn or would a "contrib" package be preferred?

Concerning the meta estimator: how would such an estimator remove rows? I just don't see it at the moment and a small(!) hint is appreciated.

@datajanko
Copy link
Contributor Author

datajanko commented Aug 26, 2017

I think I found one of the related issues

And I think I understood how to use a meta estimator. Compared to bagging, the random part in bagging is replaced by the outlier free set and the many estimators are replaced by the estimator we are interested in.

However, this means, that we can only remove outliers in the final step (or build further meta estimators out of transformer (which seems awkward)). I think in that case, a new method might be a better choice

@jnothman
Copy link
Member

Yes, #3855, #4143, #4552 and scikit-learn/enhancement_proposals#2 all relate.

I don't see why resample is entirely inappropriate here.

Meta-estimator (almost; untested):

class WithoutOutliersClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, outlier_detector, classifier):
        self.outlier_detector = outlier_detector
        self.classifier = classifier

    def fit(self, X, y):
        self.outlier_detector_ = clone(self.outlier_detector)
        mask = self.outlier_detector_.fit_predict(X, y) == 1
        self.classifier_ = clone(self.classifier).fit(X[mask], y[mask])
        return self

    def predict(self, X):
        return self.classifier_.predict(X)

@datajanko
Copy link
Contributor Author

For me, (re)sampling always contains a strong random component. This is not directly fulfilled when outliers shall get removed.

Thanks for the metaestimator hint. I thought in a similiar direction. Additionally, the classifier can be a pipeline, so my previous comment on removing outliers only in the last step is wrong in some sense.

@jnothman
Copy link
Member

jnothman commented Aug 28, 2017 via email

@amueller
Copy link
Member

I think we should close this as duplicate, given all the other open issues and the enhancement proposal.

Resampling doesn't necessarily mean random. In signal processing at least, it is not ;)

the meta-estimator is a good solution imho. Interesting question: it looks like you don't want to remove outliers at test time. Why is that? That seems counter-intuitive to me. I would refuse to make a prediction on points that I would have removed from the training set.

@jnothman
Copy link
Member

jnothman commented Aug 28, 2017 via email

@glemaitre
Copy link
Member

Just passing by this issue. Starting reading I thought that imblearn could implement a FunctionSampler in which an outlier detection transformer could be used to remove samples consider as outilers.

Then, I am sharing the feeling of @amueller regarding "not removing the outliers at test time". Using imblearn pipeline with such a FunctionSampler, you will get exactly this behaviour but I am really not sure this is the right thing to do.

@jnothman @amueller if somehow it would make sense to have such feature, you could delegate it to imblearn.

NB:
If at some point scikit-learn is interested to get the most used samplers (random + SMOTE) in imblearn, we are willing to pass those. I think that we start to have an API which is more less stable. We are still missing how to handle regression and categorical data for SMOTE.

@datajanko
Copy link
Contributor Author

@amueller, thanks for the remark on signal processing and "sampling".

I support the view of @jnothman concerning the outlier removal only in training. Particularly, I see the test set as "real" data that can of course be noisy. In that scenario, I typically want to have a prediction for any datapoint. (contrary to @mueller). However, during training I want to reduce overfitting a bit by removing outliers. This can/could improve model quality.

Maybe marking points as outliers and providing predictions could be a compromise.

@jdwittenauer
Copy link
Contributor

Just ran into this exact scenario and came looking to see if anything was in-progress, so I'll echo my support for the feature (for whatever that's worth =)). Regarding removing outliers from the test set, one could argue both ways and it's likely too dependent on the specifics of the problem to have a right answer, but the scenario @datajanko described seems to be more common (i.e. remove from training to improve generalization).

@jnothman
Copy link
Member

jnothman commented Nov 9, 2017 via email

@glemaitre
Copy link
Member

It would be great ... I think that @chkoar would agree with that.

By the way, we started to make a PR for a FunctionSampler as previously stated. We still have to document it but it should allow to warp some outlier detection algorithm inside it to later on make some sampling.

Regarding the Pipeline itself, I wanted to go back to #8350 and it could be the occasion of introducing the SamplerMixin support at the same time. @jnothman WDYT?

@jnothman
Copy link
Member

I'd be tempted to just add a selection method to outlier detectors directly rather than create FunctionSampler, but maybe I've not understood its benefits.

@jnothman
Copy link
Member

jnothman commented Nov 10, 2017 via email

@glemaitre
Copy link
Member

I'd be tempted to just add a selection method to outlier detectors directly rather than create FunctionSampler, but maybe I've not understood its benefits.

In the meantime, we made a quick (stupid) example:
https://840-36019880-gh.circle-artifacts.com/0/home/ubuntu/imbalanced-learn/doc/_build/html/auto_examples/plot_outlier_rejections.html#sphx-glr-auto-examples-plot-outlier-rejections-py

We are still working on it so any feedback is welcomed.

@john-gillam
Copy link

Hi,

I have been working on a similar pipeline that bypasses specific data. In my case I don't want it to resample the data but rather refuse to classify (i.e. either set nans as the classification outcome or use a masked array as output) those subjects that conform to a specific condition. I'm not sure if this is helpful, or similar to the FunctionSampler in imblearn, or might be appropriate for #3855.

In any case, if anyone is interested, I have made an available branch with the pipeline included.

@jnothman
Copy link
Member

Link to your branch here? But I don't really think this should be a branch modifying Pipeline so much as a separate estimator: it sounds like hierarchical (i.e. coarse then fine) classification which I think is hard to develop a generic tool for.

@john-gillam
Copy link

john-gillam commented Jan 24, 2018

Sorry, I'm a bit new to this. The branch is at:
https://github.com/john-gillam/scikit-learn/tree/BypassPipeline
I'm still working on/testing it though.

@averri
Copy link

averri commented Nov 6, 2018

If the act of detecting and removing outliers are part of the ML steps, why have they been left out the pipeline processing? I came here because I want to do the exact same thing as the author of this issue.

The Scikit-learn is a fantastic framework. The GridSearchCV is very powerful and useful, and I would like to integrate different strategies for detecting and removing outliers in the pipeline and let the GridSearchCV figure out which one is the best.

It's clear that changing rows in the X during the pipeline breaks the relationship between X and y, because the y is not seeing the same changes. I don't know if this is a design flaw or something that has been considered by the designers since the beginning. But the outlier detection/removal seem to be a rational use case to be used in the pipeline, according to the reason I explained in the paragraph above.

@jnothman
Copy link
Member

jnothman commented Nov 6, 2018 via email

@amueller
Copy link
Member

amueller commented Nov 6, 2018

@averri it's very easy to implement this as a meta-estimator btw, which would allow using all the grid-search and cross-validation tools.

pre_outlier_pipeline = make_pipeline(...)
post_outlier_pipeline = make_pipeline(..., SomeClassifier())
outlier_detector = ...
full_model = make_pipeline(pre_outlier_pipeline,
                           OutlierMeta(post_outlier_pipeline, outlier_detector))

Where OutlierMeta implements the outlier detection (I'm not gonna give an implementation of that here, but it should be less than 20 lines).

That does break the nice linear flow of the pipeline, though.

Also, it would be interesting if you could share actual real-world examples of data where using an outlier detection algorithm is helpful.

@jnothman
Copy link
Member

jnothman commented Nov 7, 2018

I have a meta-estimator coded above: #9630 (comment)

@amueller
Copy link
Member

amueller commented Nov 7, 2018

Great so our two snippets together nearly make one complete untested solution ;) I guess I should have re-read the rest of the thread.

@datajanko
Copy link
Contributor Author

Would it make sense to put this somewhere in the documentation?

@jnothman
Copy link
Member

jnothman commented Nov 8, 2018 via email

@orausch orausch linked a pull request Feb 26, 2019 that will close this issue
8 tasks
@amueller
Copy link
Member

Can anyone provide an example where this was done in practice? Or any paper evaluating using automatic outlier removal for supervised learning?
My intuition is that it would be a bad idea, and I have never heard of anyone doing it in practice.
So any references to papers or applications would be great.

@adrinjalali
Copy link
Member

  • I was in a project where they were using a random forest to automatically detect the outliers, before fitting a model on the rest of them. In practice it was working pretty well.
  • I have seen this in places where the high dimensionality of the data doesn't allow for simple usual outlier removals. For instance, use a one class SVM, remove the outliers, then continue the job.
  • Not doing it in a pipeline sounds like a bad idea. Always remove my outliers after I split the train/test. In a cross validation/grid search cv scenario, this means I always do that part of it manually, cause I can't have it in the pipeline; I never want to calculate any statistics about the data when the test data is included in it.

@ajcallegari
Copy link

This change must be implemented because the current version encourages data hygiene violations. When using IsolationForest outside of a pipeline, users will typically define outliers based on data in both the training and test set, which causes leakage.

Vaggelis-Malandrakis added a commit to lduruo10/sml-group-10-mini-project that referenced this issue Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants