Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resampler estimators that change the sample size in fitting #3855

Open
jnothman opened this issue Nov 15, 2014 · 68 comments
Open

Resampler estimators that change the sample size in fitting #3855

jnothman opened this issue Nov 15, 2014 · 68 comments
Labels
API help wanted Moderate Anything that requires some knowledge of conventions and best practices New Feature

Comments

@jnothman
Copy link
Member

Some data transformations -- including over/under-sampling (#1454), outlier removal, instance reduction, and other forms of dataset compression, like that used in BIRCH (#3802) -- entail altering a dataset at training time, but leaving it unaltered at prediction time. (In some cases, such as outlier removal, it makes sense to reapply a fitted model to new data, while in others model reuse after fitting seems less applicable. )

As noted elsewhere, transformers that change the number of samples are not currently supported, certainly in the context of Pipelines where a transformation is applied both at fit and predict time (although a hack might abuse fit_transform to make this not so). Pipelines of Transformers also would not cope with changes in the sample size at fit time for supervised problems because Transformers do not return a modified y, only X.

To handle this class of problems, I propose introducing a new category of estimator, called a Resampler. It must define at least a fit_resample method, which Pipeline will call at fit time, passing the data unchanged at other times. (For this reason, a Resampler cannot also be a Transformer, or else we need to define their precedence.)

For many models, fit_resample needs only return sample_weight. For sample compression approaches (e.g. that in BIRCH), this is not sufficient as the representative centroids are modified from the input samples. Hence I think fit_resample should return altered data directly, in the form of a dict with keys X, y, sample_weight as required. (It still might be appropriate for many Resamplers to only modify sample_weight; if necessary, another Resampler can be chained that realises the weights as replicated or deleted entries in X and y.)

@agramfort
Copy link
Member

I hear this positively after discussing this very same pb with @MechCoder

can you write a few lines of code the way you would like to pipe something
like Birch with an estimator that supports sample_weights?

@jnothman
Copy link
Member Author

I'm not sure about piping birch with sample weights, but BIRCH could be implemented as

make_pipeline(BirchResampler, PredictorToResampler(SomeClusterer), KNeighborsClassifier)

Not that it's so neat, but it gives an example of the power of the approach. (PredictorToResampler simply takes the predictions of a method and returns it as the y for the input X.)

@agramfort
Copy link
Member

I think we should list a few use cases to come up with an API that does the
job. The code seems a bit too generic for a single use case, which again I
acknowledge the relevance given our work on birch.

@GaelVaroquaux
Copy link
Member

I think that this issue is a core API issue, and a blocker for 1.0.
Thanks for bringing the debate.

To handle this class of problems, I propose introducing a new category of
estimator, called a Resampler. It must define at least a fit_resample method,
which Pipeline will call at fit time, passing the data unchanged at other
times. (For this reason, a Resampler cannot also be a Transformer, or else we
need to define their precedence.)

Why conflating fit and resample? I can see usecases for a separate fit
and resample.

Also, IMHO, the fact that transform does not modify y is a design failure
(mine). I would be happier to define a new method, similar to transform,
that modifies y (I am looking for a good name), and to progressively out
phase 'transform'.

That way we avoid introducing a new class of object, and a new concept.
The more concepts and classes of objects there are in a library, the
harder it is to understand.

Finally, I don't really like the name 'resample'. I find that it is too
specific, and that their are other usecases to the method than resample
(semi-supervised learning to propagate labels to unlabelled data, for
instance).

Here are suggestions of names:

  • apply
  • change
  • modify
  • mutate
  • convert

The name transform is just too good, IMHO. In the long run, we could come
back to it, after a couple of years of deprecation of the old behavior.
The new behavior would be that it always return the same number of arrays
than it is given (and raises an error if only X is given for a supervised
method that needs y).

@jnothman
Copy link
Member Author

Modifying y is not the fundamental issue here. Yes, that's something else that needs to be handled. The issue here is that the set of samples passed out of resample is not necessarily the set passed in. This sort of operation (of which resampling is emblematic, but I am happy to find it a better name) is frequently required for training, and is rarely the right thing to do at test time when you want the predictions to correspond to the inputs.

Not just the "mostly happens [in a pipeline context] at fit time" (and yes, as above, there are cases where a fit model will be reapplied, especially outlier detection) sets this apart from transformers that must equally apply at fit and runtime, but the idea that the sample size can change.

So never mind modifying y. A transformer that allows the sample size to change cannot be used in a FeatureUnion. A transformer that allows the sample size to change cannot be used in a Pipeline unless it modifies y also because score will break, but even so it seems a strange definition of scoring a dataset if it is modified as such.

So as much as redesigning the transformer API may be desirable, there is value IMO in a distinct type of estimator that: (a) has effect in a Pipeline during training and none otherwise; (b) is allowed to change the sample size, where Transformers or their successors should continue not to.

The idea of the name "resample" is that the most important job of this class of estimators is to change the sample size in some way, by oversampling, otherwise re-weighting, compressing, or incorporating unlabelled instances from elsewhere.

@GaelVaroquaux
Copy link
Member

A transformer that allows the sample size to change cannot be used in a
FeatureUnion.

That's the argument that I was missing. Thanks! Are there other cases?

The idea of the name "resample" is that the most important job of this
class of estimators is to change the sample size in some way, by
oversampling, otherwise re-weighting, compressing, or incorporating
unlabelled instances from elsewhere.

Based on your arguments justifying the need of the new class, I've been
thinking about the name. And indeed, it should revolve around the notion
of sample, and maybe even the term "sample", as this is what we use in
scikit-learn. The most explicit term would be "transform_samples", but I
think that this is too long (we might need things like
"fit_transform_samples").

One thing that I am worried about, however, is that if we introduce a
"resample" and keep the old "transform" method, it will be ambiguous what
a pipeline means. Of course, we can introduce an argument to the
pipeline, or create a new pipeline variant. However, I am worried that
the added complexity for users and developers does not justify creating
the extra object compared to the Transformers. In other tmers, I think
that we would be better off saying that some transformers change the
number of samples (and we can create an extra sub-class for that).

@jnothman
Copy link
Member Author

jnothman commented Nov 17, 2014

And would this subclass of transformers also only operate at fit time? I think this is different enough to motivate a different family of estimators, but I might be wrong.

This type of estimator pipelining can also be easily modelled as meta-estimators. The only real problem there is the uncomfortable nesting of param names (although I did once play with some magic that allows a nested structure to be wrapped so that parameters can be renamed, or their values tied), and that flat is better than nested.

@GaelVaroquaux
Copy link
Member

And would this subclass of transformers also only operate at fit time?

Is there a reason why a fit_transform wouldn't solve that problem?

@jnothman
Copy link
Member Author

fit_transform solves that component if fit_transform and fit().transform are allowed to have different results. I think transformers are confusing enough to many users even while more-or-less promising the functional equivalence of fit_transform and fit().transform.

@GaelVaroquaux
Copy link
Member

fit_transform solves that component if fit_transform and
fit().transform are allowed to have different results. I think
transformers are confusing enough to many users even while more-or-less
promising the functional equivalence of fit_transform and
fit().transform.

Quite clearly I agree with you that breaking this equivalence would be a
very bad idea.

But I am not sure why it would be necessary (although I am starting to
get your point, I am not yet convinced that it is not possible to
implement a transform method that has the logic necessary to have the
match between fit().transform and fit_transform).

@jnothman
Copy link
Member Author

I'm preparing some examples so that we have something to point at in
discussion, but it takes longer than writing quick responses!

On 17 November 2014 20:45, Gael Varoquaux notifications@github.com wrote:

fit_transform solves that component if fit_transform and
fit().transform are allowed to have different results. I think
transformers are confusing enough to many users even while more-or-less
promising the functional equivalence of fit_transform and
fit().transform.

Quite clearly I agree with you that breaking this equivalence would be a
very bad idea.

But I am not sure why it would be necessary (although I am starting to
get your point, I am not yet convinced that it is not possible to
implement a transform method that has the logic necessary to have the
match between fit().transform and fit_transform).


Reply to this email directly or view it on GitHub
#3855 (comment)
.

@GaelVaroquaux
Copy link
Member

I'm preparing some examples so that we have something to point at in
discussion, but it takes longer than writing quick responses!

Thank you. This is very useful!

@amueller
Copy link
Member

Thanks for restarting the discussion on this.
So with implementing something that, say, resamples the classes to equal sizes during training, there are three distinct problems:

  1. It changes y.
  2. It resamples during training, but we want to have predictions for all samples during test time.
  3. This estimator could not be FeatureUnion'ed with anything else.

The first one might be solved by changing the behavior of transformers, for the other two it is not as obvious as to what to do.
I think we might still get away with the transformer interface, though.
I would not worry too much about 3). I think raising a sensible error when someone tries that would be fine. This should be pretty easy to detect.

  1. is maybe the most tricky one as it will definitely require some new mechanism and we should be careful if it is worth adding this complexity.
    How about adding transform(X, y, fit=False) or transform(X, y, filter=False) or something keyword, that controlls whether dropping samples is allowed or not.
    In a pipeline the option could then depend on whether someone called "fit" or not.

That makes me think: what are the cases when we want different behavior during fitting and predicting? Do we always want to resample during learning, but not during prediction? What if we want to do some visualization and want to filter out outliers for both?

@MechCoder
Copy link
Member

@jnothman

As far as I understand this discussion, (sorry if I missed something, I just quickly skimmed through, especially just the parts that say Birch :P ), you mean to subclass Birch (and other instance reduction methods) from a new class of estimators, called Resamples, and whose fit_resample method we call during the fit of Pipeline, right?. Some naive questions for starters.

  1. In a way (like in Birch), one can view the centroids obtained from MBKMeans (and some other clusterers), as instance reduction, especially when n_clusters is large enough, how do we draw a line between whether a fit_resample or a fit_transform should be called?
  2. What are the transform methods of transformers typically used for in a pipeline? It seems to me that using brc.subcluster_centers_ might be much more useful than transforming the input data in the subcluster_centers_ space, especially when piped with AgglomerativeClustering et. al which is what is being done internally.

@jnothman
Copy link
Member Author

@MechCoder

Firstly, I'm not sure that reimplementing BIRCH is what I intend here. It's more that this type of algorithm can be framed as a pipeline of reduction, clustering, etc. There should be a right way to cobble together estimators into these sorts of things in scikit-learn, to whatever extent it is facilitated by the API. As for reimplementing BIRCH itself, the resampler could be pulled out as a separate component, and the full clusterer can be offered as well.

Yes, using MBKmeans for the instance reduction is equally applicable; the fact that it happens to define transform with some different semantics means that however it is wrapped as a resampler needs to appear as a separate class (somewhat like how WardAgglommeration and Ward are distinct classes).

Classifiers or clusterers or regressors that happen to implement transform are a little problematic in general because, as you suggest, the semantics of the associated transformation are not necessarily inherent to the predictor, are not necessarily described in the same reference texts as the predictor, etc. For instance, despite in #2160 suggesting that for consistency all estimators with coef_ or feature_importances_ should also have _LearntSelectorMixin to act as a feature selector, I later thought the approach of the now-stale #3011 would be more appropriate, where we replace this mixin with a way to wrap a classifier/regressor so that it acts as a feature selector; alternatively, a method of a classifier/regressor like .as_feature_selector() could perform the same magic. The idea is to more clearly separate model and function.

@amueller

  1. is maybe the most tricky one

Did you mean (2)?

what are the cases when we want different behavior during fitting and predicting? Do we always want to resample during learning, but not during prediction? What if we want to do some visualization and want to filter out outliers for both?

I think this is a key question. Certainly there must be a way to reapply the fitted resampling where appropriate; visualisation is a good example of such. Yet perhaps this is no big deal to expect users to do without the pipeline magic.

@amueller
Copy link
Member

@jnothman yes, I meant (2).
Sorry, I'm not sure I understand your reply.
What do you mean by "without the pipeline magic"? That users should not be able to use pipline in this case? Or that the heuristic of not applying resampling for predict, score or transform should be the default but there should be an option to not use this heuristic?

Btw, this heuristic gives me no option to compute the score on the training set that was used, which is a bit odd.

@jnothman
Copy link
Member Author

I'm not entirely happy with it, but I've mocked up some examples (not plots, just usage code) at https://gist.github.com/jnothman/274710f945e311697466

What do you mean by "without the pipeline magic"? That users should not be able to use pipline in this case?

I mean that currently there are cases where Pipeline can't reasonably be used. It's particularly useful for grid searches, etc., where cloning and parameter setting is involved, while requiring the visualisation of inliers to not use a Pipeline object probably doesn't hurt the user that much.

I agree it's a bit upsetting that this model would not provide a way to compute the training score.

@amueller
Copy link
Member

To summarize a discussion with @GaelVaroquaux, we both thought that breaking the equivalence of fit().transform() and fit_transform might be a viable way forward. fit_transform would subsample, but fit().transform() would not.

@jnothman
Copy link
Member Author

jnothman commented Jan 14, 2018

I think it's time to resolve this. We are already breaking fit_transform and transform equivalence elsewhere.

But are you sure we want to allow fit_transform to return (X, y, props) sometimes and only X at others? Do we then require transform to return only X or is it also allowed to change y (I think we should not allow it to change y; it is a bad idea for evaluation).

We also have a small problem in pipeline's handling of fit_params: any fit_params downstream of a resampler cannot be used and should raise an error. (Any props returned by the resampler need to be interpreted with the pipeline's routing strategy.) Indeed maybe it is a design fault in pipeline, but the handling of sample props and y there assumes that fit_transform's output is aligned with the input, sample for sample.

I find these arguments together compelling to suggest that this deserves a separate method, e.g. fit_resample, not just an option for a transformer to return a tuple that results in very different handling. I do not, however, think we should have a corresponding sample method (and find imblearn's Pipeline.sample method quite problematic). At test time, transform should be called, or else we could consider all resamplers to perform the identity transform at test time. (On objects supporting fit_resample, fit_transform should be forbidden.)

Let's make this happen.

@jnothman
Copy link
Member Author

I think for now we should forbid resamplers from implementing transform, as the common use cases are identity transforms, and allowing transform is then possible in the future without breaking backwards compatibility.

@jnothman
Copy link
Member Author

jnothman commented Jan 16, 2018

Proposal of work on this and #9630:

Implementation

  1. Create an OutlierDetectorMixin as a concrete example of an estimator with fit_resample. fit_resample is defined to return (X, y, props) corresponding to only the inliers. This way outlier detectors will act as outlier removers in a Pipeline once the rest of the work is complete (see Feature Request: Pipelining Outlier Removal #9630). They are here only as a tangible example of a resampler. props is merely a dict of params that would be passed to fit. {'sample_weight': [...]} or {} most often.
  2. Inherit from OutlierDetectorMixin where appropriate. Test it.
  3. Extend common tests to cover resamplers. It should:
    • check the output of fit_resample is of correct structure
    • check the output of fit_resample has consistent lengths
    • check the output of fit_resample is consistent for repeated calls
    • assert that having fit_resample means no fit_transform or transform
  4. handle fit_resample in Pipeline.fit, making sure that props are handled correctly (no props should already be set for downstream pipeline steps; returned props should be interpreted as Pipeline.fit's fit_params are)
  5. perform identity transform for resamplers in Pipeline.{transform,predict,...}
  6. ? add fit_resample method to Pipelines whose last step has fit_resample
  7. perhaps implement other resamplers (e.g. oversample), perhaps based on MRG+1: Add resample to preprocessing. #1454

Documentation

  1. add/modify example of outlier removal
  2. discuss outlier removal in outlier detection docs
  3. discuss resamplers in Pipeline docs
  4. add OutlierDetectorMixin to classes.rst
  5. mention OutlierDetectorMixin in glossary under outlier detector
  6. entry for resampler in glossary
  7. describe resamplers in developer docs

I'm happy to open this to a contributor (or a GSoC) if others think this is the right way to go.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 12, 2018 via email

@glemaitre
Copy link
Member

+1 for only having fit_resample defined in the Mixin. I don't recall any use case having only resample.

Regarding the Pipeline implementation, I think that the changes done in the imblearn implementation should make the trick or at least a good start. It will remain the issue regarding the props handling.

Regarding the handling of the sample_props in the resampler itself, it looks like @dmbee would go in the same direction than what we thought to handle sample_weight:
https://github.com/scikit-learn-contrib/imbalanced-learn/pull/463/files

@dmbee do not hesitate to retake some of the code/tests of imblearn. You can also ping me to review the PR. I'm going to become active again in scikit-learn from next week.

@dmbee
Copy link
Contributor

dmbee commented Dec 12, 2018

I don't know any use case that requires a resample method, really, and it certainly complicates things in Pipeline: why would we change the set of samples at test time, and how would that affect scoring a pipeline's predictions?

Above, it was not my intent to support resampling at test time. I understand that is out of scope here / niche.

Allowing separate fit / resample methods (instead of just fit_resample) does not affect how the resampler is used in pipeline or the complexity of the pipeline in my view (see my pipeline code above). It came to mind that it may be useful to separate the fit and resample methods for some potential use cases (eg generative resampling).

However, if this capability is not desirable we can use a very similar API to imblearn adding handling of sample_props which is straightforward if the resampler is just indexing the data. See rough example below.

Let me know your thoughts?

import numpy as np
from sklearn.base import BaseEstimator
from abc import abstractmethod

class ResamplerMixin(object):

    def fit_resample(self, X, y, props=None, **fit_params): # gets called by pipeline._fit()
        self._check_data(X, y, props)
        return self._fit_resample(X, y, props, **fit_params)

    @abstractmethod # must be implemented in derived classes
    def _fit_resample(self, X, y, props=None, **fit_params):
        return X, y, props

    def _check_data(self, X, y, props): # to be expanded upon
        if props is not None:
            if not np.all((np.array([len(props[k]) for k in props]) == len(y))):
                raise ValueError

    def _resample_from_indices(self, X, y, props, indices):
        if props is not None:
            return X[indices], y[indices], {k : props[k][indices] for k in props}
        else:
            return X[indices], y[indices], None


class TakeOneSample(BaseEstimator, ResamplerMixin):

    def __init__(self, index = 0):
        self.index = index

    def _fit_resample(self, X, y, props=None):
        return self._resample_from_indices(X, y, props, self.index)

@dmbee
Copy link
Contributor

dmbee commented Dec 12, 2018

@glemaitre - thank you. I certainly have no desire to replicate what has been done in imblearn (I like and use that package by the way!). It seems the main thing lacking atm from imblearn is the pipeline / resampler changes required to support sample_props. Otherwise, it seems very compatible with sklearn.

  • David

@amueller
Copy link
Member

I don't really like that we're committing to a format for the sample props here in a sense, but I guess it's not that different from the handling of fit_params in the cross-validation code right now. So I think it should be good do go.

@glemaitre
Copy link
Member

I certainly have no desire to replicate what has been done in imblearn

Actually, you should take whatever works for scikit-learn from imblearn. Our idea is to contribute whatever is good upstream and remove from our code base. We are at a stage that the API start to be more stable and we recently made some changes to reflect some discussions with @jnothman.

Bottom line, take whatever is beneficial for scikit-learn ;)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 12, 2018 via email

@dmbee
Copy link
Contributor

dmbee commented Dec 13, 2018

Actually, you should take whatever works for scikit-learn from imblearn. Our idea is to contribute whatever is good upstream and remove from our code base.

OK good to know - thanks. I should have used re-implement rather than replicate. It seems most things from imblearn can be readily ported.

Well, in theory, we would need to work on the SLEPs.

Not sure what a SLEP is....

@jnothman
Copy link
Member Author

A SLEP is an (under-used) Scikit-learn enhancement proposal. https://github.com/scikit-learn/enhancement_proposals/

Regarding the API proposal in #3855 (comment), yes, that approach to resampling looks good... Not all estimators supporting fit_resample can do so from indices, but I think you are aware of that; sample reduction techniques will not, for instance.

@dmbee
Copy link
Contributor

dmbee commented Dec 13, 2018

Regarding the API proposal in #3855 (comment), yes, that approach to resampling looks good... Not all estimators supporting fit_resample can do so from indices, but I think you are aware of that; sample reduction techniques will not, for instance.

OK good stuff. Yes - I am aware that not all resamplers will sample from indices. Those that do not will have to either implement their own method of dealing with props (if there is a sensible option - hard to know for sure without knowing what will be in props) or otherwise raise a notimplemented error if props is not none.

@orausch
Copy link

orausch commented Feb 25, 2019

Is anyone in Paris working on this? I'd be happy to help (and the api would be useful for fitting semi-supervised classifiers, as discussed in this Review.)

@glemaitre
Copy link
Member

glemaitre commented Feb 25, 2019 via email

@orausch
Copy link

orausch commented Feb 25, 2019

Sure, where can I find you?

@dmbee
Copy link
Contributor

dmbee commented Feb 25, 2019

I started working on this (I'm not in Paris), but got too busy over the last couple of months. I am happy to share what I have done already, and continue working on it.

@orausch
Copy link

orausch commented Feb 25, 2019

I started working on this (I'm not in Paris), but got too busy over the last couple of months. I am happy to share what I have done already, and continue working on it.

I'm starting work on this now for the sprint. If you have existing work that you think could be useful, I'd be more than happy to build on what you've done.

@dmbee
Copy link
Contributor

dmbee commented Feb 25, 2019

here it is - look at resample folder in sklearn
https://github.com/dmbee/scikit-learn/tree/dmbee-resampling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API help wanted Moderate Anything that requires some knowledge of conventions and best practices New Feature
Projects
None yet
Development

No branches or pull requests

10 participants