-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resampler estimators that change the sample size in fitting #3855
Comments
I hear this positively after discussing this very same pb with @MechCoder can you write a few lines of code the way you would like to pipe something |
I'm not sure about piping birch with sample weights, but BIRCH could be implemented as
Not that it's so neat, but it gives an example of the power of the approach. ( |
I think we should list a few use cases to come up with an API that does the |
I think that this issue is a core API issue, and a blocker for 1.0.
Why conflating fit and resample? I can see usecases for a separate fit Also, IMHO, the fact that transform does not modify y is a design failure That way we avoid introducing a new class of object, and a new concept. Finally, I don't really like the name 'resample'. I find that it is too Here are suggestions of names:
The name transform is just too good, IMHO. In the long run, we could come |
Modifying Not just the "mostly happens [in a pipeline context] at fit time" (and yes, as above, there are cases where a fit model will be reapplied, especially outlier detection) sets this apart from transformers that must equally apply at fit and runtime, but the idea that the sample size can change. So never mind modifying So as much as redesigning the transformer API may be desirable, there is value IMO in a distinct type of estimator that: (a) has effect in a The idea of the name "resample" is that the most important job of this class of estimators is to change the sample size in some way, by oversampling, otherwise re-weighting, compressing, or incorporating unlabelled instances from elsewhere. |
That's the argument that I was missing. Thanks! Are there other cases?
Based on your arguments justifying the need of the new class, I've been One thing that I am worried about, however, is that if we introduce a |
And would this subclass of transformers also only operate at fit time? I think this is different enough to motivate a different family of estimators, but I might be wrong. This type of estimator pipelining can also be easily modelled as meta-estimators. The only real problem there is the uncomfortable nesting of param names (although I did once play with some magic that allows a nested structure to be wrapped so that parameters can be renamed, or their values tied), and that flat is better than nested. |
Is there a reason why a fit_transform wouldn't solve that problem? |
|
Quite clearly I agree with you that breaking this equivalence would be a But I am not sure why it would be necessary (although I am starting to |
I'm preparing some examples so that we have something to point at in On 17 November 2014 20:45, Gael Varoquaux notifications@github.com wrote:
|
Thank you. This is very useful! |
Thanks for restarting the discussion on this.
The first one might be solved by changing the behavior of transformers, for the other two it is not as obvious as to what to do.
That makes me think: what are the cases when we want different behavior during fitting and predicting? Do we always want to resample during learning, but not during prediction? What if we want to do some visualization and want to filter out outliers for both? |
As far as I understand this discussion, (sorry if I missed something, I just quickly skimmed through, especially just the parts that say Birch :P ), you mean to subclass Birch (and other instance reduction methods) from a new class of estimators, called Resamples, and whose
|
Firstly, I'm not sure that reimplementing BIRCH is what I intend here. It's more that this type of algorithm can be framed as a pipeline of reduction, clustering, etc. There should be a right way to cobble together estimators into these sorts of things in scikit-learn, to whatever extent it is facilitated by the API. As for reimplementing BIRCH itself, the resampler could be pulled out as a separate component, and the full clusterer can be offered as well. Yes, using MBKmeans for the instance reduction is equally applicable; the fact that it happens to define Classifiers or clusterers or regressors that happen to implement
Did you mean (2)?
I think this is a key question. Certainly there must be a way to reapply the fitted resampling where appropriate; visualisation is a good example of such. Yet perhaps this is no big deal to expect users to do without the pipeline magic. |
@jnothman yes, I meant (2). Btw, this heuristic gives me no option to compute the score on the training set that was used, which is a bit odd. |
I'm not entirely happy with it, but I've mocked up some examples (not plots, just usage code) at https://gist.github.com/jnothman/274710f945e311697466
I mean that currently there are cases where Pipeline can't reasonably be used. It's particularly useful for grid searches, etc., where cloning and parameter setting is involved, while requiring the visualisation of inliers to not use a Pipeline object probably doesn't hurt the user that much. I agree it's a bit upsetting that this model would not provide a way to compute the training score. |
To summarize a discussion with @GaelVaroquaux, we both thought that breaking the equivalence of |
I think it's time to resolve this. We are already breaking fit_transform and transform equivalence elsewhere. But are you sure we want to allow fit_transform to return (X, y, props) sometimes and only X at others? Do we then require transform to return only X or is it also allowed to change y (I think we should not allow it to change y; it is a bad idea for evaluation). We also have a small problem in pipeline's handling of fit_params: any fit_params downstream of a resampler cannot be used and should raise an error. (Any props returned by the resampler need to be interpreted with the pipeline's routing strategy.) Indeed maybe it is a design fault in pipeline, but the handling of sample props and y there assumes that fit_transform's output is aligned with the input, sample for sample. I find these arguments together compelling to suggest that this deserves a separate method, e.g. fit_resample, not just an option for a transformer to return a tuple that results in very different handling. I do not, however, think we should have a corresponding sample method (and find imblearn's Pipeline.sample method quite problematic). At test time, transform should be called, or else we could consider all resamplers to perform the identity transform at test time. (On objects supporting fit_resample, fit_transform should be forbidden.) Let's make this happen. |
I think for now we should forbid resamplers from implementing transform, as the common use cases are identity transforms, and allowing transform is then possible in the future without breaking backwards compatibility. |
Proposal of work on this and #9630: Implementation
Documentation
I'm happy to open this to a contributor (or a GSoC) if others think this is the right way to go. |
why would we change the set of samples at test time, and how would that affect scoring a pipeline's predictions?
Yes, changing the number of samples at test time is a semantic that is
unclear to me. I would rather frown away from it.
|
+1 for only having Regarding the Regarding the handling of the @dmbee do not hesitate to retake some of the code/tests of |
Above, it was not my intent to support resampling at test time. I understand that is out of scope here / niche. Allowing separate fit / resample methods (instead of just fit_resample) does not affect how the resampler is used in pipeline or the complexity of the pipeline in my view (see my pipeline code above). It came to mind that it may be useful to separate the fit and resample methods for some potential use cases (eg generative resampling). However, if this capability is not desirable we can use a very similar API to imblearn adding handling of sample_props which is straightforward if the resampler is just indexing the data. See rough example below. Let me know your thoughts?
|
@glemaitre - thank you. I certainly have no desire to replicate what has been done in
|
I don't really like that we're committing to a format for the sample props here in a sense, but I guess it's not that different from the handling of |
Actually, you should take whatever works for scikit-learn from imblearn. Our idea is to contribute whatever is good upstream and remove from our code base. We are at a stage that the API start to be more stable and we recently made some changes to reflect some discussions with @jnothman. Bottom line, take whatever is beneficial for scikit-learn ;) |
So I think it should be good do go.
Well, in theory, we would need to work on the SLEPs.
However, it may be a good exercise to move this issue forward, in order
to be able to write a concise SLEP.
|
OK good to know - thanks. I should have used re-implement rather than replicate. It seems most things from
Not sure what a SLEP is.... |
A SLEP is an (under-used) Scikit-learn enhancement proposal. https://github.com/scikit-learn/enhancement_proposals/ Regarding the API proposal in #3855 (comment), yes, that approach to resampling looks good... Not all estimators supporting |
OK good stuff. Yes - I am aware that not all resamplers will sample from indices. Those that do not will have to either implement their own method of dealing with props (if there is a sensible option - hard to know for sure without knowing what will be in props) or otherwise raise a notimplemented error if props is not none. |
Is anyone in Paris working on this? I'd be happy to help (and the api would be useful for fitting semi-supervised classifiers, as discussed in this Review.) |
Yep I started to port the imblearn implementation. Do you want to take other and I can help for the review instead?
|
Sure, where can I find you? |
I started working on this (I'm not in Paris), but got too busy over the last couple of months. I am happy to share what I have done already, and continue working on it. |
I'm starting work on this now for the sprint. If you have existing work that you think could be useful, I'd be more than happy to build on what you've done. |
here it is - look at resample folder in sklearn |
Some data transformations -- including over/under-sampling (#1454), outlier removal, instance reduction, and other forms of dataset compression, like that used in BIRCH (#3802) -- entail altering a dataset at training time, but leaving it unaltered at prediction time. (In some cases, such as outlier removal, it makes sense to reapply a fitted model to new data, while in others model reuse after fitting seems less applicable. )
As noted elsewhere, transformers that change the number of samples are not currently supported, certainly in the context of
Pipeline
s where a transformation is applied both atfit
andpredict
time (although a hack might abusefit_transform
to make this not so).Pipeline
s ofTransformer
s also would not cope with changes in the sample size at fit time for supervised problems becauseTransformer
s do not return a modifiedy
, onlyX
.To handle this class of problems, I propose introducing a new category of estimator, called a
Resampler
. It must define at least afit_resample
method, whichPipeline
will call atfit
time, passing the data unchanged at other times. (For this reason, aResampler
cannot also be aTransformer
, or else we need to define their precedence.)For many models,
fit_resample
needs only returnsample_weight
. For sample compression approaches (e.g. that in BIRCH), this is not sufficient as the representative centroids are modified from the input samples. Hence I thinkfit_resample
should return altered data directly, in the form of a dict with keysX
,y
,sample_weight
as required. (It still might be appropriate for manyResampler
s to only modifysample_weight
; if necessary, anotherResampler
can be chained that realises the weights as replicated or deleted entries inX
andy
.)The text was updated successfully, but these errors were encountered: