Skip to content

Resampler estimators that change the sample size in fitting #3855

@jnothman

Description

@jnothman

Some data transformations -- including over/under-sampling (#1454), outlier removal, instance reduction, and other forms of dataset compression, like that used in BIRCH (#3802) -- entail altering a dataset at training time, but leaving it unaltered at prediction time. (In some cases, such as outlier removal, it makes sense to reapply a fitted model to new data, while in others model reuse after fitting seems less applicable. )

As noted elsewhere, transformers that change the number of samples are not currently supported, certainly in the context of Pipelines where a transformation is applied both at fit and predict time (although a hack might abuse fit_transform to make this not so). Pipelines of Transformers also would not cope with changes in the sample size at fit time for supervised problems because Transformers do not return a modified y, only X.

To handle this class of problems, I propose introducing a new category of estimator, called a Resampler. It must define at least a fit_resample method, which Pipeline will call at fit time, passing the data unchanged at other times. (For this reason, a Resampler cannot also be a Transformer, or else we need to define their precedence.)

For many models, fit_resample needs only return sample_weight. For sample compression approaches (e.g. that in BIRCH), this is not sufficient as the representative centroids are modified from the input samples. Hence I think fit_resample should return altered data directly, in the form of a dict with keys X, y, sample_weight as required. (It still might be appropriate for many Resamplers to only modify sample_weight; if necessary, another Resampler can be chained that realises the weights as replicated or deleted entries in X and y.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIModerateAnything that requires some knowledge of conventions and best practicesNew Featurehelp wanted

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions