Suggestion: Fill with feature mean instead of 0 for VarianceThreshold.inverse_transform

[Today, `sklearn.feature_selection.VarianceThreshold.inverse_transform` method fills in zeros to features which were removed for having too-small variance.
](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold.inverse_transform) This is certainly predictable, easy to implement and easy to explain. 

However, filling in zeros without respect to the data passed to `fit` means that the reconstruction error can become arbitrarily large. For example, suppose that one of the features in your data always takes the value `10**6`. This clearly has zero variance, since it always takes the same value; however, filling in zeros for that feature when the data via `transform` and `inverse_transform` will result in an output which dramatically differs from the input.

Instead, I think it would make sense to fill in the mean of the columns when using `inverse_transform`. The means can be computed and stored from the data passed to `fit`.  This will make the reconstruction of the transformed data via `inverse_transform` more closely reflect the data that was passed to `fit` because any columns which are removed for having variance less than `threshold` must, by definition, be tightly grouped about the sample mean.

Naturally, in the special cases where the sample means of input features passed to `fit` are *already* zero, the proposed `inverse_transform` method will function the same way it does today, as it will fill in zero values for those features.

In terms of code, this just means keeping an array of column means in addition to the indices of the removed columns.

Of course, it's possible that I have missed an important subtlety, or there is a competing concern which outweighs the argument that I've outlined here. If that's the case, I'd like to know what I've missed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Suggestion: Fill with feature mean instead of 0 for VarianceThreshold.inverse_transform #14094

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Suggestion: Fill with feature mean instead of 0 for VarianceThreshold.inverse_transform #14094

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions