Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default Value for SimpleImputer with all missing values #21639

Open
vsocrates opened this issue Nov 12, 2021 · 3 comments
Open

Default Value for SimpleImputer with all missing values #21639

vsocrates opened this issue Nov 12, 2021 · 3 comments

Comments

@vsocrates
Copy link

Describe the workflow you want to enable

In the Notes of the SimpleImputer, it says that:

Columns which only contained missing values at fit are discarded upon transform if strategy is not "constant".

It'd be nice to be able to set this to a default value in transform instead of dropping those columns if there is no valid summary statistic in fit.

I recognize that it may only be a couple of lines, but it seems useful to have the logic captured within the class instead of checking beforehand for all np.nans per column.

Describe your proposed solution

Near the lines

else:
# same as np.isnan but also works for object dtypes
invalid_mask = _get_mask(statistics, np.nan)
valid_mask = np.logical_not(invalid_mask)
valid_statistics = statistics[valid_mask]
valid_statistics_indexes = np.flatnonzero(valid_mask)
if invalid_mask.any():
missing = np.arange(X.shape[1])[invalid_mask]
if self.verbose != "deprecated" and self.verbose:
warnings.warn(
"Skipping features without observed values: %s" % missing
)
X = X[:, valid_statistics_indexes]

We could add something like

            invalid_mask = _get_mask(statistics, np.nan)
            valid_mask = np.logical_not(invalid_mask)
            valid_statistics = statistics[valid_mask]
            valid_statistics_indexes = np.flatnonzero(valid_mask)

            invalid_statistics_indexes = np.flatnonzero(invalid_mask) 
            ...
            X[:, invalid_statistics_indexes] = {default_value}
            # X = X[:, valid_statistics_indexes]

Describe alternatives you've considered, if relevant

An alternative is to do this logic before running the SimpleImputer.

Additional context

No response

@glemaitre
Copy link
Member

It'd be nice to be able to set this to a default value in transform instead of dropping those columns if there is no valid summary statistic in fit.

So instead of dropping the columns, you would like to have some columns with constant values? If this is the case, what is the reason in practice?

@vsocrates
Copy link
Author

The use case I came across was when creating a ByGroupImputer that would eventually concatenate all the groups together. Each group is relatively small and likely to contain different columns with all NaNs, so the final sizes of the imputed arrays is different. This would be solved by being able to fill in NaNs with 0 (or any filled value) instead of dropping them.

In general, having the input and output array sizes be the same after transform seems useful though.

@glemaitre
Copy link
Member

But in terms of training a predictive model, you cannot do anything with these constant features?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants