Default Value for SimpleImputer with all missing values #21639

vsocrates · 2021-11-12T04:29:25Z

Describe the workflow you want to enable

In the Notes of the SimpleImputer, it says that:

Columns which only contained missing values at fit are discarded upon transform if strategy is not "constant".

It'd be nice to be able to set this to a default value in transform instead of dropping those columns if there is no valid summary statistic in fit.

I recognize that it may only be a couple of lines, but it seems useful to have the logic captured within the class instead of checking beforehand for all np.nans per column.

Describe your proposed solution

Near the lines

scikit-learn/sklearn/impute/_base.py

Lines 512 to 525 in 74bf394

    
           else: 
        
               # same as np.isnan but also works for object dtypes 
        
               invalid_mask = _get_mask(statistics, np.nan) 
        
               valid_mask = np.logical_not(invalid_mask) 
        
               valid_statistics = statistics[valid_mask] 
        
               valid_statistics_indexes = np.flatnonzero(valid_mask) 
        
               if invalid_mask.any(): 
        
                   missing = np.arange(X.shape[1])[invalid_mask] 
        
                   if self.verbose != "deprecated" and self.verbose: 
        
                       warnings.warn( 
        
                           "Skipping features without observed values: %s" % missing 
        
                       ) 
        
                   X = X[:, valid_statistics_indexes]

We could add something like

            invalid_mask = _get_mask(statistics, np.nan)
            valid_mask = np.logical_not(invalid_mask)
            valid_statistics = statistics[valid_mask]
            valid_statistics_indexes = np.flatnonzero(valid_mask)

            invalid_statistics_indexes = np.flatnonzero(invalid_mask) 
            ...
            X[:, invalid_statistics_indexes] = {default_value}
            # X = X[:, valid_statistics_indexes]

Describe alternatives you've considered, if relevant

An alternative is to do this logic before running the SimpleImputer.

Additional context

No response

The text was updated successfully, but these errors were encountered:

glemaitre · 2021-11-15T08:46:20Z

It'd be nice to be able to set this to a default value in transform instead of dropping those columns if there is no valid summary statistic in fit.

So instead of dropping the columns, you would like to have some columns with constant values? If this is the case, what is the reason in practice?

vsocrates · 2021-11-15T18:57:33Z

The use case I came across was when creating a ByGroupImputer that would eventually concatenate all the groups together. Each group is relatively small and likely to contain different columns with all NaNs, so the final sizes of the imputed arrays is different. This would be solved by being able to fill in NaNs with 0 (or any filled value) instead of dropping them.

In general, having the input and output array sizes be the same after transform seems useful though.

glemaitre · 2021-11-15T19:05:32Z

But in terms of training a predictive model, you cannot do anything with these constant features?

vsocrates added the New Feature label Nov 12, 2021

cmarmo added the module:impute label Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default Value for SimpleImputer with all missing values #21639

Default Value for SimpleImputer with all missing values #21639

vsocrates commented Nov 12, 2021

glemaitre commented Nov 15, 2021

vsocrates commented Nov 15, 2021

glemaitre commented Nov 15, 2021

Default Value for SimpleImputer with all missing values #21639

Default Value for SimpleImputer with all missing values #21639

Comments

vsocrates commented Nov 12, 2021

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Nov 15, 2021

vsocrates commented Nov 15, 2021

glemaitre commented Nov 15, 2021