SimpleImputer to Crash on Constant Imputation with string value when dataset is encoded Numerically #12306

janvanrijn · 2018-10-05T17:06:52Z

Description

The title kind of describes it. It might be pretty logical, but just putting it out here as it took a while for me to realize and debug what exactly happened.

The SimpleImputer has the ability to impute missing values with a constant. If the data is categorical, it is possible to impute with a string value. However, when fetching a dataset from OpenML (or many other datasets from different sources) the data is encoded numerically automatically as numeric. When applying the SimpleImputer and a string value, scikit-learn crashes. I assume there's not a lot that can be done about this, as everything behaves exactly as you would expect when you dive deep into the code, but maybe the documentation can be extended a little bit (probably on SimpleImputer side, or maybe on the side of the data sources).

What do you think?

Steps/Code to Reproduce

import numpy as np
import sklearn.datasets
import sklearn.compose
import sklearn.tree
import sklearn.impute

X, y = sklearn.datasets.fetch_openml('Australian', 4, return_X_y=True)

numeric_imputer = sklearn.impute.SimpleImputer(strategy='mean')
numeric_scaler = sklearn.preprocessing.StandardScaler()

nominal_imputer = sklearn.impute.SimpleImputer(strategy='constant', fill_value='missing')
nominal_encoder = sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore')

numeric_idx = [1, 2, 7, 10, 13]
nominal_idx = [0, 3, 4, 5, 6, 8, 9, 11, 12]

print('missing numeric vals:', np.count_nonzero(~np.isnan(X[:, numeric_idx])))
print('missing nominal vals:', np.count_nonzero(~np.isnan(X[:, nominal_idx])))


clf_nom = sklearn.pipeline.make_pipeline(nominal_imputer, nominal_encoder)
clf_nom.fit(X[:, nominal_idx], y)

Expected Results

A fitted classifier? Depending on how you write the documentation, the current error could also be the expected result.

Actual Results

missing numeric vals: 3450
missing nominal vals: 6210
Traceback (most recent call last):
  File "/home/janvanrijn/projects/sklearn-bot/testjan.py", line 23, in <module>
    clf_nom.fit(X[:, nominal_idx], y)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 265, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 230, in _fit
    **fit_params_steps[name])
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/base.py", line 465, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/impute.py", line 241, in fit
    "data".format(fill_value))
ValueError: 'fill_value'=missing is invalid. Expected a numerical value when imputing numerical data

Versions

Python=3.6.0
numpy==1.15.2
scikit-learn==0.20.0
scipy==1.1.0

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-10-05T18:37:24Z

As you also mention, I think this is the expected behaviour (whether it is the expected behaviour of the fetch_openml to encode categorical data, that is something else :-))

I personally find the error message of "'fill_value'=missing is invalid. Expected a numerical value when imputing numerical data" rather clear on what the problem is.
Do you have a suggestion to further improve that?

The SimpleImputer docstring explanation of fill_value could certainly be expanded to mention that the fill_value needs to be compatible with the data (same data type).

amueller · 2018-10-05T18:45:12Z

I agree, the error message seems pretty clear. What's unclear about it?

janvanrijn · 2018-10-05T19:08:16Z

The error message is rather clear. However, having personal knowledge about the datasets that I am working with (and knowing that there could only be nominal values in this part of the pipeline), I assumed the mistake somewhere in the pipeline.

Running the following code I did not assume to crash, until I investigated:

import numpy as np
import sklearn.datasets
import sklearn.compose
import sklearn.tree
import sklearn.impute

X, y = sklearn.datasets.fetch_openml('Australian', 4, return_X_y=True)

numeric_imputer = sklearn.impute.SimpleImputer(strategy='mean')
numeric_scaler = sklearn.preprocessing.StandardScaler()

nominal_imputer = sklearn.impute.SimpleImputer(strategy='constant', fill_value='missing')
nominal_encoder = sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore')

numeric_idx = [1, 2, 7, 10, 13]
nominal_idx = [0, 3, 4, 5, 6, 8, 9, 11, 12]  # JvR: usually I get this part from the OpenML package

numeric_transformer = sklearn.pipeline.make_pipeline(numeric_imputer, numeric_imputer)
nominal_transformer = sklearn.pipeline.make_pipeline(nominal_imputer, nominal_encoder)
transformer = sklearn.compose.ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numeric_idx),
        ('nominal', nominal_transformer, nominal_idx)],  # JvR: In this part, I kind of assume only nominal values (which is wrong on some level).  
    remainder='passthrough')

clf = sklearn.pipeline.make_pipeline(transformer, sklearn.tree.DecisionTreeClassifier())

clf.fit(X, y)

My personal suggestion is to add a line to the Simple Imputer documentation, making more clear that a string value is only allowed if the data is also encoded as non-integer, but if you think this is not necessary feel free to close the issue.

thomasjpfan · 2020-01-15T17:53:35Z

With the as_frame keyword option, the above script will not fail if X is loaded as a dataframe:

X, y = sklearn.datasets.fetch_openml('Australian', 4, return_X_y=True, as_frame=True)

uzdik · 2022-11-02T09:49:08Z

check your data, because I also had these problem, when I checked my data type, some of columns' type were object, after converting them as float, this problem disapper.

Fixes #12306

Fixes scikit-learn#12306

Fixes #12306

Fixes scikit-learn#12306

amueller mentioned this issue Jul 17, 2019

Support standard data science use-case #10603

Open

cmarmo added the module:impute label Feb 5, 2022

thomasjpfan mentioned this issue Nov 30, 2022

DOC Clarify fill_value behavior in SimpleImputer #25081

Merged

jjerphan closed this as completed in #25081 Dec 1, 2022

jjerphan pushed a commit that referenced this issue Dec 1, 2022

DOC Clarify fill_value behavior in SimpleImputer (#25081)

ba5f9e0

Fixes #12306

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this issue Dec 21, 2022

DOC Clarify fill_value behavior in SimpleImputer (scikit-learn#25081)

e3e10e2

Fixes scikit-learn#12306

glemaitre pushed a commit that referenced this issue Dec 21, 2022

DOC Clarify fill_value behavior in SimpleImputer (#25081)

1e9bddf

Fixes #12306

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this issue Jan 20, 2023

DOC Clarify fill_value behavior in SimpleImputer (scikit-learn#25081)

aaf5793

Fixes scikit-learn#12306

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this issue Jan 20, 2023

DOC Clarify fill_value behavior in SimpleImputer (scikit-learn#25081)

043f1ed

Fixes scikit-learn#12306

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleImputer to Crash on Constant Imputation with string value when dataset is encoded Numerically #12306

SimpleImputer to Crash on Constant Imputation with string value when dataset is encoded Numerically #12306

janvanrijn commented Oct 5, 2018

jorisvandenbossche commented Oct 5, 2018

amueller commented Oct 5, 2018

janvanrijn commented Oct 5, 2018

thomasjpfan commented Jan 15, 2020

uzdik commented Nov 2, 2022

SimpleImputer to Crash on Constant Imputation with string value when dataset is encoded Numerically #12306

SimpleImputer to Crash on Constant Imputation with string value when dataset is encoded Numerically #12306

Comments

janvanrijn commented Oct 5, 2018

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jorisvandenbossche commented Oct 5, 2018

amueller commented Oct 5, 2018

janvanrijn commented Oct 5, 2018

thomasjpfan commented Jan 15, 2020

uzdik commented Nov 2, 2022