Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleImputer to Crash on Constant Imputation with string value when dataset is encoded Numerically #12306

Closed
janvanrijn opened this issue Oct 5, 2018 · 5 comments · Fixed by #25081

Comments

@janvanrijn
Copy link
Contributor

Description

The title kind of describes it. It might be pretty logical, but just putting it out here as it took a while for me to realize and debug what exactly happened.

The SimpleImputer has the ability to impute missing values with a constant. If the data is categorical, it is possible to impute with a string value. However, when fetching a dataset from OpenML (or many other datasets from different sources) the data is encoded numerically automatically as numeric. When applying the SimpleImputer and a string value, scikit-learn crashes. I assume there's not a lot that can be done about this, as everything behaves exactly as you would expect when you dive deep into the code, but maybe the documentation can be extended a little bit (probably on SimpleImputer side, or maybe on the side of the data sources).

What do you think?

Steps/Code to Reproduce

import numpy as np
import sklearn.datasets
import sklearn.compose
import sklearn.tree
import sklearn.impute

X, y = sklearn.datasets.fetch_openml('Australian', 4, return_X_y=True)

numeric_imputer = sklearn.impute.SimpleImputer(strategy='mean')
numeric_scaler = sklearn.preprocessing.StandardScaler()

nominal_imputer = sklearn.impute.SimpleImputer(strategy='constant', fill_value='missing')
nominal_encoder = sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore')

numeric_idx = [1, 2, 7, 10, 13]
nominal_idx = [0, 3, 4, 5, 6, 8, 9, 11, 12]

print('missing numeric vals:', np.count_nonzero(~np.isnan(X[:, numeric_idx])))
print('missing nominal vals:', np.count_nonzero(~np.isnan(X[:, nominal_idx])))


clf_nom = sklearn.pipeline.make_pipeline(nominal_imputer, nominal_encoder)
clf_nom.fit(X[:, nominal_idx], y)

Expected Results

A fitted classifier? Depending on how you write the documentation, the current error could also be the expected result.

Actual Results

missing numeric vals: 3450
missing nominal vals: 6210
Traceback (most recent call last):
  File "/home/janvanrijn/projects/sklearn-bot/testjan.py", line 23, in <module>
    clf_nom.fit(X[:, nominal_idx], y)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 265, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 230, in _fit
    **fit_params_steps[name])
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/base.py", line 465, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/impute.py", line 241, in fit
    "data".format(fill_value))
ValueError: 'fill_value'=missing is invalid. Expected a numerical value when imputing numerical data

Versions

Python=3.6.0
numpy==1.15.2
scikit-learn==0.20.0
scipy==1.1.0
@jorisvandenbossche
Copy link
Member

As you also mention, I think this is the expected behaviour (whether it is the expected behaviour of the fetch_openml to encode categorical data, that is something else :-))

I personally find the error message of "'fill_value'=missing is invalid. Expected a numerical value when imputing numerical data" rather clear on what the problem is.
Do you have a suggestion to further improve that?

The SimpleImputer docstring explanation of fill_value could certainly be expanded to mention that the fill_value needs to be compatible with the data (same data type).

@amueller
Copy link
Member

amueller commented Oct 5, 2018

I agree, the error message seems pretty clear. What's unclear about it?

@janvanrijn
Copy link
Contributor Author

The error message is rather clear. However, having personal knowledge about the datasets that I am working with (and knowing that there could only be nominal values in this part of the pipeline), I assumed the mistake somewhere in the pipeline.

Running the following code I did not assume to crash, until I investigated:

import numpy as np
import sklearn.datasets
import sklearn.compose
import sklearn.tree
import sklearn.impute

X, y = sklearn.datasets.fetch_openml('Australian', 4, return_X_y=True)

numeric_imputer = sklearn.impute.SimpleImputer(strategy='mean')
numeric_scaler = sklearn.preprocessing.StandardScaler()

nominal_imputer = sklearn.impute.SimpleImputer(strategy='constant', fill_value='missing')
nominal_encoder = sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore')

numeric_idx = [1, 2, 7, 10, 13]
nominal_idx = [0, 3, 4, 5, 6, 8, 9, 11, 12]  # JvR: usually I get this part from the OpenML package

numeric_transformer = sklearn.pipeline.make_pipeline(numeric_imputer, numeric_imputer)
nominal_transformer = sklearn.pipeline.make_pipeline(nominal_imputer, nominal_encoder)
transformer = sklearn.compose.ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numeric_idx),
        ('nominal', nominal_transformer, nominal_idx)],  # JvR: In this part, I kind of assume only nominal values (which is wrong on some level).  
    remainder='passthrough')

clf = sklearn.pipeline.make_pipeline(transformer, sklearn.tree.DecisionTreeClassifier())

clf.fit(X, y)

My personal suggestion is to add a line to the Simple Imputer documentation, making more clear that a string value is only allowed if the data is also encoded as non-integer, but if you think this is not necessary feel free to close the issue.

@thomasjpfan
Copy link
Member

With the as_frame keyword option, the above script will not fail if X is loaded as a dataframe:

X, y = sklearn.datasets.fetch_openml('Australian', 4, return_X_y=True, as_frame=True)

@uzdik
Copy link

uzdik commented Nov 2, 2022

check your data, because I also had these problem, when I checked my data type, some of columns' type were object, after converting them as float, this problem disapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants