New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SimpleImputer to Crash on Constant Imputation with string value when dataset is encoded Numerically #12306
Comments
As you also mention, I think this is the expected behaviour (whether it is the expected behaviour of the I personally find the error message of "'fill_value'=missing is invalid. Expected a numerical value when imputing numerical data" rather clear on what the problem is. The SimpleImputer docstring explanation of |
I agree, the error message seems pretty clear. What's unclear about it? |
The error message is rather clear. However, having personal knowledge about the datasets that I am working with (and knowing that there could only be nominal values in this part of the pipeline), I assumed the mistake somewhere in the pipeline. Running the following code I did not assume to crash, until I investigated:
My personal suggestion is to add a line to the Simple Imputer documentation, making more clear that a string value is only allowed if the data is also encoded as non-integer, but if you think this is not necessary feel free to close the issue. |
With the X, y = sklearn.datasets.fetch_openml('Australian', 4, return_X_y=True, as_frame=True) |
check your data, because I also had these problem, when I checked my data type, some of columns' type were object, after converting them as float, this problem disapper. |
Description
The title kind of describes it. It might be pretty logical, but just putting it out here as it took a while for me to realize and debug what exactly happened.
The SimpleImputer has the ability to impute missing values with a constant. If the data is categorical, it is possible to impute with a string value. However, when fetching a dataset from OpenML (or many other datasets from different sources) the data is encoded numerically automatically as numeric. When applying the SimpleImputer and a string value, scikit-learn crashes. I assume there's not a lot that can be done about this, as everything behaves exactly as you would expect when you dive deep into the code, but maybe the documentation can be extended a little bit (probably on SimpleImputer side, or maybe on the side of the data sources).
What do you think?
Steps/Code to Reproduce
Expected Results
A fitted classifier? Depending on how you write the documentation, the current error could also be the expected result.
Actual Results
Versions
The text was updated successfully, but these errors were encountered: