Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] FIX: raise error with inconsistent dtype X and missing_values #11391

Merged
merged 7 commits into from Jul 16, 2018

Conversation

glemaitre
Copy link
Member

closes #11390

Implement an additional check to avoid raising NotImplemented from np.equal when the dtype of X and missing_values are inconsistent for a comparison.

not isinstance(missing_values, numbers.Real)):
raise TypeError("The data type of 'missing_values' and 'X' are "
"not compatible. 'missing_values' data type is "
"{} and 'X' is {}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the rest of the code, we raise a ValueError for this kind of error, there is a discussion in #11211.

# for X and missing_values was not raising a proper error.
X = np.random.randn(1000, 10)
X[0, 0] = X_missing_value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure 1000 is necessary here :)

[SimpleImputer, ChainedImputer])
@pytest.mark.parametrize(
"missing_values, X_missing_value, err_msg",
[("NaN", np.nan, "contains"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give more context for the contains error message? I don't know what the expected error is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the error raised by check_array when X contains NaN or inf and `force_all_finite=True:

Input contains NaN, infinity or a value too large for dtype('float64').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. would be easier to follow if it was "Input contains NaN" ...

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But aren't we still supporting missing_values='NaN'?

@jeremiedbb
Copy link
Member

No, in SimpleImputer you have to do missing_values=np.nan. If you put missing_values='NaN' it will try to match the string 'NaN'. We did that in order to support categorical inputs in the SimpleImputer. It's possible to have a dataset with missing values encoded as 'NaN'.

@jnothman
Copy link
Member

jnothman commented Jun 30, 2018 via email

@glemaitre
Copy link
Member Author

@qinhanmin2014 @jnothman DO you have some remarks in order to move on on the MissingIndicator

@glemaitre glemaitre added this to the 0.20 milestone Jul 11, 2018
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart for a few minor comments.

@@ -40,6 +40,17 @@
]


def _check_inputs_dtype(X, missing_values):
"""Check that the dtype of X is in accordance with the one of
missing_values."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would phrase this in the other order
"""
Check that the type of missing_values is in accordance with the dtype of X.
"""
as that's what's we are checking right? I mean when we get this error, the user should change the value of missing_value not the other way around, right?

def _check_inputs_dtype(X, missing_values):
"""Check that the dtype of X is in accordance with the one of
missing_values."""
if (X.dtype.kind in ("f", "i", "u") and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I didn't know about this.

err_msg):
# regression test for issue #11390. Comparison between incoherent dtype
# for X and missing_values was not raising a proper error.
X = np.random.randn(10, 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a random state (even if we don't care about the actual values)

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

@@ -40,6 +40,18 @@
]


def _check_inputs_dtype(X, missing_values):
"""Check that the type of missing_values is in accordance with the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this docstring, doesn't add much information or document it fully if you insist this function needs to be documented.

"""
if (X.dtype.kind in ("f", "i", "u") and
not isinstance(missing_values, numbers.Real)):
raise ValueError("The data type of 'missing_values' and 'X' are "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add what is expected? In general it's better when the error message says something like: "this is what was expected. Got this instead."

@pytest.mark.parametrize("imputer_constructor",
[SimpleImputer, ChainedImputer])
@pytest.mark.parametrize(
"missing_values, X_missing_value, err_msg",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_missing_value may not be a great name since it seems to be a number

@agramfort agramfort merged commit d986354 into scikit-learn:master Jul 16, 2018
@agramfort
Copy link
Member

thx @glemaitre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cryptic error in imputers due to missing checking in _get_mask
7 participants