Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check arguments of `MissingIndicator` imputer when handling sparse arrays #13240

Merged
merged 15 commits into from Feb 26, 2019
@@ -1109,6 +1109,7 @@ class MissingIndicator(BaseEstimator, TransformerMixin):

def __init__(self, missing_values=np.nan, features="missing-only",
sparse="auto", error_on_new=True):

This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 25, 2019

Contributor

You don't need to add this blank line

This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 25, 2019

Contributor

Could you remove this line

self.missing_values = missing_values
self.features = features
self.sparse = sparse
@@ -1207,6 +1208,18 @@ def fit(self, X, y=None):
raise ValueError("'features' has to be either 'missing-only' or "
"'all'. Got {} instead.".format(self.features))

if self.sparse is True and self.missing_values == 0:
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 25, 2019

Member

I don't think the sparsity of the output is the problem, only that of the input.

This comment has been minimized.

Copy link
@btel

btel Feb 25, 2019

Author Contributor

Ok, so in this case we will pass a sparse array without densifying it with missing_value=0 (which is implicit in sparse array)?

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 25, 2019

Contributor

The output is only ones and zeros. It has nothing to do with the value of missing_values. We only want to raise an error when the input is a sparse matrix and missing values == 0

raise ValueError("'missing_values' can not be 0 "
"when 'sparse' is True")

This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 25, 2019

Contributor

I don't think we want this case to raise an error.

This comment has been minimized.

Copy link
@btel

btel Feb 25, 2019

Author Contributor

ok, I removed this case. Thanks!

if sparse.issparse(X):
# missing_values = 0 not allowed with sparse data as it would
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

Since fit and transform both call _validate_imput, we can move this test into this function to avoid redundant code.

This comment has been minimized.

Copy link
@btel

btel Feb 26, 2019

Author Contributor

Good idea. Thanks!

# force densification
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 25, 2019

Contributor

You also need to raise the same error in transform

if self.missing_values == 0:
raise ValueError("Imputation not possible when missing_values "
"== 0 and input is sparse. Provide a dense "
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

It's not Imputation here. Maybe something like "Sparse input with missing_values=0 is not supported."

This comment has been minimized.

Copy link
@btel

btel Feb 26, 2019

Author Contributor

Ok, I changed it.

"array instead.")

if not ((isinstance(self.sparse, str) and
self.sparse == "auto") or isinstance(self.sparse, bool)):
raise ValueError("'sparse' has to be a boolean or 'auto'. "
@@ -923,6 +923,9 @@ def test_iterative_imputer_early_stopping():
(np.array([[-1, 1], [1, 2]]), np.array([[-1, 1], [1, 2]]),
{'features': 'all', 'sparse': 'random'},
"'sparse' has to be a boolean or 'auto'"),
(np.array([[-1, 1], [1, 2]]), np.array([[-1, 1], [1, 2]]),
{'missing_values': 0, 'sparse': True},
"'missing_values' can not be 0 when 'sparse' is True"),
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 25, 2019

Contributor

We should accept this case

(np.array([['a', 'b'], ['c', 'a']], dtype=str),
np.array([['a', 'b'], ['c', 'a']], dtype=str),
{}, "MissingIndicator does not support data with dtype")]
@@ -934,6 +937,7 @@ def test_missing_indicator_error(X_fit, X_trans, params, msg_err):
indicator.fit(X_fit).transform(X_trans)



This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 25, 2019

Contributor

Remove this line

This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 25, 2019

Contributor

Remove this line

@pytest.mark.parametrize(
"missing_values, dtype",
[(np.nan, np.float64),
@@ -965,6 +969,12 @@ def test_missing_indicator_new(missing_values, arr_type, dtype, param_features,
indicator = MissingIndicator(missing_values=missing_values,
features=param_features,
sparse=False)

This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 25, 2019

Contributor

Remove this line

if sparse.issparse(X_fit) and missing_values == 0:
with pytest.raises(ValueError):
X_fit_mask_sparse = indicator.fit_transform(X_fit)
return

X_fit_mask = indicator.fit_transform(X_fit)
X_trans_mask = indicator.transform(X_trans)

@@ -981,6 +991,11 @@ def test_missing_indicator_new(missing_values, arr_type, dtype, param_features,
assert isinstance(X_trans_mask, np.ndarray)

indicator.set_params(sparse=True)
if missing_values == 0:
with pytest.raises(ValueError):
X_fit_mask_sparse = indicator.fit_transform(X_fit)
return

X_fit_mask_sparse = indicator.fit_transform(X_fit)
X_trans_mask_sparse = indicator.transform(X_trans)

@@ -1009,6 +1024,17 @@ def test_missing_indicator_sparse_param(arr_type, missing_values,

indicator = MissingIndicator(missing_values=missing_values,
sparse=param_sparse)

if param_sparse is True and missing_values == 0:
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 25, 2019

Contributor
Suggested change
if param_sparse is True and missing_values == 0:
if param_sparse and missing_values == 0:

This comment has been minimized.

Copy link
@btel

btel Feb 25, 2019

Author Contributor

wouldn't it also cover the case when "param_sparse=='auto'", which should be an allowed case

with pytest.raises(ValueError):
X_fit_mask = indicator.fit_transform(X_fit)
return
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 25, 2019

Contributor

We should accept this case

This comment has been minimized.

Copy link
@btel

btel Feb 25, 2019

Author Contributor

ok, done!


if sparse.issparse(X_fit) and missing_values == 0:
with pytest.raises(ValueError):
X_fit_mask = indicator.fit_transform(X_fit)
return

X_fit_mask = indicator.fit_transform(X_fit)
X_trans_mask = indicator.transform(X_trans)

ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.