Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check arguments of `MissingIndicator` imputer when handling sparse arrays #13240

Merged
merged 15 commits into from Feb 26, 2019
Copy path View file
@@ -82,7 +82,7 @@ Support for Python 3.4 and below has been officially dropped.
- |Fix| Fixed a bug in :class:`decomposition.NMF` where `init = 'nndsvd'`,
`init = 'nndsvda'`, and `init = 'nndsvdar'` are allowed when
`n_components < n_features` instead of
`n_components <= min(n_samples, n_features)`.
`n_components <= min(n_samples, n_features)`.
:issue:`11650` by :user:`Hossein Pourbozorg <hossein-pourbozorg>` and
:user:`Zijie (ZJ) Poh <zjpoh>`.

@@ -134,6 +134,10 @@ Support for Python 3.4 and below has been officially dropped.
:issue:`12177` by :user:`Sergey Feldman <sergeyf>` :user:`Ben Lawson
<benlawson>`.

- |Fix| In :class:`impute.MissingIndicator` avoid implicit densification by
raising an exception if input is sparse add `missing_values` property
is set to 0. :issue:`13240` by :user:`Bartosz Telenczuk <btel>`.

:mod:`sklearn.linear_model`
...........................

@@ -167,7 +171,7 @@ Support for Python 3.4 and below has been officially dropped.

- |Fix| Fixed a bug in :class:`linear_model.LassoLarsIC`, where user input
``copy_X=False`` at instance creation would be overridden by default
parameter value ``copy_X=True`` in ``fit``.
parameter value ``copy_X=True`` in ``fit``.

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

Please avoid fixing pep8 issues not related to your PR. It makes reviews harder to follow

This comment has been minimized.

Copy link
@btel

btel Feb 26, 2019

Author Contributor

Ok. I will avoid it for future PRs, but I guess I should keep this one.

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

we usually ask contributors to revert it :/

:issue:`12972` by :user:`Lucio Fernandez-Arjona <luk-f-a>`

:mod:`sklearn.manifold`
Copy path View file
@@ -1134,7 +1134,7 @@ def _get_missing_features_info(self, X):
The features containing missing values.
"""
if sparse.issparse(X) and self.missing_values != 0:
if sparse.issparse(X):
mask = _get_mask(X.data, self.missing_values)

# The imputer mask will be constructed with the same sparse format
@@ -1157,10 +1157,6 @@ def _get_missing_features_info(self, X):
elif imputer_mask.format == 'csr':
imputer_mask = imputer_mask.tocsc()
else:
if sparse.issparse(X):
# case of sparse matrix with 0 as missing values. Implicit and
# explicit zeros are considered as missing values.
X = X.toarray()
imputer_mask = _get_mask(X, self.missing_values)
features_with_missing = np.flatnonzero(imputer_mask.sum(axis=0))

@@ -1184,6 +1180,14 @@ def _validate_input(self, X):
"categorical data represented either as an array "
"with integer dtype or an array of string values "
"with an object dtype.".format(X.dtype))

if sparse.issparse(X) and self.missing_values == 0:
# missing_values = 0 not allowed with sparse data as it would
# force densification
raise ValueError("Sparse input with missing_values=0 is "
"not supported. Provide a dense "
"array instead.")

return X

def fit(self, X, y=None):
Copy path View file
@@ -935,14 +935,21 @@ def test_missing_indicator_error(X_fit, X_trans, params, msg_err):


@pytest.mark.parametrize(
"missing_values, dtype",
[(np.nan, np.float64),
(0, np.int32),
(-1, np.int32)])
@pytest.mark.parametrize(
"arr_type",
[np.array, sparse.csc_matrix, sparse.csr_matrix, sparse.coo_matrix,
sparse.lil_matrix, sparse.bsr_matrix])
"missing_values, dtype, arr_type",
[(np.nan, np.float64, np.array),
(0, np.int32, np.array),
(-1, np.int32, np.array),
(np.nan, np.float64, sparse.csc_matrix),
(-1, np.int32, sparse.csc_matrix),
(np.nan, np.float64, sparse.csr_matrix),
(-1, np.int32, sparse.csr_matrix),
(np.nan, np.float64, sparse.coo_matrix),
(-1, np.int32, sparse.coo_matrix),
(np.nan, np.float64, sparse.lil_matrix),
(-1, np.int32, sparse.lil_matrix),
(np.nan, np.float64, sparse.bsr_matrix),
(-1, np.int32, sparse.bsr_matrix)
])
@pytest.mark.parametrize(
"param_features, n_features, features_indices",
[('missing-only', 2, np.array([0, 1])),
@@ -992,11 +999,42 @@ def test_missing_indicator_new(missing_values, arr_type, dtype, param_features,
assert_allclose(X_trans_mask_sparse.toarray(), X_trans_mask)


@pytest.mark.parametrize("param_sparse", [True, False, 'auto'])
@pytest.mark.parametrize("missing_values", [np.nan, 0])
@pytest.mark.parametrize(
"arr_type",
[np.array, sparse.csc_matrix, sparse.csr_matrix, sparse.coo_matrix])
[sparse.csc_matrix, sparse.csr_matrix, sparse.coo_matrix,
sparse.lil_matrix, sparse.bsr_matrix])
def test_missing_indicator_raise_on_sparse_with_missing_0(arr_type):
# test for sparse input and missing_value == 0

missing_values = 0
X_fit = np.array([[missing_values, missing_values, 1],
[4, missing_values, 2]])
X_trans = np.array([[missing_values, missing_values, 1],
[4, 12, 10]])

# convert the input to the right array format
X_fit_sparse = arr_type(X_fit)
X_trans_sparse = arr_type(X_trans)

indicator = MissingIndicator(missing_values=missing_values)

with pytest.raises(ValueError, match="Sparse input with missing_values=0"):
indicator.fit_transform(X_fit_sparse)

indicator.fit_transform(X_fit)
with pytest.raises(ValueError, match="Sparse input with missing_values=0"):
indicator.transform(X_trans_sparse)

This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

I'm not sure the second part raises the right error. I think the error raised is a non fitted estimator error since the fit should have failed above. To be sure, you can add a match parameter to pytest.raises, with a part of the error message you want to catch.

You'll then need to fit it on proper data to check the transform.

This comment has been minimized.

Copy link
@btel

btel Feb 26, 2019

Author Contributor

Good catch! As you said, it was raising "instance not fitted" exception instead of "sparse with missing_values=0". I fixed the test.


@pytest.mark.parametrize("param_sparse", [True, False, 'auto'])
@pytest.mark.parametrize("missing_values, arr_type",
[(np.nan, np.array),
(0, np.array),
(np.nan, sparse.csc_matrix),
(np.nan, sparse.csr_matrix),
(np.nan, sparse.coo_matrix),
(np.nan, sparse.lil_matrix)
])
def test_missing_indicator_sparse_param(arr_type, missing_values,
param_sparse):
# check the format of the output with different sparse parameter
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.