Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check arguments of `MissingIndicator` imputer when handling sparse arrays #13240

Merged
merged 15 commits into from Feb 26, 2019
Copy path View file
@@ -82,7 +82,7 @@ Support for Python 3.4 and below has been officially dropped.
- |Fix| Fixed a bug in :class:`decomposition.NMF` where `init = 'nndsvd'`,
`init = 'nndsvda'`, and `init = 'nndsvdar'` are allowed when
`n_components < n_features` instead of
`n_components <= min(n_samples, n_features)`.
`n_components <= min(n_samples, n_features)`.
:issue:`11650` by :user:`Hossein Pourbozorg <hossein-pourbozorg>` and
:user:`Zijie (ZJ) Poh <zjpoh>`.

@@ -133,6 +133,10 @@ Support for Python 3.4 and below has been officially dropped.
function of other features in a round-robin fashion. :issue:`8478` and
:issue:`12177` by :user:`Sergey Feldman <sergeyf>` :user:`Ben Lawson
<benlawson>`.

- |Fix| In :class:`impute.MissingIndicator` avoid implicit densification by
raising an exception if input is sparse add `missing_values` property
is set to 0. :issue:`12750` by :user:`Bartosz Telenczuk <btel>`.
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

Counter intuitive but the issue number should be your PR number :)

This comment has been minimized.

Copy link
@btel

btel Feb 26, 2019

Author Contributor

Fixed. Thanks!


:mod:`sklearn.linear_model`
...........................
@@ -167,7 +171,7 @@ Support for Python 3.4 and below has been officially dropped.

- |Fix| Fixed a bug in :class:`linear_model.LassoLarsIC`, where user input
``copy_X=False`` at instance creation would be overridden by default
parameter value ``copy_X=True`` in ``fit``.
parameter value ``copy_X=True`` in ``fit``.

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

Please avoid fixing pep8 issues not related to your PR. It makes reviews harder to follow

This comment has been minimized.

Copy link
@btel

btel Feb 26, 2019

Author Contributor

Ok. I will avoid it for future PRs, but I guess I should keep this one.

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

we usually ask contributors to revert it :/

:issue:`12972` by :user:`Lucio Fernandez-Arjona <luk-f-a>`

:mod:`sklearn.manifold`
Copy path View file
@@ -1207,6 +1207,14 @@ def fit(self, X, y=None):
raise ValueError("'features' has to be either 'missing-only' or "
"'all'. Got {} instead.".format(self.features))

if sparse.issparse(X):
# missing_values = 0 not allowed with sparse data as it would
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

Since fit and transform both call _validate_imput, we can move this test into this function to avoid redundant code.

This comment has been minimized.

Copy link
@btel

btel Feb 26, 2019

Author Contributor

Good idea. Thanks!

# force densification
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 25, 2019

Contributor

You also need to raise the same error in transform

if self.missing_values == 0:
raise ValueError("Imputation not possible when missing_values "
"== 0 and input is sparse. Provide a dense "
This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

It's not Imputation here. Maybe something like "Sparse input with missing_values=0 is not supported."

This comment has been minimized.

Copy link
@btel

btel Feb 26, 2019

Author Contributor

Ok, I changed it.

"array instead.")

if not ((isinstance(self.sparse, str) and
self.sparse == "auto") or isinstance(self.sparse, bool)):
raise ValueError("'sparse' has to be a boolean or 'auto'. "
@@ -1240,6 +1248,14 @@ def transform(self, X):
raise ValueError("X has a different number of features "
"than during fitting.")

if sparse.issparse(X):
# missing_values = 0 not allowed with sparse data as it would
# force densification
if self.missing_values == 0:
raise ValueError("Imputation not possible when missing_values "
"== 0 and input is sparse. Provide a dense "
"array instead.")

imputer_mask, features = self._get_missing_features_info(X)

if self.features == "missing-only":
Copy path View file
@@ -935,14 +935,21 @@ def test_missing_indicator_error(X_fit, X_trans, params, msg_err):


@pytest.mark.parametrize(
"missing_values, dtype",
[(np.nan, np.float64),
(0, np.int32),
(-1, np.int32)])
@pytest.mark.parametrize(
"arr_type",
[np.array, sparse.csc_matrix, sparse.csr_matrix, sparse.coo_matrix,
sparse.lil_matrix, sparse.bsr_matrix])
"missing_values, dtype, arr_type",
[(np.nan, np.float64, np.array),
(0, np.int32, np.array),
(-1, np.int32, np.array),
(np.nan, np.float64, sparse.csc_matrix),
(-1, np.int32, sparse.csc_matrix),
(np.nan, np.float64, sparse.csr_matrix),
(-1, np.int32, sparse.csr_matrix),
(np.nan, np.float64, sparse.coo_matrix),
(-1, np.int32, sparse.coo_matrix),
(np.nan, np.float64, sparse.lil_matrix),
(-1, np.int32, sparse.lil_matrix),
(np.nan, np.float64, sparse.bsr_matrix),
(-1, np.int32, sparse.bsr_matrix)
])
@pytest.mark.parametrize(
"param_features, n_features, features_indices",
[('missing-only', 2, np.array([0, 1])),
@@ -992,11 +999,40 @@ def test_missing_indicator_new(missing_values, arr_type, dtype, param_features,
assert_allclose(X_trans_mask_sparse.toarray(), X_trans_mask)


@pytest.mark.parametrize("param_sparse", [True, False, 'auto'])
@pytest.mark.parametrize("missing_values", [np.nan, 0])
@pytest.mark.parametrize(
"arr_type",
[np.array, sparse.csc_matrix, sparse.csr_matrix, sparse.coo_matrix])
[sparse.csc_matrix, sparse.csr_matrix, sparse.coo_matrix,
sparse.lil_matrix, sparse.bsr_matrix])
def test_missing_indicator_raise_on_sparse_with_missing_0(arr_type):
# test for sparse input and missing_value == 0

missing_values = 0
X_fit = np.array([[missing_values, missing_values, 1],
[4, missing_values, 2]])
X_trans = np.array([[missing_values, missing_values, 1],
[4, 12, 10]])

# convert the input to the right array format
X_fit = arr_type(X_fit)
X_trans = arr_type(X_trans)

indicator = MissingIndicator(missing_values=missing_values)

with pytest.raises(ValueError):
indicator.fit_transform(X_fit)
with pytest.raises(ValueError):
indicator.transform(X_trans)

This conversation was marked as resolved by btel

This comment has been minimized.

Copy link
@jeremiedbb

jeremiedbb Feb 26, 2019

Contributor

I'm not sure the second part raises the right error. I think the error raised is a non fitted estimator error since the fit should have failed above. To be sure, you can add a match parameter to pytest.raises, with a part of the error message you want to catch.

You'll then need to fit it on proper data to check the transform.

This comment has been minimized.

Copy link
@btel

btel Feb 26, 2019

Author Contributor

Good catch! As you said, it was raising "instance not fitted" exception instead of "sparse with missing_values=0". I fixed the test.


@pytest.mark.parametrize("param_sparse", [True, False, 'auto'])
@pytest.mark.parametrize("missing_values, arr_type",
[(np.nan, np.array),
(0, np.array),
(np.nan, sparse.csc_matrix),
(np.nan, sparse.csr_matrix),
(np.nan, sparse.coo_matrix),
(np.nan, sparse.lil_matrix)
])
def test_missing_indicator_sparse_param(arr_type, missing_values,
param_sparse):
# check the format of the output with different sparse parameter
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.