Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale #11308

Merged
merged 16 commits into from Jul 5, 2018

Conversation

Projects
None yet
6 participants
@glemaitre
Copy link
Contributor

glemaitre commented Jun 18, 2018

Reference Issues/PRs

Toward #10404

Merge #11011 before due to better test in the common tests.

What does this implement/fix? Explain your changes.

TODO:

  • RobustScaler should handle better sparse matrix.

Any other comments?

@glemaitre glemaitre force-pushed the glemaitre:nan_robust_scaler branch from 37f7cb5 to b1c5a21 Jun 18, 2018

glemaitre added some commits Jun 18, 2018

fix

@glemaitre glemaitre changed the title [WIP] Ignore and pass-through NaNs in RobustScaler and robust_scale [MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale Jun 18, 2018

@jnothman jnothman referenced this pull request Jun 18, 2018

Closed

Disregard NaNs in preprocessing #10404

6 of 7 tasks complete

@glemaitre glemaitre added this to the 0.20 milestone Jun 18, 2018

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jun 18, 2018

@jnothman
Copy link
Member

jnothman left a comment

Your sparse adaptation doesn't work. The qth quantile of non-zero values is not the qth quantile of values including (majority) zeros.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jun 18, 2018

Your sparse adaptation doesn't work. The qth quantile of non-zero values is not the qth quantile of values including (majority) zeros.

Oh so I should materialize the zero.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 18, 2018

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jun 18, 2018

I think just leave it as not handling the sparse case. The IQR is only likely to be > 0 once density > 20%. So for many sparse matrices, quantiles are not relevant.​

In this case, it should be mentioned that robust_scale does not work on sparse matrices.
It was actually my original issue :)

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jun 18, 2018

I think just leave it as not handling the sparse case. The IQR is only likely to be > 0 once density > 20%. So for many sparse matrices, quantiles are not relevant.​

Actually I have a second thought. More specifically, we should deprecate transform support for sparse matrix. I don't know if it worth or instead compute properly the IQR (in case some people use sparse matrices with density large enough).

glemaitre added some commits Jun 18, 2018

@@ -905,6 +951,20 @@ def test_robust_scaler_2d_arrays():
assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


def test_robust_scaler_equivalence_dense_sparse():
# Check the equivalence of the fitting with dense and sparse matrices
X_sparse = sparse.rand(1000, 5, density=0.5).tocsc()

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2018

Member

To avoid regressions, please make sure this works with an all-positive, all-negative, all-zero, positive, negative and zero, and all-nonzero columns. Check with a couple of densities and also integer dtypes.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jun 21, 2018

Author Contributor

NaNs with integer dtype should not work.

@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche left a comment

Reminder: update transform docstring

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Jun 21, 2018

This pull request introduces 1 alert when merging de147a4 into e555893 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jun 21, 2018

@jnothman I added a test for different cases of density and signs of the matrix.

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Jun 21, 2018

This pull request introduces 1 alert when merging f884532 into e555893 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

for feature_idx in range(X.shape[1]):
if sparse.issparse(X):
column_nnz_data = X.data[X.indptr[feature_idx]:
X.indptr[feature_idx + 1]]

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 21, 2018

Member

I assume you realise that there should be a more asymptotically efficient way to handle the sparse case, as it should be easy to work out whether a percentile is zero, positive or negative, then adjust the quantile parameter...

But this is fine in the first instance.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jun 22, 2018

Author Contributor

To be honest, I don't know about it.

@@ -919,6 +965,29 @@ def test_robust_scaler_2d_arrays():
assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


@pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1])

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 21, 2018

Member

This is now a bit excessive.

# Check the equivalence of the fitting with dense and sparse matrices
X_sparse = sparse.rand(1000, 5, density=density).tocsc()
if strictly_signed == 'positif':
X_sparse.data += X_sparse.min()

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 21, 2018

Member

This won't make it positive. Subtracting the min will. So will abs

@@ -919,6 +965,29 @@ def test_robust_scaler_2d_arrays():
assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


@pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1])
@pytest.mark.parametrize("strictly_signed",
['positif', 'negatif', 'zeros', None])

This comment has been minimized.

Copy link
@ogrisel

ogrisel Jun 22, 2018

Member

typo: positive, negative

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Jun 22, 2018

This pull request introduces 1 alert when merging 7f22b3f into 93382cc - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Jun 25, 2018

This pull request introduces 1 alert when merging 05a23f6 into eec7649 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@jnothman
Copy link
Member

jnothman left a comment

Otherwise LGTM

@@ -274,6 +274,10 @@ Preprocessing
- :class:`preprocessing.PowerTransformer` and
:func:`preprocessing.power_transform` ignore and pass-through NaN values.
:issue:`11306` by :user:`Guillaume Lemaitre <glemaitre>`.
- :class:`preprocessing.RobustScaler` and :func:`preprocessing.robust_scale`
ignore and pass-through NaN values. In addition, the scaler can now be fitted

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 26, 2018

Member

I think the second comment is a separate enhancement and should be reported separately.

I would like to see the "ignore and pass through NaN values" what's news merged into one (not necessarily in this PR).

self.scale_ = (q[1] - q[0])
quantiles.append(
nanpercentile(column_data, self.quantile_range)
if column_data.size else [0, 0])

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 26, 2018

Member

I think you no longer need this if column_data.size check?

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jun 26, 2018

Author Contributor

nop because it returns nan otherwise (which is consistent with the numpy function).

In [1]: import numpy as np

In [2]: from sklearn.utils.fixes import nanpercentile

In [3]: nanpercentile([], [25, 50, 75])
/home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[3]: nan

In [4]: np.nanpercentile([], [25, 50, 75])
/home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[4]: nan

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 26, 2018

Member

But when would you encounter 0 samples after check_array?

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jun 27, 2018

Author Contributor

With sparse matrices with all zero column:

In [1]: from scipy import sparse

In [3]: from sklearn.utils import check_array

In [4]: X = sparse.rand(10, 2, density=0).tocsr()

In [5]: check_array(X, accept_sparse=True)
Out[5]: 
<10x2 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [6]: X.data
Out[6]: array([], dtype=float64)

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 27, 2018

Member

Yes, but you're now only ever passing in something with shape X.shape[0]

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jun 27, 2018

Author Contributor

Oh sorry, I see now

column_data = np.zeros(shape=X.shape[0], dtype=X.dtype)

I missed this and I was thinking that we were using the data.nnz which can be an empty array.

Making the changes then.


quantiles = np.transpose(quantiles)

self.scale_ = (quantiles[1] - quantiles[0])

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 26, 2018

Member

rm parentheses

@glemaitre glemaitre force-pushed the glemaitre:nan_robust_scaler branch from 05a23f6 to eecb39a Jun 26, 2018

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Jun 26, 2018

This pull request introduces 1 alert when merging eecb39a into 3b5abf7 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

glemaitre added some commits Jun 27, 2018

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Jun 27, 2018

This pull request introduces 1 alert when merging 86cf707 into 3b5abf7 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 27, 2018

Have you cheers that unused import?

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jun 28, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 28, 2018

@ogrisel, good to merge?

@ogrisel ogrisel merged commit f8adfa2 into scikit-learn:master Jul 5, 2018

6 checks passed

LGTM analysis: Python 1 new alert
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Jul 5, 2018

Merged! Thanks, @glemaitre!

check_is_fitted(self, 'center_', 'scale_')
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
estimator=self, dtype=FLOAT_DTYPES,
force_all_finite='allow-nan')

This comment has been minimized.

Copy link
@mrocklin

mrocklin Jul 5, 2018

FYI this affected dask-ml. Previously the logic in this transformer was equally applicable to numpy and dask arrays. Now it auto-converts dask arrays to numpy arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.