[MRG] ENH: Ignore NaNs in StandardScaler and scale #11206

glemaitre · 2018-06-05T16:24:00Z

Reference Issues/PRs

Towards #10404

Supersedes and closes #10457. Supersedes and closes #10618

What does this implement/fix? Explain your changes.

Any other comments?

TODO:

Handle the dense case
Handle the sparse case
Additional test for scale function
Additional test for partial fit
Unit testing of the helper functions in sparsefuncs
Back-compatibility of attribute n_samples_seen_ -> [MRG+1] FIX: enforce consistency between dense and sparse cases in StandardScaler #11235
Address initial comment @ogrisel @jnothman
Address error message NaN and infinity
Optional benchmark

jnothman · 2018-06-06T02:45:42Z

Yay! Why WIP? A todo list?

jnothman

The things missing as far as I can tell are test:

partial_fit / _incremental_mean_and_var with NaN
equivalence of scale and StandardScaler in the presence of NaNs

Perhaps also a quick benchmark of _incremental_mean_and_var where there are no NaNs. (I wonder, for instance, whether it's worth rewriting _incremental_mean_and_var in cython with nogil.

jnothman · 2018-06-06T02:50:01Z

sklearn/preprocessing/data.py

@@ -656,7 +658,7 @@ def partial_fit(self, X, y=None):
            # First pass
            if not hasattr(self, 'n_samples_seen_'):
                self.mean_ = .0
-                self.n_samples_seen_ = 0
+                self.n_samples_seen_ = np.zeros(X.shape[1], dtype=np.int32)


Am I being overly cautious if I worry that this changes public API? We could squeeze this back to a single value if np.ptp(n_samples_seen_) == 0 after each partial_fit

I agree it's better to try to preserve backward compat but maybe instead of using a data dependent np.ptp check we could do the squeeze only when self.force_all_finite != 'allow-nan'. This way the user is more in control and it's more explicit.

In other feature-wise preprocessing, we're allowing NaNs through by default, under the assumption that extra processing cost is negligible and that downstream estimators will deal with or complain about the presence of NaN. Thus self.force_all_finite does not exist.

Indeed, I misread the code snippet. We don't have much choice but to use the np.ptp trick then.

jnothman · 2018-06-06T02:52:28Z

sklearn/preprocessing/tests/test_common.py

@@ -25,6 +26,7 @@ def _get_valid_samples_by_column(X, col):
 @pytest.mark.parametrize(
    "est, support_sparse",
    [(MinMaxScaler(), False),
+     (StandardScaler(), False),
     (QuantileTransformer(n_quantiles=10, random_state=42), True)]
 )
 def test_missing_value_handling(est, support_sparse):


Perhaps this should be extended for the partial_fit case?

ogrisel · 2018-06-06T11:49:20Z

sklearn/preprocessing/tests/test_data.py

    X = [[np.inf, 5, 6, 7, 8]]
    assert_raises_regex(ValueError,
-                        "Input contains NaN, infinity or a value too large",
+                        "Input contains infinity or a value too large",


This actually a weird message: what is a "value too large" if it's not infinity?

If the values vary too much and computing the scale is not numerically possible (overflow)? If so we should add a specific test for this.

This is not the reason:

>>> from sklearn.preprocessing import scale >>> import numpy as np >>> data = np.array([np.finfo(np.float32).max, np.finfo(np.float32).min]) >>> scale(data) /home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/_methods.py:116: RuntimeWarning: overflow encountered in multiply x = um.multiply(x, x, out=x) array([ 0., -0.], dtype=float32)

So I would just change the message to "Input contains infinity".

Uhm this is the error message of check_array actually. Changing this part would require to touch quite a lot of tests not really related. Would it be wised to make it inside another PR.

jnothman · 2018-06-09T11:51:21Z

Tests failing. Ping when you want reviews!

glemaitre · 2018-06-09T13:11:45Z

Yep I have to find the segfault which I don't have locally. Sent from my phone - sorry to be brief and potential misspell.

jnothman · 2018-06-10T11:12:43Z

Tests pass on your system at 4d60bfe? I'm getting 4 failures in test_data.py

jnothman · 2018-06-10T11:13:03Z

(but no segfault)

jnothman · 2018-06-10T11:17:22Z

sklearn/preprocessing/data.py

+                                            * X.shape[0])
+                    print(self.n_samples_seen_)
+                    counts_nan = sparse.csr_matrix(
+                        (np.isnan(X.data), X.indices, X.indptr)).sum(


this needs a shape kwarg to correctly infer the number of columns. That's the source of the errors for me.

jnothman

It would be good to see basic benchmarks of this. Should we be worried about runtime on scaling?

jnothman · 2018-06-11T10:09:18Z

sklearn/preprocessing/data.py

-                    self.n_samples_seen_ = X.shape[0]
+                    self.n_samples_seen_ = (np.ones(X.shape[1], dtype=np.int32)
+                                            * X.shape[0])
+                    sparse_constr = (sparse.csr_matrix if X.format == 'csr'


I wish we could just use X._with_data

Yep, it could be nice to have something like that publicly exposed.

jnothman · 2018-06-11T10:10:49Z

sklearn/utils/extmath.py


-    new_sample_count = X.shape[0]
+    new_sample_count = np.nansum(~np.isnan(X), axis=0)


nansum of a bool array?

glemaitre · 2018-06-11T13:54:39Z

It would be good to see basic benchmarks of this. Should we be worried about runtime on scaling?

I will do one now since both sparse and dense are supported. It will give us the big picture.

glemaitre · 2018-06-11T15:24:26Z

Dense matrices

ratio fit time master / 0.19

n_samples  n_features
1000       10            2.160415
           100           1.570363
10000      10            1.440275
           100           2.281825
100000     10            2.154921
           100           1.279070
500000     10            1.517103
           100           1.133352
dtype: float64

Sparse matrices

ratio fit time master / 0.19

n_samples  n_features
1000       10            2.856238
           100           3.638280
10000      10            2.764379
           100           1.899592
100000     10            2.425454
           100           1.826168
500000     10            2.237013
           100           1.264190
dtype: float64

glemaitre · 2018-06-11T17:18:04Z

@jnothman Do you think that the regression for a low number of samples/features is a problem?

jnothman · 2018-06-11T21:37:14Z

why is fit faster? does that gain disappear with 1000 features? why is transform slower? isn't it unchanged?

jnothman · 2018-06-11T21:58:59Z

Actually I'm confused about your comparing 0.19 to master. Which is this pr?

glemaitre · 2018-06-11T21:59:28Z

why is fit faster? does that gain disappear with 1000 features? why is transform slower? isn't it unchanged?

fit is actually slower due to the additional check of the nan. As an example:

In [6]: X = np.random.random(10000000)

In [7]: %timeit np.mean(X)
5.54 ms ± 67.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit np.nanmean(X)
62.2 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The fluctuation could be due to the sample size.

transform could be faster since we avoid this check

glemaitre · 2018-06-11T22:00:23Z

Actually I'm confused about your comparing 0.19 to master. Which is this pr?

Sorry I did not see my mistake

master refers to this PR
0.19 refers to the last release

glemaitre · 2018-06-16T00:08:53Z

@jnothman I touch slightly the cython code such that we support 64 bits indices as done in #9663.
I try to avoid any python interaction by forcing some type of using fused type.
The only thing remaining to be able to release the GIL would be to avoid calling zeros and zeros_like. So we could make the allocation outside of the private function and always the public one instead. I am not sure this is worth however.

I still have to make a test for partial_fit but I think that this is almost ready for a full review.

glemaitre · 2018-06-16T09:58:46Z

An update on the benchmark

Dense matrices

Fit time ratio: PR / 0.19,1

n_samples  n_features
1000       10            1.554382
           100           1.093015
10000      10            1.410582
           100           2.049729
100000     10            1.942390
           100           1.113770
500000     10            1.288468
           100           1.032234
dtype: float64

Sparse matrices

Fit time ratio: PR / 0.19,1

n_samples  n_features
1000       10            2.141932
           100           1.681234
10000      10            1.610512
           100           1.461307
100000     10            1.469542
           100           1.306887
500000     10            1.345844
           100           1.528964
dtype: float64

jnothman · 2018-06-16T11:01:32Z

Benchmarks seem reasonable. Especially as absolute differences in fit time are small.

jnothman

Please also make sure to test the new n_samples_seen_ shape directly. Otherwise LGTM

jnothman · 2018-06-16T12:19:11Z

sklearn/preprocessing/data.py

-        new calls to fit, but increments across ``partial_fit`` calls.
+    n_samples_seen_ : int or array, shape (n_features,)
+        The number of samples processed by the estimator for each feature.
+        If there is not missing samples, the ``n_samples_seen`` will be an


is not -> are no

glemaitre · 2018-06-18T12:04:42Z

@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014
I would appreciate reviews.

jorisvandenbossche

Not very familiar with this code, but went through the diff and apart from few minor comments, looks good to me.

jorisvandenbossche · 2018-06-19T12:28:13Z

sklearn/preprocessing/data.py

@@ -800,6 +836,9 @@ class MaxAbsScaler(BaseEstimator, TransformerMixin):

    Notes
    -----
+    NaNs are treated as missing values: disregarded in fit, and maintained in
+    transform.


I don't think you update the MaxAbsScaler in this PR?

jorisvandenbossche · 2018-06-19T12:40:09Z

sklearn/utils/extmath.py

+        # avoid division by 0
+        non_zero_idx = last_sample_count > 0
+        updated_unnormalized_variance[~non_zero_idx] =\
+            new_unnormalized_variance[~non_zero_idx]


I think it would be more readable to just do the calculation on the full array, and then in the end do updated_unnormalized_variance[non_zero_idx] = 0

jorisvandenbossche · 2018-06-19T12:41:39Z

sklearn/utils/extmath.py

-                last_over_new_count / updated_sample_count *
-                (last_sum / last_over_new_count - new_sum) ** 2)
+        new_unnormalized_variance = np.nanvar(X, axis=0) * new_sample_count
+        last_over_new_count = last_sample_count / new_sample_count


this can also already divide by zero, and then will give a RuntimeWarning by numpy, need to ignore this with an errstate?

jorisvandenbossche · 2018-06-19T14:13:22Z

sklearn/utils/extmath.py

@@ -664,7 +664,7 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,

    last_variance : array-like, shape: (n_features,)

-    last_sample_count : int
+    last_sample_count : array-like, shape (n_features,)


The default is still 0, which will actually never work.
So I would simply remove all defaults and make it all positional keywords (they are always used like that, so that shouldn't change anything for the rest)

ogrisel

LGTM once @jorisvandenbossche comments and the following are addressed.

ogrisel · 2018-06-21T11:26:40Z

sklearn/preprocessing/data.py

            else:
                self.mean_, self.var_, self.n_samples_seen_ = \
                    _incremental_mean_and_var(X, self.mean_, self.var_,
                                              self.n_samples_seen_)

+        # for back-compatibility, reduce n_samples_seen_ to an integer if the


typo: backward-compatibility

glemaitre · 2018-06-21T13:28:25Z

@ogrisel This is ready to be merged. The previous failure was PEP8.

ogrisel · 2018-06-21T13:48:59Z

Merged! Thanks @glemaitre 🍻

glemaitre added 2 commits June 5, 2018 18:07

EHN ignore NaN when incrementing mean and var

5dba57a

EHN ignore NaNs in StandardScaler for dense case

591c18f

glemaitre force-pushed the nan_standardscaler branch from 0dcc9a8 to 591c18f Compare June 5, 2018 21:19

jnothman reviewed Jun 6, 2018

View reviewed changes

ogrisel reviewed Jun 6, 2018

View reviewed changes

glemaitre added 4 commits June 8, 2018 18:55

EHN should handle sparse case

43fa54b

TST launch common tests

faab12a

FIX use loop

fa12fe7

FIX number of samples first iteration

4d60bfe

jnothman reviewed Jun 10, 2018

View reviewed changes

glemaitre added 4 commits June 11, 2018 10:18

FIX use proper sparse constructor

eba3087

BUG wrong index and remove nan counter

1782e69

revert

0b074ce

cleanup

a6cb76d

jnothman reviewed Jun 11, 2018

View reviewed changes

FIX use sum on the boolean array

ab2c465

TST equivalance function and class

76691a9

glemaitre mentioned this pull request Jun 11, 2018

minmax_scale does not ignore NaNs #11239

Closed

glemaitre added 5 commits June 15, 2018 22:11

TST revert some test for back compatibility

2ffe497

TST check NaN are ignored in incremental_mean_and_variance

424fdba

TST check NaNs ignore in incr_mean_variance

0fe8a3e

OPTIM cython variable typing

f267a35

DOC corrections

4785fb2

glemaitre changed the title ~~[WIP] EHN: Ignore NaNs in StandardScaler and scale~~ [MRG] EHN: Ignore NaNs in StandardScaler and scale Jun 16, 2018

DOC mentioned that NaNs are ignored in Notes

082633d

jnothman changed the title ~~[MRG] EHN: Ignore NaNs in StandardScaler and scale~~ [MRG] ENH: Ignore NaNs in StandardScaler and scale Jun 16, 2018

jnothman reviewed Jun 16, 2018

View reviewed changes

jnothman added this to the 0.20 milestone Jun 16, 2018

jnothman mentioned this pull request Jun 16, 2018

Disregard NaNs in preprocessing #10404

Closed

7 tasks

glemaitre added 3 commits June 16, 2018 15:30

TST shape of n_samples_seen with missing values

d79b867

DOC fix spelling

cb077ea

DOC whats new entry

449d24a

This was referenced Jun 16, 2018

[MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale #11011

Merged

[MRG] ignore NaNs in PowerTransformer #11306

Merged

jorisvandenbossche reviewed Jun 19, 2018

View reviewed changes

address joris comments

6090a85

jorisvandenbossche reviewed Jun 19, 2018

View reviewed changes

ogrisel approved these changes Jun 21, 2018

View reviewed changes

glemaitre added 3 commits June 21, 2018 14:51

Update data.py

7b4a6a3

Update extmath.py

52d833f

PEP8

c0b633a

ogrisel merged commit 5718466 into scikit-learn:master Jun 21, 2018


		new_sample_count = X.shape[0]
		new_sample_count = np.nansum(~np.isnan(X), axis=0)

[MRG] ENH: Ignore NaNs in StandardScaler and scale #11206

[MRG] ENH: Ignore NaNs in StandardScaler and scale #11206

Conversation

glemaitre commented Jun 5, 2018 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

jnothman commented Jun 6, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Jun 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jun 9, 2018

glemaitre commented Jun 9, 2018 via email

jnothman commented Jun 10, 2018

jnothman commented Jun 10, 2018

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jun 11, 2018

glemaitre commented Jun 11, 2018 • edited

Dense matrices

ratio fit time master / 0.19

Sparse matrices

ratio fit time master / 0.19

glemaitre commented Jun 11, 2018

jnothman commented Jun 11, 2018 via email

jnothman commented Jun 11, 2018

glemaitre commented Jun 11, 2018

glemaitre commented Jun 11, 2018 • edited

glemaitre commented Jun 16, 2018

glemaitre commented Jun 16, 2018

Dense matrices

Sparse matrices

jnothman commented Jun 16, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jun 18, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jun 21, 2018

ogrisel commented Jun 21, 2018

glemaitre commented Jun 5, 2018 •

edited

ogrisel Jun 6, 2018 •

edited

glemaitre commented Jun 11, 2018 •

edited

glemaitre commented Jun 11, 2018 •

edited