New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH: Ignore NaNs in StandardScaler and scale #11206

Merged
merged 29 commits into from Jun 21, 2018

Conversation

Projects
None yet
4 participants
@glemaitre
Contributor

glemaitre commented Jun 5, 2018

Reference Issues/PRs

Towards #10404

Supersedes and closes #10457. Supersedes and closes #10618

What does this implement/fix? Explain your changes.

Any other comments?

TODO:

  • Handle the dense case
  • Handle the sparse case
  • Additional test for scale function
  • Additional test for partial fit
  • Unit testing of the helper functions in sparsefuncs
  • Back-compatibility of attribute n_samples_seen_ -> #11235
  • Address initial comment @ogrisel @jnothman
  • Address error message NaN and infinity
  • Optional benchmark
@jnothman

This comment has been minimized.

Member

jnothman commented Jun 6, 2018

Yay! Why WIP? A todo list?

@jnothman

The things missing as far as I can tell are test:

  • partial_fit / _incremental_mean_and_var with NaN
  • equivalence of scale and StandardScaler in the presence of NaNs

Perhaps also a quick benchmark of _incremental_mean_and_var where there are no NaNs. (I wonder, for instance, whether it's worth rewriting _incremental_mean_and_var in cython with nogil.

@@ -656,7 +658,7 @@ def partial_fit(self, X, y=None):
# First pass
if not hasattr(self, 'n_samples_seen_'):
self.mean_ = .0
self.n_samples_seen_ = 0
self.n_samples_seen_ = np.zeros(X.shape[1], dtype=np.int32)

This comment has been minimized.

@jnothman

jnothman Jun 6, 2018

Member

Am I being overly cautious if I worry that this changes public API? We could squeeze this back to a single value if np.ptp(n_samples_seen_) == 0 after each partial_fit

This comment has been minimized.

@ogrisel

ogrisel Jun 6, 2018

Member

I agree it's better to try to preserve backward compat but maybe instead of using a data dependent np.ptp check we could do the squeeze only when self.force_all_finite != 'allow-nan'. This way the user is more in control and it's more explicit.

This comment has been minimized.

@jnothman

jnothman Jun 6, 2018

Member

In other feature-wise preprocessing, we're allowing NaNs through by default, under the assumption that extra processing cost is negligible and that downstream estimators will deal with or complain about the presence of NaN. Thus self.force_all_finite does not exist.

This comment has been minimized.

@ogrisel

ogrisel Jun 11, 2018

Member

Indeed, I misread the code snippet. We don't have much choice but to use the np.ptp trick then.

@@ -25,6 +26,7 @@ def _get_valid_samples_by_column(X, col):
@pytest.mark.parametrize(
"est, support_sparse",
[(MinMaxScaler(), False),
(StandardScaler(), False),
(QuantileTransformer(n_quantiles=10, random_state=42), True)]
)
def test_missing_value_handling(est, support_sparse):

This comment has been minimized.

@jnothman

jnothman Jun 6, 2018

Member

Perhaps this should be extended for the partial_fit case?

X = [[np.inf, 5, 6, 7, 8]]
assert_raises_regex(ValueError,
"Input contains NaN, infinity or a value too large",
"Input contains infinity or a value too large",

This comment has been minimized.

@ogrisel

ogrisel Jun 6, 2018

Member

This actually a weird message: what is a "value too large" if it's not infinity?

This comment has been minimized.

@ogrisel

ogrisel Jun 6, 2018

Member

If the values vary too much and computing the scale is not numerically possible (overflow)? If so we should add a specific test for this.

This comment has been minimized.

@ogrisel

ogrisel Jun 6, 2018

Member

This is not the reason:

>>> from sklearn.preprocessing import scale
>>> import numpy as np
>>> data = np.array([np.finfo(np.float32).max, np.finfo(np.float32).min])
>>> scale(data)
/home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/_methods.py:116: RuntimeWarning: overflow encountered in multiply
  x = um.multiply(x, x, out=x)
array([ 0., -0.], dtype=float32)

So I would just change the message to "Input contains infinity".

This comment has been minimized.

@glemaitre

glemaitre Jun 16, 2018

Contributor

Uhm this is the error message of check_array actually. Changing this part would require to touch quite a lot of tests not really related. Would it be wised to make it inside another PR.

glemaitre added some commits Jun 8, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 9, 2018

Tests failing. Ping when you want reviews!

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 9, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 10, 2018

Tests pass on your system at 4d60bfe? I'm getting 4 failures in test_data.py

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 10, 2018

(but no segfault)

* X.shape[0])
print(self.n_samples_seen_)
counts_nan = sparse.csr_matrix(
(np.isnan(X.data), X.indices, X.indptr)).sum(

This comment has been minimized.

@jnothman

jnothman Jun 10, 2018

Member

this needs a shape kwarg to correctly infer the number of columns. That's the source of the errors for me.

glemaitre added some commits Jun 11, 2018

@jnothman

It would be good to see basic benchmarks of this. Should we be worried about runtime on scaling?

self.n_samples_seen_ = X.shape[0]
self.n_samples_seen_ = (np.ones(X.shape[1], dtype=np.int32)
* X.shape[0])
sparse_constr = (sparse.csr_matrix if X.format == 'csr'

This comment has been minimized.

@jnothman

jnothman Jun 11, 2018

Member

I wish we could just use X._with_data

This comment has been minimized.

@glemaitre

glemaitre Jun 11, 2018

Contributor

Yep, it could be nice to have something like that publicly exposed.

new_sample_count = X.shape[0]
new_sample_count = np.nansum(~np.isnan(X), axis=0)

This comment has been minimized.

@jnothman

jnothman Jun 11, 2018

Member

nansum of a bool array?

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 11, 2018

It would be good to see basic benchmarks of this. Should we be worried about runtime on scaling?

I will do one now since both sparse and dense are supported. It will give us the big picture.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 11, 2018

Dense matrices

fit

transform

ratio fit time master / 0.19

n_samples  n_features
1000       10            2.160415
           100           1.570363
10000      10            1.440275
           100           2.281825
100000     10            2.154921
           100           1.279070
500000     10            1.517103
           100           1.133352
dtype: float64

Sparse matrices

fit_sparse

transform_sparse

ratio fit time master / 0.19

n_samples  n_features
1000       10            2.856238
           100           3.638280
10000      10            2.764379
           100           1.899592
100000     10            2.425454
           100           1.826168
500000     10            2.237013
           100           1.264190
dtype: float64
@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 11, 2018

@jnothman Do you think that the regression for a low number of samples/features is a problem?

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 11, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 11, 2018

Actually I'm confused about your comparing 0.19 to master. Which is this pr?

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 11, 2018

why is fit faster? does that gain disappear with 1000 features? why is transform slower? isn't it unchanged?

fit is actually slower due to the additional check of the nan. As an example:

In [6]: X = np.random.random(10000000)

In [7]: %timeit np.mean(X)
5.54 ms ± 67.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit np.nanmean(X)
62.2 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The fluctuation could be due to the sample size.

transform could be faster since we avoid this check

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 11, 2018

Actually I'm confused about your comparing 0.19 to master. Which is this pr?

Sorry I did not see my mistake

  • master refers to this PR
  • 0.19 refers to the last release
@jnothman

This comment has been minimized.

Member

jnothman commented Jun 12, 2018

Sorry, I misread the plots also because the legend and bars have opposite order (top to bottom vs bottom to top). So fit time goes from 6.5s to 9s on 5e5 samples with 100 features.

numpy.nanmean is currently implemented (inlining _replace_nan) as:

    arr = np.array(arr, subok=True, copy=True)

    if arr.dtype == np.object_:
        # object arrays do not support `isnan` (gh-9009), so make a guess
        mask = arr != arr
    elif issubclass(arr.dtype.type, np.inexact):
        mask = np.isnan(arr)
    else:
        mask = None

    if mask is not None:
        np.copyto(arr, 0, where=mask)
    else:
        return np.mean(arr, axis=axis, dtype=dtype, out=out, keepdims=keepdims)

    if dtype is not None:
        dtype = np.dtype(dtype)
    if dtype is not None and not issubclass(dtype.type, np.inexact):
        raise TypeError("If a is inexact, then dtype must be inexact")
    if out is not None and not issubclass(out.dtype.type, np.inexact):
        raise TypeError("If a is inexact, then out must be inexact")

    cnt = np.sum(~mask, axis=axis, dtype=np.intp, keepdims=keepdims)
    tot = np.sum(arr, axis=axis, dtype=dtype, out=out, keepdims=keepdims)
    avg = _divide_by_count(tot, cnt, out=out)

So this is designed to make no-NaNs case a bit faster, but it involves a necessary copy and at least two passes (isnan; copyto) over n_samples * n_features arrays before np.mean is performed. Then that work is duplicated for nanvar. I don't think mean is optimised beyond the use of np.add.reduce. Would it be worth considering a cython version of nansum/mean/var that only copies for computing the squared diff in var? Perhaps not for this current PR, but later?

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 16, 2018

@jnothman I touch slightly the cython code such that we support 64 bits indices as done in #9663.
I try to avoid any python interaction by forcing some type of using fused type.
The only thing remaining to be able to release the GIL would be to avoid calling zeros and zeros_like. So we could make the allocation outside of the private function and always the public one instead. I am not sure this is worth however.

I still have to make a test for partial_fit but I think that this is almost ready for a full review.

@glemaitre glemaitre changed the title from [WIP] EHN: Ignore NaNs in StandardScaler and scale to [MRG] EHN: Ignore NaNs in StandardScaler and scale Jun 16, 2018

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 16, 2018

An update on the benchmark

Dense matrices

Fit time ratio: PR / 0.19,1

n_samples  n_features
1000       10            1.554382
           100           1.093015
10000      10            1.410582
           100           2.049729
100000     10            1.942390
           100           1.113770
500000     10            1.288468
           100           1.032234
dtype: float64

Sparse matrices

Fit time ratio: PR / 0.19,1

n_samples  n_features
1000       10            2.141932
           100           1.681234
10000      10            1.610512
           100           1.461307
100000     10            1.469542
           100           1.306887
500000     10            1.345844
           100           1.528964
dtype: float64
@jnothman

This comment has been minimized.

Member

jnothman commented Jun 16, 2018

Benchmarks seem reasonable. Especially as absolute differences in fit time are small.

@jnothman jnothman changed the title from [MRG] EHN: Ignore NaNs in StandardScaler and scale to [MRG] ENH: Ignore NaNs in StandardScaler and scale Jun 16, 2018

@jnothman

Please also make sure to test the new n_samples_seen_ shape directly. Otherwise LGTM

new calls to fit, but increments across ``partial_fit`` calls.
n_samples_seen_ : int or array, shape (n_features,)
The number of samples processed by the estimator for each feature.
If there is not missing samples, the ``n_samples_seen`` will be an

This comment has been minimized.

@jnothman

jnothman Jun 16, 2018

Member

is not -> are no

@jnothman jnothman added this to the 0.20 milestone Jun 16, 2018

@jnothman jnothman referenced this pull request Jun 16, 2018

Closed

Disregard NaNs in preprocessing #10404

6 of 7 tasks complete

glemaitre added some commits Jun 16, 2018

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 18, 2018

@jorisvandenbossche

Not very familiar with this code, but went through the diff and apart from few minor comments, looks good to me.

@@ -800,6 +836,9 @@ class MaxAbsScaler(BaseEstimator, TransformerMixin):
Notes
-----
NaNs are treated as missing values: disregarded in fit, and maintained in
transform.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 19, 2018

Contributor

I don't think you update the MaxAbsScaler in this PR?

# avoid division by 0
non_zero_idx = last_sample_count > 0
updated_unnormalized_variance[~non_zero_idx] =\
new_unnormalized_variance[~non_zero_idx]

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 19, 2018

Contributor

I think it would be more readable to just do the calculation on the full array, and then in the end do updated_unnormalized_variance[non_zero_idx] = 0

last_over_new_count / updated_sample_count *
(last_sum / last_over_new_count - new_sum) ** 2)
new_unnormalized_variance = np.nanvar(X, axis=0) * new_sample_count
last_over_new_count = last_sample_count / new_sample_count

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 19, 2018

Contributor

this can also already divide by zero, and then will give a RuntimeWarning by numpy, need to ignore this with an errstate?

@@ -664,7 +664,7 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,
last_variance : array-like, shape: (n_features,)
last_sample_count : int
last_sample_count : array-like, shape (n_features,)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 19, 2018

Contributor

The default is still 0, which will actually never work.
So I would simply remove all defaults and make it all positional keywords (they are always used like that, so that shouldn't change anything for the rest)

@ogrisel

LGTM once @jorisvandenbossche comments and the following are addressed.

else:
self.mean_, self.var_, self.n_samples_seen_ = \
_incremental_mean_and_var(X, self.mean_, self.var_,
self.n_samples_seen_)
# for back-compatibility, reduce n_samples_seen_ to an integer if the

This comment has been minimized.

@ogrisel

ogrisel Jun 21, 2018

Member

typo: backward-compatibility

glemaitre added some commits Jun 21, 2018

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 21, 2018

@ogrisel This is ready to be merged. The previous failure was PEP8.

@ogrisel ogrisel merged commit 5718466 into scikit-learn:master Jun 21, 2018

3 of 6 checks passed

LGTM analysis: Python Running analyses for revisions
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
@ogrisel

This comment has been minimized.

Member

ogrisel commented Jun 21, 2018

Merged! Thanks @glemaitre 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment