Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH: Ignore NaNs in StandardScaler and scale #11206

Merged
merged 29 commits into from Jun 21, 2018

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Jun 5, 2018

Reference Issues/PRs

Towards #10404

Supersedes and closes #10457. Supersedes and closes #10618

What does this implement/fix? Explain your changes.

Any other comments?

TODO:

@jnothman
Copy link
Member

jnothman commented Jun 6, 2018

Yay! Why WIP? A todo list?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The things missing as far as I can tell are test:

  • partial_fit / _incremental_mean_and_var with NaN
  • equivalence of scale and StandardScaler in the presence of NaNs

Perhaps also a quick benchmark of _incremental_mean_and_var where there are no NaNs. (I wonder, for instance, whether it's worth rewriting _incremental_mean_and_var in cython with nogil.

@@ -656,7 +658,7 @@ def partial_fit(self, X, y=None):
# First pass
if not hasattr(self, 'n_samples_seen_'):
self.mean_ = .0
self.n_samples_seen_ = 0
self.n_samples_seen_ = np.zeros(X.shape[1], dtype=np.int32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I being overly cautious if I worry that this changes public API? We could squeeze this back to a single value if np.ptp(n_samples_seen_) == 0 after each partial_fit

Copy link
Member

@ogrisel ogrisel Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's better to try to preserve backward compat but maybe instead of using a data dependent np.ptp check we could do the squeeze only when self.force_all_finite != 'allow-nan'. This way the user is more in control and it's more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other feature-wise preprocessing, we're allowing NaNs through by default, under the assumption that extra processing cost is negligible and that downstream estimators will deal with or complain about the presence of NaN. Thus self.force_all_finite does not exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I misread the code snippet. We don't have much choice but to use the np.ptp trick then.

@@ -25,6 +26,7 @@ def _get_valid_samples_by_column(X, col):
@pytest.mark.parametrize(
"est, support_sparse",
[(MinMaxScaler(), False),
(StandardScaler(), False),
(QuantileTransformer(n_quantiles=10, random_state=42), True)]
)
def test_missing_value_handling(est, support_sparse):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be extended for the partial_fit case?

X = [[np.inf, 5, 6, 7, 8]]
assert_raises_regex(ValueError,
"Input contains NaN, infinity or a value too large",
"Input contains infinity or a value too large",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually a weird message: what is a "value too large" if it's not infinity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the values vary too much and computing the scale is not numerically possible (overflow)? If so we should add a specific test for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the reason:

>>> from sklearn.preprocessing import scale
>>> import numpy as np
>>> data = np.array([np.finfo(np.float32).max, np.finfo(np.float32).min])
>>> scale(data)
/home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/_methods.py:116: RuntimeWarning: overflow encountered in multiply
  x = um.multiply(x, x, out=x)
array([ 0., -0.], dtype=float32)

So I would just change the message to "Input contains infinity".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm this is the error message of check_array actually. Changing this part would require to touch quite a lot of tests not really related. Would it be wised to make it inside another PR.

@jnothman
Copy link
Member

jnothman commented Jun 9, 2018

Tests failing. Ping when you want reviews!

@glemaitre
Copy link
Member Author

glemaitre commented Jun 9, 2018 via email

@jnothman
Copy link
Member

Tests pass on your system at 4d60bfe? I'm getting 4 failures in test_data.py

@jnothman
Copy link
Member

(but no segfault)

* X.shape[0])
print(self.n_samples_seen_)
counts_nan = sparse.csr_matrix(
(np.isnan(X.data), X.indices, X.indptr)).sum(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a shape kwarg to correctly infer the number of columns. That's the source of the errors for me.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to see basic benchmarks of this. Should we be worried about runtime on scaling?

self.n_samples_seen_ = X.shape[0]
self.n_samples_seen_ = (np.ones(X.shape[1], dtype=np.int32)
* X.shape[0])
sparse_constr = (sparse.csr_matrix if X.format == 'csr'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could just use X._with_data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it could be nice to have something like that publicly exposed.


new_sample_count = X.shape[0]
new_sample_count = np.nansum(~np.isnan(X), axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nansum of a bool array?

@glemaitre
Copy link
Member Author

It would be good to see basic benchmarks of this. Should we be worried about runtime on scaling?

I will do one now since both sparse and dense are supported. It will give us the big picture.

@glemaitre
Copy link
Member Author

glemaitre commented Jun 11, 2018

Dense matrices

fit

transform

ratio fit time master / 0.19

n_samples  n_features
1000       10            2.160415
           100           1.570363
10000      10            1.440275
           100           2.281825
100000     10            2.154921
           100           1.279070
500000     10            1.517103
           100           1.133352
dtype: float64

Sparse matrices

fit_sparse

transform_sparse

ratio fit time master / 0.19

n_samples  n_features
1000       10            2.856238
           100           3.638280
10000      10            2.764379
           100           1.899592
100000     10            2.425454
           100           1.826168
500000     10            2.237013
           100           1.264190
dtype: float64

@glemaitre
Copy link
Member Author

@jnothman Do you think that the regression for a low number of samples/features is a problem?

@jnothman
Copy link
Member

jnothman commented Jun 11, 2018 via email

@jnothman
Copy link
Member

Actually I'm confused about your comparing 0.19 to master. Which is this pr?

@glemaitre
Copy link
Member Author

why is fit faster? does that gain disappear with 1000 features? why is transform slower? isn't it unchanged?

fit is actually slower due to the additional check of the nan. As an example:

In [6]: X = np.random.random(10000000)

In [7]: %timeit np.mean(X)
5.54 ms ± 67.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit np.nanmean(X)
62.2 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The fluctuation could be due to the sample size.

transform could be faster since we avoid this check

@glemaitre
Copy link
Member Author

glemaitre commented Jun 11, 2018

Actually I'm confused about your comparing 0.19 to master. Which is this pr?

Sorry I did not see my mistake

  • master refers to this PR
  • 0.19 refers to the last release

@glemaitre
Copy link
Member Author

@jnothman I touch slightly the cython code such that we support 64 bits indices as done in #9663.
I try to avoid any python interaction by forcing some type of using fused type.
The only thing remaining to be able to release the GIL would be to avoid calling zeros and zeros_like. So we could make the allocation outside of the private function and always the public one instead. I am not sure this is worth however.

I still have to make a test for partial_fit but I think that this is almost ready for a full review.

@glemaitre glemaitre changed the title [WIP] EHN: Ignore NaNs in StandardScaler and scale [MRG] EHN: Ignore NaNs in StandardScaler and scale Jun 16, 2018
@glemaitre
Copy link
Member Author

An update on the benchmark

Dense matrices

Fit time ratio: PR / 0.19,1

n_samples  n_features
1000       10            1.554382
           100           1.093015
10000      10            1.410582
           100           2.049729
100000     10            1.942390
           100           1.113770
500000     10            1.288468
           100           1.032234
dtype: float64

Sparse matrices

Fit time ratio: PR / 0.19,1

n_samples  n_features
1000       10            2.141932
           100           1.681234
10000      10            1.610512
           100           1.461307
100000     10            1.469542
           100           1.306887
500000     10            1.345844
           100           1.528964
dtype: float64

@jnothman
Copy link
Member

Benchmarks seem reasonable. Especially as absolute differences in fit time are small.

@jnothman jnothman changed the title [MRG] EHN: Ignore NaNs in StandardScaler and scale [MRG] ENH: Ignore NaNs in StandardScaler and scale Jun 16, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also make sure to test the new n_samples_seen_ shape directly. Otherwise LGTM

new calls to fit, but increments across ``partial_fit`` calls.
n_samples_seen_ : int or array, shape (n_features,)
The number of samples processed by the estimator for each feature.
If there is not missing samples, the ``n_samples_seen`` will be an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not -> are no

@jnothman jnothman added this to the 0.20 milestone Jun 16, 2018
@jnothman jnothman mentioned this pull request Jun 16, 2018
7 tasks
@glemaitre
Copy link
Member Author

@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014
I would appreciate reviews.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with this code, but went through the diff and apart from few minor comments, looks good to me.

@@ -800,6 +836,9 @@ class MaxAbsScaler(BaseEstimator, TransformerMixin):

Notes
-----
NaNs are treated as missing values: disregarded in fit, and maintained in
transform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you update the MaxAbsScaler in this PR?

# avoid division by 0
non_zero_idx = last_sample_count > 0
updated_unnormalized_variance[~non_zero_idx] =\
new_unnormalized_variance[~non_zero_idx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more readable to just do the calculation on the full array, and then in the end do updated_unnormalized_variance[non_zero_idx] = 0

last_over_new_count / updated_sample_count *
(last_sum / last_over_new_count - new_sum) ** 2)
new_unnormalized_variance = np.nanvar(X, axis=0) * new_sample_count
last_over_new_count = last_sample_count / new_sample_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can also already divide by zero, and then will give a RuntimeWarning by numpy, need to ignore this with an errstate?

@@ -664,7 +664,7 @@ def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,

last_variance : array-like, shape: (n_features,)

last_sample_count : int
last_sample_count : array-like, shape (n_features,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is still 0, which will actually never work.
So I would simply remove all defaults and make it all positional keywords (they are always used like that, so that shouldn't change anything for the rest)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once @jorisvandenbossche comments and the following are addressed.

else:
self.mean_, self.var_, self.n_samples_seen_ = \
_incremental_mean_and_var(X, self.mean_, self.var_,
self.n_samples_seen_)

# for back-compatibility, reduce n_samples_seen_ to an integer if the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: backward-compatibility

@glemaitre
Copy link
Member Author

@ogrisel This is ready to be merged. The previous failure was PEP8.

@ogrisel ogrisel merged commit 5718466 into scikit-learn:master Jun 21, 2018
@ogrisel
Copy link
Member

ogrisel commented Jun 21, 2018

Merged! Thanks @glemaitre 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants