MAINT: stats: replace `np.var` with `_moment(..., 2)` to warn on constant input #16055

mdhaber · 2022-04-27T20:23:54Z

Reference issue

What does this implement/fix?

gh-14418 reported that ttest_ind gives unreliable results when input arrays are constant (i.e. contain only one unique value). Ultimately, this is due to catastrophic cancellation when calculating variance. This PR resolves the issue by using _moment to calculate the variance (and _moment warns when the input data are identical or nearly identical).

>>> from scipy.stats import ttest_ind
>>> ttest_ind([0.04]*10, [0.04]*30)
C:\Users\matth\AppData\Local\Temp\ipykernel_50616\217009264.py:1: RuntimeWarning: Precision loss occurred in 
moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results
may be unreliable.
  ttest_ind([0.04]*10, [0.04]*30)
Ttest_indResult(statistic=-5.338539126015656, pvalue=4.591739995593669e-06)

Additional information

Do we need new unit tests, or is the addition of pytest.warns to the existing unit tests OK?

I went ahead and applied the change not only to the tests but also to describe.

_moment is slower than np.var. For the example above, on main:

%timeit ttest_ind([0.04]*10, [0.04]*30)  # 85.1 µs ± 472 ns per loop on main, 166 µs ± 1.52 µs per loop in PR

For a random array of size 10000, ttest_1samp takes ~67 µs in main and 114 µs in the PR.
For a random array of size 100000, ttest_1samp takes ~216 µs in main and 335 µs in the PR.

It might be possible to recover some of the time by performing the data constancy check separately, then using np.var to calculate the variance. We could also optimize a bit by using the mean, which already needs to be calculated outside of _var. (The difficulty with that is that we need the mean with keepdims=True.)

tirthasheshpatel · 2022-04-28T09:34:36Z

Do we need new unit tests, or is the addition of pytest.warns to the existing unit tests OK?

I think it's OK to just add pytest.warns in the existing tests.

tirthasheshpatel · 2022-04-28T09:34:40Z

scipy/stats/_stats_py.py

@@ -1339,7 +1349,7 @@ def describe(a, axis=0, ddof=1, bias=True, nan_policy='propagate'):
    n = a.shape[axis]
    mm = (np.min(a, axis=axis), np.max(a, axis=axis))
    m = np.mean(a, axis=axis)
-    v = np.var(a, axis=axis, ddof=ddof)
+    v = _var(a, axis=axis, ddof=ddof)


We could also optimize a bit by using the mean, which already needs to be calculated outside of _var. (The difficulty with that is that we need the mean with keepdims=True.)

I guess we can just do:

Suggested change

v = _var(a, axis=axis, ddof=ddof)

v = _var(a, mean=np.expand_dims(m, axis=axis), axis=axis, ddof=ddof)

It doesn't make a big difference though.

I guess I can just add the check like there is in _moment, then calculate the variance with np.var. Do you think that's the way to go?

Yes, sounds good!

That doesn't help : / The check itself is what is taking most of the time, and I don't see how to speed it up. (Taking the maximum first is indeed faster than doing the division for all elements.)

We can leave it as it is. I don't think the slowdown is too bad.

tirthasheshpatel

LGTM! I can merge if you don't mind the slowdown.

mdhaber · 2022-04-29T20:29:59Z

@tupui @chrisb83 what do you think?

gh-15905 added a reasonable (I think) way to fix problems reported for several stats functions when all data along an axis slice is identical or nearly identical (e.g. gh-15554, gh-14418, gh-13245, gh-11086, gh-10896). It is not a complex calculation, but for simple functions, the overhead is substantial: stats.moment takes ~50% longer, the t-tests take ~80% longer. For data sets that fit in memory, the time to perform these functions is still quite small (e.g. for an ~1 GB array of doubles, the functions still take only ~1.25s to run on my machine). This could have a practical impact for resampling methods, though.

Update: We could probably speed things up a bit by avoiding the division by the mean. Instead, the tolerance eps could be multiplied by the mean. (And of course, we could re-use the pre-calculated mean as discussed above, but as noted, that isn't the time consuming part.)None of this is helping much. I think I want to leave this as-is; squeezing out a bit more performance would be a follow-up.

Do we value performance at the expense of quietly returning a bogus answer for degenerate data?

@tupui this reminds me of your suggestion for some sort of standard way to bypass input validation and such e.g. warp_speed=True : P

mdhaber · 2022-04-30T03:31:12Z

@tirthasheshpatel I'm thinking we should go for it. If we really want to improve the performance, a Pythran version of the precision loss check with an explicit loop could (usually) quickly detect when data is not degenerate. (If any of the differences between the data and the mean exceed the tolerance, execution can continue.)

tirthasheshpatel · 2022-04-30T05:35:15Z

If we really want to improve the performance, a Pythran version of the precision loss check with an explicit loop could (usually) quickly detect when data is not degenerate. (If any of the differences between the data and the mean exceed the tolerance, execution can continue.)

We can do that in a follow-up.

I'm thinking we should go for it.

👍, let's get this in.

MAINT: stats: replace np.var with _var to warn on constant input

35b32f6

mdhaber added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats labels Apr 27, 2022

mdhaber changed the title ~~MAINT: stats: replace np.var with _var to warn on constant input~~ MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input Apr 27, 2022

mdhaber changed the title ~~MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input~~ MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input Apr 27, 2022

mdhaber requested a review from tirthasheshpatel April 27, 2022 21:05

tirthasheshpatel reviewed Apr 28, 2022

View reviewed changes

tirthasheshpatel approved these changes Apr 29, 2022

View reviewed changes

tirthasheshpatel merged commit 418fbdb into scipy:main Apr 30, 2022

mdhaber added this to the 1.9.0 milestone Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: stats: replace `np.var` with `_moment(..., 2)` to warn on constant input #16055

MAINT: stats: replace `np.var` with `_moment(..., 2)` to warn on constant input #16055

mdhaber commented Apr 27, 2022 •

edited

Loading

tirthasheshpatel commented Apr 28, 2022

tirthasheshpatel Apr 28, 2022 •

edited

Loading

tirthasheshpatel Apr 28, 2022

mdhaber Apr 28, 2022

tirthasheshpatel Apr 29, 2022

mdhaber Apr 29, 2022 •

edited

Loading

tirthasheshpatel Apr 29, 2022

tirthasheshpatel left a comment

mdhaber commented Apr 29, 2022 •

edited

Loading

mdhaber commented Apr 30, 2022

tirthasheshpatel commented Apr 30, 2022

	v = _var(a, axis=axis, ddof=ddof)
	v = _var(a, mean=np.expand_dims(m, axis=axis), axis=axis, ddof=ddof)

MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input #16055

MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input #16055

Conversation

mdhaber commented Apr 27, 2022 • edited Loading

Reference issue

What does this implement/fix?

Additional information

tirthasheshpatel commented Apr 28, 2022

tirthasheshpatel Apr 28, 2022 • edited Loading

Choose a reason for hiding this comment

tirthasheshpatel Apr 28, 2022

Choose a reason for hiding this comment

mdhaber Apr 28, 2022

Choose a reason for hiding this comment

tirthasheshpatel Apr 29, 2022

Choose a reason for hiding this comment

mdhaber Apr 29, 2022 • edited Loading

Choose a reason for hiding this comment

tirthasheshpatel Apr 29, 2022

Choose a reason for hiding this comment

tirthasheshpatel left a comment

Choose a reason for hiding this comment

mdhaber commented Apr 29, 2022 • edited Loading

mdhaber commented Apr 30, 2022

tirthasheshpatel commented Apr 30, 2022

MAINT: stats: replace `np.var` with `_moment(..., 2)` to warn on constant input #16055

MAINT: stats: replace `np.var` with `_moment(..., 2)` to warn on constant input #16055

mdhaber commented Apr 27, 2022 •

edited

Loading

tirthasheshpatel Apr 28, 2022 •

edited

Loading

mdhaber Apr 29, 2022 •

edited

Loading

mdhaber commented Apr 29, 2022 •

edited

Loading