Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input #16055

Merged
merged 1 commit into from
Apr 30, 2022

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Apr 27, 2022

Reference issue

Closes gh-14418
Follow-up to gh-15905

What does this implement/fix?

gh-14418 reported that ttest_ind gives unreliable results when input arrays are constant (i.e. contain only one unique value). Ultimately, this is due to catastrophic cancellation when calculating variance. This PR resolves the issue by using _moment to calculate the variance (and _moment warns when the input data are identical or nearly identical).

>>> from scipy.stats import ttest_ind
>>> ttest_ind([0.04]*10, [0.04]*30)
C:\Users\matth\AppData\Local\Temp\ipykernel_50616\217009264.py:1: RuntimeWarning: Precision loss occurred in 
moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results
may be unreliable.
  ttest_ind([0.04]*10, [0.04]*30)
Ttest_indResult(statistic=-5.338539126015656, pvalue=4.591739995593669e-06)

Additional information

Do we need new unit tests, or is the addition of pytest.warns to the existing unit tests OK?

I went ahead and applied the change not only to the tests but also to describe.

_moment is slower than np.var. For the example above, on main:

%timeit ttest_ind([0.04]*10, [0.04]*30)  # 85.1 µs ± 472 ns per loop on main, 166 µs ± 1.52 µs per loop in PR

For a random array of size 10000, ttest_1samp takes ~67 µs in main and 114 µs in the PR.
For a random array of size 100000, ttest_1samp takes ~216 µs in main and 335 µs in the PR.

It might be possible to recover some of the time by performing the data constancy check separately, then using np.var to calculate the variance. We could also optimize a bit by using the mean, which already needs to be calculated outside of _var. (The difficulty with that is that we need the mean with keepdims=True.)

@mdhaber mdhaber added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats labels Apr 27, 2022
@mdhaber mdhaber changed the title MAINT: stats: replace np.var with _var to warn on constant input MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input Apr 27, 2022
@mdhaber mdhaber changed the title MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input Apr 27, 2022
@tirthasheshpatel
Copy link
Member

Do we need new unit tests, or is the addition of pytest.warns to the existing unit tests OK?

I think it's OK to just add pytest.warns in the existing tests.

@@ -1339,7 +1349,7 @@ def describe(a, axis=0, ddof=1, bias=True, nan_policy='propagate'):
n = a.shape[axis]
mm = (np.min(a, axis=axis), np.max(a, axis=axis))
m = np.mean(a, axis=axis)
v = np.var(a, axis=axis, ddof=ddof)
v = _var(a, axis=axis, ddof=ddof)
Copy link
Member

@tirthasheshpatel tirthasheshpatel Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also optimize a bit by using the mean, which already needs to be calculated outside of _var. (The difficulty with that is that we need the mean with keepdims=True.)

I guess we can just do:

Suggested change
v = _var(a, axis=axis, ddof=ddof)
v = _var(a, mean=np.expand_dims(m, axis=axis), axis=axis, ddof=ddof)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make a big difference though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I can just add the check like there is in _moment, then calculate the variance with np.var. Do you think that's the way to go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good!

Copy link
Contributor Author

@mdhaber mdhaber Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't help : / The check itself is what is taking most of the time, and I don't see how to speed it up. (Taking the maximum first is indeed faster than doing the division for all elements.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it as it is. I don't think the slowdown is too bad.

Copy link
Member

@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I can merge if you don't mind the slowdown.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 29, 2022

@tupui @chrisb83 what do you think?

gh-15905 added a reasonable (I think) way to fix problems reported for several stats functions when all data along an axis slice is identical or nearly identical (e.g. gh-15554, gh-14418, gh-13245, gh-11086, gh-10896). It is not a complex calculation, but for simple functions, the overhead is substantial: stats.moment takes ~50% longer, the t-tests take ~80% longer. For data sets that fit in memory, the time to perform these functions is still quite small (e.g. for an ~1 GB array of doubles, the functions still take only ~1.25s to run on my machine). This could have a practical impact for resampling methods, though.

Update: We could probably speed things up a bit by avoiding the division by the mean. Instead, the tolerance eps could be multiplied by the mean. (And of course, we could re-use the pre-calculated mean as discussed above, but as noted, that isn't the time consuming part.)None of this is helping much. I think I want to leave this as-is; squeezing out a bit more performance would be a follow-up.

Do we value performance at the expense of quietly returning a bogus answer for degenerate data?

@tupui this reminds me of your suggestion for some sort of standard way to bypass input validation and such e.g. warp_speed=True : P

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 30, 2022

@tirthasheshpatel I'm thinking we should go for it. If we really want to improve the performance, a Pythran version of the precision loss check with an explicit loop could (usually) quickly detect when data is not degenerate. (If any of the differences between the data and the mean exceed the tolerance, execution can continue.)

@tirthasheshpatel
Copy link
Member

If we really want to improve the performance, a Pythran version of the precision loss check with an explicit loop could (usually) quickly detect when data is not degenerate. (If any of the differences between the data and the mean exceed the tolerance, execution can continue.)

We can do that in a follow-up.

I'm thinking we should go for it.

👍, let's get this in.

@tirthasheshpatel tirthasheshpatel merged commit 418fbdb into scipy:main Apr 30, 2022
@mdhaber mdhaber added this to the 1.9.0 milestone Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ttest_ind for two sampled distributions with the same single unique value
2 participants