-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: stats: replace np.var
with _moment(..., 2)
to warn on constant input
#16055
Conversation
_moment(..., 2)
to warn on constant input
_moment(..., 2)
to warn on constant inputnp.var
with _moment(..., 2)
to warn on constant input
I think it's OK to just add |
@@ -1339,7 +1349,7 @@ def describe(a, axis=0, ddof=1, bias=True, nan_policy='propagate'): | |||
n = a.shape[axis] | |||
mm = (np.min(a, axis=axis), np.max(a, axis=axis)) | |||
m = np.mean(a, axis=axis) | |||
v = np.var(a, axis=axis, ddof=ddof) | |||
v = _var(a, axis=axis, ddof=ddof) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also optimize a bit by using the mean, which already needs to be calculated outside of
_var
. (The difficulty with that is that we need the mean withkeepdims=True
.)
I guess we can just do:
v = _var(a, axis=axis, ddof=ddof) | |
v = _var(a, mean=np.expand_dims(m, axis=axis), axis=axis, ddof=ddof) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make a big difference though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I can just add the check like there is in _moment
, then calculate the variance with np.var
. Do you think that's the way to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't help : / The check itself is what is taking most of the time, and I don't see how to speed it up. (Taking the maximum first is indeed faster than doing the division for all elements.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave it as it is. I don't think the slowdown is too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I can merge if you don't mind the slowdown.
@tupui @chrisb83 what do you think? gh-15905 added a reasonable (I think) way to fix problems reported for several stats functions when all data along an axis slice is identical or nearly identical (e.g. gh-15554, gh-14418, gh-13245, gh-11086, gh-10896). It is not a complex calculation, but for simple functions, the overhead is substantial:
Do we value performance at the expense of quietly returning a bogus answer for degenerate data? @tupui this reminds me of your suggestion for some sort of standard way to bypass input validation and such e.g. |
@tirthasheshpatel I'm thinking we should go for it. If we really want to improve the performance, a Pythran version of the precision loss check with an explicit loop could (usually) quickly detect when data is not degenerate. (If any of the differences between the data and the mean exceed the tolerance, execution can continue.) |
We can do that in a follow-up.
👍, let's get this in. |
Reference issue
Closes gh-14418
Follow-up to gh-15905
What does this implement/fix?
gh-14418 reported that
ttest_ind
gives unreliable results when input arrays are constant (i.e. contain only one unique value). Ultimately, this is due to catastrophic cancellation when calculating variance. This PR resolves the issue by using_moment
to calculate the variance (and_moment
warns when the input data are identical or nearly identical).Additional information
Do we need new unit tests, or is the addition of
pytest.warns
to the existing unit tests OK?I went ahead and applied the change not only to the tests but also to
describe
._moment
is slower thannp.var
. For the example above, onmain
:For a random array of size 10000,
ttest_1samp
takes ~67 µs in main and 114 µs in the PR.For a random array of size 100000,
ttest_1samp
takes ~216 µs in main and 335 µs in the PR.It might be possible to recover some of the time by performing the data constancy check separately, then using
np.var
to calculate the variance. We could also optimize a bit by using the mean, which already needs to be calculated outside of_var
. (The difficulty with that is that we need the mean withkeepdims=True
.)