New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats: add masked array, axis tuple, and nan policy support to trimmed statistics #19425
Conversation
…rimmed statistics
Just |
Doesn't |
The decorator doesn't always pass individual slices. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks close! Sorry for the nits!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are so much cleaner than before. Nice work.
I included a few nits inline. You're welcome to push back on if you disagree.
It's a shame that there is some unavoidable inconsistency here:
- default
axis
oftmean
isNone
; it's0
for the rest - different behaviors when there are no values within the limits (depending on the function and particular circumstances)
but this PR seems to maintain those "features". (In SciPy 2.0, I think we should pick a common axis default (-1
if it weren't for NumPy, but I guess 0
to follow suit) and really rethink when we raise vs returning NaN)
I also like that this PR reduces the number of warnings that get generated in edge cases. For example, stats.tvar([])
used to produce:
/Users/matthaberland/Desktop/scipy/scipy/stats/_stats_py.py:677: RuntimeWarning: Degrees of freedom <= 0 for slice
return a.var(ddof=ddof, axis=axis)
/Users/matthaberland/miniforge3/envs/scipy-dev/lib/python3.11/site-packages/numpy/core/_methods.py:163: RuntimeWarning: invalid value encountered in divide
arrmean = um.true_divide(arrmean, div, out=arrmean,
/Users/matthaberland/miniforge3/envs/scipy-dev/lib/python3.11/site-packages/numpy/core/_methods.py:198: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
Now we only get the first one, which is consistent with NumPy var
(although perhaps it should be changed there).
Co-authored-by: Matt Haberland <mdhaberla@calpoly.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tirthasheshpatel. Just a few last thoughts.
Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
Thanks, Tirth! |
… support to trimmed statistics (scipy#19425) * ENH: stats: simplify and add masked array, axis tuple, and nan policy support to trimmed statistics
@tirthasheshpatel I thought I'd motivate what I wrote about Given that NumPy tends toward row-based ordering, import numpy as np
rng = np.random.default_rng(734572435824)
n = 10_000_000
x = rng.random(size=(n, 2))
y = x.T.copy()
# %timeit np.mean(x, axis=0)
# 54.9 ms ± 58.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# %timeit np.mean(y, axis=-1)
# 3.83 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Calculations are consistent when working on entire arrays vs operating on individual slices: print(np.mean(x, axis=0) - [np.mean(x[:, 0]), np.mean(x[:, 1])])
# [-1.62092562e-14 3.67483821e-14]
print(np.mean(y, axis=-1) - [np.mean(y[0, :]), np.mean(y[1, :])])
# [0. 0.] With higher dimensional arrays, in particular, it's also nice to have the independent slices along z = np.arange(3*5*3).reshape((3, 5, 3))
# array([[[ 0, 1, 2],
# [ 3, 4, 5],
# [ 6, 7, 8],
# [ 9, 10, 11],
# [12, 13, 14]],
#
# [[15, 16, 17],
# [18, 19, 20],
# [21, 22, 23],
# [24, 25, 26],
# [27, 28, 29]],
#
# [[30, 31, 32],
# [33, 34, 35],
# [36, 37, 38],
# [39, 40, 41],
# [42, 43, 44]]])
# vs
z.T
# array([[[ 0, 15, 30],
# [ 3, 18, 33],
# [ 6, 21, 36],
# [ 9, 24, 39],
# [12, 27, 42]],
#
# [[ 1, 16, 31],
# [ 4, 19, 34],
# [ 7, 22, 37],
# [10, 25, 40],
# [13, 28, 43]],
#
# [[ 2, 17, 32],
# [ 5, 20, 35],
# [ 8, 23, 38],
# [11, 26, 41],
# [14, 29, 44]]]) And a reason why lb, ub = y |
Reference issue
Towards #14651
What does this implement/fix?
Adds masked array, axis tuple, and nan policy support to trimmed statistics functions:
stats.tmean
,stats.tvar
,stats.tstd
,stats.tmin
,stats.tmax
, andstats.tsem
.Additional information
@mdhaber I will remove the
axis
andnan_policy
arguments in the next commit. I have left them right now so you can run the dtype consistency tests.