ENH: stats: add masked array, axis tuple, and nan policy support to trimmed statistics #19425

tirthasheshpatel · 2023-10-22T23:02:58Z

Reference issue

What does this implement/fix?

Adds masked array, axis tuple, and nan policy support to trimmed statistics functions: stats.tmean, stats.tvar, stats.tstd, stats.tmin, stats.tmax, and stats.tsem.

Additional information

@mdhaber I will remove the axis and nan_policy arguments in the next commit. I have left them right now so you can run the dtype consistency tests.

…rimmed statistics

scipy/stats/_stats_py.py

mdhaber · 2023-10-23T19:57:11Z

I will remove the axis and nan_policy arguments in the next commit

Just nan_policy, right? axis should stay.

tirthasheshpatel · 2023-10-23T20:03:12Z

Just nan_policy, right? axis should stay.

Doesn't _axis_nan_policy_factory add the axis argument (and just pass slices of inputs from input axis)? Why do we need to keep it?

mdhaber · 2023-10-23T20:17:15Z

The decorator doesn't always pass individual slices. If axis behavior is defined by the function, the decorator uses it when it can (e.g. no NaNs) for efficiency. It doesn't use the existing nan_policy because I lost trust in existing nan_policy implementations and in many cases the _axis_nan_policy approach to nan_policy turns out faster.

scipy/stats/_stats_py.py

mdhaber

Looks close! Sorry for the nits!

scipy/stats/_stats_py.py

mdhaber

These are so much cleaner than before. Nice work.

I included a few nits inline. You're welcome to push back on if you disagree.

It's a shame that there is some unavoidable inconsistency here:

default axis of tmean is None; it's 0 for the rest
different behaviors when there are no values within the limits (depending on the function and particular circumstances)

but this PR seems to maintain those "features". (In SciPy 2.0, I think we should pick a common axis default (-1 if it weren't for NumPy, but I guess 0 to follow suit) and really rethink when we raise vs returning NaN)

I also like that this PR reduces the number of warnings that get generated in edge cases. For example, stats.tvar([]) used to produce:

/Users/matthaberland/Desktop/scipy/scipy/stats/_stats_py.py:677: RuntimeWarning: Degrees of freedom <= 0 for slice
  return a.var(ddof=ddof, axis=axis)
/Users/matthaberland/miniforge3/envs/scipy-dev/lib/python3.11/site-packages/numpy/core/_methods.py:163: RuntimeWarning: invalid value encountered in divide
  arrmean = um.true_divide(arrmean, div, out=arrmean,
/Users/matthaberland/miniforge3/envs/scipy-dev/lib/python3.11/site-packages/numpy/core/_methods.py:198: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)

Now we only get the first one, which is consistent with NumPy var (although perhaps it should be changed there).

scipy/stats/_stats_py.py

Co-authored-by: Matt Haberland <mdhaberla@calpoly.edu>

mdhaber

Thanks @tirthasheshpatel. Just a few last thoughts.

scipy/stats/_stats_py.py

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

mdhaber · 2023-11-07T18:52:55Z

Thanks, Tirth!

… support to trimmed statistics (scipy#19425) * ENH: stats: simplify and add masked array, axis tuple, and nan policy support to trimmed statistics

mdhaber · 2023-11-10T07:33:45Z

@tirthasheshpatel I thought I'd motivate what I wrote about axis=-1 in #19425 (review).

Given that NumPy tends toward row-based ordering, axis=-1 has a big performance benefit:

import numpy as np
rng = np.random.default_rng(734572435824)

n = 10_000_000
x = rng.random(size=(n, 2))
y = x.T.copy()

# %timeit np.mean(x, axis=0)
# 54.9 ms ± 58.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# %timeit np.mean(y, axis=-1)
# 3.83 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Calculations are consistent when working on entire arrays vs operating on individual slices:

print(np.mean(x, axis=0) - [np.mean(x[:, 0]), np.mean(x[:, 1])])
# [-1.62092562e-14  3.67483821e-14]

print(np.mean(y, axis=-1) - [np.mean(y[0, :]), np.mean(y[1, :])])
# [0. 0.]

With higher dimensional arrays, in particular, it's also nice to have the independent slices along axis=-1 because they stay contiguous when printed. To illustrate, imagine our slices are consecutive numbers:

z = np.arange(3*5*3).reshape((3, 5, 3))
# array([[[ 0,  1,  2],
#         [ 3,  4,  5],
#         [ 6,  7,  8],
#         [ 9, 10, 11],
#         [12, 13, 14]],
#
#        [[15, 16, 17],
#         [18, 19, 20],
#         [21, 22, 23],
#         [24, 25, 26],
#         [27, 28, 29]],
#
#        [[30, 31, 32],
#         [33, 34, 35],
#         [36, 37, 38],
#         [39, 40, 41],
#         [42, 43, 44]]])

# vs

z.T
# array([[[ 0, 15, 30],
#         [ 3, 18, 33],
#         [ 6, 21, 36],
#         [ 9, 24, 39],
#         [12, 27, 42]],
#        
#        [[ 1, 16, 31],
#         [ 4, 19, 34],
#         [ 7, 22, 37],
#         [10, 25, 40],
#         [13, 28, 43]],
#        
#        [[ 2, 17, 32],
#         [ 5, 20, 35],
#         [ 8, 23, 38],
#         [11, 26, 41],
#         [14, 29, 44]]])

And a reason why axis should not be 0 is that axis 0 is useful for convenient unpacking of arrays. I find myself unpacking independent slices into separate variables much more often than elements in the same position within their slice.

lb, ub = y

ENH: stats: add masked array, axis tuple, and nan policy support to t…

9ef54e1

…rimmed statistics

tirthasheshpatel added scipy.stats enhancement A new feature or improvement labels Oct 22, 2023

tirthasheshpatel requested a review from mdhaber October 22, 2023 23:02

BUG: fix type promotion issues in tstd, tvar, and tsem

f51259b

mdhaber reviewed Oct 23, 2023

View reviewed changes

mdhaber reviewed Oct 24, 2023

View reviewed changes

scipy/stats/_stats_py.py Show resolved Hide resolved

MAINT: stats: remove nan_policy, use axis in tsem

4c72a8a

mdhaber reviewed Oct 26, 2023

View reviewed changes

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

tirthasheshpatel added 4 commits October 27, 2023 19:38

MAINT: add nan_policy argument back

49500d2

BUG: stats: resolve the tmin/tmax garbage value problem

c298113

MAINT: stats: use np.nan instead of _get_nan in tmean and tvar

6f209b9

MAINT: stats: avoid converting to masked arrays

21737e4

mdhaber reviewed Oct 29, 2023

View reviewed changes

tirthasheshpatel and others added 5 commits November 7, 2023 05:22

Address review comments

40d6822

Co-authored-by: Matt Haberland <mdhaberla@calpoly.edu>

Merge branch 'main' of github.com:scipy/scipy into tstats-anp

9bb6414

Cast dtype back to original if no nans present in result

37df4cb

Merge branch 'main' of github.com:scipy/scipy into tstats-anp

e555f0c

Resolve merge artifacts

ed5c436

mdhaber reviewed Nov 7, 2023

View reviewed changes

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

tirthasheshpatel and others added 2 commits November 7, 2023 08:42

Apply suggestions from code review

6f8008a

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

Update scipy/stats/_stats_py.py

e7f162b

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

mdhaber merged commit 4d50bca into scipy:main Nov 7, 2023
21 of 23 checks passed

tirthasheshpatel deleted the tstats-anp branch November 7, 2023 18:55

j-bowhay added this to the 1.12.0 milestone Nov 7, 2023

tirthasheshpatel mentioned this pull request Jan 30, 2024

ENH: stats: add axis/nan_policy support to f_oneway and alexandergovern #19980

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: stats: add masked array, axis tuple, and nan policy support to trimmed statistics #19425

ENH: stats: add masked array, axis tuple, and nan policy support to trimmed statistics #19425

tirthasheshpatel commented Oct 22, 2023

mdhaber commented Oct 23, 2023

tirthasheshpatel commented Oct 23, 2023

mdhaber commented Oct 23, 2023

mdhaber left a comment

mdhaber left a comment •

edited

mdhaber left a comment

mdhaber commented Nov 7, 2023

mdhaber commented Nov 10, 2023 •

edited

ENH: stats: add masked array, axis tuple, and nan policy support to trimmed statistics #19425

ENH: stats: add masked array, axis tuple, and nan policy support to trimmed statistics #19425

Conversation

tirthasheshpatel commented Oct 22, 2023

Reference issue

What does this implement/fix?

Additional information

mdhaber commented Oct 23, 2023

tirthasheshpatel commented Oct 23, 2023

mdhaber commented Oct 23, 2023

mdhaber left a comment

Choose a reason for hiding this comment

mdhaber left a comment • edited

Choose a reason for hiding this comment

mdhaber left a comment

Choose a reason for hiding this comment

mdhaber commented Nov 7, 2023

mdhaber commented Nov 10, 2023 • edited

mdhaber left a comment •

edited

mdhaber commented Nov 10, 2023 •

edited