Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats: add masked array, axis tuple, and nan policy support to trimmed statistics #19425

Merged
merged 14 commits into from Nov 7, 2023

Conversation

tirthasheshpatel
Copy link
Member

Reference issue

Towards #14651

What does this implement/fix?

Adds masked array, axis tuple, and nan policy support to trimmed statistics functions: stats.tmean, stats.tvar, stats.tstd, stats.tmin, stats.tmax, and stats.tsem.

Additional information

@mdhaber I will remove the axis and nan_policy arguments in the next commit. I have left them right now so you can run the dtype consistency tests.

@tirthasheshpatel tirthasheshpatel added scipy.stats enhancement A new feature or improvement labels Oct 22, 2023
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
@mdhaber
Copy link
Contributor

mdhaber commented Oct 23, 2023

I will remove the axis and nan_policy arguments in the next commit

Just nan_policy, right? axis should stay.

@tirthasheshpatel
Copy link
Member Author

Just nan_policy, right? axis should stay.

Doesn't _axis_nan_policy_factory add the axis argument (and just pass slices of inputs from input axis)? Why do we need to keep it?

@mdhaber
Copy link
Contributor

mdhaber commented Oct 23, 2023

The decorator doesn't always pass individual slices. If axis behavior is defined by the function, the decorator uses it when it can (e.g. no NaNs) for efficiency. It doesn't use the existing nan_policy because I lost trust in existing nan_policy implementations and in many cases the _axis_nan_policy approach to nan_policy turns out faster.

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks close! Sorry for the nits!

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are so much cleaner than before. Nice work.

I included a few nits inline. You're welcome to push back on if you disagree.

It's a shame that there is some unavoidable inconsistency here:

  • default axis of tmean is None; it's 0 for the rest
  • different behaviors when there are no values within the limits (depending on the function and particular circumstances)

but this PR seems to maintain those "features". (In SciPy 2.0, I think we should pick a common axis default (-1 if it weren't for NumPy, but I guess 0 to follow suit) and really rethink when we raise vs returning NaN)

I also like that this PR reduces the number of warnings that get generated in edge cases. For example, stats.tvar([]) used to produce:

/Users/matthaberland/Desktop/scipy/scipy/stats/_stats_py.py:677: RuntimeWarning: Degrees of freedom <= 0 for slice
  return a.var(ddof=ddof, axis=axis)
/Users/matthaberland/miniforge3/envs/scipy-dev/lib/python3.11/site-packages/numpy/core/_methods.py:163: RuntimeWarning: invalid value encountered in divide
  arrmean = um.true_divide(arrmean, div, out=arrmean,
/Users/matthaberland/miniforge3/envs/scipy-dev/lib/python3.11/site-packages/numpy/core/_methods.py:198: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)

Now we only get the first one, which is consistent with NumPy var (although perhaps it should be changed there).

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tirthasheshpatel. Just a few last thoughts.

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
tirthasheshpatel and others added 2 commits November 7, 2023 08:42
Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
@mdhaber mdhaber merged commit 4d50bca into scipy:main Nov 7, 2023
21 of 23 checks passed
@mdhaber
Copy link
Contributor

mdhaber commented Nov 7, 2023

Thanks, Tirth!

@tirthasheshpatel tirthasheshpatel deleted the tstats-anp branch November 7, 2023 18:55
@j-bowhay j-bowhay added this to the 1.12.0 milestone Nov 7, 2023
perimosocordiae pushed a commit to perimosocordiae/scipy that referenced this pull request Nov 8, 2023
… support to trimmed statistics (scipy#19425)

* ENH: stats: simplify and add masked array, axis tuple, and nan policy support to trimmed statistics
@mdhaber
Copy link
Contributor

mdhaber commented Nov 10, 2023

@tirthasheshpatel I thought I'd motivate what I wrote about axis=-1 in #19425 (review).

Given that NumPy tends toward row-based ordering, axis=-1 has a big performance benefit:

import numpy as np
rng = np.random.default_rng(734572435824)

n = 10_000_000
x = rng.random(size=(n, 2))
y = x.T.copy()

# %timeit np.mean(x, axis=0)
# 54.9 ms ± 58.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# %timeit np.mean(y, axis=-1)
# 3.83 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Calculations are consistent when working on entire arrays vs operating on individual slices:

print(np.mean(x, axis=0) - [np.mean(x[:, 0]), np.mean(x[:, 1])])
# [-1.62092562e-14  3.67483821e-14]

print(np.mean(y, axis=-1) - [np.mean(y[0, :]), np.mean(y[1, :])])
# [0. 0.]

With higher dimensional arrays, in particular, it's also nice to have the independent slices along axis=-1 because they stay contiguous when printed. To illustrate, imagine our slices are consecutive numbers:

z = np.arange(3*5*3).reshape((3, 5, 3))
# array([[[ 0,  1,  2],
#         [ 3,  4,  5],
#         [ 6,  7,  8],
#         [ 9, 10, 11],
#         [12, 13, 14]],
#
#        [[15, 16, 17],
#         [18, 19, 20],
#         [21, 22, 23],
#         [24, 25, 26],
#         [27, 28, 29]],
#
#        [[30, 31, 32],
#         [33, 34, 35],
#         [36, 37, 38],
#         [39, 40, 41],
#         [42, 43, 44]]])

# vs

z.T
# array([[[ 0, 15, 30],
#         [ 3, 18, 33],
#         [ 6, 21, 36],
#         [ 9, 24, 39],
#         [12, 27, 42]],
#        
#        [[ 1, 16, 31],
#         [ 4, 19, 34],
#         [ 7, 22, 37],
#         [10, 25, 40],
#         [13, 28, 43]],
#        
#        [[ 2, 17, 32],
#         [ 5, 20, 35],
#         [ 8, 23, 38],
#         [11, 26, 41],
#         [14, 29, 44]]])

And a reason why axis should not be 0 is that axis 0 is useful for convenient unpacking of arrays. I find myself unpacking independent slices into separate variables much more often than elements in the same position within their slice.

lb, ub = y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants