ENH: stats: Add 'alternative' and confidence interval to pearsonr #15586

mdhaber · 2022-02-13T16:20:32Z

Reference issue

What does this implement/fix?

This PR adds one-sided alternatives and a confidence interval to pearsonr. This is an updated version of gh-12609, which stalled over a year ago, and I couldn't push to the original repo.

Why is this important?

The PR is about much more than pearsonr. Besides adding the alternative parameter to the last stats function that obviously needs it, by making the required changes to _make_tuple_bunch, this PR paves the way to add standard errors and confidence intervals in several other essential places (e.g. the t-tests). These are long-overdue improvements to scipy.stats.

Suggestions for review

Please compare against gh-12609, which I did not write. Here, I've just renamed attributes for consistency with other functions/classes, made _alternative and _n private attributes of the result object, and enabled the rendering of the result object documentation. I would have merged the original (if I could push changes) except for the changes to _bunch.py, so I'd suggest focusing on that and double-checking the interface changes.

The confidence interval is computed using Fisher's z transform.

scipy/stats/_stats_py.py

…rval

mdhaber · 2022-02-13T21:57:50Z

scipy/_lib/_bunch.py

+        raise AttributeError("can't set attribute %r of class %r"
+                             % (key, self.__class__.__name__))
+    else:
+        self.__dict__[key] = val


This change is needed to allow setting new attributes. ~~Someone who understands the bunch code should assess whether this is OK.@larsoner, can you help out here?~~

Based on this comment, it sounds like the only reason this was not allowed before was that there was not an immediate need.

I had namedtuple on the brain, and so was thinking this object should have an API that is locked down. You can't add attributes to a tuple or namedtuple. For now, I think I'd like to keep it locked down. By doing so, we aren't removing an existing feature, since this is only a replacement for objects that are tuples or namedtuples. If it turns out that there is demand from users to assign new attributes to one of these object, or to allow the extra attributes to be modified, we can enable that later.

Now there is a need: objects that inherit from those created using _make_tuple_bunch can have methods (like confidence_interval, so they should also be able to have private attributes to maintain state.

Based on that, I'm more confident it's OK. It seems that most of this code was just adapted from namedtuple and it's not absolutely essential to hold the bunch to the same restrictions.

@larsoner, I'd still really appreciate your thoughts here on the proposed changes to the bunch object:

Allowing setting new attributes (those that were not part of the original tuple)

Allowing new attribute names to begin with an underscore

Elimination of the default documentation of properties to make it easier to render correctly (see here).

but detailed review of the whole PR is not needed.

This change looks good to me. If we want to be more restrictive, we can just allow private attributes to be set (the one that start with _). Just a small thing though.

I didn't really see the need to lock this down originally. Sound ok to leave it open as-is?

why we are careful to ensure that the extra fields are not mutable and that none can be added?

I guess to be consistent with namedtuple.

Sound ok to leave it open as-is?

Yup, I don't see any problem with it.

scipy/stats/_stats_py.py

scipy/stats/_result_classes.py

scipy/stats/_stats_py.py

scipy/stats/tests/test_stats.py

scipy/stats/_stats_py.py

[skip azp][skip actions] Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>

[skip actions][skip azp]

[skip azp] [skip actions]

scipy/stats/_stats_py.py

Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>

tirthasheshpatel · 2022-02-28T06:34:43Z

scipy/stats/_stats_py.py

+        result = PearsonRResult(statistic=np.nan, pvalue=np.nan, n=n,
+                                alternative=alternative)


One thing I am worried about is adding _axis_nan_policy_factory to these result tuples. First, how do we retrieve the n and alternative attributes when the decorator is applied? And if we manage to do that, confidence_interval is not vectorized to handle multidimensional inputs.

Do we want to add _axis_nan_policy_factory to pearsonr? I see it is listed in the spreadsheet but not sure if all the functions in the sheet need the decorator.

gh-15632 shows how to get attributes out of these bunches. I need to merge main now that the keepdims PR has merged; gh-15632 will be simpler after that because it made several similar changes.

The confidence interval question is a really good one. I need to think about that. One quick answer is that I don't see the decorator as a forever-solution. Ideally, in the (perhaps very) long run, calculations will be natively vectorized, NumPy and SciPy functions will offer better support for omitting NaNs on their own, etc., so that maybe we can get the desired behavior easily without a decorator. But yeah, I don't like the idea of having to choose between an object with methods like confidence_interval and a vectorized result. It can certainly be made to work; but I'll need to think more about a clean way.

@tirthasheshpatel I think this is pretty simple, actually. It's just a matter of updating our definition of the PearsonRResult class.

There are two ways we interact with a PearsonRResult object inside the decorator:

We extract the results of calls to the pearsonr function from them. Their confidence_interval method doesn't matter at this point. I show how that can work in ENH: stats.wilcoxon: return z-statistic (as requested) #15632.

PearsonRResult is the tuple_to_result callable. All the hard stuff (omitting nans, tuple axis, etc.) has already been taken care of by the decorator before tuple_to_result gets called. It gets passed some arguments: an array of statistics, an array of ns, etc. The confidence_interval method just needs to be vectorized to act on each element of these arrays, not along an axis, so it's not so bad.

Does that sound right?

an array of statistics, an array of ns, etc.

The decorator won't pass an array of ns and alternatives to tuple_to_result object since the result returned by pearsonr doesn't contain them as iterables. Even if it did, alternative is a string and the decorator treats each output as a float64. I don't know how would one immediately solve this problem.

The confidence_interval method just needs to be vectorized to act on each element, not along an axis, so it's not so bad.

Yeah, this is correct.

since the result returned by pearsonr doesn't contain them as iterables

Again, see gh-15632, which shows the little change that lets us do that. I'll need to think about alternative some more. But it's code. There will be a way.

First thing that comes to mind that would work without change to the existing structure: result_to_tuple can turn alternative into a float corresponding with one of the three alternatives (e.g. -1 for less, 0 for two-sided, 1 for greater). tuple_to_result can look at any element of the resulting array (they will all be the same) and use that to set the alternative field of the PearsonRResult object to the appropriate string. (Since they will all be the same, no sense making it an array of strings. But n could, in general, vary if there are masked elements or omitted NaNs).

So there are definitely ways, and we could probably make it cleaner if we allow changes to the decorator. I say we think about it in detail when we get there.

First thing that comes to mind that would work without change to the existing structure: result_to_tuple can turn alternative into a float corresponding with one of the three alternatives (e.g. -1 for less, 0 for two-sided, 1 for greater). tuple_to_result can look at any element of the resulting array (they will all be the same) and use that to set the alternative field of the PearsonRResult object to the appropriate string.

That could work! One question: IIUC, the decorator could return an array of nan as result (i.e. alternative=nan could be passed to tuple_to_result). In that case, how does one decode the value of alternative? (I guess it doesn't matter since the result is just going to be nan anyways.)

Yeah I don't think it matters.
And I intended for us to allow that behavior - returning an all-nan array when nan_policy='propagage' - to be overridden. When @_axis_nan_policy(....) is called, we could pass a callable,
But yeah, we don't really need to.

tirthasheshpatel

Implementation and docs looks good. I also verified with R and mpmath and values match. It would be good to add a test for confidence interval when n=2 (test_length_two_pos1) and n=3 (test_length3_r_exactly_negative_one).

mdhaber · 2022-02-28T07:29:22Z

It would be good to add a test for confidence interval when n=2 (test_length_two_pos1) and n=3 (test_length3_r_exactly_negative_one).

Good idea. I was doing this but then thought that the confidence interval for n=2 could probably just be -1, 1 instead of nan, nan. I think it's OK to do that if we document it, and it's more user-friendly than NaNS. What do you think?

tirthasheshpatel · 2022-02-28T07:30:47Z

Makes sense. -1 and +1 confidence intervals for n<=3 sound good to me.

mdhaber · 2022-03-01T07:28:01Z

Anything else you wanted to see here?

tirthasheshpatel

LGTM now! I looked at the tests and doc note for n<=3 and they look good. Thanks @mdhaber! I will merge tomorrow if there are no other comments.

tirthasheshpatel · 2022-03-01T08:05:14Z

Really happy to see this will close long-standing gh-12506 once merged!!

tirthasheshpatel · 2022-03-02T10:21:50Z

Failures are unrelated. Merging. Thanks @mdhaber @WarrenWeckesser!

WarrenWeckesser and others added 4 commits November 6, 2020 04:16

ENH: stats: Add 'alternative' and confidence interval to pearsonr.

55111c5

The confidence interval is computed using Fisher's z transform.

Apply suggestions from code review

f917342

Merge remote-tracking branch 'upstream/main' into pearsonr-ci

3ef0ea1

MAINT: stats.pearsonr: move back to stats_py.py for easier review

c9487c8

mdhaber added scipy.stats enhancement A new feature or improvement labels Feb 13, 2022

mdhaber commented Feb 13, 2022

View reviewed changes

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

mdhaber commented Feb 13, 2022

View reviewed changes

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

Apply suggestions from code review

bdc7dc3

mdhaber mentioned this pull request Feb 13, 2022

ENH: stats: Add 'alternative' and confidence interval to pearsonr. #12609

Closed

mdhaber added 3 commits February 13, 2022 13:12

MAINT: stats.pearsonr: standardize name of statistic, confidence_inte…

300ab50

…rval

ENH: stats: remove alternative and n from repr of PearsonRResult

2a2176a

DOC: stats.pearsonr: render PearsonRResult in documentation

bc02329

mdhaber commented Feb 13, 2022

View reviewed changes

DOC: stats.pearsonr: improve doctest

72116d9

mdhaber commented Feb 13, 2022

View reviewed changes

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

mdhaber commented Feb 13, 2022

View reviewed changes

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

mdhaber commented Feb 13, 2022

View reviewed changes

scipy/stats/_result_classes.py Show resolved Hide resolved

Apply suggestions from code review

8f9efaf