Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats: Add 'alternative' and confidence interval to pearsonr #15586

Merged
merged 18 commits into from
Mar 2, 2022

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Feb 13, 2022

Reference issue

supersedes gh-12609
closes gh-12506

What does this implement/fix?

This PR adds one-sided alternatives and a confidence interval to pearsonr. This is an updated version of gh-12609, which stalled over a year ago, and I couldn't push to the original repo.

Why is this important?

The PR is about much more than pearsonr. Besides adding the alternative parameter to the last stats function that obviously needs it, by making the required changes to _make_tuple_bunch, this PR paves the way to add standard errors and confidence intervals in several other essential places (e.g. the t-tests). These are long-overdue improvements to scipy.stats.

Suggestions for review

Please compare against gh-12609, which I did not write. Here, I've just renamed attributes for consistency with other functions/classes, made _alternative and _n private attributes of the result object, and enabled the rendering of the result object documentation. I would have merged the original (if I could push changes) except for the changes to _bunch.py, so I'd suggest focusing on that and double-checking the interface changes.

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Feb 13, 2022
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
raise AttributeError("can't set attribute %r of class %r"
% (key, self.__class__.__name__))
else:
self.__dict__[key] = val
Copy link
Contributor Author

@mdhaber mdhaber Feb 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is needed to allow setting new attributes. Someone who understands the bunch code should assess whether this is OK.@larsoner, can you help out here?

Based on this comment, it sounds like the only reason this was not allowed before was that there was not an immediate need.

I had namedtuple on the brain, and so was thinking this object should have an API that is locked down. You can't add attributes to a tuple or namedtuple. For now, I think I'd like to keep it locked down. By doing so, we aren't removing an existing feature, since this is only a replacement for objects that are tuples or namedtuples. If it turns out that there is demand from users to assign new attributes to one of these object, or to allow the extra attributes to be modified, we can enable that later.

Now there is a need: objects that inherit from those created using _make_tuple_bunch can have methods (like confidence_interval, so they should also be able to have private attributes to maintain state.

Based on that, I'm more confident it's OK. It seems that most of this code was just adapted from namedtuple and it's not absolutely essential to hold the bunch to the same restrictions.

@larsoner, I'd still really appreciate your thoughts here on the proposed changes to the bunch object:

  • Allowing setting new attributes (those that were not part of the original tuple)
  • Allowing new attribute names to begin with an underscore
  • Elimination of the default documentation of properties to make it easier to render correctly (see here).

but detailed review of the whole PR is not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me. If we want to be more restrictive, we can just allow private attributes to be set (the one that start with _). Just a small thing though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really see the need to lock this down originally. Sound ok to leave it open as-is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are careful to ensure that the extra fields are not mutable and that none can be added?

I guess to be consistent with namedtuple.

Sound ok to leave it open as-is?

Yup, I don't see any problem with it.

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>
Comment on lines +4264 to +4265
result = PearsonRResult(statistic=np.nan, pvalue=np.nan, n=n,
alternative=alternative)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I am worried about is adding _axis_nan_policy_factory to these result tuples. First, how do we retrieve the n and alternative attributes when the decorator is applied? And if we manage to do that, confidence_interval is not vectorized to handle multidimensional inputs.

Do we want to add _axis_nan_policy_factory to pearsonr? I see it is listed in the spreadsheet but not sure if all the functions in the sheet need the decorator.

Copy link
Contributor Author

@mdhaber mdhaber Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gh-15632 shows how to get attributes out of these bunches. I need to merge main now that the keepdims PR has merged; gh-15632 will be simpler after that because it made several similar changes.

The confidence interval question is a really good one. I need to think about that. One quick answer is that I don't see the decorator as a forever-solution. Ideally, in the (perhaps very) long run, calculations will be natively vectorized, NumPy and SciPy functions will offer better support for omitting NaNs on their own, etc., so that maybe we can get the desired behavior easily without a decorator. But yeah, I don't like the idea of having to choose between an object with methods like confidence_interval and a vectorized result. It can certainly be made to work; but I'll need to think more about a clean way.

Copy link
Contributor Author

@mdhaber mdhaber Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tirthasheshpatel I think this is pretty simple, actually. It's just a matter of updating our definition of the PearsonRResult class.

There are two ways we interact with a PearsonRResult object inside the decorator:

  • We extract the results of calls to the pearsonr function from them. Their confidence_interval method doesn't matter at this point. I show how that can work in ENH: stats.wilcoxon: return z-statistic (as requested) #15632.
  • PearsonRResult is the tuple_to_result callable. All the hard stuff (omitting nans, tuple axis, etc.) has already been taken care of by the decorator before tuple_to_result gets called. It gets passed some arguments: an array of statistics, an array of ns, etc. The confidence_interval method just needs to be vectorized to act on each element of these arrays, not along an axis, so it's not so bad.

Does that sound right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an array of statistics, an array of ns, etc.

The decorator won't pass an array of ns and alternatives to tuple_to_result object since the result returned by pearsonr doesn't contain them as iterables. Even if it did, alternative is a string and the decorator treats each output as a float64. I don't know how would one immediately solve this problem.

The confidence_interval method just needs to be vectorized to act on each element, not along an axis, so it's not so bad.

Yeah, this is correct.

Copy link
Contributor Author

@mdhaber mdhaber Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the result returned by pearsonr doesn't contain them as iterables

Again, see gh-15632, which shows the little change that lets us do that. I'll need to think about alternative some more. But it's code. There will be a way.

Copy link
Contributor Author

@mdhaber mdhaber Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thing that comes to mind that would work without change to the existing structure: result_to_tuple can turn alternative into a float corresponding with one of the three alternatives (e.g. -1 for less, 0 for two-sided, 1 for greater). tuple_to_result can look at any element of the resulting array (they will all be the same) and use that to set the alternative field of the PearsonRResult object to the appropriate string. (Since they will all be the same, no sense making it an array of strings. But n could, in general, vary if there are masked elements or omitted NaNs).

So there are definitely ways, and we could probably make it cleaner if we allow changes to the decorator. I say we think about it in detail when we get there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thing that comes to mind that would work without change to the existing structure: result_to_tuple can turn alternative into a float corresponding with one of the three alternatives (e.g. -1 for less, 0 for two-sided, 1 for greater). tuple_to_result can look at any element of the resulting array (they will all be the same) and use that to set the alternative field of the PearsonRResult object to the appropriate string.

That could work! One question: IIUC, the decorator could return an array of nan as result (i.e. alternative=nan could be passed to tuple_to_result). In that case, how does one decode the value of alternative? (I guess it doesn't matter since the result is just going to be nan anyways.)

Copy link
Contributor Author

@mdhaber mdhaber Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think it matters.
And I intended for us to allow that behavior - returning an all-nan array when nan_policy='propagage' - to be overridden. When @_axis_nan_policy(....) is called, we could pass a callable,
But yeah, we don't really need to.

Copy link
Member

@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation and docs looks good. I also verified with R and mpmath and values match. It would be good to add a test for confidence interval when n=2 (test_length_two_pos1) and n=3 (test_length3_r_exactly_negative_one).

@mdhaber
Copy link
Contributor Author

mdhaber commented Feb 28, 2022

It would be good to add a test for confidence interval when n=2 (test_length_two_pos1) and n=3 (test_length3_r_exactly_negative_one).

Good idea. I was doing this but then thought that the confidence interval for n=2 could probably just be -1, 1 instead of nan, nan. I think it's OK to do that if we document it, and it's more user-friendly than NaNS. What do you think?

@tirthasheshpatel
Copy link
Member

Makes sense. -1 and +1 confidence intervals for n<=3 sound good to me.

@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 1, 2022

Anything else you wanted to see here?

Copy link
Member

@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now! I looked at the tests and doc note for n<=3 and they look good. Thanks @mdhaber! I will merge tomorrow if there are no other comments.

@tirthasheshpatel
Copy link
Member

Really happy to see this will close long-standing gh-12506 once merged!!

@tirthasheshpatel
Copy link
Member

Failures are unrelated. Merging. Thanks @mdhaber @WarrenWeckesser!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: stats: one-sided p-values for statistical tests
3 participants