-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats: Add 'alternative' and confidence interval to pearsonr #15586
Conversation
The confidence interval is computed using Fisher's z transform.
raise AttributeError("can't set attribute %r of class %r" | ||
% (key, self.__class__.__name__)) | ||
else: | ||
self.__dict__[key] = val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is needed to allow setting new attributes. Someone who understands the bunch code should assess whether this is OK.@larsoner, can you help out here?
Based on this comment, it sounds like the only reason this was not allowed before was that there was not an immediate need.
I had
namedtuple
on the brain, and so was thinking this object should have an API that is locked down. You can't add attributes to a tuple or namedtuple. For now, I think I'd like to keep it locked down. By doing so, we aren't removing an existing feature, since this is only a replacement for objects that are tuples or namedtuples. If it turns out that there is demand from users to assign new attributes to one of these object, or to allow the extra attributes to be modified, we can enable that later.
Now there is a need: objects that inherit from those created using _make_tuple_bunch
can have methods (like confidence_interval
, so they should also be able to have private attributes to maintain state.
Based on that, I'm more confident it's OK. It seems that most of this code was just adapted from namedtuple
and it's not absolutely essential to hold the bunch
to the same restrictions.
@larsoner, I'd still really appreciate your thoughts here on the proposed changes to the bunch object:
- Allowing setting new attributes (those that were not part of the original tuple)
- Allowing new attribute names to begin with an underscore
- Elimination of the default documentation of properties to make it easier to render correctly (see here).
but detailed review of the whole PR is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good to me. If we want to be more restrictive, we can just allow private attributes to be set (the one that start with _
). Just a small thing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really see the need to lock this down originally. Sound ok to leave it open as-is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we are careful to ensure that the extra fields are not mutable and that none can be added?
I guess to be consistent with namedtuple
.
Sound ok to leave it open as-is?
Yup, I don't see any problem with it.
[skip azp][skip actions] Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>
[skip actions][skip azp]
[skip azp] [skip actions]
Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>
result = PearsonRResult(statistic=np.nan, pvalue=np.nan, n=n, | ||
alternative=alternative) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I am worried about is adding _axis_nan_policy_factory
to these result tuples. First, how do we retrieve the n
and alternative
attributes when the decorator is applied? And if we manage to do that, confidence_interval
is not vectorized to handle multidimensional inputs.
Do we want to add _axis_nan_policy_factory
to pearsonr
? I see it is listed in the spreadsheet but not sure if all the functions in the sheet need the decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gh-15632 shows how to get attributes out of these bunches. I need to merge main
now that the keepdims
PR has merged; gh-15632 will be simpler after that because it made several similar changes.
The confidence interval question is a really good one. I need to think about that. One quick answer is that I don't see the decorator as a forever-solution. Ideally, in the (perhaps very) long run, calculations will be natively vectorized, NumPy and SciPy functions will offer better support for omitting NaNs on their own, etc., so that maybe we can get the desired behavior easily without a decorator. But yeah, I don't like the idea of having to choose between an object with methods like confidence_interval
and a vectorized result. It can certainly be made to work; but I'll need to think more about a clean way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tirthasheshpatel I think this is pretty simple, actually. It's just a matter of updating our definition of the PearsonRResult
class.
There are two ways we interact with a PearsonRResult
object inside the decorator:
- We extract the results of calls to the
pearsonr
function from them. Theirconfidence_interval
method doesn't matter at this point. I show how that can work in ENH: stats.wilcoxon: return z-statistic (as requested) #15632. PearsonRResult
is thetuple_to_result
callable. All the hard stuff (omitting nans, tuple axis, etc.) has already been taken care of by the decorator beforetuple_to_result
gets called. It gets passed some arguments: an array of statistics, an array ofn
s, etc. Theconfidence_interval
method just needs to be vectorized to act on each element of these arrays, not along an axis, so it's not so bad.
Does that sound right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an array of statistics, an array of
n
s, etc.
The decorator won't pass an array of n
s and alternative
s to tuple_to_result
object since the result returned by pearsonr
doesn't contain them as iterables. Even if it did, alternative
is a string and the decorator treats each output as a float64. I don't know how would one immediately solve this problem.
The
confidence_interval
method just needs to be vectorized to act on each element, not along an axis, so it's not so bad.
Yeah, this is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the result returned by pearsonr doesn't contain them as iterables
Again, see gh-15632, which shows the little change that lets us do that. I'll need to think about alternative
some more. But it's code. There will be a way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First thing that comes to mind that would work without change to the existing structure: result_to_tuple
can turn alternative
into a float corresponding with one of the three alternatives (e.g. -1 for less, 0 for two-sided, 1 for greater). tuple_to_result
can look at any element of the resulting array (they will all be the same) and use that to set the alternative
field of the PearsonRResult
object to the appropriate string. (Since they will all be the same, no sense making it an array of strings. But n
could, in general, vary if there are masked elements or omitted NaNs).
So there are definitely ways, and we could probably make it cleaner if we allow changes to the decorator. I say we think about it in detail when we get there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First thing that comes to mind that would work without change to the existing structure:
result_to_tuple
can turnalternative
into a float corresponding with one of the three alternatives (e.g. -1 for less, 0 for two-sided, 1 for greater).tuple_to_result
can look at any element of the resulting array (they will all be the same) and use that to set thealternative
field of thePearsonRResult
object to the appropriate string.
That could work! One question: IIUC, the decorator could return an array of nan
as result (i.e. alternative=nan
could be passed to tuple_to_result
). In that case, how does one decode the value of alternative
? (I guess it doesn't matter since the result is just going to be nan anyways.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think it matters.
And I intended for us to allow that behavior - returning an all-nan array when nan_policy='propagage' - to be overridden. When @_axis_nan_policy(....)
is called, we could pass a callable,
But yeah, we don't really need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation and docs looks good. I also verified with R and mpmath and values match. It would be good to add a test for confidence interval when n=2 (test_length_two_pos1
) and n=3 (test_length3_r_exactly_negative_one
).
Good idea. I was doing this but then thought that the confidence interval for |
Makes sense. -1 and +1 confidence intervals for n<=3 sound good to me. |
Anything else you wanted to see here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now! I looked at the tests and doc note for n<=3 and they look good. Thanks @mdhaber! I will merge tomorrow if there are no other comments.
Really happy to see this will close long-standing gh-12506 once merged!! |
Failures are unrelated. Merging. Thanks @mdhaber @WarrenWeckesser! |
Reference issue
supersedes gh-12609
closes gh-12506
What does this implement/fix?
This PR adds one-sided alternatives and a confidence interval to
pearsonr
. This is an updated version of gh-12609, which stalled over a year ago, and I couldn't push to the original repo.Why is this important?
The PR is about much more than
pearsonr
. Besides adding thealternative
parameter to the last stats function that obviously needs it, by making the required changes to_make_tuple_bunch
, this PR paves the way to add standard errors and confidence intervals in several other essential places (e.g. the t-tests). These are long-overdue improvements toscipy.stats
.Suggestions for review
Please compare against gh-12609, which I did not write. Here, I've just renamed attributes for consistency with other functions/classes, made
_alternative
and_n
private attributes of the result object, and enabled the rendering of the result object documentation. I would have merged the original (if I could push changes) except for the changes to_bunch.py
, so I'd suggest focusing on that and double-checking the interface changes.