-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrected definition of pearsonr for masked arrays #5512
Conversation
The given formula was incorrect. The t-statistic will only follow a students-t distribution if you ignore each PAIR of values (x_i, y_i) for which at least one of x_i, y_i is masked. This was not done in the denominator, so returning p-values based on that formula is misleading. Let x' denote the (smaller) array obtained by removing masked values from x, and y' denote y with masked values removed. Further let x'' be x with values removed if they are masked in either x or y, and similarly y''. The old formula, which equated to ( len(x'') / sqrt(len(x') * len(y')) ) * cov(x'', y'') / (var(x') * var(y')), does not relate to any other definition of correlation. This version is simply: cov(x'', y'') / (std(x'') * std(y'')), which gives the correctly distributed t-statistic.
(xm, ym) = (x-mx, y-my) | ||
|
||
# Mask arrays to make same size | ||
(xm, ym) = (ma.MaskedArray(x, m), ma.MaskedArray(y, m)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't do anything, since it's not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean it's not used - it will change the calculation of r_den on line 391 (not quite sure how to add that to this comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You reassign to (xm, ym) in the next line (388) based on the original x, y
I guess you meant to use xm and ym in line 388
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, sorry that should be xm and ym in line 388 - just added an edit to fix it.
My guess is that you are replacing pairwise deletion by listwise deletion, both are appropriate but different. Pandas has both, I think. correction the logical or masks elementwise, so it's still only pairwise deletion, AFAICS |
We have discussed this issue here: #3645 (Sorry, I don't have time at the moment to review this PR.) |
(and the function says 1-D, but then ravels, i.e. uses axis=None) I don't have a (preformed) opinion about using the full or the jointly masked variance. Both are possible. My guess is that using a full variance estimate would be more common if we calculate a correlation matrix (for many pairs). But I never checked what pandas or R are doing. We use all available information for example in calculating the autocorrelation function, variance based on full sample, covariance with lag j has j fewer pairs. |
Thanks for the link Warren, I thought it sounded familiar, but didn't remember the context. |
The old method used (implicitly) pairwise deletion in the numerator and something else in the denominator. This one uses pairwise deletion in both. Having tested, this method agrees with Pandas' method for a 2-column DataFrame. As mentioned in that bug report, this change also gives the same behaviour as np.random.seed(12345)
a, b = np.random.normal(0, 1, [2, 10])
a[0] = np.nan
a, b = ma.masked_invalid(a), ma.masked_invalid(b)
m = ma.mask_or(ma.getmask(a), ma.getmask(b))
df = pd.DataFrame(dict(a=a, b=b)) Now |
This issue seems to have been fixed by gh-13169. import numpy as np
from scipy import stats
np.random.seed(0)
x = np.random.rand(100)
y = np.random.rand(100)
m1 = np.random.rand(100) > 0.9
m2 = np.random.rand(100) > 0.9
m = m1 | m2
assert not np.all(m1 == m2) # arrays are not exactly aligned - passes
xm = np.ma.masked_array(x, m1)
ym = np.ma.masked_array(y, m2)
np.testing.assert_equal(stats.mstats.pearsonr(xm, ym), stats.pearsonr(x[~m], y[~m])) # passes strict equality check |
The given formula was incorrect. The t-statistic will only follow a students-t distribution if you ignore each PAIR of values (x_i, y_i) for which at least one of x_i, y_i is masked. This was not done in the denominator, so returning p-values based on that formula is misleading.
Let x' denote the (smaller) array obtained by removing masked values from x, and y' denote y with masked values removed. Further let x'' be x with values removed if they are masked in either x or y, and similarly y''. The old formula, which equated to
( len(x'') / sqrt(len(x') * len(y')) ) * cov(x'', y'') / (var(x') * var(y')),
does not relate to any other definition of correlation.
This version is simply:
cov(x'', y'') / (std(x'') * std(y'')),
which gives the correctly distributed t-statistic.