Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrected definition of pearsonr for masked arrays #5512

Closed
wants to merge 2 commits into from

Conversation

olliemath
Copy link

The given formula was incorrect. The t-statistic will only follow a students-t distribution if you ignore each PAIR of values (x_i, y_i) for which at least one of x_i, y_i is masked. This was not done in the denominator, so returning p-values based on that formula is misleading.

Let x' denote the (smaller) array obtained by removing masked values from x, and y' denote y with masked values removed. Further let x'' be x with values removed if they are masked in either x or y, and similarly y''. The old formula, which equated to
( len(x'') / sqrt(len(x') * len(y')) ) * cov(x'', y'') / (var(x') * var(y')),
does not relate to any other definition of correlation.
This version is simply:
cov(x'', y'') / (std(x'') * std(y'')),
which gives the correctly distributed t-statistic.

The given formula was incorrect. The t-statistic will only follow a students-t distribution if you ignore each PAIR of values (x_i, y_i) for which at least one of x_i, y_i is masked. This was not done in the denominator, so returning p-values based on that formula is misleading.

Let x' denote the (smaller) array obtained by removing masked values from x, and y' denote y with masked values removed. Further let x'' be x with values removed if they are masked in either x or y, and similarly y''. The old formula, which equated to
    ( len(x'')  / sqrt(len(x') * len(y')) ) * cov(x'', y'') / (var(x') * var(y')),
does not relate to any other definition of correlation.
This version is simply:
   cov(x'', y'') / (std(x'') * std(y'')),
which gives the correctly distributed t-statistic.
(xm, ym) = (x-mx, y-my)

# Mask arrays to make same size
(xm, ym) = (ma.MaskedArray(x, m), ma.MaskedArray(y, m))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't do anything, since it's not used

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean it's not used - it will change the calculation of r_den on line 391 (not quite sure how to add that to this comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You reassign to (xm, ym) in the next line (388) based on the original x, y
I guess you meant to use xm and ym in line 388

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, sorry that should be xm and ym in line 388 - just added an edit to fix it.

@josef-pkt
Copy link
Member

My guess is that you are replacing pairwise deletion by listwise deletion, both are appropriate but different. Pandas has both, I think.

correction the logical or masks elementwise, so it's still only pairwise deletion, AFAICS

@WarrenWeckesser
Copy link
Member

We have discussed this issue here: #3645

(Sorry, I don't have time at the moment to review this PR.)

@josef-pkt
Copy link
Member

(and the function says 1-D, but then ravels, i.e. uses axis=None)

I don't have a (preformed) opinion about using the full or the jointly masked variance. Both are possible. My guess is that using a full variance estimate would be more common if we calculate a correlation matrix (for many pairs). But I never checked what pandas or R are doing.

We use all available information for example in calculating the autocorrelation function, variance based on full sample, covariance with lag j has j fewer pairs.

@josef-pkt
Copy link
Member

Thanks for the link Warren, I thought it sounded familiar, but didn't remember the context.

@olliemath
Copy link
Author

The old method used (implicitly) pairwise deletion in the numerator and something else in the denominator. This one uses pairwise deletion in both.

Having tested, this method agrees with Pandas' method for a 2-column DataFrame. As mentioned in that bug report, this change also gives the same behaviour as np.corrcoef:

np.random.seed(12345)
a, b = np.random.normal(0, 1, [2, 10])
a[0] = np.nan
a, b = ma.masked_invalid(a), ma.masked_invalid(b)
m = ma.mask_or(ma.getmask(a), ma.getmask(b))

df = pd.DataFrame(dict(a=a, b=b))

Now df.corr().loc['a', 'b'] == ma.corrcoef(a, b)[0, 1] == stats.pearsonr(a[~m], b[~m])[0] is True, with each having a value of around 0.397 whereas the value given by stats.mstats.pearsonr(a, b)[0] is around 0.383. Moreover, the general difference in stated p-values was big enough to cause a colleague problems in meteorological applications (which is what lead me to examine the source code in the first place).

@olliemath
Copy link
Author

p.s. I do get the argument about using all information - but for calculating a p-value you need things t-distributed, which means only using valid pairs for the whole calculation (rather than just in the numerator) the old method overweights the centre (see below).
random_t-vals

Code:

def RandomTVals(N):
    # Create random a and b sets
    a, b = np.random.normal(0, 1, [2, 100, N])

    # First 50 values of a are lost
    m = np.array([[True] * N] * 50 + [[False] * N] * 50)
    a = ma.MaskedArray(a, m)
    b = ma.MaskedArray(b, None)

    # Masked copy of b with m is useful
    bm = ma.MaskedArray(b, m)
    df = 50 - 2     # n - 2

    a = a - a.mean()
    b = b - b.mean()
    bm = bm - bm.mean()

    # Old calculation, vectorised for speed
    o_rs = ma.sum(a * b, axis=0) / ma.sqrt(ma.sum(a * a, axis=0) * ma.sum(b * b, axis=0))
    ot_vals = ma.sqrt(df / (1.0 - o_rs ** 2)) * o_rs

    # New calculation, vectorised for speed
    n_rs = ma.sum(a * bm, axis=0) / ma.sqrt(ma.sum(a * a, axis=0) * ma.sum(bm * bm, axis=0))
    nt_vals = ma.sqrt(df / (1.0 - n_rs ** 2)) * n_rs

    return ot_vals, nt_vals


ot_vals, nt_vals = RandomTVals(10000)

plt.hist(ot_vals, normed=True, color='r', label='Old method: t-values from random masked data sets')
plt.hist(nt_vals, normed=True, color='b', label='New method: t-values from random masked data sets')
rv = scipy.stats.t(48)
x = np.linspace(-3, 3, 100)
plt.plot(x, rv.pdf(x), color='k', label='Actual students-t distribution')
plt.legend()
plt.show()

@rgommers rgommers added scipy.stats maintenance Items related to regular maintenance tasks labels Nov 22, 2015
@mdhaber
Copy link
Contributor

mdhaber commented Feb 20, 2022

This issue seems to have been fixed by gh-13169. mstats.pearsonr now ignores "each PAIR of values (x_i, y_i) for which at least one of x_i, y_i is masked".

import numpy as np
from scipy import stats

np.random.seed(0)
x = np.random.rand(100)
y = np.random.rand(100)
m1 = np.random.rand(100) > 0.9
m2 = np.random.rand(100) > 0.9
m = m1 | m2
assert not np.all(m1 == m2)  # arrays are not exactly aligned - passes

xm = np.ma.masked_array(x, m1)
ym = np.ma.masked_array(y, m2)

np.testing.assert_equal(stats.mstats.pearsonr(xm, ym), stats.pearsonr(x[~m], y[~m]))  # passes strict equality check

@mdhaber mdhaber closed this Feb 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants