Corrected definition of pearsonr for masked arrays #5512

olliemath · 2015-11-20T17:58:56Z

The given formula was incorrect. The t-statistic will only follow a students-t distribution if you ignore each PAIR of values (x_i, y_i) for which at least one of x_i, y_i is masked. This was not done in the denominator, so returning p-values based on that formula is misleading.

Let x' denote the (smaller) array obtained by removing masked values from x, and y' denote y with masked values removed. Further let x'' be x with values removed if they are masked in either x or y, and similarly y''. The old formula, which equated to
( len(x'') / sqrt(len(x') * len(y')) ) * cov(x'', y'') / (var(x') * var(y')),
does not relate to any other definition of correlation.
This version is simply:
cov(x'', y'') / (std(x'') * std(y'')),
which gives the correctly distributed t-statistic.

The given formula was incorrect. The t-statistic will only follow a students-t distribution if you ignore each PAIR of values (x_i, y_i) for which at least one of x_i, y_i is masked. This was not done in the denominator, so returning p-values based on that formula is misleading. Let x' denote the (smaller) array obtained by removing masked values from x, and y' denote y with masked values removed. Further let x'' be x with values removed if they are masked in either x or y, and similarly y''. The old formula, which equated to ( len(x'') / sqrt(len(x') * len(y')) ) * cov(x'', y'') / (var(x') * var(y')), does not relate to any other definition of correlation. This version is simply: cov(x'', y'') / (std(x'') * std(y'')), which gives the correctly distributed t-statistic.

josef-pkt · 2015-11-20T18:13:57Z

scipy/stats/mstats_basic.py

-    (xm, ym) = (x-mx, y-my)
+
+    # Mask arrays to make same size
+    (xm, ym) = (ma.MaskedArray(x, m), ma.MaskedArray(y, m))


this doesn't do anything, since it's not used

Not sure what you mean it's not used - it will change the calculation of r_den on line 391 (not quite sure how to add that to this comment).

You reassign to (xm, ym) in the next line (388) based on the original x, y
I guess you meant to use xm and ym in line 388

Ah yeah, sorry that should be xm and ym in line 388 - just added an edit to fix it.

josef-pkt · 2015-11-20T18:15:40Z

My guess is that you are replacing pairwise deletion by listwise deletion, both are appropriate but different. Pandas has both, I think.

correction the logical or masks elementwise, so it's still only pairwise deletion, AFAICS

WarrenWeckesser · 2015-11-20T18:27:17Z

We have discussed this issue here: #3645

(Sorry, I don't have time at the moment to review this PR.)

josef-pkt · 2015-11-20T18:31:45Z

(and the function says 1-D, but then ravels, i.e. uses axis=None)

I don't have a (preformed) opinion about using the full or the jointly masked variance. Both are possible. My guess is that using a full variance estimate would be more common if we calculate a correlation matrix (for many pairs). But I never checked what pandas or R are doing.

We use all available information for example in calculating the autocorrelation function, variance based on full sample, covariance with lag j has j fewer pairs.

josef-pkt · 2015-11-20T18:37:55Z

Thanks for the link Warren, I thought it sounded familiar, but didn't remember the context.

olliemath · 2015-11-20T20:45:32Z

The old method used (implicitly) pairwise deletion in the numerator and something else in the denominator. This one uses pairwise deletion in both.

Having tested, this method agrees with Pandas' method for a 2-column DataFrame. As mentioned in that bug report, this change also gives the same behaviour as np.corrcoef:

np.random.seed(12345)
a, b = np.random.normal(0, 1, [2, 10])
a[0] = np.nan
a, b = ma.masked_invalid(a), ma.masked_invalid(b)
m = ma.mask_or(ma.getmask(a), ma.getmask(b))

df = pd.DataFrame(dict(a=a, b=b))

Now df.corr().loc['a', 'b'] == ma.corrcoef(a, b)[0, 1] == stats.pearsonr(a[~m], b[~m])[0] is True, with each having a value of around 0.397 whereas the value given by stats.mstats.pearsonr(a, b)[0] is around 0.383. Moreover, the general difference in stated p-values was big enough to cause a colleague problems in meteorological applications (which is what lead me to examine the source code in the first place).

olliemath · 2015-11-21T17:44:29Z

p.s. I do get the argument about using all information - but for calculating a p-value you need things t-distributed, which means only using valid pairs for the whole calculation (rather than just in the numerator) the old method overweights the centre (see below).

Code:

def RandomTVals(N):
    # Create random a and b sets
    a, b = np.random.normal(0, 1, [2, 100, N])

    # First 50 values of a are lost
    m = np.array([[True] * N] * 50 + [[False] * N] * 50)
    a = ma.MaskedArray(a, m)
    b = ma.MaskedArray(b, None)

    # Masked copy of b with m is useful
    bm = ma.MaskedArray(b, m)
    df = 50 - 2     # n - 2

    a = a - a.mean()
    b = b - b.mean()
    bm = bm - bm.mean()

    # Old calculation, vectorised for speed
    o_rs = ma.sum(a * b, axis=0) / ma.sqrt(ma.sum(a * a, axis=0) * ma.sum(b * b, axis=0))
    ot_vals = ma.sqrt(df / (1.0 - o_rs ** 2)) * o_rs

    # New calculation, vectorised for speed
    n_rs = ma.sum(a * bm, axis=0) / ma.sqrt(ma.sum(a * a, axis=0) * ma.sum(bm * bm, axis=0))
    nt_vals = ma.sqrt(df / (1.0 - n_rs ** 2)) * n_rs

    return ot_vals, nt_vals


ot_vals, nt_vals = RandomTVals(10000)

plt.hist(ot_vals, normed=True, color='r', label='Old method: t-values from random masked data sets')
plt.hist(nt_vals, normed=True, color='b', label='New method: t-values from random masked data sets')
rv = scipy.stats.t(48)
x = np.linspace(-3, 3, 100)
plt.plot(x, rv.pdf(x), color='k', label='Actual students-t distribution')
plt.legend()
plt.show()

mdhaber · 2022-02-20T01:57:01Z

This issue seems to have been fixed by gh-13169. mstats.pearsonr now ignores "each PAIR of values (x_i, y_i) for which at least one of x_i, y_i is masked".

import numpy as np
from scipy import stats

np.random.seed(0)
x = np.random.rand(100)
y = np.random.rand(100)
m1 = np.random.rand(100) > 0.9
m2 = np.random.rand(100) > 0.9
m = m1 | m2
assert not np.all(m1 == m2)  # arrays are not exactly aligned - passes

xm = np.ma.masked_array(x, m1)
ym = np.ma.masked_array(y, m2)

np.testing.assert_equal(stats.mstats.pearsonr(xm, ym), stats.pearsonr(x[~m], y[~m]))  # passes strict equality check

josef-pkt reviewed Nov 20, 2015
View reviewed changes

Fixed line 388

398febe

rgommers added scipy.stats maintenance Items related to regular maintenance tasks labels Nov 22, 2015

mdhaber closed this Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrected definition of pearsonr for masked arrays #5512

Corrected definition of pearsonr for masked arrays #5512

olliemath commented Nov 20, 2015

josef-pkt Nov 20, 2015

olliemath Nov 20, 2015

josef-pkt Nov 20, 2015

olliemath Nov 20, 2015

josef-pkt commented Nov 20, 2015

WarrenWeckesser commented Nov 20, 2015

josef-pkt commented Nov 20, 2015

josef-pkt commented Nov 20, 2015

olliemath commented Nov 20, 2015

olliemath commented Nov 21, 2015

mdhaber commented Feb 20, 2022

Corrected definition of pearsonr for masked arrays #5512

Corrected definition of pearsonr for masked arrays #5512

Conversation

olliemath commented Nov 20, 2015

josef-pkt Nov 20, 2015

Choose a reason for hiding this comment

olliemath Nov 20, 2015

Choose a reason for hiding this comment

josef-pkt Nov 20, 2015

Choose a reason for hiding this comment

olliemath Nov 20, 2015

Choose a reason for hiding this comment

josef-pkt commented Nov 20, 2015

WarrenWeckesser commented Nov 20, 2015

josef-pkt commented Nov 20, 2015

josef-pkt commented Nov 20, 2015

olliemath commented Nov 20, 2015

olliemath commented Nov 21, 2015

mdhaber commented Feb 20, 2022