BUG: stats: mstats.pearsonr calculation is wrong if the masks are not aligned #3645

WarrenWeckesser · 2014-05-12T14:24:42Z

See http://stackoverflow.com/questions/23601150/why-dont-scipy-stats-mstats-pearsonr-results-agree-with-scipy-stats-pearsonr

The desired behavior isn't explicitly stated in the docstring, but it looks to me like the intent is that

result = mstats.pearsonr(x, y)

should give the same result as

m = x.mask | y.mask  # common mask
result = stats.pearsonr(x[~m].data, y[~m].data)

Currently it does not:

In [86]: x = np.ma.masked_array([1,2,3,4,5],mask=[0,1,0,0,0])

In [87]: y = np.ma.masked_array([9, 8, 7, 6, 5], mask=[0,0,1,0, 0])

In [88]: mstats.pearsonr(x, y)
Out[88]: 
(-0.93541434669348533, masked_array(data = 0.230053456163,
              mask = False,
        fill_value = 1e+20))

In [89]: m = x.mask | y.mask  # common mask

In [90]: stats.pearsonr(x[~m].data, y[~m].data)
Out[90]: (-1.0, 0.0)

At least one problem in the implementation (pointed out in my answer on stackoverflow) is that the mstats version does not use the common mask when it computes the means of x and y.

The text was updated successfully, but these errors were encountered:

josef-pkt · 2014-05-12T14:42:34Z

I was just working through several example for this.
The missing common mask has only a small effect and can be justified by not doing complete case deletion.
The main bug is that there is no correction for the different number of observations in the calculation of covariance and standard deviation.

I still have a small discrepancy in my numbers that I haven't figured out yet where it comes from. (df correction ?)

josef-pkt · 2014-05-12T14:44:23Z

prio-high because it looks like a serious bug if the missing values are far from being aligned.

WarrenWeckesser · 2014-05-12T14:46:57Z

The missing common mask has only a small effect and can be justified by not doing complete case deletion.

Could you elaborate on this? "A small effect" is still incorrect, isn't it?

Is my description of the equivalent implementation using stats.pearsonr the correct interpretation of the desired behavior?

josef-pkt · 2014-05-12T15:09:10Z

here's my stackoverflow answer

We have (at least) two options for missing value handling, complete case deletion and pairwise deletion.

In your use of scipy.stats.pearsonr you completely drop cases where there is a missing value in any of the variables.

numpy.ma.corrcoef gives the same results.

Checking the source of scipy.stats.mstats.pearsonr, it doesn't do complete case deletion for the calculating the variance or the mean.

    >>> xm = x - x.mean(0)
    >>> ym = y - y.mean(0)
    >>> np.ma.dot(xm, ym) / np.sqrt(np.ma.dot(xm, xm) * np.ma.dot(ym, ym))
    0.7731167378113557

    >>> scipy.stats.mstats.pearsonr(x,y)[0]
    0.77311673781135637

However, the difference between complete and pairwise case deletion on mean and standard deviations is small.

The main discrepancy seems to come from the missing correction for the different number of non-missing elements. Iignoring degrees of freedom corrections, I get

    >>> np.ma.dot(xm, ym) / bothok.sum() / \
        np.sqrt(np.ma.dot(xm, xm) / (~xm.mask).sum() * np.ma.dot(ym, ym) / (~ym.mask).sum())
    0.85855728319303393

which is close to the complete case deletion case.

josef-pkt · 2014-05-12T15:19:01Z

Could you elaborate on this? "A small effect" is still incorrect, isn't it?

We need to estimate mean, standard deviation and covariance. For mean and standard deviation only one variable is involved and we can estimated it using all available values.

With complete case deletion, what we get when we use stats.pearsonr with the non-missing values based on the joint mask, then we also use only the reduced number of observations for the calculation of the mean and standard deviation.

Both can be justified, it's a difference in definition, not really "incorrect".

However, if we use different observations for mean and standard deviation, then we have to calculate those correctly: dot product divided by the number of observations actually used minus ddof.

This is the discrepancy that we get from using different observations in the mean and standard deviation calculations, (except: I'm not sure which mean np.ma.cov is using, so there might be a small additional source of difference.)

>>> np.ma.cov(x, y)[0,1] / np.ma.std(x, ddof=1) / np.ma.std(y, ddof=1)
0.8585811978161848
>>> scipy.stats.pearsonr(x[bothok].data,y[bothok].data)[0]
0.85864344658604019
>>> np.ma.corrcoef(x, y)[0,1]
0.85864344658603986

josef-pkt · 2014-05-12T15:37:10Z

http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html

The usual discussion is if we have several variables, the the covariance or correlation between x1 and x2 need not have the same missing values as x1 and x3 or x2 and x3.

For the covariance matrix the same will apply for the diagonal, i.e. the variances.
I never checked what R or other packages do for the mean()

I think we had a discussion for numpy about what a nan aware np.cov or np.corrcoef should do. the best would be to offer several options.

np.ma.cov has a allow_masked. mstats.pearsonr could be rewritten to use np.ma.corrcoef with options.

WarrenWeckesser · 2014-05-12T16:17:15Z

(@josef-pkt, I didn't notice your last comment while I was editing mine, so I deleted my previous comment.)

Perhaps a future enhancement is to provide a keyword argument that selects one of several possible behaviors. What should the current (and in the future, the default) behavior be? My reading of the code led me to the description I gave above (where I specify the behavior of the mstats version in terms of the stats version). If that's not a good default, then I don't know what it should be.

The usual discussion is if we have several variables, ...

In the current API, we have just two 1-D arrays, x and y.

josef-pkt · 2014-05-12T16:32:58Z

>>> np.corrcoef(xx, yy)[0, 1] - scipy.stats.pearsonr(xx, yy)[0] 0.0

I think the default should be the same as the np.ma function np.ma.corrcoef.
whether that's the same as complete case deletion as in stats.pearsonr(x[~m].data, y[~m].data) is secondary.

I think there is no reason here to fix this bug without refactoring to reuse np.ma.corrcoef. All we need is the additional p-value calculation. (including degrees of freedom which is the only work left, I guess)

josef-pkt · 2014-05-12T17:00:37Z

mstats.spearmanr applies the joint mask

it looks like np.ma.cov has some inconsistency depending on whether we use one array or two arrays:

>>> np.ma.cov(x, y, bias=True).data
array([[ 0.98802039,  0.99511658],
       [ 0.99511658,  1.35942827]])
>>> np.ma.cov([x, y], bias=True).data
array([[ 0.98816536,  0.99510108],
       [ 0.99510108,  1.35945933]])
>>> np.ma.dot(x-x[bothok].mean(), y-y[bothok].mean()) / bothok.sum()
0.99511657902222872
>>> np.ma.dot(x-x.mean(), y - y.mean()) / bothok.sum()
0.99510108399067376

WarrenWeckesser · 2014-05-12T17:16:50Z

(@josef-pkt: This is a follow-up to earlier comments, not your most recent comment.)

I just want to be sure that before any changes are made, the desired behavior is unambiguously defined. It looks like np.ma.corrcoef(x, y)[0, 1] corresponds to "complete case deletion". For example, with x = [1, --, 0, 4, 5, 6], y = [9, 8, --, --, 5, 2]:

In [150]: x
Out[150]: 
masked_array(data = [1 -- 0 4 5 6],
             mask = [False  True False False False False],
       fill_value = 999999)

In [151]: y
Out[151]: 
masked_array(data = [9 8 -- -- 5 2],
             mask = [False False  True  True False False],
       fill_value = 999999)

In [152]: np.ma.corrcoef(x, y)[0, 1]
Out[152]: -0.96861960450113671

Apply np.corrcoef and stats.pearsonr to the "effective" data sets associated with complete case removal:

In [153]: np.corrcoef([1, 5, 6], [9, 5, 2])[0, 1]
Out[153]: -0.96861960450113671

In [154]: stats.pearsonr([1, 5, 6], [9, 5, 2])
Out[154]: (-0.96861960450113671, 0.15990668218800944)

For the p-value, I'd go with the number shown above in the result of stats.pearsonr. That is, "complete case removal" means values that have been removed are not counted at all. (Again, if that's not a reasonable default, I don't know what is.)

josef-pkt · 2014-05-12T18:00:07Z

your latest example

>>> np.ma.corrcoef(x, y, bias=True).data
array([[ 1.       , -0.9686196],
       [-0.9686196,  1.       ]])
>>> np.ma.corrcoef([x, y], bias=True).data
array([[ 1.        , -1.05471912],
       [-1.05471912,  1.        ]])

your earlier example

>>> xw = np.ma.masked_array([1,2,3,4,5],mask=[0,1,0,0,0])
>>> yw = np.ma.masked_array([9, 8, 7, 6, 5], mask=[0,0,1,0, 0])

>>> np.ma.cov(xw, yw, bias=True).data
array([[ 2.88888889, -2.88888889],
       [-2.88888889,  2.88888889]])
>>> np.ma.cov([xw, yw], bias=True).data
array([[ 2.1875    , -2.91666667],
       [-2.91666667,  2.5       ]])


>>> np.ma.corrcoef(xw, yw, bias=True).data
array([[ 1., -1.],
       [-1.,  1.]])
>>> np.ma.corrcoef([xw, yw], bias=True).data
array([[ 1.        , -1.00961538],
       [-1.00961538,  1.        ]])

This makes me think that we want to go for complete case deletion as the default. It's the safest for users, because they always will get a proper correlation matrix back.
Although we are throwing away information and might have to mask more results.
Which means we remove according to the joint mask also for mean and standard deviation.

pairwise deletion can be added as an option later.

mstats spearmanr is masking based on the joint mask. There are two,three lines that are missing from pearsonr.

As you say, with complete case deletion we can just get the same result as stats.pearsonr on the arrays that are compressed.

If we keep complete case deletion as the default for the future, then we can do just a bugfix here.
Given the current behavior of np.ma.corrcoef, I take back that mstats should match it.

WarrenWeckesser · 2018-07-17T17:09:40Z

Another related question on stackoverflow, about spearmanr: https://stackoverflow.com/questions/51386399/python-scipy-spearman-correlation-for-matrix-does-not-match-two-array-correlatio

After compressing the masked arrays using a common mask for x and y, the data is passed to stats.pearsonr to do the calculation. Closes scipygh-3645.

WarrenWeckesser added the defect label May 12, 2014

josef-pkt added scipy.stats labels May 12, 2014

rcomer mentioned this issue Jan 22, 2015

analysis.stats.pearsonr and masked arrays SciTools/iris#1534

Closed

WarrenWeckesser mentioned this issue Nov 20, 2015

Corrected definition of pearsonr for masked arrays #5512

Closed

WarrenWeckesser mentioned this issue Nov 29, 2020

BUG: stats: Fix handling of misaligned masks in mstats.pearsonr. #13169

Merged

tylerjereddy added this to the 1.6.0 milestone Dec 1, 2020

tylerjereddy closed this as completed in #13169 Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: stats: mstats.pearsonr calculation is wrong if the masks are not aligned #3645

BUG: stats: mstats.pearsonr calculation is wrong if the masks are not aligned #3645

WarrenWeckesser commented May 12, 2014

josef-pkt commented May 12, 2014

josef-pkt commented May 12, 2014

WarrenWeckesser commented May 12, 2014

josef-pkt commented May 12, 2014

josef-pkt commented May 12, 2014

josef-pkt commented May 12, 2014

WarrenWeckesser commented May 12, 2014

josef-pkt commented May 12, 2014

josef-pkt commented May 12, 2014

WarrenWeckesser commented May 12, 2014

josef-pkt commented May 12, 2014

WarrenWeckesser commented Jul 17, 2018 •

edited

BUG: stats: mstats.pearsonr calculation is wrong if the masks are not aligned #3645

BUG: stats: mstats.pearsonr calculation is wrong if the masks are not aligned #3645

Comments

WarrenWeckesser commented May 12, 2014

josef-pkt commented May 12, 2014

josef-pkt commented May 12, 2014

WarrenWeckesser commented May 12, 2014

josef-pkt commented May 12, 2014

josef-pkt commented May 12, 2014

josef-pkt commented May 12, 2014

WarrenWeckesser commented May 12, 2014

josef-pkt commented May 12, 2014

josef-pkt commented May 12, 2014

WarrenWeckesser commented May 12, 2014

josef-pkt commented May 12, 2014

WarrenWeckesser commented Jul 17, 2018 • edited

WarrenWeckesser commented Jul 17, 2018 •

edited