Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stats: mstats.pearsonr calculation is wrong if the masks are not aligned #3645

Closed
WarrenWeckesser opened this issue May 12, 2014 · 12 comments · Fixed by #13169
Closed
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@WarrenWeckesser
Copy link
Member

See http://stackoverflow.com/questions/23601150/why-dont-scipy-stats-mstats-pearsonr-results-agree-with-scipy-stats-pearsonr

The desired behavior isn't explicitly stated in the docstring, but it looks to me like the intent is that

result = mstats.pearsonr(x, y)

should give the same result as

m = x.mask | y.mask  # common mask
result = stats.pearsonr(x[~m].data, y[~m].data)

Currently it does not:

In [86]: x = np.ma.masked_array([1,2,3,4,5],mask=[0,1,0,0,0])

In [87]: y = np.ma.masked_array([9, 8, 7, 6, 5], mask=[0,0,1,0, 0])

In [88]: mstats.pearsonr(x, y)
Out[88]: 
(-0.93541434669348533, masked_array(data = 0.230053456163,
              mask = False,
        fill_value = 1e+20))

In [89]: m = x.mask | y.mask  # common mask

In [90]: stats.pearsonr(x[~m].data, y[~m].data)
Out[90]: (-1.0, 0.0)

At least one problem in the implementation (pointed out in my answer on stackoverflow) is that the mstats version does not use the common mask when it computes the means of x and y.

@josef-pkt
Copy link
Member

I was just working through several example for this.
The missing common mask has only a small effect and can be justified by not doing complete case deletion.
The main bug is that there is no correction for the different number of observations in the calculation of covariance and standard deviation.

I still have a small discrepancy in my numbers that I haven't figured out yet where it comes from. (df correction ?)

@josef-pkt
Copy link
Member

prio-high because it looks like a serious bug if the missing values are far from being aligned.

@WarrenWeckesser
Copy link
Member Author

The missing common mask has only a small effect and can be justified by not doing complete case deletion.

Could you elaborate on this? "A small effect" is still incorrect, isn't it?

Is my description of the equivalent implementation using stats.pearsonr the correct interpretation of the desired behavior?

@josef-pkt
Copy link
Member

here's my stackoverflow answer

We have (at least) two options for missing value handling, complete case deletion and pairwise deletion.

In your use of scipy.stats.pearsonr you completely drop cases where there is a missing value in any of the variables.

numpy.ma.corrcoef gives the same results.

Checking the source of scipy.stats.mstats.pearsonr, it doesn't do complete case deletion for the calculating the variance or the mean.

    >>> xm = x - x.mean(0)
    >>> ym = y - y.mean(0)
    >>> np.ma.dot(xm, ym) / np.sqrt(np.ma.dot(xm, xm) * np.ma.dot(ym, ym))
    0.7731167378113557

    >>> scipy.stats.mstats.pearsonr(x,y)[0]
    0.77311673781135637

However, the difference between complete and pairwise case deletion on mean and standard deviations is small.

The main discrepancy seems to come from the missing correction for the different number of non-missing elements. Iignoring degrees of freedom corrections, I get

    >>> np.ma.dot(xm, ym) / bothok.sum() / \
        np.sqrt(np.ma.dot(xm, xm) / (~xm.mask).sum() * np.ma.dot(ym, ym) / (~ym.mask).sum())
    0.85855728319303393

which is close to the complete case deletion case.

@josef-pkt
Copy link
Member

Could you elaborate on this? "A small effect" is still incorrect, isn't it?

We need to estimate mean, standard deviation and covariance. For mean and standard deviation only one variable is involved and we can estimated it using all available values.

With complete case deletion, what we get when we use stats.pearsonr with the non-missing values based on the joint mask, then we also use only the reduced number of observations for the calculation of the mean and standard deviation.

Both can be justified, it's a difference in definition, not really "incorrect".

However, if we use different observations for mean and standard deviation, then we have to calculate those correctly: dot product divided by the number of observations actually used minus ddof.

This is the discrepancy that we get from using different observations in the mean and standard deviation calculations, (except: I'm not sure which mean np.ma.cov is using, so there might be a small additional source of difference.)

>>> np.ma.cov(x, y)[0,1] / np.ma.std(x, ddof=1) / np.ma.std(y, ddof=1)
0.8585811978161848
>>> scipy.stats.pearsonr(x[bothok].data,y[bothok].data)[0]
0.85864344658604019
>>> np.ma.corrcoef(x, y)[0,1]
0.85864344658603986

@josef-pkt
Copy link
Member

http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html

The usual discussion is if we have several variables, the the covariance or correlation between x1 and x2 need not have the same missing values as x1 and x3 or x2 and x3.

For the covariance matrix the same will apply for the diagonal, i.e. the variances.
I never checked what R or other packages do for the mean()

I think we had a discussion for numpy about what a nan aware np.cov or np.corrcoef should do. the best would be to offer several options.

np.ma.cov has a allow_masked. mstats.pearsonr could be rewritten to use np.ma.corrcoef with options.

@WarrenWeckesser
Copy link
Member Author

(@josef-pkt, I didn't notice your last comment while I was editing mine, so I deleted my previous comment.)

Perhaps a future enhancement is to provide a keyword argument that selects one of several possible behaviors. What should the current (and in the future, the default) behavior be? My reading of the code led me to the description I gave above (where I specify the behavior of the mstats version in terms of the stats version). If that's not a good default, then I don't know what it should be.

The usual discussion is if we have several variables, ...

In the current API, we have just two 1-D arrays, x and y.

@josef-pkt
Copy link
Member

>>> np.corrcoef(xx, yy)[0, 1] - scipy.stats.pearsonr(xx, yy)[0] 0.0

I think the default should be the same as the np.ma function np.ma.corrcoef.
whether that's the same as complete case deletion as in stats.pearsonr(x[~m].data, y[~m].data) is secondary.

I think there is no reason here to fix this bug without refactoring to reuse np.ma.corrcoef. All we need is the additional p-value calculation. (including degrees of freedom which is the only work left, I guess)

@josef-pkt
Copy link
Member

mstats.spearmanr applies the joint mask

it looks like np.ma.cov has some inconsistency depending on whether we use one array or two arrays:

>>> np.ma.cov(x, y, bias=True).data
array([[ 0.98802039,  0.99511658],
       [ 0.99511658,  1.35942827]])
>>> np.ma.cov([x, y], bias=True).data
array([[ 0.98816536,  0.99510108],
       [ 0.99510108,  1.35945933]])
>>> np.ma.dot(x-x[bothok].mean(), y-y[bothok].mean()) / bothok.sum()
0.99511657902222872
>>> np.ma.dot(x-x.mean(), y - y.mean()) / bothok.sum()
0.99510108399067376

@WarrenWeckesser
Copy link
Member Author

(@josef-pkt: This is a follow-up to earlier comments, not your most recent comment.)

I just want to be sure that before any changes are made, the desired behavior is unambiguously defined. It looks like np.ma.corrcoef(x, y)[0, 1] corresponds to "complete case deletion". For example, with x = [1, --, 0, 4, 5, 6], y = [9, 8, --, --, 5, 2]:

In [150]: x
Out[150]: 
masked_array(data = [1 -- 0 4 5 6],
             mask = [False  True False False False False],
       fill_value = 999999)

In [151]: y
Out[151]: 
masked_array(data = [9 8 -- -- 5 2],
             mask = [False False  True  True False False],
       fill_value = 999999)

In [152]: np.ma.corrcoef(x, y)[0, 1]
Out[152]: -0.96861960450113671

Apply np.corrcoef and stats.pearsonr to the "effective" data sets associated with complete case removal:

In [153]: np.corrcoef([1, 5, 6], [9, 5, 2])[0, 1]
Out[153]: -0.96861960450113671

In [154]: stats.pearsonr([1, 5, 6], [9, 5, 2])
Out[154]: (-0.96861960450113671, 0.15990668218800944)

For the p-value, I'd go with the number shown above in the result of stats.pearsonr. That is, "complete case removal" means values that have been removed are not counted at all. (Again, if that's not a reasonable default, I don't know what is.)

@josef-pkt
Copy link
Member

your latest example

>>> np.ma.corrcoef(x, y, bias=True).data
array([[ 1.       , -0.9686196],
       [-0.9686196,  1.       ]])
>>> np.ma.corrcoef([x, y], bias=True).data
array([[ 1.        , -1.05471912],
       [-1.05471912,  1.        ]])

your earlier example

>>> xw = np.ma.masked_array([1,2,3,4,5],mask=[0,1,0,0,0])
>>> yw = np.ma.masked_array([9, 8, 7, 6, 5], mask=[0,0,1,0, 0])

>>> np.ma.cov(xw, yw, bias=True).data
array([[ 2.88888889, -2.88888889],
       [-2.88888889,  2.88888889]])
>>> np.ma.cov([xw, yw], bias=True).data
array([[ 2.1875    , -2.91666667],
       [-2.91666667,  2.5       ]])


>>> np.ma.corrcoef(xw, yw, bias=True).data
array([[ 1., -1.],
       [-1.,  1.]])
>>> np.ma.corrcoef([xw, yw], bias=True).data
array([[ 1.        , -1.00961538],
       [-1.00961538,  1.        ]])

This makes me think that we want to go for complete case deletion as the default. It's the safest for users, because they always will get a proper correlation matrix back.
Although we are throwing away information and might have to mask more results.
Which means we remove according to the joint mask also for mean and standard deviation.

pairwise deletion can be added as an option later.

mstats spearmanr is masking based on the joint mask. There are two,three lines that are missing from pearsonr.

As you say, with complete case deletion we can just get the same result as stats.pearsonr on the arrays that are compressed.

If we keep complete case deletion as the default for the future, then we can do just a bugfix here.
Given the current behavior of np.ma.corrcoef, I take back that mstats should match it.

@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented Jul 17, 2018

WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Nov 29, 2020
After compressing the masked arrays using a common mask for x and
y, the data is passed to stats.pearsonr to do the calculation.

Closes scipygh-3645.
@tylerjereddy tylerjereddy added this to the 1.6.0 milestone Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants