New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: stats: mstats.pearsonr calculation is wrong if the masks are not aligned #3645
Comments
I was just working through several example for this. I still have a small discrepancy in my numbers that I haven't figured out yet where it comes from. (df correction ?) |
|
Could you elaborate on this? "A small effect" is still incorrect, isn't it? Is my description of the equivalent implementation using |
here's my stackoverflow answer We have (at least) two options for missing value handling, complete case deletion and pairwise deletion. In your use of
Checking the source of
However, the difference between complete and pairwise case deletion on mean and standard deviations is small. The main discrepancy seems to come from the missing correction for the different number of non-missing elements. Iignoring degrees of freedom corrections, I get
which is close to the complete case deletion case. |
We need to estimate mean, standard deviation and covariance. For mean and standard deviation only one variable is involved and we can estimated it using all available values. With complete case deletion, what we get when we use Both can be justified, it's a difference in definition, not really "incorrect". However, if we use different observations for mean and standard deviation, then we have to calculate those correctly: dot product divided by the number of observations actually used minus ddof. This is the discrepancy that we get from using different observations in the mean and standard deviation calculations, (except: I'm not sure which mean np.ma.cov is using, so there might be a small additional source of difference.)
|
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html The usual discussion is if we have several variables, the the covariance or correlation between x1 and x2 need not have the same missing values as x1 and x3 or x2 and x3. For the covariance matrix the same will apply for the diagonal, i.e. the variances. I think we had a discussion for numpy about what a nan aware np.cov or np.corrcoef should do. the best would be to offer several options. np.ma.cov has a |
(@josef-pkt, I didn't notice your last comment while I was editing mine, so I deleted my previous comment.) Perhaps a future enhancement is to provide a keyword argument that selects one of several possible behaviors. What should the current (and in the future, the default) behavior be? My reading of the code led me to the description I gave above (where I specify the behavior of the
In the current API, we have just two 1-D arrays, |
I think the default should be the same as the np.ma function np.ma.corrcoef. I think there is no reason here to fix this bug without refactoring to reuse np.ma.corrcoef. All we need is the additional p-value calculation. (including degrees of freedom which is the only work left, I guess) |
mstats.spearmanr applies the joint mask it looks like np.ma.cov has some inconsistency depending on whether we use one array or two arrays:
|
(@josef-pkt: This is a follow-up to earlier comments, not your most recent comment.) I just want to be sure that before any changes are made, the desired behavior is unambiguously defined. It looks like
Apply
For the p-value, I'd go with the number shown above in the result of |
your latest example
your earlier example
This makes me think that we want to go for complete case deletion as the default. It's the safest for users, because they always will get a proper correlation matrix back. pairwise deletion can be added as an option later. mstats spearmanr is masking based on the joint mask. There are two,three lines that are missing from pearsonr. As you say, with complete case deletion we can just get the same result as stats.pearsonr on the arrays that are compressed. If we keep complete case deletion as the default for the future, then we can do just a bugfix here. |
Another related question on stackoverflow, about |
After compressing the masked arrays using a common mask for x and y, the data is passed to stats.pearsonr to do the calculation. Closes scipygh-3645.
See http://stackoverflow.com/questions/23601150/why-dont-scipy-stats-mstats-pearsonr-results-agree-with-scipy-stats-pearsonr
The desired behavior isn't explicitly stated in the docstring, but it looks to me like the intent is that
should give the same result as
Currently it does not:
At least one problem in the implementation (pointed out in my answer on stackoverflow) is that the
mstats
version does not use the common mask when it computes the means ofx
andy
.The text was updated successfully, but these errors were encountered: