Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipped correlation gives different results than the Pernet implementation in Matlab #164

Closed
adamnarai opened this issue Mar 30, 2021 · 11 comments
Assignees
Labels
invalid 🚩 This doesn't seem right

Comments

@adamnarai
Copy link
Contributor

I think this line in correlation.py:

dis[i, :] = np.linalg.norm(B * B2[i, :] / bot[i], axis=1)

is functionally different compared to the Matlab implementation and the equation in Wilcox, R. (2004)

@raphaelvallat raphaelvallat self-assigned this Mar 30, 2021
@raphaelvallat raphaelvallat added the invalid 🚩 This doesn't seem right label Mar 30, 2021
@raphaelvallat
Copy link
Owner

Hi @adamnarai,

I coded this function a long time ago, but from what I remember it is essentially an (optimized) translation of the Matlab code from the robust correlation toolbox by Cyril Pernet and Guillaume Rousselet. Furthermore, the output of the function is tested against the original toolbox:

# Compare with robust corr toolbox
stats = corr(x, y, method='skipped')
assert np.round(stats['r'].to_numpy(), 3) == 0.512
assert stats['outliers'].to_numpy() == 2

That said, if you have any idea for a better implementation, and/or if you find inconsistent results with the robust corr toolbox then I'm all in for reworking this function.

Thanks!
Raphael

@adamnarai
Copy link
Contributor Author

Hi @raphaelvallat,

This test yields the same result as the robust correlation toolbox, however I can create a different test data, where it's not the case. For example the data generated by replacing line 17 with the following:

x[3], y[5] = 7, 2.6

yields r=0.474, n_outl=0 in pingouin and r=0.512, n_outl=2 in the robust correlation toolbox.

I would suggest replacing this line:

dis[i, :] = np.linalg.norm(B * B2[i, :] / bot[i], axis=1)

with

dis[i, :] = np.linalg.norm(B.dot(B[i, :, None]) * B[i, :] / bot[i], axis=1)

By applying this change, I found the results consistent with the robust correlation toolbox.

Best,
Adam

@raphaelvallat
Copy link
Owner

Hi @adamnarai,

Thanks for looking into that! You were right, I have now updated the line and unit testing in this commit: ce7545d

I will release a new version of Pingouin in the next couple of weeks.
Raphael

@adamnarai
Copy link
Contributor Author

Hi @raphaelvallat,

I still found some differences in the number of outliers between the two implementations when applied on real data (even 5 vs. 1 outlier in one case). As an example, using the following values in the unit test (line 18):

x[3], y[5] = 20, 3.5

I get r=0.512, n_outl=2 in pingouin and r=0.418, n_outl=1 in the robust correlation toolbox.
I suspect the difference is in the MCD estimator implementation, since the center values are also different but I couldn't reproduce the robust correlation toolbox results by trying different parametrization of the MinCovDet class.

Best,
Adam

@raphaelvallat
Copy link
Owner

Hi @adamnarai,

Re-opening this issue for visibility.

You're right, I just checked the output of MCD with the example you provide and Matlab returns: center = [3.8123, 5.7939] while in Python MinCovDet().fit(X).location_ = [3.78007729, 5.88216227]. As explained in scikit-learn/scikit-learn#4682, this is because sklearn scales by N and not by N-1 (the latter being arguably more adequate though).

I think we could implement the following fix

image

which gives r=0.418, n_outl=1 (although the center values are not exactly similar to Matlab)

Let me know what you think!
Raphael

@raphaelvallat
Copy link
Owner

I just pushed a commit with these changes on the develop branch if you want to try it out on real-world data: f2deb57

@adamnarai
Copy link
Contributor Author

Hi @raphaelvallat,

I think the normalization only affects the covariance and not the location and there is something else causing this difference. I will try to look more into it if my time allows.

By using the fast_mcd() function directly we basically get the estimate without re-weighting, which is the same as MinCovDet().fit(X).raw_location_=[3.81101474, 5.81656632]
As an example, using your commit and x[3], y[5] = 10, 3.5, I got r=0.512, n_outl=2 in Pingouin and r=0.418, n_outl=1 in Matlab.

Best,
Adam

@raphaelvallat
Copy link
Owner

raphaelvallat commented Jul 18, 2021

@adamnarai Ah, you're absolutely right, my bad. I've just tried with the following (inspired from sklearn):

X = np.column_stack((x, y))
nrows, ncols = X.shape
center, cov, support, dist = fast_mcd(X, cov_computation_method=lambda x: np.cov(x, rowvar=False))
correction = np.median(dist) / chi2(X.shape[1]).isf(0.5)
dist /= correction
mask = dist < chi2(2).isf(0.025)
print(X[mask].mean(0))

But I still get the same center values as sklearn.MinCovDet, meaning that indeed the normalization has no effect on the location. I'm reverting the commit now.

I've tried playing around with different support_fraction (the h parameter in Matlab), but I can never get the same results as Matlab so my guess is the Matlab/Libra algorithm is different than scikit-learn. I don't think that re-implementing the entire MCD algorithm is worth it, so my recommendation would be to add a warning in the skipped correlation function / documentation. Let me know what you think.

Best,
Raphael

@adamnarai
Copy link
Contributor Author

adamnarai commented Jul 19, 2021

I agree, at this point that's the best we can realistically do.

For future reference:
The fast_mcd() function seems to give the same center values as the Matlab mcdcov() raw.center, however, the other outputs are different, even when using the same covariance normalization and the same support_fraction parameter. This difference means different re-weighting and ultimately different corrected center values.

Best,
Adam

@raphaelvallat
Copy link
Owner

Warning added in e56df01. This will be included in the next stable release of Pingouin.

Thanks again for looking into this,
Raphael

@raphaelvallat
Copy link
Owner

The warning has been included in the new stable version of Pingouin (v0.4.0). Please make sure to upgrade with pip install --upgrade pingouin.

Thanks,
Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid 🚩 This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants