Skipped correlation gives different results than the Pernet implementation in Matlab #164

adamnarai · 2021-03-30T17:52:38Z

I think this line in correlation.py:

dis[i, :] = np.linalg.norm(B * B2[i, :] / bot[i], axis=1)

is functionally different compared to the Matlab implementation and the equation in Wilcox, R. (2004)

The text was updated successfully, but these errors were encountered:

raphaelvallat · 2021-03-30T21:21:27Z

Hi @adamnarai,

I coded this function a long time ago, but from what I remember it is essentially an (optimized) translation of the Matlab code from the robust correlation toolbox by Cyril Pernet and Guillaume Rousselet. Furthermore, the output of the function is tested against the original toolbox:

pingouin/pingouin/tests/test_correlation.py

Lines 22 to 25 in 91b1847

    
           # Compare with robust corr toolbox 
        
           stats = corr(x, y, method='skipped') 
        
           assert np.round(stats['r'].to_numpy(), 3) == 0.512 
        
           assert stats['outliers'].to_numpy() == 2

That said, if you have any idea for a better implementation, and/or if you find inconsistent results with the robust corr toolbox then I'm all in for reworking this function.

Thanks!
Raphael

adamnarai · 2021-03-31T12:15:07Z

Hi @raphaelvallat,

This test yields the same result as the robust correlation toolbox, however I can create a different test data, where it's not the case. For example the data generated by replacing line 17 with the following:

x[3], y[5] = 7, 2.6

yields r=0.474, n_outl=0 in pingouin and r=0.512, n_outl=2 in the robust correlation toolbox.

I would suggest replacing this line:

pingouin/pingouin/correlation.py

Line 82 in 91b1847

dis[i, :] = np.linalg.norm(B * B2[i, :] / bot[i], axis=1)

with

dis[i, :] = np.linalg.norm(B.dot(B[i, :, None]) * B[i, :] / bot[i], axis=1)

By applying this change, I found the results consistent with the robust correlation toolbox.

Best,
Adam

raphaelvallat · 2021-03-31T17:07:44Z

Hi @adamnarai,

Thanks for looking into that! You were right, I have now updated the line and unit testing in this commit: ce7545d

I will release a new version of Pingouin in the next couple of weeks.
Raphael

adamnarai · 2021-07-17T19:01:34Z

Hi @raphaelvallat,

I still found some differences in the number of outliers between the two implementations when applied on real data (even 5 vs. 1 outlier in one case). As an example, using the following values in the unit test (line 18):

x[3], y[5] = 20, 3.5

I get r=0.512, n_outl=2 in pingouin and r=0.418, n_outl=1 in the robust correlation toolbox.
I suspect the difference is in the MCD estimator implementation, since the center values are also different but I couldn't reproduce the robust correlation toolbox results by trying different parametrization of the MinCovDet class.

Best,
Adam

raphaelvallat · 2021-07-17T23:20:40Z

Hi @adamnarai,

Re-opening this issue for visibility.

You're right, I just checked the output of MCD with the example you provide and Matlab returns: center = [3.8123, 5.7939] while in Python MinCovDet().fit(X).location_ = [3.78007729, 5.88216227]. As explained in scikit-learn/scikit-learn#4682, this is because sklearn scales by N and not by N-1 (the latter being arguably more adequate though).

I think we could implement the following fix

which gives r=0.418, n_outl=1 (although the center values are not exactly similar to Matlab)

Let me know what you think!
Raphael

…Matlab) #164

raphaelvallat · 2021-07-17T23:30:24Z

I just pushed a commit with these changes on the develop branch if you want to try it out on real-world data: f2deb57

adamnarai · 2021-07-18T19:44:54Z

Hi @raphaelvallat,

I think the normalization only affects the covariance and not the location and there is something else causing this difference. I will try to look more into it if my time allows.

By using the fast_mcd() function directly we basically get the estimate without re-weighting, which is the same as MinCovDet().fit(X).raw_location_=[3.81101474, 5.81656632]
As an example, using your commit and x[3], y[5] = 10, 3.5, I got r=0.512, n_outl=2 in Pingouin and r=0.418, n_outl=1 in Matlab.

Best,
Adam

raphaelvallat · 2021-07-18T20:35:39Z

@adamnarai Ah, you're absolutely right, my bad. I've just tried with the following (inspired from sklearn):

X = np.column_stack((x, y))
nrows, ncols = X.shape
center, cov, support, dist = fast_mcd(X, cov_computation_method=lambda x: np.cov(x, rowvar=False))
correction = np.median(dist) / chi2(X.shape[1]).isf(0.5)
dist /= correction
mask = dist < chi2(2).isf(0.025)
print(X[mask].mean(0))

But I still get the same center values as sklearn.MinCovDet, meaning that indeed the normalization has no effect on the location. I'm reverting the commit now.

I've tried playing around with different support_fraction (the h parameter in Matlab), but I can never get the same results as Matlab so my guess is the Matlab/Libra algorithm is different than scikit-learn. I don't think that re-implementing the entire MCD algorithm is worth it, so my recommendation would be to add a warning in the skipped correlation function / documentation. Let me know what you think.

Best,
Raphael

adamnarai · 2021-07-19T12:44:13Z

I agree, at this point that's the best we can realistically do.

For future reference:
The fast_mcd() function seems to give the same center values as the Matlab mcdcov() raw.center, however, the other outputs are different, even when using the same covariance normalization and the same support_fraction parameter. This difference means different re-weighting and ultimately different corrected center values.

Best,
Adam

raphaelvallat · 2021-07-19T19:07:39Z

Warning added in e56df01. This will be included in the next stable release of Pingouin.

Thanks again for looking into this,
Raphael

raphaelvallat · 2021-08-13T18:05:29Z

The warning has been included in the new stable version of Pingouin (v0.4.0). Please make sure to upgrade with pip install --upgrade pingouin.

Thanks,
Closing the issue.

raphaelvallat self-assigned this Mar 30, 2021

raphaelvallat added the invalid 🚩 This doesn't seem right label Mar 30, 2021

raphaelvallat closed this as completed Apr 14, 2021

raphaelvallat reopened this Jul 17, 2021

raphaelvallat added a commit that referenced this issue Jul 17, 2021

Skipped correlation normalizes the MCD covariance by N-1 (similar to …

f2deb57

…Matlab) #164

raphaelvallat closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skipped correlation gives different results than the Pernet implementation in Matlab #164

Skipped correlation gives different results than the Pernet implementation in Matlab #164

adamnarai commented Mar 30, 2021

raphaelvallat commented Mar 30, 2021

adamnarai commented Mar 31, 2021

raphaelvallat commented Mar 31, 2021

adamnarai commented Jul 17, 2021

raphaelvallat commented Jul 17, 2021

raphaelvallat commented Jul 17, 2021

adamnarai commented Jul 18, 2021

raphaelvallat commented Jul 18, 2021 •

edited

adamnarai commented Jul 19, 2021 •

edited

raphaelvallat commented Jul 19, 2021

raphaelvallat commented Aug 13, 2021

Skipped correlation gives different results than the Pernet implementation in Matlab #164

Skipped correlation gives different results than the Pernet implementation in Matlab #164

Comments

adamnarai commented Mar 30, 2021

raphaelvallat commented Mar 30, 2021

adamnarai commented Mar 31, 2021

raphaelvallat commented Mar 31, 2021

adamnarai commented Jul 17, 2021

raphaelvallat commented Jul 17, 2021

raphaelvallat commented Jul 17, 2021

adamnarai commented Jul 18, 2021

raphaelvallat commented Jul 18, 2021 • edited

adamnarai commented Jul 19, 2021 • edited

raphaelvallat commented Jul 19, 2021

raphaelvallat commented Aug 13, 2021

raphaelvallat commented Jul 18, 2021 •

edited

adamnarai commented Jul 19, 2021 •

edited