Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior of canonical correlation analysis (CCA) #14169

Open
shauli-ravfogel opened this issue Jun 23, 2019 · 3 comments
Open

Unexpected behavior of canonical correlation analysis (CCA) #14169

shauli-ravfogel opened this issue Jun 23, 2019 · 3 comments

Comments

@shauli-ravfogel
Copy link

Description

Given two view matrices X and Y, CCA should return the n_components linear combinations of the elements in X's and Y's vectors with the largest correlation, i.e the first component has the largest correlation, the second one has the second-largest correlation in a direction that is uncorrelated with the first one, etc. However, in the following code, the second columns are clearly less correlated than the third columns:

Steps/Code to Reproduce

from sklearn.cross_decomposition import CCA
import numpy as np
from scipy import stats

X = [[0.006061109337620652, 0.0392466675239141, -0.04312459861840733], [-0.6652995467596429, 0.2076410843346226, -0.7817536882379651], [-1.4060868112838942, -0.055029478343267685, -1.006415484608637], [-2.170613455952169, -0.15770102997315535, -1.5223958036356375], [-2.895702070412092, -0.20498481413822175, -1.8232022285963847], [-3.687452614812402, -0.543003880524402, -2.2952030829468533], [-4.206168972149556, -0.3365693935548624, -2.841946351795423], [-5.231288268781064, -0.8024321344988571, -3.40532581256557], [-6.095805742721522, -0.9381144689340173, -3.593752186094848], [-6.297988264542059, -0.7231985020991631, -3.9435579269998406], [-7.10897027952524, -0.8639925998765747, -4.264992629284153], [-8.116238092376772, -1.0123970020855437, -4.96858178622968], [-8.969468878952105, -1.0235782019578692, -5.617282941713933], [-9.839359511108077, -1.2819621078971968, -5.8901943190245625], [-10.181936322525571, -0.9904671991812529, -6.240811384647836]]
Y = [[0.032927114749911154, 0.21320841666565743, -0.23427536580450153], [1.431742605643286, 0.23963850202268067, 0.8438745303679628], [2.908798834568648, 0.7357229001312737, 1.325345683629048], [4.438824821921929, 0.9473643810538429, 2.35038560647864], [5.887201894166226, 1.0302756424934638, 2.964806513433767], [7.409049064480012, 1.3070946380395154, 4.347473875547982], [8.51501831350366, 1.3380108570442941, 4.9533251686263275], [10.57244384646805, 2.31627294094068, 6.028949244604159], [12.22872203222364, 2.1165257564864675, 6.923464021607424], [12.664660419747504, 1.8911363532121173, 7.398432173930664], [14.29235367239137, 2.2098221962551343, 8.000538342827351], [16.327977920399373, 2.643183255720207, 9.257671785118596], [18.081288169620517, 2.968898443090926, 10.221747267811098], [19.754046559146662, 3.051682253577557, 11.244435627784393], [20.466418131910004, 2.644933083198568, 11.752014917896375]]

cca = CCA(n_components=3)
cca.fit(X, Y)
X_transformed, Y_transformed = cca.transform(X, Y)

print(X_transformed)

[[ 1.64277244  0.08237031  0.11724683]
 [ 1.41457457 -0.11600721  0.07162219]
 [ 1.18686358  0.00666119  0.08822118]
 [ 0.94070119 -0.02504267 -0.0112954 ]
 [ 0.71314666  0.02919558  0.25990473]
 [ 0.46246267  0.05607036 -0.16424275]
 [ 0.28625435 -0.09546609 -0.02850206]
 [-0.03644528 -0.03407977 -0.55790786]
 [-0.30127324  0.12266269 -0.12622283]
 [-0.37581414 -0.01941656 -0.0343278 ]
 [-0.62900674  0.05973748  0.13448604]
 [-0.95400947  0.0082079  -0.05487306]
 [-1.23214839 -0.07548718 -0.08864002]
 [-1.50031366  0.05776429  0.00665019]
 [-1.61776455 -0.0571703   0.38788062]]

print(Y_transformed)

[[ 1.64131294  0.01428169  0.11343087]
 [ 1.41330957 -0.06169376  0.06929115]
 [ 1.18580457  0.04946041  0.0853499 ]
 [ 0.9398609   0.01613582 -0.01092778]
 [ 0.71251056  0.06356946  0.25144578]
 [ 0.46205587 -0.05371303 -0.15889725]
 [ 0.28599564 -0.05298207 -0.02757442]
 [-0.03641563  0.02220818 -0.53975   ]
 [-0.30099795  0.01004611 -0.12211474]
 [-0.37547807 -0.063071   -0.03321056]
 [-0.62844374  0.02679924  0.13010901]
 [-0.95315959  0.0185694  -0.05308714]
 [-1.23105629  0.03726576 -0.08575511]
 [-1.49897395  0.01483407  0.00643375]
 [-1.61632484 -0.04171028  0.37525653]]

Expected Results

The correlation between the third columns of X_transfomed and Y_transformed is approximately 1, while the correlation between the second columns is only 0.389. Therefore, I expected the third column to appear in the place of the second column.

Actual Results

When I run CCA with n_components=2 I get the first and second columns, while I should have gotten the first and the third.

Versions

Linux-4.13.0-16-generic-x86_64-with-Ubuntu-17.10-artful
Python 3.6.3 (default, Oct 3 2017, 21:45:48)
NumPy 1.14.3
SciPy 1.2.0
Scikit-Learn 0.19.1

@Qoobis
Copy link

Qoobis commented Jun 25, 2019

I don't mean to be rude but what is the purpose of this. Why are you doing this?

@shauli-ravfogel
Copy link
Author

I am doing this in order to get the shared components between two views of some data.

@amitcs010
Copy link

how to set a threshold value using cca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants