New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Incorrect implementation of explained_variance_ in PCA #9105

Merged
merged 9 commits into from Jun 21, 2017

Conversation

Projects
None yet
4 participants
@qinhanmin2014
Member

qinhanmin2014 commented Jun 11, 2017

Reference Issue

Fixes #7699
Mentioned in #8541 (the 5th comment by IainStrachan on 11 May)
Mentioned in #8544 (the 6th comment by IainStrachan on 11 May)

What does this implement/fix? Explain your changes.

PCA.explained_variance_ is incorrectly implemented. The test is also wrong so that the mistake is not detected.

The result of explained_variance_ is wrong.

result from python:
04
result from R:
01
result according to the definition:
02

Reason for the incorrect implementation.

(1)In the code, explained_variance_ = (S ** 2) / n_samples should be explained_variance_ = (S ** 2) / (n_samples - 1). Because when using SVD to calculate PCA, we get the eigenvalue of A'A(A' means the transpose of A), but what we need is the eigenvalue of A'A/(n_samples - 1) , which is equivalent to the covariance matrix of A.
(2)In the test, we simply use np.var to calculate the variance. That's incorrect. We should use np.cor and pick the elements on the diagonal or set ddof=1 .

Any other comments?

I provide two ways for the new test, one based on current version, another based on the definition.
Please take some time to consider it. Thanks.

qinhanmin2014 added some commits Jun 11, 2017

@qinhanmin2014 qinhanmin2014 changed the title from [WIP] Incorrect implementation of explained_variance_ in PCA to [MRG] Incorrect implementation of explained_variance_ in PCA Jun 11, 2017

qinhanmin2014 added some commits Jun 12, 2017

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jun 12, 2017

Member

+1 for MRG when CIs are happy

Member

agramfort commented Jun 12, 2017

+1 for MRG when CIs are happy

@agramfort agramfort changed the title from [MRG] Incorrect implementation of explained_variance_ in PCA to [MRG+1] Incorrect implementation of explained_variance_ in PCA Jun 12, 2017

@jnothman jnothman added this to the 0.19 milestone Jun 18, 2017

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 19, 2017

Member

This is issue #7699 right? I thought we were ok with the current implementation? I guess we changed our opinion? I don't have a strong opinion.

Member

amueller commented Jun 19, 2017

This is issue #7699 right? I thought we were ok with the current implementation? I guess we changed our opinion? I don't have a strong opinion.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 19, 2017

Member

@qinhanmin2014 I think when looking at #7699 I checked the definition of PCA, and I don't think any of the books I looked at mentioned the bessel correction. In particular Elements of Statistical Learning, which is my standard reference. Can you provide a reference with your definition, other than prcomp?

Member

amueller commented Jun 19, 2017

@qinhanmin2014 I think when looking at #7699 I checked the definition of PCA, and I don't think any of the books I looked at mentioned the bessel correction. In particular Elements of Statistical Learning, which is my standard reference. Can you provide a reference with your definition, other than prcomp?

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 19, 2017

Member

@amueller Thanks for your instruction, this issue is indeed related to #7699

(1)I think the problem is not whether to use bessel correction. The main point is that the variance is supposed to be equal to the eigenvalue of the covariance matrix. I propose an additional test to prove my implementation.
(2)It seems that Elements of Statistical Learning simply use SVD to calculate PCA and say nothing about the covariance matrix and the calculation of variance.
(3)R, Matlab(princomp) implemented in the same way as the pull request.
(4)Many papers and books also implemented in the same way as the pull request.(e.g. https://arxiv.org/abs/1404.1100 (P11-P12), Machine Learning in Action (https://github.com/pbharrin/machinelearninginaction/blob/master/Ch13/pca.py)). but I currently can't find one which implemented in the same way as sklearn.

What's more, from my perspective, even if we dont't find any strong evidence, it may be better to ensure that sklearn behave the same as others.

Member

qinhanmin2014 commented Jun 19, 2017

@amueller Thanks for your instruction, this issue is indeed related to #7699

(1)I think the problem is not whether to use bessel correction. The main point is that the variance is supposed to be equal to the eigenvalue of the covariance matrix. I propose an additional test to prove my implementation.
(2)It seems that Elements of Statistical Learning simply use SVD to calculate PCA and say nothing about the covariance matrix and the calculation of variance.
(3)R, Matlab(princomp) implemented in the same way as the pull request.
(4)Many papers and books also implemented in the same way as the pull request.(e.g. https://arxiv.org/abs/1404.1100 (P11-P12), Machine Learning in Action (https://github.com/pbharrin/machinelearninginaction/blob/master/Ch13/pca.py)). but I currently can't find one which implemented in the same way as sklearn.

What's more, from my perspective, even if we dont't find any strong evidence, it may be better to ensure that sklearn behave the same as others.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 20, 2017

Member

Oh wow, I misread the covariance test. Never mind, that test is clearly correct.
LGTM. Could you please add an entry to whats_new.rst in the bugfix section?

Member

amueller commented Jun 20, 2017

Oh wow, I misread the covariance test. Never mind, that test is clearly correct.
LGTM. Could you please add an entry to whats_new.rst in the bugfix section?

@amueller amueller changed the title from [MRG+1] Incorrect implementation of explained_variance_ in PCA to [MRG+2] Incorrect implementation of explained_variance_ in PCA Jun 20, 2017

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jun 20, 2017

Member

@qinhanmin2014 please update what's new and let's merge

Member

agramfort commented Jun 20, 2017

@qinhanmin2014 please update what's new and let's merge

qinhanmin2014 added some commits Jun 20, 2017

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 21, 2017

Member

@amueller @agramfort Finished. I also revert #7843 since these statements in the document are no longer needed. Thanks.

Member

qinhanmin2014 commented Jun 21, 2017

@amueller @agramfort Finished. I also revert #7843 since these statements in the document are no longer needed. Thanks.

@amueller amueller merged commit 2a36ff1 into scikit-learn:master Jun 21, 2017

5 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.27%)
Details
codecov/project 96.3% (+0.03%) compared to 689f412
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 21, 2017

Member

thanks :)

Member

amueller commented Jun 21, 2017

thanks :)

@qinhanmin2014 qinhanmin2014 deleted the qinhanmin2014:my-feature-1 branch Jun 21, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+2] Incorrect implementation of explained_variance_ in PCA (#9105)
* fix pca explained_variance_

* fix fit_transform

* fix test_whitening

* fix IncrementalPCA

* uncomment the test

* improve test

* make CI green

* revert #7843 and add what's new

* fix what's new

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+2] Incorrect implementation of explained_variance_ in PCA (#9105)
* fix pca explained_variance_

* fix fit_transform

* fix test_whitening

* fix IncrementalPCA

* uncomment the test

* improve test

* make CI green

* revert #7843 and add what's new

* fix what's new

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+2] Incorrect implementation of explained_variance_ in PCA (#9105)
* fix pca explained_variance_

* fix fit_transform

* fix test_whitening

* fix IncrementalPCA

* uncomment the test

* improve test

* make CI green

* revert #7843 and add what's new

* fix what's new

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+2] Incorrect implementation of explained_variance_ in PCA (#9105)
* fix pca explained_variance_

* fix fit_transform

* fix test_whitening

* fix IncrementalPCA

* uncomment the test

* improve test

* make CI green

* revert #7843 and add what's new

* fix what's new

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+2] Incorrect implementation of explained_variance_ in PCA (#9105)
* fix pca explained_variance_

* fix fit_transform

* fix test_whitening

* fix IncrementalPCA

* uncomment the test

* improve test

* make CI green

* revert #7843 and add what's new

* fix what's new

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+2] Incorrect implementation of explained_variance_ in PCA (#9105)
* fix pca explained_variance_

* fix fit_transform

* fix test_whitening

* fix IncrementalPCA

* uncomment the test

* improve test

* make CI green

* revert #7843 and add what's new

* fix what's new

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+2] Incorrect implementation of explained_variance_ in PCA (#9105)
* fix pca explained_variance_

* fix fit_transform

* fix test_whitening

* fix IncrementalPCA

* uncomment the test

* improve test

* make CI green

* revert #7843 and add what's new

* fix what's new
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment