New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108

Merged
merged 17 commits into from Aug 6, 2017

Conversation

Projects
None yet
4 participants
@qinhanmin2014
Member

qinhanmin2014 commented Jun 11, 2017

Reference Issue

Fixes #7568
00
Fixes #8541
Fixes #8544
05

What does this implement/fix? Explain your changes.

noise_variance_ is correctly implemented in PCA._fit_full and IncrementalPCA.partial_fit, but incorrectly implemented in PCA._fit_truncated.
After the modification, in function get_precision(decomposition/base.py), we will not get any 0 in exp_var_diff(line 72), thus will not get any inf in precision(line 74), which is the main reason of #8544 .

Any other comments?

@amueller @lesteve Thanks for your precious time.

@qinhanmin2014 qinhanmin2014 changed the title from [WIP] Incorrent implementation of noise_variance_ in PCA._fit_truncated to [MRG] Incorrent implementation of noise_variance_ in PCA._fit_truncated Jun 11, 2017

@wallygauze

I believe it should be min(n_features, n_samples) instead of n_features (on line 503 as well)

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 12, 2017

Member

@wallygauze Thanks. I agree with you and modify my code. For the current issue, it's enough, but scikit-learn seems to handle situations when n_samples < n_features strangely (See the example below). Maybe some other modifications is needed.
00
@amueller @lesteve

Member

qinhanmin2014 commented Jun 12, 2017

@wallygauze Thanks. I agree with you and modify my code. For the current issue, it's enough, but scikit-learn seems to handle situations when n_samples < n_features strangely (See the example below). Maybe some other modifications is needed.
00
@amueller @lesteve

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 14, 2017

Member

@jnothman @amueller Since the pull request fixes 3 issues, could you please take some time to consider it?

Member

qinhanmin2014 commented Jun 14, 2017

@jnothman @amueller Since the pull request fixes 3 issues, could you please take some time to consider it?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 14, 2017

Member

Could you try to write tests to show that it fixes three issues?

Member

jnothman commented Jun 14, 2017

Could you try to write tests to show that it fixes three issues?

@jnothman jnothman added this to the 0.19 milestone Jun 14, 2017

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 14, 2017

Member

@jnothman Thanks. I construct a regression test based on #7568 and it takes 0.2s on my PC.

Member

qinhanmin2014 commented Jun 14, 2017

@jnothman Thanks. I construct a regression test based on #7568 and it takes 0.2s on my PC.

@qinhanmin2014 qinhanmin2014 deleted the qinhanmin2014:my-feature-2 branch Jun 15, 2017

@qinhanmin2014 qinhanmin2014 restored the qinhanmin2014:my-feature-2 branch Jun 15, 2017

@qinhanmin2014 qinhanmin2014 reopened this Jun 15, 2017

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 15, 2017

Member

Sorry, I delete the branch without noticing it. Reopening.

Member

qinhanmin2014 commented Jun 15, 2017

Sorry, I delete the branch without noticing it. Reopening.

@wallygauze

This comment has been minimized.

Show comment
Hide comment
@wallygauze

wallygauze Jun 16, 2017

Contributor

regression test:
You are right that this code fails with master and not with the fix. However, the test design (and comment) does not reflect its purpose.

Regarding the comments, I doubt (I'm not sure) relying on giving the links to the issues in the file is convenient/best practice, especially since they are three of them and going through them could be long. A reliable summary should be included.

Regarding the test design, the scores calculated by the different solvers being equal is not what we want - actually, if you test the same code with the iris dataset instead in the failing master, the score method is run without raising an error and yet the 3 scores are equal. The noise_variance formula for '_fit_truncated' being incorrect is not so directly connected with the score values.

My understanding is that any lack of clarity could be dangerous in case someone considers changing the dataset in the future for test speed or something similar. I said so in #8544, but get_precision() raises no error in test_arpack_pca_solver and test_pca_randomized_solver, and that is actually not because the iris dataset is an unlikely exception. It suffices that the explained variance ratio of a dataset be imbalanced enough for it to happen.

Contributor

wallygauze commented Jun 16, 2017

regression test:
You are right that this code fails with master and not with the fix. However, the test design (and comment) does not reflect its purpose.

Regarding the comments, I doubt (I'm not sure) relying on giving the links to the issues in the file is convenient/best practice, especially since they are three of them and going through them could be long. A reliable summary should be included.

Regarding the test design, the scores calculated by the different solvers being equal is not what we want - actually, if you test the same code with the iris dataset instead in the failing master, the score method is run without raising an error and yet the 3 scores are equal. The noise_variance formula for '_fit_truncated' being incorrect is not so directly connected with the score values.

My understanding is that any lack of clarity could be dangerous in case someone considers changing the dataset in the future for test speed or something similar. I said so in #8544, but get_precision() raises no error in test_arpack_pca_solver and test_pca_randomized_solver, and that is actually not because the iris dataset is an unlikely exception. It suffices that the explained variance ratio of a dataset be imbalanced enough for it to happen.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 16, 2017

Member

Thanks @wallygauze . If we only focus on the issue, we may simply remove the assertions and only calculate the score(the name of the function may be something like test_pca_score_no_error). Since it is a bug which is repeatedly reported, it may be better to mark the issue numbers or add it in what's new. Could you please give me some suggestions about the test? @jnothman

Member

qinhanmin2014 commented Jun 16, 2017

Thanks @wallygauze . If we only focus on the issue, we may simply remove the assertions and only calculate the score(the name of the function may be something like test_pca_score_no_error). Since it is a bug which is repeatedly reported, it may be better to mark the issue numbers or add it in what's new. Could you please give me some suggestions about the test? @jnothman

@wallygauze

This comment has been minimized.

Show comment
Hide comment
@wallygauze

wallygauze Jun 16, 2017

Contributor

(My bad, I commented again just before you answered.)
Yes, I see why we'd want to include the issue reference (as long as we include a good summary).

Contributor

wallygauze commented Jun 16, 2017

(My bad, I commented again just before you answered.)
Yes, I see why we'd want to include the issue reference (as long as we include a good summary).

@wallygauze

This comment has been minimized.

Show comment
Hide comment
@wallygauze

wallygauze Jun 16, 2017

Contributor

And I do think calculating the score (or, again, calling get_precision) without any kind of comparison is enough as a test.
If it's too ad-hoc for a regression test , do consider modifying or extending test_arpack_pca_solver and test_pca_randomized_solver so that they do pick the bug with get_precision

Contributor

wallygauze commented Jun 16, 2017

And I do think calculating the score (or, again, calling get_precision) without any kind of comparison is enough as a test.
If it's too ad-hoc for a regression test , do consider modifying or extending test_arpack_pca_solver and test_pca_randomized_solver so that they do pick the bug with get_precision

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 16, 2017

Member

@wallygauze Another reason for me to design this test is to ensure that in extreme situations, the scores are not only calculated but also correctly calculated using different svd_solver. Maybe you are right, so let's wait for the suggestions from the core developers.

Member

qinhanmin2014 commented Jun 16, 2017

@wallygauze Another reason for me to design this test is to ensure that in extreme situations, the scores are not only calculated but also correctly calculated using different svd_solver. Maybe you are right, so let's wait for the suggestions from the core developers.

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 19, 2017

Member

The previous test failure is because of #7893. I use an empty commit to make CI green. Could you please take some time to consider this pull request? Thanks. @jnothman

Member

qinhanmin2014 commented Jun 19, 2017

The previous test failure is because of #7893. I use an empty commit to make CI green. Could you please take some time to consider this pull request? Thanks. @jnothman

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jun 22, 2017

Member

ping @jnothman Could you please have a look at it? Thanks.

Member

qinhanmin2014 commented Jun 22, 2017

ping @jnothman Could you please have a look at it? Thanks.

@jnothman

Thanks for the pings.

This doesn't appear to test the min(n_features, n_samples) logic.

Are we able to describe the noise_variance_ (and explained_variance_) parameter and its scaling more clearly in the docstring?

Show outdated Hide outdated sklearn/decomposition/tests/test_pca.py
Show outdated Hide outdated sklearn/decomposition/tests/test_pca.py
@jnothman

Otherwise, +1 for merge when green.

Show outdated Hide outdated sklearn/decomposition/tests/test_pca.py
Show outdated Hide outdated sklearn/decomposition/tests/test_pca.py
@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Aug 6, 2017

Member

@jnothman Thanks. I update the comment. CIs are green.

Member

qinhanmin2014 commented Aug 6, 2017

@jnothman Thanks. I update the comment. CIs are green.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 6, 2017

Member

Thanks!

Member

jnothman commented Aug 6, 2017

Thanks!

@jnothman jnothman merged commit 3f53d78 into scikit-learn:master Aug 6, 2017

5 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.18%)
Details
codecov/project Absolute coverage decreased by -0.01% but relative coverage increased by +3.81% compared to 0dd19de
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@qinhanmin2014 qinhanmin2014 deleted the qinhanmin2014:my-feature-2 branch Aug 6, 2017

jnothman added a commit that referenced this pull request Aug 6, 2017

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Aug 6, 2017

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Aug 6, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Aug 7, 2017

Member

@qinhanmin2014 thanks a lot for the fix!

Member

lesteve commented Aug 7, 2017

@qinhanmin2014 thanks a lot for the fix!

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Aug 7, 2017

Member

@lesteve Thanks a lot too for your patient instruction and great update. :)

Member

qinhanmin2014 commented Aug 7, 2017

@lesteve Thanks a lot too for your patient instruction and great update. :)

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Aug 12, 2017

Merge tag '0.19.0' into releases
Release 0.19.0

* tag '0.19.0': (99 commits)
  DOC one more version issue in doc
  skip docstring tests because not useful to users and has some issues
  deprecation of n_components happened in 0.19 not 0.18 (#9527)
  sync whatsnew with master so I'm less confused
  DOC more navigation links
  DOC a note on data leakage and pipeline (#9510)
  DOC set release date to Friday
  DOC Update news and menu for 0.19 release
  DOC list of contributors to 0.19
  DOC Change release date to Thursday
  DOC Remove some whitespace from what's new
  Update what's new for recent changes
  Use base.is_classifier instead instead of isinstance (#9482)
  Fix safe_indexing with read-only indices (#9507)
  [MRG+1] add scorer based on explained_variance_score (#9259)
  fix wrong assert in test_validation (#9480)
  [MRG+1] FIX Add missing mixins to ClassifierChain (#9473)
  Bring last code block in line with the image. (#9488)
  FIX Pass sample_weight as kwargs in VotingClassifier (#9493)
  FIX Incorrent implementation of noise_variance_ in PCA._fit_truncated (#9108)
  ...

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Aug 12, 2017

Merge branch 'releases' into dfsg
* releases: (99 commits)
  DOC one more version issue in doc
  skip docstring tests because not useful to users and has some issues
  deprecation of n_components happened in 0.19 not 0.18 (#9527)
  sync whatsnew with master so I'm less confused
  DOC more navigation links
  DOC a note on data leakage and pipeline (#9510)
  DOC set release date to Friday
  DOC Update news and menu for 0.19 release
  DOC list of contributors to 0.19
  DOC Change release date to Thursday
  DOC Remove some whitespace from what's new
  Update what's new for recent changes
  Use base.is_classifier instead instead of isinstance (#9482)
  Fix safe_indexing with read-only indices (#9507)
  [MRG+1] add scorer based on explained_variance_score (#9259)
  fix wrong assert in test_validation (#9480)
  [MRG+1] FIX Add missing mixins to ClassifierChain (#9473)
  Bring last code block in line with the image. (#9488)
  FIX Pass sample_weight as kwargs in VotingClassifier (#9493)
  FIX Incorrent implementation of noise_variance_ in PCA._fit_truncated (#9108)
  ...

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Aug 12, 2017

Merge branch 'dfsg' into debian
* dfsg: (99 commits)
  DOC one more version issue in doc
  skip docstring tests because not useful to users and has some issues
  deprecation of n_components happened in 0.19 not 0.18 (#9527)
  sync whatsnew with master so I'm less confused
  DOC more navigation links
  DOC a note on data leakage and pipeline (#9510)
  DOC set release date to Friday
  DOC Update news and menu for 0.19 release
  DOC list of contributors to 0.19
  DOC Change release date to Thursday
  DOC Remove some whitespace from what's new
  Update what's new for recent changes
  Use base.is_classifier instead instead of isinstance (#9482)
  Fix safe_indexing with read-only indices (#9507)
  [MRG+1] add scorer based on explained_variance_score (#9259)
  fix wrong assert in test_validation (#9480)
  [MRG+1] FIX Add missing mixins to ClassifierChain (#9473)
  Bring last code block in line with the image. (#9488)
  FIX Pass sample_weight as kwargs in VotingClassifier (#9493)
  FIX Incorrent implementation of noise_variance_ in PCA._fit_truncated (#9108)
  ...

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment