Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108
What does this implement/fix? Explain your changes.
Any other comments?
Regarding the comments, I doubt (I'm not sure) relying on giving the links to the issues in the file is convenient/best practice, especially since they are three of them and going through them could be long. A reliable summary should be included.
Regarding the test design, the scores calculated by the different solvers being equal is not what we want - actually, if you test the same code with the iris dataset instead in the failing master, the score method is run without raising an error and yet the 3 scores are equal. The noise_variance formula for '_fit_truncated' being incorrect is not so directly connected with the score values.
My understanding is that any lack of clarity could be dangerous in case someone considers changing the dataset in the future for test speed or something similar. I said so in #8544, but get_precision() raises no error in test_arpack_pca_solver and test_pca_randomized_solver, and that is actually not because the iris dataset is an unlikely exception. It suffices that the explained variance ratio of a dataset be imbalanced enough for it to happen.
Thanks @wallygauze . If we only focus on the issue, we may simply remove the assertions and only calculate the score(the name of the function may be something like test_pca_score_no_error). Since it is a bug which is repeatedly reported, it may be better to mark the issue numbers or add it in what's new. Could you please give me some suggestions about the test? @jnothman
And I do think calculating the score (or, again, calling get_precision) without any kind of comparison is enough as a test.
@wallygauze Another reason for me to design this test is to ensure that in extreme situations, the scores are not only calculated but also correctly calculated using different svd_solver. Maybe you are right, so let's wait for the suggestions from the core developers.
This was referenced
Jun 19, 2017
Thanks for the pings.
This doesn't appear to test the
min(n_features, n_samples) logic.
Are we able to describe the
explained_variance_) parameter and its scaling more clearly in the docstring?