New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108
Changes from 10 commits
d06cbcf
d0d1199
217da6e
4784285
afb7e9d
f0f1245
200a672
c4b7e5e
3f3b100
2c678c9
58748dd
2a9d47e
531ae01
0131ebb
a3030f3
f67c3c7
0fdf5b9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -201,6 +201,9 @@ class PCA(_BasePCA): | |
explained_variance_ : array, shape (n_components,) | ||
The amount of variance explained by each of the selected components. | ||
|
||
Equal to n_components largest eigenvalues | ||
of the covariance matrix of X. | ||
|
||
.. versionadded:: 0.18 | ||
|
||
explained_variance_ratio_ : array, shape (n_components,) | ||
|
@@ -232,6 +235,9 @@ class PCA(_BasePCA): | |
http://www.miketipping.com/papers/met-mppca.pdf. It is required to | ||
computed the estimated data covariance and score samples. | ||
|
||
Equal to the average of (n_features - n_components) | ||
smallest eigenvalues of the covariance matrix of X. | ||
|
||
References | ||
---------- | ||
For n_components == 'mle', this class uses the method of `Thomas P. Minka: | ||
|
@@ -494,9 +500,10 @@ def _fit_truncated(self, X, n_components, svd_solver): | |
self.explained_variance_ratio_ = \ | ||
self.explained_variance_ / total_var.sum() | ||
self.singular_values_ = S.copy() # Store the singular values. | ||
if self.n_components_ < n_features: | ||
if self.n_components_ < min(n_features, n_samples): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a test to make sure that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lesteve Thanks. Will add in a few day. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have just pushed a change to do this. To be honest I think we should look at the validation code that validate that validate n_components vs n_samples and n_features and use a helper function where appropriate. Also in incremental_pca.py as noted in https://github.com/scikit-learn/scikit-learn/pull/9303/files#r131103304. |
||
self.noise_variance_ = (total_var.sum() - | ||
self.explained_variance_.sum()) | ||
self.noise_variance_ /= min(n_features, n_samples) - n_components | ||
else: | ||
self.noise_variance_ = 0. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -529,6 +529,26 @@ def test_pca_score3(): | |
assert_true(ll.argmax() == 1) | ||
|
||
|
||
def test_pca_score4(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can keep this test but I feel there must be a simpler test. I am going to think about it a bit more and come back to you. A simpler test more related to your fix would be to check that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lesteve Thanks. The reason for current test is that noise_variance is not correctly implemented in PCA._fit_truncated but correctly implemented in PCA._fit_full, which is the main purpose of the pull request. So we can compare between the scores calculated with different methods to construct the test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think a simpler test is In terms of eigenvalues of the covariance matrix of X, each of the n_component of the biggest eigenvalues should be bigger than the average of the smaller eigenvalues There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lesteve Thanks. But I think it may not be suitable since the small eigenvalues are usually very small. Their sum may still be smaller than the big eigenvalues, making it not a good regression test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
"May" is the term indeed. On the digits dataset this is not what is happening though. The reason we get infinite values and an error in master when calling |
||
# Ensure that the scores are correctly calculated in extreme situations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not an "extreme situation" at all, it is actually a very common situation ... |
||
# Specially designed for issue #7568, #8541, #8544 | ||
digits = datasets.load_digits() | ||
X_digits = digits.data | ||
|
||
pca1 = PCA(n_components=30, svd_solver='full') | ||
pca1.fit(X_digits) | ||
score1 = pca1.score(X_digits) | ||
pca2 = PCA(n_components=30, svd_solver='arpack') | ||
pca2.fit(X_digits) | ||
score2 = pca2.score(X_digits) | ||
pca3 = PCA(n_components=30, svd_solver='randomized') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use the |
||
pca3.fit(X_digits) | ||
score3 = pca3.score(X_digits) | ||
|
||
assert_almost_equal(score1, score2, 12) | ||
assert_almost_equal(score1, score3, 2) | ||
|
||
|
||
def test_svd_solver_auto(): | ||
rng = np.random.RandomState(0) | ||
X = rng.uniform(size=(1000, 50)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should be
min(n_features, n_samples) - n_components
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than trying to describe a formula with words, maybe we could follow http://www.miketipping.com/papers/met-mppca.pdf and say something like: "this can be interpreted as the average variance ‘lost’ per discarded dimension"