Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108

Merged
merged 17 commits into from Aug 6, 2017
3 changes: 3 additions & 0 deletions doc/whats_new.rst
Expand Up @@ -239,6 +239,9 @@ Decomposition, manifold learning and clustering
``singular_values_``, like in :class:`decomposition.IncrementalPCA`.
:issue:`7685` by :user:`Tommy Löfstedt <tomlof>`

- Fixed the implementation of noise_variance_ in :class:`decomposition.PCA`.
:issue:`9108` by `Hanmin Qin <https://github.com/qinhanmin2014>`_.

- :class:`decomposition.NMF` now faster when ``beta_loss=0``.
:issue:`9277` by :user:`hongkahjun`.

Expand Down
9 changes: 8 additions & 1 deletion sklearn/decomposition/pca.py
Expand Up @@ -201,6 +201,9 @@ class PCA(_BasePCA):
explained_variance_ : array, shape (n_components,)
The amount of variance explained by each of the selected components.

Equal to n_components largest eigenvalues
of the covariance matrix of X.

.. versionadded:: 0.18

explained_variance_ratio_ : array, shape (n_components,)
Expand Down Expand Up @@ -232,6 +235,9 @@ class PCA(_BasePCA):
http://www.miketipping.com/papers/met-mppca.pdf. It is required to
computed the estimated data covariance and score samples.

Equal to the average of (n_features - n_components)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be min(n_features, n_samples) - n_components.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than trying to describe a formula with words, maybe we could follow http://www.miketipping.com/papers/met-mppca.pdf and say something like: "this can be interpreted as the average variance ‘lost’ per discarded dimension"

smallest eigenvalues of the covariance matrix of X.

References
----------
For n_components == 'mle', this class uses the method of `Thomas P. Minka:
Expand Down Expand Up @@ -494,9 +500,10 @@ def _fit_truncated(self, X, n_components, svd_solver):
self.explained_variance_ratio_ = \
self.explained_variance_ / total_var.sum()
self.singular_values_ = S.copy() # Store the singular values.
if self.n_components_ < n_features:
if self.n_components_ < min(n_features, n_samples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to make sure that noise_variance_ is 0 when either n_components > n_features or n_components > n_samples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve Thanks. Will add in a few day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just pushed a change to do this.

To be honest I think we should look at the validation code that validate that validate n_components vs n_samples and n_features and use a helper function where appropriate. Also in incremental_pca.py as noted in https://github.com/scikit-learn/scikit-learn/pull/9303/files#r131103304.

self.noise_variance_ = (total_var.sum() -
self.explained_variance_.sum())
self.noise_variance_ /= min(n_features, n_samples) - n_components
else:
self.noise_variance_ = 0.

Expand Down
20 changes: 20 additions & 0 deletions sklearn/decomposition/tests/test_pca.py
Expand Up @@ -529,6 +529,26 @@ def test_pca_score3():
assert_true(ll.argmax() == 1)


def test_pca_score4():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep this test but I feel there must be a simpler test. I am going to think about it a bit more and come back to you.

A simpler test more related to your fix would be to check that noise_variance_ is 0 when either n_components < n_features or n_components < n_samples.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve Thanks. The reason for current test is that noise_variance is not correctly implemented in PCA._fit_truncated but correctly implemented in PCA._fit_full, which is the main purpose of the pull request. So we can compare between the scores calculated with different methods to construct the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to think about it a bit more and come back to you.

I think a simpler test is
assert np.all((self.explained_variance_ - self.noise_variance_) >= 0)

In terms of eigenvalues of the covariance matrix of X, each of the n_component of the biggest eigenvalues should be bigger than the average of the smaller eigenvalues

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve Thanks. But I think it may not be suitable since the small eigenvalues are usually very small. Their sum may still be smaller than the big eigenvalues, making it not a good regression test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Their sum may still be smaller than the big eigenvalues, making it not a good regression test.

"May" is the term indeed. On the digits dataset this is not what is happening though. The reason we get infinite values and an error in master when calling get_precision is exactly because np.all((self.explained_variance_ - self.noise_variance_) >= 0) is False (because we took the sum of the eigenvalues rather than the mean).

# Ensure that the scores are correctly calculated in extreme situations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an "extreme situation" at all, it is actually a very common situation ...

# Specially designed for issue #7568, #8541, #8544
digits = datasets.load_digits()
X_digits = digits.data

pca1 = PCA(n_components=30, svd_solver='full')
pca1.fit(X_digits)
score1 = pca1.score(X_digits)
pca2 = PCA(n_components=30, svd_solver='arpack')
pca2.fit(X_digits)
score2 = pca2.score(X_digits)
pca3 = PCA(n_components=30, svd_solver='randomized')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the random_state and set it to anything your like e.g. 0 to ensure determinism

pca3.fit(X_digits)
score3 = pca3.score(X_digits)

assert_almost_equal(score1, score2, 12)
assert_almost_equal(score1, score3, 2)


def test_svd_solver_auto():
rng = np.random.RandomState(0)
X = rng.uniform(size=(1000, 50))
Expand Down