Skip to content

BUG take n_components strictly greater than fraction of explai… #15669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 22, 2020

Conversation

krishnachaitanya7
Copy link
Contributor

So according to the documentation:
If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. [Link]
Let's take a look at an example:

>>> import numpy as np
>>> from sklearn.datasets import load_iris
>>> from sklearn.decomposition import PCA 
>>> X, y = load_iris(return_X_y=True)

>>> print(X.shape)
(150, 4)
# Above line shows that there are 4 features in the dataset

>>> pca = PCA().fit(X)   
>>> print(pca.explained_variance_ratio_.cumsum())
[0.92461872 0.97768521 0.99478782 1.        ]
# Above line prints the cumulative sum
# But it's not complete, if you put a debug point there and try to get exact values of cumsum
# It would be something like this:
# [0.924618723201727, 0.9776852063187949, 0.9947878161267246, 1.0]

Now If you try to input something like:

>>> pca = PCA(n_components=0.9947878161267246); pca.fit(X)
>>> print(pca.n_components_)
3

Shouldn't have the above code printed 4 according to the documentation? The 3 features out 4 features explain exactly 0.9947878161267246 variance, but according to documentation, it should have explained more variance.

I agree that this scenario looks kind of artificial, but still, this can happen and is a possibility.

So there are two ways of solving it

  • Change the documentation to "explains the variance greater than or equal to"
  • Or change the code as I have done in this pull request which would then be at par with documentation

What step to take I leave at the hand of moderators

Changes I made to the code:
First I calculate n_components_index which is equal to np.searchsorted(ratio_cumsum, n_components). I check if n_components == ratio_cumsum[n_components_index] and if equal I do np.searchsorted(ratio_cumsum, n_components) + 2 which means that searchsorted will give me an index which is exactly equal to value at the index n_components_index of ratio_cumsum, what I want to is to add +1 to so that I should be able to explain more variance so that I am at par with documentation and finally I add +1 to convert that number of features AKA n_components. This approach will not exceed bounds because n_component float is always between 0 and 1. Also, ratio_cumsum array length is equal total number of features(let's call it n). If the penultimate value is passed as n_component, it's index in terms of n would be n-2. Hence this method cannot go out of bounds.

In normal cases, I just do np.searchsorted(ratio_cumsum, n_components) + 1 which is exactly what stable code has right now.

Thank you for your time.

Reference Issues/PRs

Fixes my previous faulty understanding and commit #15663

What does this implement/fix? Explain your changes.

Any other comments?

Before n_components was updated something like this:
n_components = np.searchsorted(ratio_cumsum, n_components) + 1
There is a catch in numpy's searchsorted function. It returns index i where a[i-1] < v <= a[i]. So here i can be returned such that v < a[i] which is exactly what is mentioned in the documentation: 
If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
so adding 1 here is redundant. 

But if v = a[i] then 1 should be added.

Hence I feel using np.searchsorted for this task isn't correct and it should be replaced by alternatives like:
n_components = np.nonzero(ratio_cumsum > n_components)[0][0]

Now, n_components is just the exact total number of principal components required.
Convert `n_components = np.nonzero(ratio_cumsum > n_components)[0][0]` to `n_components = np.argmax(ratio_cumsum > n_components)`
@rth
Copy link
Member

rth commented Nov 20, 2019

Thanks! It would be better to re-open the original PR #15663 than creating a new one (so that reviewers have the discussion history) .. cc @glemaitre

@glemaitre
Copy link
Member

Your analysis is still not entirely correct. Your example is dealing with floating-point precision error. Just to illustrate:

>>> 1.1 + 2.2 == 3.3                                                             
False
>>> 1.1 + 2.2                                                                    
3.3000000000000003

In your case, when giving:

pca = PCA(n_components=0.9947878161267246); pca.fit(X)

The number given will be smaller. To be correct, you should have

n_components = pca.explained_variance_ratio_.cumsum()[2]
pca = PCA(n_components=n_components);

and the equality between float will still be tricky. However, in case of equality, we are taking the index on the left and not on the right which might be an issue if we look at the documentation. The right fix is to change the line of code with:

n_components = np.searchsorted(ratio_cumsum, n_components, side='right') + 1

or to update the documentation with:

If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater or equal than the percentage specified by n_components.

I don't really have a strong opinion on the matter.
ping @agramfort @jnothman

@agramfort
Copy link
Member

I would go with:

n_components = np.searchsorted(ratio_cumsum, n_components, side='right') + 1

as if you have an integer number of component that matches the expected explained variance we should get this.

@glemaitre
Copy link
Member

So we will need a regression tests because this solution will not lead to any failure because this is an edge untested case. A possible regression test using load_iris

n_components = pca.explained_variance_ratio_.cumsum()[2]
pca = PCA(n_components=n_components);
assert pca.n_components == 4

@krishnachaitanya7
Copy link
Contributor Author

krishnachaitanya7 commented Nov 20, 2019

Thanks, @rth, right call. Will remember it next time. @glemaitre, thank you for the detailed explanation. I agree with what you are saying. I feel like your's and @agramfort's solution is much elegant than mine, I feel so stupid that why I didn't come up with such a solution. If it's okay with you, I will add those lines and push a new commit. Thank you!

@glemaitre glemaitre changed the title Update on my previous pull request #15663 BUG take n_components strictly greater than fraction of explained variance in PCA Jan 13, 2020
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart of rephrasing the entry in what's new

:mod:`sklearn.decomposition`
............................

- |Fix| :class:`decomposition.PCA` with a float `n_components` parameter, now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Fix| :class:`decomposition.PCA` with a float `n_components` parameter, now
- |Fix| :class:`decomposition.PCA` with a float `n_components` parameter will

............................

- |Fix| :class:`decomposition.PCA` with a float `n_components` parameter, now
currently exclusively chooses the components that explains the variance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
currently exclusively chooses the components that explains the variance
exclusively choose the components that explain the variance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, the edits look great! Thank you!

@thomasjpfan thomasjpfan changed the title BUG take n_components strictly greater than fraction of explained variance in PCA BUG take n_components strictly greater than fraction of explai… Jan 22, 2020
@thomasjpfan thomasjpfan merged commit fd12d56 into scikit-learn:master Jan 22, 2020
@thomasjpfan
Copy link
Member

Thank you @KrishnaChaitanya9 !

@krishnachaitanya7
Copy link
Contributor Author

Thank you so much for your time @thomasjpfan @glemaitre @agramfort @rth. Thank you!

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants