-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
BUG take n_components strictly greater than fraction of explai… #15669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Before n_components was updated something like this: n_components = np.searchsorted(ratio_cumsum, n_components) + 1 There is a catch in numpy's searchsorted function. It returns index i where a[i-1] < v <= a[i]. So here i can be returned such that v < a[i] which is exactly what is mentioned in the documentation: If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. so adding 1 here is redundant. But if v = a[i] then 1 should be added. Hence I feel using np.searchsorted for this task isn't correct and it should be replaced by alternatives like: n_components = np.nonzero(ratio_cumsum > n_components)[0][0] Now, n_components is just the exact total number of principal components required.
Convert `n_components = np.nonzero(ratio_cumsum > n_components)[0][0]` to `n_components = np.argmax(ratio_cumsum > n_components)`
Thanks! It would be better to re-open the original PR #15663 than creating a new one (so that reviewers have the discussion history) .. cc @glemaitre |
Your analysis is still not entirely correct. Your example is dealing with floating-point precision error. Just to illustrate: >>> 1.1 + 2.2 == 3.3
False
>>> 1.1 + 2.2
3.3000000000000003 In your case, when giving: pca = PCA(n_components=0.9947878161267246); pca.fit(X) The number given will be smaller. To be correct, you should have n_components = pca.explained_variance_ratio_.cumsum()[2]
pca = PCA(n_components=n_components); and the equality between float will still be tricky. However, in case of equality, we are taking the index on the left and not on the right which might be an issue if we look at the documentation. The right fix is to change the line of code with: n_components = np.searchsorted(ratio_cumsum, n_components, side='right') + 1 or to update the documentation with:
I don't really have a strong opinion on the matter. |
I would go with:
as if you have an integer number of component that matches the expected explained variance we should get this. |
So we will need a regression tests because this solution will not lead to any failure because this is an edge untested case. A possible regression test using n_components = pca.explained_variance_ratio_.cumsum()[2]
pca = PCA(n_components=n_components);
assert pca.n_components == 4 |
Thanks, @rth, right call. Will remember it next time. @glemaitre, thank you for the detailed explanation. I agree with what you are saying. I feel like your's and @agramfort's solution is much elegant than mine, I feel so stupid that why I didn't come up with such a solution. If it's okay with you, I will add those lines and push a new commit. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart of rephrasing the entry in what's new
doc/whats_new/v0.23.rst
Outdated
:mod:`sklearn.decomposition` | ||
............................ | ||
|
||
- |Fix| :class:`decomposition.PCA` with a float `n_components` parameter, now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- |Fix| :class:`decomposition.PCA` with a float `n_components` parameter, now | |
- |Fix| :class:`decomposition.PCA` with a float `n_components` parameter will |
doc/whats_new/v0.23.rst
Outdated
............................ | ||
|
||
- |Fix| :class:`decomposition.PCA` with a float `n_components` parameter, now | ||
currently exclusively chooses the components that explains the variance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently exclusively chooses the components that explains the variance | |
exclusively choose the components that explain the variance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view, the edits look great! Thank you!
Thank you @KrishnaChaitanya9 ! |
Thank you so much for your time @thomasjpfan @glemaitre @agramfort @rth. Thank you! |
So according to the documentation:
If
0 < n_components < 1
andsvd_solver == 'full'
, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. [Link]Let's take a look at an example:
Now If you try to input something like:
Shouldn't have the above code printed 4 according to the documentation? The 3 features out 4 features explain exactly 0.9947878161267246 variance, but according to documentation, it should have explained more variance.
I agree that this scenario looks kind of artificial, but still, this can happen and is a possibility.
So there are two ways of solving it
What step to take I leave at the hand of moderators
Changes I made to the code:
First I calculate n_components_index which is equal to np.searchsorted(ratio_cumsum, n_components). I check if n_components == ratio_cumsum[n_components_index] and if equal I do np.searchsorted(ratio_cumsum, n_components) + 2 which means that searchsorted will give me an index which is exactly equal to value at the index n_components_index of ratio_cumsum, what I want to is to add +1 to so that I should be able to explain more variance so that I am at par with documentation and finally I add +1 to convert that number of features AKA n_components. This approach will not exceed bounds because n_component float is always between 0 and 1. Also, ratio_cumsum array length is equal total number of features(let's call it n). If the penultimate value is passed as n_component, it's index in terms of n would be n-2. Hence this method cannot go out of bounds.
In normal cases, I just do np.searchsorted(ratio_cumsum, n_components) + 1 which is exactly what stable code has right now.
Thank you for your time.
Reference Issues/PRs
Fixes my previous faulty understanding and commit #15663
What does this implement/fix? Explain your changes.
Any other comments?