Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] MNT Fixes for PCA with n_components='mle' #16841

Merged
merged 6 commits into from Apr 7, 2020

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Apr 4, 2020

Fixes #16730
Closes #16546

  • infer_dimension() now returns a value in [1, n_features - 1]. rank = 0 isn't possible anymore (it's a bug fix, as this would transform into an empty array).
  • rank as passed to _assess_dimension() is actually the rank, not an index as previously. This fixes the infamous off-by-one error. Note also that the method isn't defined if we pass it rank == n_features (division by zero)
  • a few fixes in the formula

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug added this to the 0.23 milestone Apr 4, 2020
@lschwetlick
Copy link
Contributor

This looks good to me, thanks for fixing!

@NicolasHug
Copy link
Member Author

Thanks for checking @lschwetlick . If you have time / feel like it, it'd be great if you could check with your notebook if the results make sense now? (but no worries otherwise)

@lschwetlick
Copy link
Contributor

Well... not sure if this is maybe quite edge case-y but

b = np.ones((9, 6))
print("rank=", np.linalg.matrix_rank(b))
u2, s2, vh2 = np.linalg.svd(b, full_matrices=True)
ll = np.empty_like(s2)
ll[0] = -np.inf
for r in range(1,s2.shape[0]):
    ll[r] = (_assess_dimension(np.asarray(s2), r, 9))
print(ll)

-> [-inf nan -inf -inf -inf -inf]

but then funnily enough it gives the right number of dimensions (1) because apparently argmax treats nan>-inf. Its giving nan because if all subsequent values in the spectrum are 0 then in line 76 v is also 0...

@lschwetlick
Copy link
Contributor

I'm also still slightly confused to be honest: we dont test full rank because the method isn't defined for testing full rank. But what if the number of dimensions is full rank?

@NicolasHug
Copy link
Member Author

NicolasHug commented Apr 7, 2020

Thanks @lschwetlick , I fixed that and added a test (though indeed it's an unlikely case I think)

But what if the number of dimensions is full rank?

I guess users use mle when they want some dimensionality reduction, but just don't know by how much. So that's fine not to consider rank==n_features (we can't anyway). There's always the solution to directly provide n_components=n_features.

@lschwetlick
Copy link
Contributor

Alright cool, other than that my notebook and I are happy :)

@agramfort agramfort merged commit a655de5 into scikit-learn:master Apr 7, 2020
@agramfort
Copy link
Member

thanks a lot @NicolasHug and @lschwetlick for the team work !

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
* Fixed off by one in MLE and better handling of small eigenvalues

* light update tests

* pep8

* Added test + threhsold on small log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: MLE for PCA mis-estimates rank Off-By-One Error in _pca with 'mle'
3 participants