Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract randomized PCA impl in a dedicated toplevel class #30

Merged
7 commits merged into from Dec 12, 2010

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 10, 2010

I wanted to make PCA able to handle sparse data (scipy.sparse matrices) using the fast_svd implementation. Since computing PCA for sparse dataset (with big n_features) is only feasable with truncated SVD the API of the existing PCA module (with automated "mle" strategy for finding n_components) is not very well suited to such an evolution.

I hence decided to wrap the fast_svd method in a dedicated RandomizedPCA that makes it explicit that it is able to handle both sparse and dense input provided that you are willing to truncate the singular spectrum to an arbitrary level.

Here is a branch that does just that along with updated docstring, tests and examples along with a renaming of n_comp to n_components to be consistent with the overall scikit-learn naming conventions for dimension parameters.

@GaelVaroquaux
Copy link
Member

I wonder if I like the explicit subclass or not. I guess I tend to like subclasses in general, but here they seem so similar, that I wonder if, in the interest of the user, it would not be interesting to keep it in a single class.

Apart from this, no comments other that I like your renaming of n_comp to n_components, I am sorry, I am having a glance at your code in between talks.

@ogrisel
Copy link
Member Author

ogrisel commented Dec 10, 2010

I first tried to keep a single class able to handle both sparse and dense data on the one hand, mle and explicit n_components init on the other hand and linalg.svd and and fast_svd on the third hand. The interactions between those three dimensions made the code unreadable and the initialization logic too hard to document with too many parameters in the docstring. The new splitted code is much more readable and not that much code duplication in practice.

@agramfort
Copy link
Member

+1 for merge

@ogrisel
Copy link
Member Author

ogrisel commented Dec 11, 2010

Thanks agramfort, I'll wait for a second vote before merging.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants