Extract randomized PCA impl in a dedicated toplevel class #30

ogrisel · 2010-12-10T21:16:22Z

I wanted to make PCA able to handle sparse data (scipy.sparse matrices) using the fast_svd implementation. Since computing PCA for sparse dataset (with big n_features) is only feasable with truncated SVD the API of the existing PCA module (with automated "mle" strategy for finding n_components) is not very well suited to such an evolution.

I hence decided to wrap the fast_svd method in a dedicated RandomizedPCA that makes it explicit that it is able to handle both sparse and dense input provided that you are willing to truncate the singular spectrum to an arbitrary level.

Here is a branch that does just that along with updated docstring, tests and examples along with a renaming of n_comp to n_components to be consistent with the overall scikit-learn naming conventions for dimension parameters.

… handle sparse data as well

GaelVaroquaux · 2010-12-10T22:37:40Z

I wonder if I like the explicit subclass or not. I guess I tend to like subclasses in general, but here they seem so similar, that I wonder if, in the interest of the user, it would not be interesting to keep it in a single class.

Apart from this, no comments other that I like your renaming of n_comp to n_components, I am sorry, I am having a glance at your code in between talks.

ogrisel · 2010-12-10T23:02:21Z

I first tried to keep a single class able to handle both sparse and dense data on the one hand, mle and explicit n_components init on the other hand and linalg.svd and and fast_svd on the third hand. The interactions between those three dimensions made the code unreadable and the initialization logic too hard to document with too many parameters in the docstring. The new splitted code is much more readable and not that much code duplication in practice.

agramfort · 2010-12-11T16:37:14Z

+1 for merge

ogrisel · 2010-12-11T20:12:06Z

Thanks agramfort, I'll wait for a second vote before merging.

ogrisel added 3 commits December 10, 2010 19:56

extract the randomized SVD implementation as a toplevel class able to…

25b7bff

… handle sparse data as well

consistently rename n_comp to n_components

d7609b3

Merge branch 'master' into sparse-pca

b6dfcc0

ogrisel added 3 commits December 11, 2010 00:23

FIX: typo s/mean/mean_/g in RandomizedPCA

01be9a5

Merge branch 'master' into sparse-pca

e39f23b

sed -i "s/\<n_componentsonents\>/n_components/g"

604e8b7

merging master

0573d9f

ZHCSOFT mentioned this pull request Mar 10, 2022

random Segfaults on distance_transform_edt with Intel 12 Alder lake (E-Core enabled) #22744

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract randomized PCA impl in a dedicated toplevel class #30

Extract randomized PCA impl in a dedicated toplevel class #30

ogrisel commented Dec 10, 2010

GaelVaroquaux commented Dec 10, 2010

ogrisel commented Dec 10, 2010

agramfort commented Dec 11, 2010

ogrisel commented Dec 11, 2010

Extract randomized PCA impl in a dedicated toplevel class #30

Extract randomized PCA impl in a dedicated toplevel class #30

Conversation

ogrisel commented Dec 10, 2010

GaelVaroquaux commented Dec 10, 2010

ogrisel commented Dec 10, 2010

agramfort commented Dec 11, 2010

ogrisel commented Dec 11, 2010