Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use irlba's truncated SVD to speed up step_pca #82

Closed
dgrtwo opened this issue Jun 4, 2021 · 7 comments · Fixed by #174
Closed

Use irlba's truncated SVD to speed up step_pca #82

dgrtwo opened this issue Jun 4, 2021 · 7 comments · Fixed by #174
Labels
feature a feature request or enhancement

Comments

@dgrtwo
Copy link

dgrtwo commented Jun 4, 2021

step_pca is very useful, but is slow and memory-intensive when run on more than a few hundred features, even if num_comp is much smaller than p. (In my experience this makes it especially time-intensive to tune the num_comp training parameter, which requires running the SVD preparation step many times).

As a solution, this step could use the irlba package for truncated SVD, which is much faster and more memory efficient when the number of components is small compared to p.

I could imagine the step either automatically using irlba when num_comp is far smaller than p, or doing so only when the user requests something like truncated = TRUE, but in any case it would be very helpful!


Reproducible example, if we were trying to build a model to identify which Jane Austen book a line of text came from:

library(janeaustenr)
library(recipes)
library(textrecipes)

# Train a model to match a single line to one of Jane Austen's books 
books <- austen_books() %>%
  filter(text != "")

rec <- recipe(book ~ text, books) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 300) %>%
  step_tfidf(text)

# This is slow (~40s for me), and uses so much memory that it's hard to terminate
rec %>%
  step_pca(starts_with("tfidf"), num_comp = 5) %>%
  prep() %>%
  juice()

# But this is fast (~3.5s)
rec %>%
  prep() %>%
  juice() %>%
  select(-book) %>%
  as.matrix() %>%
  irlba(nv = 5)
@juliasilge juliasilge transferred this issue from tidymodels/recipes Jun 7, 2021
@juliasilge
Copy link
Member

Related to #73

We are definitely interested in functionality like this! This is mostly implemented already so we'll get a draft PR ready and would love some feedback on it and/or more contributions. We are fairly sure we want to include this in embed, along with a Bayesian implementation of sparse PCA.

@juliasilge juliasilge added the feature a feature request or enhancement label Jun 7, 2021
@alexpghayes
Copy link

In general I'm a firm believer that PCA should default to a truncated SVD implementation (either irlba or RSpectra) and only switch to a full SVD when the user requests something like num_comp > p / 4 or something like that. It would also be nice to have a randomized SVD implementation (perhaps the rsvd package) for larger datasets, perhaps as step_pca_approximate().

@alexpghayes
Copy link

Also cc @topepo https://github.com/DataSlingers/MoMA is a high quality sparse PCA implementation by Michael Weylandt (of the high quality glmnet replacement implementation)

@EmilHvitfeldt
Copy link
Member

EmilHvitfeldt commented Mar 30, 2022

This did issue get resolved in #83 or should it be kept open for more step variants?

@dgrtwo
Copy link
Author

dgrtwo commented Mar 30, 2022

This did issue get resolved in #83 or should it be kept open for more step variants?

I don't think this is resolved, since step_pca still uses full PCA by default, and the above reprex (getting 5 principal components from a dataset with 62k observations) is still slow. I agree with Alex that it can be made much faster in the common use case by making it the default:

In general I'm a firm believer that PCA should default to a truncated SVD implementation (either irlba or RSpectra) and only switch to a full SVD when the user requests something like num_comp > p / 4 or something like that

But maybe this issue belongs in the recipes package, since that's where step_pca lives?

@topepo
Copy link
Member

topepo commented Apr 19, 2022

I'd add an alternate PCA step here. Those package dependencies are a pita and I'd keep them here.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants