Use irlba's truncated SVD to speed up step_pca #82

dgrtwo · 2021-06-04T17:37:54Z

step_pca is very useful, but is slow and memory-intensive when run on more than a few hundred features, even if num_comp is much smaller than p. (In my experience this makes it especially time-intensive to tune the num_comp training parameter, which requires running the SVD preparation step many times).

As a solution, this step could use the irlba package for truncated SVD, which is much faster and more memory efficient when the number of components is small compared to p.

I could imagine the step either automatically using irlba when num_comp is far smaller than p, or doing so only when the user requests something like truncated = TRUE, but in any case it would be very helpful!

Reproducible example, if we were trying to build a model to identify which Jane Austen book a line of text came from:

library(janeaustenr)
library(recipes)
library(textrecipes)

# Train a model to match a single line to one of Jane Austen's books 
books <- austen_books() %>%
  filter(text != "")

rec <- recipe(book ~ text, books) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 300) %>%
  step_tfidf(text)

# This is slow (~40s for me), and uses so much memory that it's hard to terminate
rec %>%
  step_pca(starts_with("tfidf"), num_comp = 5) %>%
  prep() %>%
  juice()

# But this is fast (~3.5s)
rec %>%
  prep() %>%
  juice() %>%
  select(-book) %>%
  as.matrix() %>%
  irlba(nv = 5)

The text was updated successfully, but these errors were encountered:

juliasilge · 2021-06-07T21:10:00Z

Related to #73

We are definitely interested in functionality like this! This is mostly implemented already so we'll get a draft PR ready and would love some feedback on it and/or more contributions. We are fairly sure we want to include this in embed, along with a Bayesian implementation of sparse PCA.

alexpghayes · 2021-06-07T21:11:12Z

In general I'm a firm believer that PCA should default to a truncated SVD implementation (either irlba or RSpectra) and only switch to a full SVD when the user requests something like num_comp > p / 4 or something like that. It would also be nice to have a randomized SVD implementation (perhaps the rsvd package) for larger datasets, perhaps as step_pca_approximate().

alexpghayes · 2021-06-07T21:12:59Z

Also cc @topepo https://github.com/DataSlingers/MoMA is a high quality sparse PCA implementation by Michael Weylandt (of the high quality glmnet replacement implementation)

EmilHvitfeldt · 2022-03-30T03:56:54Z

This did issue get resolved in #83 or should it be kept open for more step variants?

dgrtwo · 2022-03-30T13:21:38Z

This did issue get resolved in #83 or should it be kept open for more step variants?

I don't think this is resolved, since step_pca still uses full PCA by default, and the above reprex (getting 5 principal components from a dataset with 62k observations) is still slow. I agree with Alex that it can be made much faster in the common use case by making it the default:

In general I'm a firm believer that PCA should default to a truncated SVD implementation (either irlba or RSpectra) and only switch to a full SVD when the user requests something like num_comp > p / 4 or something like that

But maybe this issue belongs in the recipes package, since that's where step_pca lives?

topepo · 2022-04-19T21:32:18Z

I'd add an alternate PCA step here. Those package dependencies are a pita and I'd keep them here.

github-actions · 2023-04-12T01:19:55Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

juliasilge transferred this issue from tidymodels/recipes Jun 7, 2021

juliasilge added the feature a feature request or enhancement label Jun 7, 2021

juliasilge mentioned this issue Jun 8, 2021

Sparse pca steps #83

Merged

jonthegeek mentioned this issue Apr 26, 2022

Chapter 12 Matrix Completion EmilHvitfeldt/ISLR-tidymodels-labs#33

Open

EmilHvitfeldt mentioned this issue Mar 28, 2023

Add step_pca_truncated() #174

Merged

EmilHvitfeldt closed this as completed in #174 Mar 29, 2023

github-actions bot locked and limited conversation to collaborators Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use irlba's truncated SVD to speed up step_pca #82

Use irlba's truncated SVD to speed up step_pca #82

dgrtwo commented Jun 4, 2021

juliasilge commented Jun 7, 2021

alexpghayes commented Jun 7, 2021

alexpghayes commented Jun 7, 2021

EmilHvitfeldt commented Mar 30, 2022 •

edited

Loading

dgrtwo commented Mar 30, 2022

topepo commented Apr 19, 2022

github-actions bot commented Apr 12, 2023

Use irlba's truncated SVD to speed up step_pca #82

Use irlba's truncated SVD to speed up step_pca #82

Comments

dgrtwo commented Jun 4, 2021

juliasilge commented Jun 7, 2021

alexpghayes commented Jun 7, 2021

alexpghayes commented Jun 7, 2021

EmilHvitfeldt commented Mar 30, 2022 • edited Loading

dgrtwo commented Mar 30, 2022

topepo commented Apr 19, 2022

github-actions bot commented Apr 12, 2023

EmilHvitfeldt commented Mar 30, 2022 •

edited

Loading