Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERFORMANCE: Providing pre-generated index as an argument #1316

Closed
HenrikBengtsson opened this issue Oct 15, 2021 · 3 comments · Fixed by tidyverse/vroom#421
Closed

PERFORMANCE: Providing pre-generated index as an argument #1316

HenrikBengtsson opened this issue Oct 15, 2021 · 3 comments · Fixed by tidyverse/vroom#421
Labels
feature a feature request or enhancement

Comments

@HenrikBengtsson
Copy link

I have a huge, static 70 GB tab-delimited file that I want to slice and dice now and then and in the future. Because it's so big, indexing alone will take 10-15 minutes when calling read_tsv(). Since the file is static, this index will be identical every time I try to parse this file, e.g. in a future R session. Is it possible to save the index to file so that it can be imported quickly next time? Something like:

## Indexing takes 10-15 minues
idxs <- index_file(pathname)

## Cache to file
saveRDS(idxs, sprintf("%s.readr_index", pathname))

In future R session:

## Read cached file index
idxs <- readRDS(sprintf("%s.readr_index", pathname))

## Quickly read 100 entries starting from row 10 million.
data <- read_tsv(pathname, index = index, skip = 10e6-1, n_max = 100)
@jimhester
Copy link
Collaborator

jimhester commented Oct 15, 2021

Not currently possible, however this use case would be more compelling if we had support for memory mapped indexes (tidyverse/vroom#51).

I have experimented with this in branches with good results, but it would still require more work to fully implement.

@rdmorin
Copy link

rdmorin commented May 2, 2024

This issue is marked as resolved but I can't find anything in the read_delim or vroom documentation about supplying an existing index to enhance read performance. I was wondering if someone could point me in the right direction?

@jennybc
Copy link
Member

jennybc commented May 2, 2024

I think this is closed here because, if it happened, it would happen in vroom, which is the default backend for readr now.

And the issue tracking it in vroom is the one linked above (tidyverse/vroom#51).

There's no immediate plan to work on this because the effort needed exceeds the developer time dedicated to readr/vroom right now: #1316.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants