Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read subset of rows #370

Closed
rgayler opened this issue Apr 24, 2018 · 9 comments · Fixed by #468
Labels

Comments

@rgayler
Copy link

@rgayler rgayler commented Apr 24, 2018

The motivation for this is similar to #248

People send me large SAS files (say, 5M rows x 1k columns) that are too large to read into R on my computer. I end up having to get the sender to chop the file into smaller sets of rows because I don't have access to SAS to do this myself.

If I could specify an arbitrary contiguous set of rows to read I could split the file myself without having to go back to the sender.

Thanks

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Apr 25, 2018

I think haven could provide nrow and skip parameters.

@evanmiller is there an existing ReadStat API to skip rows? I didn't see anything obvious.

@evanmiller

This comment was marked as outdated.

Copy link
Collaborator

@evanmiller evanmiller commented Apr 25, 2018

@hadley Nothing in there currently.

@rgayler

This comment was marked as off-topic.

Copy link
Author

@rgayler rgayler commented Apr 25, 2018

@hadley

This comment was marked as outdated.

Copy link
Member

@hadley hadley commented Apr 25, 2018

Command line readstat is built on top of the readstat API, so that won't help.

@rgayler

This comment was marked as off-topic.

Copy link
Author

@rgayler rgayler commented Apr 25, 2018

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jun 20, 2018

Waiting on WizardMac/ReadStat#141

@hadley hadley changed the title [feature request] read subset of rows Read subset of rows Jun 20, 2018
@mikmart

This comment has been minimized.

Copy link
Contributor

@mikmart mikmart commented Jul 10, 2019

@hadley in the mean time, do you think it would be useful to add skip with a workaround? Combined with #440 the essential row/col skipping API would be in place.

It'd be simple to just tell ReadStat to read skip + n_max rows and then ignore the first skip rows on haven's side (mikmart@39bdac2). Much slower of course than when implemented in the ReadStat parser directly, but it would cut down memory use, making it possible to handle large files (albeit slowly):

library(haven) # mikmart/haven@39bdac

write_sas(manycars <- purrr::map_df(seq_len(1e5), ~ mtcars), tf <- tempfile())

read_sas(tf, n_max = 5)
#> # A tibble: 5 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2

read_sas(tf, n_max = 5, skip = 2)
#> # A tibble: 5 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 2  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 3  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 4  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
#> 5  14.3     8   360   245  3.21  3.57  15.8     0     0     3     4

n <- nrow(manycars)
bench::mark(
  check = FALSE,
  read_sas(tf),
  read_sas(tf, n_max = 1000),
  read_sas(tf, skip = n - 1000),
  read_sas(tf, n_max = n - 1000),
  read_sas(tf, skip = 1000)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 5 x 6
#>   expression                          min   median `itr/sec` mem_alloc
#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 read_sas(tf)                      6.55s    6.55s     0.153   268.6MB
#> 2 read_sas(tf, n_max = 1000)     406.54ms 453.23ms     2.21     88.9KB
#> 3 read_sas(tf, skip = n - 1000)     2.51s    2.51s     0.399    88.9KB
#> 4 read_sas(tf, n_max = n - 1000)    6.33s    6.33s     0.158   268.5MB
#> 5 read_sas(tf, skip = 1000)         6.63s    6.63s     0.151   268.5MB
#> # ... with 1 more variable: `gc/sec` <dbl>

Created on 2019-07-10 by the reprex package (v0.3.0.9000)

Once ReadStat supports it properly, it'd be a small internal change for a big speed-up.

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jul 11, 2019

It doesn't seem that worthwhile to me, and I'm worried that it will be a misleading API if it doesn't offer any speed advantages.

@evanmiller

This comment has been minimized.

Copy link
Collaborator

@evanmiller evanmiller commented Aug 27, 2019

WizardMac/ReadStat#141 is now closed thanks to @mikmart. The new readstat_set_row_offset API is O(1) for uncompressed, seekable files.

@hadley hadley closed this in #468 Nov 6, 2019
hadley added a commit that referenced this issue Nov 6, 2019
Fixes #370
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.