Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upRead subset of rows #370
Comments
This comment has been minimized.
This comment has been minimized.
I think haven could provide @evanmiller is there an existing ReadStat API to skip rows? I didn't see anything obvious. |
This comment was marked as outdated.
This comment was marked as outdated.
@hadley Nothing in there currently. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Thanks for looking. Assuming command-line readstat doesn't try to keep the
whole file in RAM, a possible workaround would be to use command-line
readstat to transcode from SAS to CSV, then use command-line tools to chop
the CSV file into smaller chanks.
…On 25 April 2018 at 12:10, Evan Miller ***@***.***> wrote:
@hadley <https://github.com/hadley> Nothing in there currently.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#370 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFKJGyCIrpMiSMp4Oi1jxt1mJoJll-krks5tr9sOgaJpZM4Tg49R>
.
|
This comment was marked as outdated.
This comment was marked as outdated.
Command line readstat is built on top of the readstat API, so that won't help. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Use readstat to convert to CSV then use text tools to chop that up.
…On Wed, 25 Apr 2018, 23:03 Hadley Wickham, ***@***.***> wrote:
Command line readstat is built on top of the readstat API, so that won't
help.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#370 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFKJG9TGzDi44D4qB5-l1gZXybRtVOoLks5tsHQWgaJpZM4Tg49R>
.
|
This comment has been minimized.
This comment has been minimized.
Waiting on WizardMac/ReadStat#141 |
This comment has been minimized.
This comment has been minimized.
@hadley in the mean time, do you think it would be useful to add It'd be simple to just tell ReadStat to read library(haven) # mikmart/haven@39bdac
write_sas(manycars <- purrr::map_df(seq_len(1e5), ~ mtcars), tf <- tempfile())
read_sas(tf, n_max = 5)
#> # A tibble: 5 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
read_sas(tf, n_max = 5, skip = 2)
#> # A tibble: 5 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 3 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 4 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 5 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
n <- nrow(manycars)
bench::mark(
check = FALSE,
read_sas(tf),
read_sas(tf, n_max = 1000),
read_sas(tf, skip = n - 1000),
read_sas(tf, n_max = n - 1000),
read_sas(tf, skip = 1000)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 5 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 read_sas(tf) 6.55s 6.55s 0.153 268.6MB
#> 2 read_sas(tf, n_max = 1000) 406.54ms 453.23ms 2.21 88.9KB
#> 3 read_sas(tf, skip = n - 1000) 2.51s 2.51s 0.399 88.9KB
#> 4 read_sas(tf, n_max = n - 1000) 6.33s 6.33s 0.158 268.5MB
#> 5 read_sas(tf, skip = 1000) 6.63s 6.63s 0.151 268.5MB
#> # ... with 1 more variable: `gc/sec` <dbl> Created on 2019-07-10 by the reprex package (v0.3.0.9000) Once ReadStat supports it properly, it'd be a small internal change for a big speed-up. |
This comment has been minimized.
This comment has been minimized.
It doesn't seem that worthwhile to me, and I'm worried that it will be a misleading API if it doesn't offer any speed advantages. |
This comment has been minimized.
This comment has been minimized.
WizardMac/ReadStat#141 is now closed thanks to @mikmart. The new |
The motivation for this is similar to #248
People send me large SAS files (say, 5M rows x 1k columns) that are too large to read into R on my computer. I end up having to get the sender to chop the file into smaller sets of rows because I don't have access to SAS to do this myself.
If I could specify an arbitrary contiguous set of rows to read I could split the file myself without having to go back to the sender.
Thanks