New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import subset of columns #248
Comments
This is a relatively complicated problem because you need to figure out some good way of describing which columns you want to important. It might be possible to borrow readr's It would be helpful if you could provide some examples of the problems you are trying to solve. |
An example is I work in a data access center where data is kept exclusively in (large) SAS files. I develop Shiny apps, which have to read the entire SAS file, which takes a while, even though the app is only using a small subset of the variables in the file. Obviously SQL is the solution, but in the world of public health SAS/Stata/SPSS is ubiquitous, and SQL is rare (for now). My hope is that I can specify a subset of the columns when I read in the SAS files, and speed up the reading process. Honestly, I haven't looked at the source code enough to know if that is even feasible. Regards |
Could you give me an idea of the size of the files? (in terms megabytes, rows x cols, and how long it takes to load?). Actually importing selected columns will be relatively straightforward, but I'm not sure if it will improve performance by that much. |
For a one year Adult CHIS file you will find around 21000 rows and 2400 columnn, so a little over half a terabyte in size. It takes a little under 20 seconds to load. Sometimes I just need one variable from two separate years so I end up loading essentially a whole terabyte of data into R just to work with a really small subset of that data. Happy to chat about this with you at CSP next month if you aren't too busy. |
Ah, I was thinking about speed, but memory is also an issue for that much data. Even if it was only marginally faster, it would still be helpful to pull in only selected columns. @evanmiller have you thought about this at all? Obviously I could filter based on the variable index, but maybe if haven knew which variables I was interested in it could skip even more work? |
Skipping columns will likely be the same amount of I/O but could cut down on CPU significantly. My guess is the perf gain will depend on whether the workload is I/O-bound or CPU-bound. Doing the filtering in ReadStat instead of in haven will save a short |
Ok, I might have a go at hacking together a solution where you can just supply a vector of column names. Then I can do a little benchmarking to see if it's worth doing at a lower level. @carlganz any chance those big SAS files are publicly available? |
HIPAA data so no unfortunately. I can send you an equivalent SAS file to work with tomorrow when I have access to SAS again if you'd like. |
@carlganz that would be great if you could host them somewhere publicly. It's possible I'll get caught up with other projects and won't be able to get back to this for a few months. |
@hadley Column subsetting isn't too hard to implement on the ReadStat side, so I'll take a crack at it today. Thinking of adding a |
@hadley Check out WizardMac/ReadStat@fc2fafc |
@evanmiller just to confirm - the variable |
@hadley Correct, currently the value of index is unchanged. I could potentially add a second index that takes into account skipping.
|
@evanmiller that would make me quite happy 😄 |
@hadley The SAS file I generated isn't quite as large as the one I was dealing with originally, but it should work fine. I took the CHIS public use file, which is much smaller, and I concatenated it with itself multiple times to get a larger file. Here is a link |
@hadley Try this: int readstat_variable_get_index_after_skipping(const readstat_variable_t *variable); |
Initial benchmarks look pretty promising: library(haven)
system.time(df <- read_sas("~/Desktop/all.sas7bdat"))
#> user system elapsed
#> 12.548 0.662 13.282
system.time(read_sas("~/Desktop/all.sas7bdat", cols_only = c("AA5C", "YRUS_P1_three")))
#> user system elapsed
#> 0.938 0.227 1.175
system.time(read_sas("~/Desktop/all.sas7bdat", cols_only = names(df)[1:20]))
#> user system elapsed
#> 0.999 0.240 1.252
system.time(read_sas("~/Desktop/all.sas7bdat", cols_only = names(df)[1:100]))
#> user system elapsed
#> 1.222 0.252 1.485 It looks like there's around a second of overhead, but filtering down to only the variable you're interested in makes things a lot faster. The R interface will take some thinking about so I'll probably leave until the next release of haven. |
I'd love to be able to select specific columns by name with
There are also variants with the metadata (variable and value labels) in Polish in case you want to test for potential encoding issues (just switch language to Polish on www.diagnoza.com). Column names in data files should be pure ASCII though. As the column names usually have some structured format it would be great to be able to use It is a question of design, but perhaps (as I speculated in #90) this could be done with R syntax like spss_src("file.sav") %>% select(starts_with("q1")) and so on, depending whether or not |
This comment has been minimized.
This comment has been minimized.
Can you do something similar for haven's read_dta please? I have serious memory constraints. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
If there was a way to surface the column names from the data to R before reading the file, maybe the column selection could be specified via tidyselect? That would allow for a more flexible selection interface familiar to many users as @mbojan suggests, while being simpler to implement than a full dplyr data source interface. |
I figured one way to read the column names would be to implement library(haven) # mikmart/haven@818eff6
library(tidyselect)
vars <- dplyr::vars
bench::system_time(df <- read_sas("~/all.sas7bdat"))
#> process real
#> 19s 19.1s
bench::system_time(read_sas("~/all.sas7bdat", n_max = 1L))
#> process real
#> 141ms 150ms
nm <- names(df)
bench::mark(
read_sas("~/all.sas7bdat", cols_only = 1:10),
read_sas("~/all.sas7bdat", cols_only = nm[1:10])
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch> <bch> <bch:> <bch> <dbl> <bch:byt> <dbl> <int>
#> 1 "read_sas~ 922ms 922ms 922ms 922ms 1.08 4.44MB 1 1
#> 2 "read_sas~ 719ms 719ms 719ms 719ms 1.39 3.21MB 1 1
#> # ... with 1 more variable: total_time <bch:tm>
bench::mark(
read_sas("~/all.sas7bdat", cols_only = vars(starts_with("AH"))),
read_sas("~/all.sas7bdat", cols_only = nm[startsWith(nm, "AH")])
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch> <bch> <bch:> <bch> <dbl> <bch:byt> <dbl> <int>
#> 1 "read_sas~ 2.01s 2.01s 2.01s 2.01s 0.497 56.2MB 15 1
#> 2 "read_sas~ 1.79s 1.79s 1.79s 1.79s 0.560 55.3MB 14 1
#> # ... with 1 more variable: total_time <bch:tm> Created on 2019-03-08 by the reprex package (v0.2.1) |
This looks like a promising approach! You might try |
Great! At the moment ReadStat doesn't support I'll work on turning this into a PR. I think there's at least two issues that need to be addressed:
|
Awesome @mikmart ! I really look forward to this! I will finally be able to dump my spaghetti-code converting huge SPSS files into SQLite databases. Will this work for other formats beyond SAS too? |
Yeah, I think once the final design has been worked out it should be straightforward to add both |
Very nice Meta data like column names, type and length are often more useful for programmers than the data. I have not examined the C code but SAS IEEE 754 doubles do not require any 'conversion code' in windows. In fact a large SAS back to back array of binary IEEE doubles can be read extremely quickly Into R From SAS |
Good point. You already get the types along with the names when you use library(haven)
tf <- tempfile()
write_sas(mtcars, tf)
read_sas(tf, n_max = 0)
#> # A tibble: 0 x 11
#> # ... with 11 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>,
#> # drat <dbl>, wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#> # carb <dbl> and with 83c0676 you can get a 0-column df that still contains info on the number of rows: read_sas(tf, cols_only = character())
#> # A tibble: 32 x 0 Created on 2019-03-09 by the reprex package (v0.2.1) |
This comment has been minimized.
This comment has been minimized.
@mikmart I'll definitely take a look next time I'm working on haven. I now think it should match the vroom interface — i.e. it should be called |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
Similar to
readr::cols_only
, it would be very nice if I didn't have to load entire SAS datasets into R just to use one or two variables. Let me know if I can help.Regards
The text was updated successfully, but these errors were encountered: