Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import subset of columns #248

Closed
carlganz opened this issue Dec 1, 2016 · 36 comments
Closed

Import subset of columns #248

carlganz opened this issue Dec 1, 2016 · 36 comments
Labels
feature a feature request or enhancement

Comments

@carlganz
Copy link

carlganz commented Dec 1, 2016

Similar to readr::cols_only, it would be very nice if I didn't have to load entire SAS datasets into R just to use one or two variables. Let me know if I can help.

Regards

@hadley
Copy link
Member

hadley commented Jan 25, 2017

This is a relatively complicated problem because you need to figure out some good way of describing which columns you want to important. It might be possible to borrow readr's cols() specification, but that feels like a bit of an abuse to me, because haven doesn't have to do any coercion.

It would be helpful if you could provide some examples of the problems you are trying to solve.

@hadley hadley added the feature a feature request or enhancement label Jan 25, 2017
@carlganz
Copy link
Author

An example is I work in a data access center where data is kept exclusively in (large) SAS files. I develop Shiny apps, which have to read the entire SAS file, which takes a while, even though the app is only using a small subset of the variables in the file. Obviously SQL is the solution, but in the world of public health SAS/Stata/SPSS is ubiquitous, and SQL is rare (for now).

My hope is that I can specify a subset of the columns when I read in the SAS files, and speed up the reading process. Honestly, I haven't looked at the source code enough to know if that is even feasible.

Regards

@hadley
Copy link
Member

hadley commented Jan 25, 2017

Could you give me an idea of the size of the files? (in terms megabytes, rows x cols, and how long it takes to load?). Actually importing selected columns will be relatively straightforward, but I'm not sure if it will improve performance by that much.

@carlganz
Copy link
Author

For a one year Adult CHIS file you will find around 21000 rows and 2400 columnn, so a little over half a terabyte in size. It takes a little under 20 seconds to load.

Sometimes I just need one variable from two separate years so I end up loading essentially a whole terabyte of data into R just to work with a really small subset of that data.

Happy to chat about this with you at CSP next month if you aren't too busy.

@hadley
Copy link
Member

hadley commented Jan 25, 2017

Ah, I was thinking about speed, but memory is also an issue for that much data. Even if it was only marginally faster, it would still be helpful to pull in only selected columns.

@evanmiller have you thought about this at all? Obviously I could filter based on the variable index, but maybe if haven knew which variables I was interested in it could skip even more work?

@evanmiller
Copy link
Collaborator

Skipping columns will likely be the same amount of I/O but could cut down on CPU significantly. My guess is the perf gain will depend on whether the workload is I/O-bound or CPU-bound.

Doing the filtering in ReadStat instead of in haven will save a short memcpy and at least function call per value (i.e. invoking the value handler)... more with text values that require UTF-8 conversion. Because of the iconv stuff I could imagine significant performance improvements for data sets that are mostly non-UTF-8 text... but this is just speculation.

@hadley
Copy link
Member

hadley commented Jan 25, 2017

Ok, I might have a go at hacking together a solution where you can just supply a vector of column names. Then I can do a little benchmarking to see if it's worth doing at a lower level.

@carlganz any chance those big SAS files are publicly available?

@carlganz
Copy link
Author

HIPAA data so no unfortunately. I can send you an equivalent SAS file to work with tomorrow when I have access to SAS again if you'd like.

@hadley
Copy link
Member

hadley commented Jan 26, 2017

@carlganz that would be great if you could host them somewhere publicly. It's possible I'll get caught up with other projects and won't be able to get back to this for a few months.

@evanmiller
Copy link
Collaborator

@hadley Column subsetting isn't too hard to implement on the ReadStat side, so I'll take a crack at it today. Thinking of adding a SKIP return value from the variable handler.

@evanmiller
Copy link
Collaborator

@hadley Check out WizardMac/ReadStat@fc2fafc

@hadley
Copy link
Member

hadley commented Jan 26, 2017

@evanmiller just to confirm - the variable index will still be the position in the original file, right? i.e. I'll need to maintain my own map from input column to output column.

@evanmiller
Copy link
Collaborator

evanmiller commented Jan 26, 2017 via email

@hadley
Copy link
Member

hadley commented Jan 26, 2017

@evanmiller that would make me quite happy 😄

@carlganz
Copy link
Author

@hadley The SAS file I generated isn't quite as large as the one I was dealing with originally, but it should work fine. I took the CHIS public use file, which is much smaller, and I concatenated it with itself multiple times to get a larger file. Here is a link

@evanmiller
Copy link
Collaborator

@hadley Try this:

int readstat_variable_get_index_after_skipping(const readstat_variable_t *variable);

WizardMac/ReadStat@9072ac4

@hadley
Copy link
Member

hadley commented Jan 30, 2017

Initial benchmarks look pretty promising:

library(haven)
system.time(df <- read_sas("~/Desktop/all.sas7bdat"))
#>    user  system elapsed 
#>  12.548   0.662  13.282

system.time(read_sas("~/Desktop/all.sas7bdat", cols_only = c("AA5C", "YRUS_P1_three")))
#>    user  system elapsed 
#>   0.938   0.227   1.175
system.time(read_sas("~/Desktop/all.sas7bdat", cols_only = names(df)[1:20]))
#>    user  system elapsed 
#>   0.999   0.240   1.252
system.time(read_sas("~/Desktop/all.sas7bdat", cols_only = names(df)[1:100]))
#>    user  system elapsed 
#>   1.222   0.252   1.485

It looks like there's around a second of overhead, but filtering down to only the variable you're interested in makes things a lot faster.

The R interface will take some thinking about so I'll probably leave until the next release of haven.

@mbojan
Copy link

mbojan commented Apr 18, 2017

I'd love to be able to select specific columns by name with read_spss (was #90)! A public example of the type of files I work with is "Social Diagnosis" survey (http://www.diagnoza.com/). There are two files for download:

There are also variants with the metadata (variable and value labels) in Polish in case you want to test for potential encoding issues (just switch language to Polish on www.diagnoza.com). Column names in data files should be pure ASCII though.

As the column names usually have some structured format it would be great to be able to use dplyr features like starts_with, matches and so on to quickly select blocks of columns that come from specific questionnaire items and so on (e.g. q1a up to q1k and so on).

It is a question of design, but perhaps (as I speculated in #90) this could be done with R syntax like

spss_src("file.sav") %>% select(starts_with("q1"))

and so on, depending whether or not read_spss should encapsulate both reading and selecting/filtering.

@rogerjdeangelis

This comment has been minimized.

@lao8n
Copy link

lao8n commented Aug 8, 2017

Can you do something similar for haven's read_dta please? I have serious memory constraints.

@hadley hadley changed the title [Feature Request] Import subset of columns Import subset of columns Feb 15, 2018
@ADam-Z514

This comment has been minimized.

@mbojan

This comment has been minimized.

@ADam-Z514

This comment has been minimized.

@maraab23

This comment has been minimized.

@mikmart
Copy link
Contributor

mikmart commented Mar 7, 2019

If there was a way to surface the column names from the data to R before reading the file, maybe the column selection could be specified via tidyselect? That would allow for a more flexible selection interface familiar to many users as @mbojan suggests, while being simpler to implement than a full dplyr data source interface.

@mikmart
Copy link
Contributor

mikmart commented Mar 8, 2019

I figured one way to read the column names would be to implement n_max and read a single row to extract the names: I made a branch that implements that and then uses tidyselect for cols_only (I don't tidyselect with a character cols_only here for the sake of benchmarking). I tested with the data posted earlier, and the additional read does add a bit of overhead, but it doesn't seem too bad:

library(haven) # mikmart/haven@818eff6
library(tidyselect)
vars <- dplyr::vars

bench::system_time(df <- read_sas("~/all.sas7bdat"))
#> process    real 
#>     19s   19.1s
bench::system_time(read_sas("~/all.sas7bdat", n_max = 1L))
#> process    real 
#>   141ms   150ms

nm <- names(df)

bench::mark(
  read_sas("~/all.sas7bdat", cols_only = 1:10),
  read_sas("~/all.sas7bdat", cols_only = nm[1:10])
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 "read_sas~ 922ms 922ms  922ms 922ms      1.08    4.44MB     1     1
#> 2 "read_sas~ 719ms 719ms  719ms 719ms      1.39    3.21MB     1     1
#> # ... with 1 more variable: total_time <bch:tm>

bench::mark(
  read_sas("~/all.sas7bdat", cols_only = vars(starts_with("AH"))),
  read_sas("~/all.sas7bdat", cols_only = nm[startsWith(nm, "AH")])
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 "read_sas~ 2.01s 2.01s  2.01s 2.01s     0.497    56.2MB    15     1
#> 2 "read_sas~ 1.79s 1.79s  1.79s 1.79s     0.560    55.3MB    14     1
#> # ... with 1 more variable: total_time <bch:tm>

Created on 2019-03-08 by the reprex package (v0.2.1)

@hadley
Copy link
Member

hadley commented Mar 8, 2019

This looks like a promising approach! You might try n_max = 0 (if ReadStat) supports it, and we'll need to make a few tweaks to how cols_only works, but overall it looks good and I'd be happy to review a PR.

@mikmart
Copy link
Contributor

mikmart commented Mar 8, 2019

Great! At the moment ReadStat doesn't support n_max = 0, as it uses 0 as the sentinel for all rows.

I'll work on turning this into a PR. I think there's at least two issues that need to be addressed:

  • cols_only = character() should probably return no columns instead of all columns, as that's what would be expected for e.g. cols_only = vars(one_of("non-existent-col"))
  • n_max = 0 should return a 0-row df even if it doesn't happen automatically via ReadStat's row_limit

@mbojan
Copy link

mbojan commented Mar 9, 2019

Awesome @mikmart ! I really look forward to this! I will finally be able to dump my spaghetti-code converting huge SPSS files into SQLite databases. Will this work for other formats beyond SAS too?

@mikmart
Copy link
Contributor

mikmart commented Mar 9, 2019

Yeah, I think once the final design has been worked out it should be straightforward to add both n_max and cols_only to other formats, too.

@rogerjdeangelis
Copy link

Very nice

Meta data like column names, type and length are often more useful for programmers than the data.
You might add type and length along with names.

I have not examined the C code but SAS IEEE 754 doubles do not require any 'conversion code' in windows.

In fact a large SAS back to back array of binary IEEE doubles can be read extremely quickly

Into R
float <- readBin(read.from, n=1000000, "double");

From SAS
filename bin "d:/bin/have.bin" lrecl=32000 recfm=f;
data null;
array tmp[1000000] temporary;
do n=1 to 1000000;
tmp[n] =ceil(1000000*rand('uniform'));
file bin ;
put tmp[n] rb8. @ ;
end;
stop;
run;quit;

@mikmart
Copy link
Contributor

mikmart commented Mar 9, 2019

Good point. You already get the types along with the names when you use n_max = 0:

library(haven)

tf <- tempfile()
write_sas(mtcars, tf)

read_sas(tf, n_max = 0)
#> # A tibble: 0 x 11
#> # ... with 11 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>,
#> #   drat <dbl>, wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>

and with 83c0676 you can get a 0-column df that still contains info on the number of rows:

read_sas(tf, cols_only = character())
#> # A tibble: 32 x 0

Created on 2019-03-09 by the reprex package (v0.2.1)

@rogerjdeangelis

This comment has been minimized.

@mikmart
Copy link
Contributor

mikmart commented Jul 7, 2019

@hadley I think #440 would be ready for review when you have a chance. If the approach looks okay I'd be happy to implement it for other formats, too.

@hadley
Copy link
Member

hadley commented Jul 7, 2019

@mikmart I'll definitely take a look next time I'm working on haven. I now think it should match the vroom interface — i.e. it should be called cols_select

@hadley hadley closed this as completed in 7c24314 Jul 17, 2019
@lock
Copy link

lock bot commented Jan 13, 2020

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

9 participants