New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import subset of columns #248

Open
carlganz opened this Issue Dec 1, 2016 · 24 comments

Comments

Projects
None yet
8 participants
@carlganz
Copy link

carlganz commented Dec 1, 2016

Similar to readr::cols_only, it would be very nice if I didn't have to load entire SAS datasets into R just to use one or two variables. Let me know if I can help.

Regards

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 25, 2017

This is a relatively complicated problem because you need to figure out some good way of describing which columns you want to important. It might be possible to borrow readr's cols() specification, but that feels like a bit of an abuse to me, because haven doesn't have to do any coercion.

It would be helpful if you could provide some examples of the problems you are trying to solve.

@hadley hadley added the feature label Jan 25, 2017

@carlganz

This comment has been minimized.

Copy link

carlganz commented Jan 25, 2017

An example is I work in a data access center where data is kept exclusively in (large) SAS files. I develop Shiny apps, which have to read the entire SAS file, which takes a while, even though the app is only using a small subset of the variables in the file. Obviously SQL is the solution, but in the world of public health SAS/Stata/SPSS is ubiquitous, and SQL is rare (for now).

My hope is that I can specify a subset of the columns when I read in the SAS files, and speed up the reading process. Honestly, I haven't looked at the source code enough to know if that is even feasible.

Regards

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 25, 2017

Could you give me an idea of the size of the files? (in terms megabytes, rows x cols, and how long it takes to load?). Actually importing selected columns will be relatively straightforward, but I'm not sure if it will improve performance by that much.

@carlganz

This comment has been minimized.

Copy link

carlganz commented Jan 25, 2017

For a one year Adult CHIS file you will find around 21000 rows and 2400 columnn, so a little over half a terabyte in size. It takes a little under 20 seconds to load.

Sometimes I just need one variable from two separate years so I end up loading essentially a whole terabyte of data into R just to work with a really small subset of that data.

Happy to chat about this with you at CSP next month if you aren't too busy.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 25, 2017

Ah, I was thinking about speed, but memory is also an issue for that much data. Even if it was only marginally faster, it would still be helpful to pull in only selected columns.

@evanmiller have you thought about this at all? Obviously I could filter based on the variable index, but maybe if haven knew which variables I was interested in it could skip even more work?

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jan 25, 2017

Skipping columns will likely be the same amount of I/O but could cut down on CPU significantly. My guess is the perf gain will depend on whether the workload is I/O-bound or CPU-bound.

Doing the filtering in ReadStat instead of in haven will save a short memcpy and at least function call per value (i.e. invoking the value handler)... more with text values that require UTF-8 conversion. Because of the iconv stuff I could imagine significant performance improvements for data sets that are mostly non-UTF-8 text... but this is just speculation.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 25, 2017

Ok, I might have a go at hacking together a solution where you can just supply a vector of column names. Then I can do a little benchmarking to see if it's worth doing at a lower level.

@carlganz any chance those big SAS files are publicly available?

@carlganz

This comment has been minimized.

Copy link

carlganz commented Jan 26, 2017

HIPAA data so no unfortunately. I can send you an equivalent SAS file to work with tomorrow when I have access to SAS again if you'd like.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 26, 2017

@carlganz that would be great if you could host them somewhere publicly. It's possible I'll get caught up with other projects and won't be able to get back to this for a few months.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jan 26, 2017

@hadley Column subsetting isn't too hard to implement on the ReadStat side, so I'll take a crack at it today. Thinking of adding a SKIP return value from the variable handler.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jan 26, 2017

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 26, 2017

@evanmiller just to confirm - the variable index will still be the position in the original file, right? i.e. I'll need to maintain my own map from input column to output column.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jan 26, 2017

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 26, 2017

@evanmiller that would make me quite happy 😄

@carlganz

This comment has been minimized.

Copy link

carlganz commented Jan 26, 2017

@hadley The SAS file I generated isn't quite as large as the one I was dealing with originally, but it should work fine. I took the CHIS public use file, which is much smaller, and I concatenated it with itself multiple times to get a larger file. Here is a link

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jan 26, 2017

@hadley Try this:

int readstat_variable_get_index_after_skipping(const readstat_variable_t *variable);

WizardMac/ReadStat@9072ac4

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 30, 2017

Initial benchmarks look pretty promising:

library(haven)
system.time(df <- read_sas("~/Desktop/all.sas7bdat"))
#>    user  system elapsed 
#>  12.548   0.662  13.282

system.time(read_sas("~/Desktop/all.sas7bdat", cols_only = c("AA5C", "YRUS_P1_three")))
#>    user  system elapsed 
#>   0.938   0.227   1.175
system.time(read_sas("~/Desktop/all.sas7bdat", cols_only = names(df)[1:20]))
#>    user  system elapsed 
#>   0.999   0.240   1.252
system.time(read_sas("~/Desktop/all.sas7bdat", cols_only = names(df)[1:100]))
#>    user  system elapsed 
#>   1.222   0.252   1.485

It looks like there's around a second of overhead, but filtering down to only the variable you're interested in makes things a lot faster.

The R interface will take some thinking about so I'll probably leave until the next release of haven.

hadley added a commit that referenced this issue Jan 30, 2017

@mbojan

This comment has been minimized.

Copy link

mbojan commented Apr 18, 2017

I'd love to be able to select specific columns by name with read_spss (was #90)! A public example of the type of files I work with is "Social Diagnosis" survey (http://www.diagnoza.com/). There are two files for download:

There are also variants with the metadata (variable and value labels) in Polish in case you want to test for potential encoding issues (just switch language to Polish on www.diagnoza.com). Column names in data files should be pure ASCII though.

As the column names usually have some structured format it would be great to be able to use dplyr features like starts_with, matches and so on to quickly select blocks of columns that come from specific questionnaire items and so on (e.g. q1a up to q1k and so on).

It is a question of design, but perhaps (as I speculated in #90) this could be done with R syntax like

spss_src("file.sav") %>% select(starts_with("q1"))

and so on, depending whether or not read_spss should encapsulate both reading and selecting/filtering.

@rogerjdeangelis

This comment was marked as off-topic.

Copy link

rogerjdeangelis commented Apr 18, 2017

@lao8n

This comment has been minimized.

Copy link

lao8n commented Aug 8, 2017

Can you do something similar for haven's read_dta please? I have serious memory constraints.

@hadley hadley changed the title [Feature Request] Import subset of columns Import subset of columns Feb 15, 2018

@ADam-Z514

This comment was marked as off-topic.

Copy link

ADam-Z514 commented Jul 4, 2018

Hi,
I have imported some sasdata into R using the sas7bdat package. I have some nominal variables with some missing values.
R is creating a new level which is emty “”.When I ask for tabulate this new level is presented with 0 as a frequency.
I want to get rid of this level and have my file imported correctly.
Do you have some hint to help solve this problem?
Best,
Adam

@mbojan

This comment was marked as resolved.

Copy link

mbojan commented Jul 4, 2018

@ADam-Z514 please don't clutter this issue with unrelated posts, but open a new one if needed.

@ADam-Z514

This comment was marked as off-topic.

Copy link

ADam-Z514 commented Jul 5, 2018

@maraab23

This comment has been minimized.

Copy link

maraab23 commented Sep 19, 2018

Can you do something similar for haven's read_dta please? I have serious memory constraints.

Are there any news on this matter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment