Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feasibility of read_dta? #15

Open
buscandoaverroes opened this issue Jul 20, 2021 · 6 comments
Open

Feasibility of read_dta? #15

buscandoaverroes opened this issue Jul 20, 2021 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@buscandoaverroes
Copy link

Hello and thank you all so much for creating this package! I love using the metadata file to help track changes in labels and variable names over time.

I'm trying to use read_surveys() with .dta files, it doesn't seem to read them properly unless I first convert from .dta to .rds with haven, and then import with read_rds():

# importing .dta files doesn't form a proper survey_list object
survey_waves <- read_surveys(rounds, .f="read_dta")    # this runs
document_waves(survey_waves)   # this produces the error
## Error: is.survey(df = survey_list[[1]]) is not TRUE

# the same process with the .dta files exported as .rds files presents no problem
survey_waves_rds <- read_surveys(rounds_rds, .f="read_rds")    # this runs
document_waves(survey_waves_rds)   # no error

Since it looks like retroharmonize::read_spss() is a wrapper for haven::read_spss(), to what extent could a retroharmonize::read_dta() be created by substituting the equivalent haven function? Conceptually, what other changes to the code might you think necessary? Again, thanks so much for retroharmonize and for any help or input you could provide.

@antaldaniel
Copy link
Collaborator

antaldaniel commented Jul 20, 2021

The retroharmonize read_spss is indeed a wrapper, but the package needed a new, inherited class from haven (which depends on labelled), because haven does not correctly handle SPSS files: it does not write back missing values, and often has problems with the missing value range. In other words, the challenge is not to make the reader work, but to make the mapping of a different file format into R. SPSS handles metadata in a particular way, and processing the metadata into R terms was a big challenge.

If there is an interesting use case, we can do a bit more extensive testing with dta - it may turn out to be a very simple task to solve, or a very difficult. Can you provide a retrospective harmonization example with at least two .dta files that are publicly accessible? Or make a small subsample for republication? Gladly take a look.

@antaldaniel antaldaniel added the enhancement New feature or request label Jul 20, 2021
@antaldaniel
Copy link
Collaborator

Please check the 0.1.19 development version with devtools::install_github("rOpenGov/retroharmonize"). It would be great if you could provide a reproducible example for testing missing values. I tested on two dta files, but only as much that it imports into the survey class.

@antaldaniel antaldaniel self-assigned this Jul 20, 2021
@buscandoaverroes
Copy link
Author

Wow, thank you so much! I'll test and return with a reprex -- I personally don't run into .dta files with extended missing values often (.a .b etc) but I'll include these as well to include all potential cases.

@antaldaniel
Copy link
Collaborator

Writing a wrappre is not a big deal if there are no special metadata issues. The thing is with SPSS files is that the user can record otherwise valid values (such as 9999) as a numeric code for "Do not know", etc. Which can be either translated into a category as factor, or should be omitted (as NA) when calculating averages. If Stata files do not have similar issues, than I do not think you'll run into troubles.

@buscandoaverroes
Copy link
Author

I see now, yes the same does happen in Stata. I'm finishing up an .Rmd to share, where I'm trying to work through some of these "don't know/ incomplete" value issues from the .dta files.

As a side note, one thing I noticed is that both in the read_dta and _spss functions I couldn't figure out how to pass on the "encoding" argument to haven. Not sure how important this is for .sav files, but for .dta files older than version 14 (surprisingly common) apparently haven needs the encoding specified sometimes -- the help file for haven's read_dta explains this. Otherwise read_surveys() reads in an empty survey with a warning. Anyway, just letting you know and I'll share the markdown when complete.

@buscandoaverroes
Copy link
Author

buscandoaverroes commented Aug 18, 2021

I'm linking to an .Rmd file here that walks through my attempt to try out read_dta(). Basically my only suggestion would be to allow read_dta() to pass an "encoding" argument to haven because a lot of .dta files are saved in versions that require haven to specify it. Or maybe this is already possible and I'm not specifying correctly. Otherwise it seems to work great -- thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants