Feasibility of read_dta? #15

buscandoaverroes · 2021-07-20T15:40:46Z

Hello and thank you all so much for creating this package! I love using the metadata file to help track changes in labels and variable names over time.

I'm trying to use read_surveys() with .dta files, it doesn't seem to read them properly unless I first convert from .dta to .rds with haven, and then import with read_rds():

# importing .dta files doesn't form a proper survey_list object
survey_waves <- read_surveys(rounds, .f="read_dta")    # this runs
document_waves(survey_waves)   # this produces the error
## Error: is.survey(df = survey_list[[1]]) is not TRUE

# the same process with the .dta files exported as .rds files presents no problem
survey_waves_rds <- read_surveys(rounds_rds, .f="read_rds")    # this runs
document_waves(survey_waves_rds)   # no error

Since it looks like retroharmonize::read_spss() is a wrapper for haven::read_spss(), to what extent could a retroharmonize::read_dta() be created by substituting the equivalent haven function? Conceptually, what other changes to the code might you think necessary? Again, thanks so much for retroharmonize and for any help or input you could provide.

The text was updated successfully, but these errors were encountered:

antaldaniel · 2021-07-20T21:43:04Z

The retroharmonize read_spss is indeed a wrapper, but the package needed a new, inherited class from haven (which depends on labelled), because haven does not correctly handle SPSS files: it does not write back missing values, and often has problems with the missing value range. In other words, the challenge is not to make the reader work, but to make the mapping of a different file format into R. SPSS handles metadata in a particular way, and processing the metadata into R terms was a big challenge.

If there is an interesting use case, we can do a bit more extensive testing with dta - it may turn out to be a very simple task to solve, or a very difficult. Can you provide a retrospective harmonization example with at least two .dta files that are publicly accessible? Or make a small subsample for republication? Gladly take a look.

antaldaniel · 2021-07-20T22:52:49Z

Please check the 0.1.19 development version with devtools::install_github("rOpenGov/retroharmonize"). It would be great if you could provide a reproducible example for testing missing values. I tested on two dta files, but only as much that it imports into the survey class.

buscandoaverroes · 2021-07-21T16:00:22Z

Wow, thank you so much! I'll test and return with a reprex -- I personally don't run into .dta files with extended missing values often (.a .b etc) but I'll include these as well to include all potential cases.

antaldaniel · 2021-07-22T15:21:01Z

Writing a wrappre is not a big deal if there are no special metadata issues. The thing is with SPSS files is that the user can record otherwise valid values (such as 9999) as a numeric code for "Do not know", etc. Which can be either translated into a category as factor, or should be omitted (as NA) when calculating averages. If Stata files do not have similar issues, than I do not think you'll run into troubles.

buscandoaverroes · 2021-07-22T20:08:18Z

I see now, yes the same does happen in Stata. I'm finishing up an .Rmd to share, where I'm trying to work through some of these "don't know/ incomplete" value issues from the .dta files.

As a side note, one thing I noticed is that both in the read_dta and _spss functions I couldn't figure out how to pass on the "encoding" argument to haven. Not sure how important this is for .sav files, but for .dta files older than version 14 (surprisingly common) apparently haven needs the encoding specified sometimes -- the help file for haven's read_dta explains this. Otherwise read_surveys() reads in an empty survey with a warning. Anyway, just letting you know and I'll share the markdown when complete.

buscandoaverroes · 2021-08-18T16:40:37Z

I'm linking to an .Rmd file here that walks through my attempt to try out read_dta(). Basically my only suggestion would be to allow read_dta() to pass an "encoding" argument to haven because a lot of .dta files are saved in versions that require haven to specify it. Or maybe this is already possible and I'm not specifying correctly. Otherwise it seems to work great -- thanks a lot!

antaldaniel added the enhancement New feature or request label Jul 20, 2021

antaldaniel self-assigned this Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feasibility of read_dta? #15

Feasibility of read_dta? #15

buscandoaverroes commented Jul 20, 2021

antaldaniel commented Jul 20, 2021 •

edited

antaldaniel commented Jul 20, 2021

buscandoaverroes commented Jul 21, 2021

antaldaniel commented Jul 22, 2021

buscandoaverroes commented Jul 22, 2021

buscandoaverroes commented Aug 18, 2021 •

edited

Feasibility of read_dta? #15

Feasibility of read_dta? #15

Comments

buscandoaverroes commented Jul 20, 2021

antaldaniel commented Jul 20, 2021 • edited

antaldaniel commented Jul 20, 2021

buscandoaverroes commented Jul 21, 2021

antaldaniel commented Jul 22, 2021

buscandoaverroes commented Jul 22, 2021

buscandoaverroes commented Aug 18, 2021 • edited

antaldaniel commented Jul 20, 2021 •

edited

buscandoaverroes commented Aug 18, 2021 •

edited