New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible for `haven` to import dataset attributes from Stata? #186

Closed
jjchern opened this Issue Jun 17, 2016 · 12 comments

Comments

Projects
None yet
3 participants
@jjchern
Copy link

jjchern commented Jun 17, 2016

Background

Stata has various dataset attributes, such as

  • dataset label,
  • last saved date,
  • short notes that allow dataset maintainer to include other metadata: data source, sample, maintainer email, etc.

Take the dataset nlsw88 for example. Once loaded, it prints out a short dataset label:

webuse

The next thing a typical Stata user would do is probably use the describe command to get a basic idea about the dataset. As shown in the following screenshot, the d command print out a header that shows data label, last saved date, and indicates that there're more notes about this dataset.

describe

The notes command can then prints out these notes:

notes

The codebook, header command shows similar dataset attributes:

cb-header

Feature Request

Right now haven is not preserving these dataset level attributes. I was wondering if it's possible to keep them, as it's done in the foreign package:

foreign::read.dta("http://www.stata-press.com/data/r12/nlsw88.dta") %>% str()
#> 'data.frame': 2246 obs. of  17 variables:
#> ...
#> - attr(*, "datalabel")= chr "NLSW, 1988 extract"
#> - attr(*, "time.stamp")= chr "1 May 2011 22:52"
#> - attr(*, "formats")= chr  "%8.0g" "%8.0g" "%8.0g" "%8.0g" ...
#> - attr(*, "types")= int  252 251 251 251 251 251 251 251 251 251 ...
#> - attr(*, "val.labels")= chr  "" "" "racelbl" "marlbl" ...
#> - attr(*, "var.labels")= chr  "NLS id" "age in current year" "race" "married" ...
#> - attr(*, "expansion.fields")=List of 7
#> ..$ : chr  "_dta" "note1" "1988 data, extracted from National Longitudinal of Young Woman"
#> ..$ : chr  "_dta" "note2" "who were ages 14-24 in 1968 (NLSW)."
#> ..$ : chr  "_dta" "note3" "This dataset is the result of extraction and processing by various"
#> ..$ : chr  "_dta" "note4" "people at various times."
#> ..$ : chr  "_dta" "note5" "For more information on the NLS, see"
#> ..$ : chr  "_dta" "note0" "6"
#> ..$ : chr  "_dta" "note6" "http://www.bls.gov/nls/"
#> - attr(*, "version")= int 12

foreign is not doing quite correctly for keeping "format", "types", "val.labels", and "var.labels" as these attributes belong to corresponding variables, but I believe it'd be worthwhile to keep others at the dataset level.

Moreover, it might also be useful for tibble to print out some of these attributes, and indicate that there're more detailed notes.

Probably related: hadley/tibble#90

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jun 25, 2016

@hadley The ReadStat build is currently broken, but I've added a new "metadata handler" that reports the file label, timestamp, and format version. Looks like there are some strftime portability issues though so don't integrate just yet.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jun 25, 2016

Okay, I've fixed the strftime issues. Should be ready to integrate, although I don't have a solution yet for Stata's "notes" feature.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jul 29, 2016

@jjchern could you please create a simple stata file that I can include in haven and use for testing?

@jjchern

This comment has been minimized.

Copy link

jjchern commented Jul 29, 2016

@hadley Sure. How about the iris dataset? Here's the dataset in three Stata versions. Just in case.

urls = paste0("http://www.stata-press.com/data/r", 12:14, "/iris.dta")
fils = paste0("iris_v", 12:14, ".dta")
purrr::map2(urls, fils, ~ if(!file.exists(.y)) download.file(.x, .y))
list.files(pattern = "iris_")
#> [1] "iris_v12.dta" "iris_v13.dta" "iris_v14.dta"

And here's a screenshot that shows the dataset attributes:

screen shot 2016-07-29 at 9 59 04 am

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jul 29, 2016

Can you make your own really simple dataset? I want to include it in haven for testing

@jjchern

This comment has been minimized.

Copy link

jjchern commented Jul 29, 2016

Sorry about that. Here's a test dataset:

test_metadata.dta.zip

The Stata code that generates the dataset (just for the record):

clear
set seed 13
set obs 5
gen id = _n
gen female = runiform() < .6
gen treatment = runiform() < .5
gen outcome = 1 + 2 * treatment + (rnormal() < .5)
label data "This is a test dataset."
notes: Reference: https://github.com/hadley/haven/issues/186
saveold "~/Desktop/test_metadata.dta", replace v(12)

And a screenshot of how the metadata looks like in Stata's output window:

screen shot 2016-07-29 at 2 31 04 pm

As always, thanks a lot for the attention.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Aug 4, 2016

Just to provide a quick update on the ReadStat side of things, I am working on adding read/write support for Stata notes, as well as the SPSS document record. You can track the progress of that effort over here:

WizardMac/ReadStat#73

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Aug 9, 2016

@evanmiller when I load that "notes.dta" I get two notes: one is "1" and the other is "Reference: ...". Is that what you expect?

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Aug 9, 2016

That is what I see in the data file.

The "1" note is labeled "notes0" internally (handler receives note_index=0), so maybe it has some private meaning, e.g. number of notes or first note index. This stuff is completely undocumented so I'm flying a little blind.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Aug 9, 2016

@evanmiller also I get an "invalid timestamp" for http://www.stata-press.com/data/r12/nlsw88.dta

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Aug 9, 2016

Looking at the notes in http://www.stata-press.com/data/r14/iris.dta, I'm guess that "empty" notes are given a default value based on their position.

@hadley hadley closed this in db28efc Aug 9, 2016

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Aug 9, 2016

@hadley I'll open an issue on the invalid timestamp issue.

If you come to a better understanding about notes and how ReadStat should handle empty ones please file an issue.

@lock lock bot locked and limited conversation to collaborators Jun 26, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.