Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_dta needs to check for valid Stata variable names #132

Closed
wbuchanan opened this issue Nov 9, 2015 · 4 comments
Closed

write_dta needs to check for valid Stata variable names #132

wbuchanan opened this issue Nov 9, 2015 · 4 comments

Comments

@wbuchanan
Copy link

d3urls <- list("Selections" = "https://github.com/mbostock/d3/wiki/Selections",
         "Transitions" = "https://github.com/mbostock/d3/wiki/Transitions",
         "Arrays" = "https://github.com/mbostock/d3/wiki/Arrays",
         "Requests" = "https://github.com/mbostock/d3/wiki/Requests",
         "Formatting" = "https://github.com/mbostock/d3/wiki/Formatting",
         "Localization" = "https://github.com/mbostock/d3/wiki/Localization",
         "Colors" = "https://github.com/mbostock/d3/wiki/Colors",
         "Namespaces" = "https://github.com/mbostock/d3/wiki/Namespaces",
         "Math" = "https://github.com/mbostock/d3/wiki/Math", 
         "Internals" = "https://github.com/mbostock/d3/wiki/Internals",
         "Behaviors - Drag" = "https://github.com/mbostock/d3/wiki/Drag-Behavior",
         "Behaviors - Zoom" = "https://github.com/mbostock/d3/wiki/Zoom-Behavior",
         "Geo - Paths" = "https://github.com/mbostock/d3/wiki/Geo-Paths", 
         "Geo - Projections" = "https://github.com/mbostock/d3/wiki/Geo-Projections", 
         "Geo - Streams" = "https://github.com/mbostock/d3/wiki/Geo-Streams",
         "Geom - Voronoi" = "https://github.com/mbostock/d3/wiki/Voronoi-Geom", 
         "Geom - Hull" = "https://github.com/mbostock/d3/wiki/Hull-Geom",
         "Geom - Polygon" = "https://github.com/mbostock/d3/wiki/Polygon-Geom", 
         "Geom - Quadtree" = "https://github.com/mbostock/d3/wiki/Quadtree-Geom", 
         "Layouts - Bundle" = "https://github.com/mbostock/d3/wiki/Bundle-Layout", 
         "Layouts - Chord" = "https://github.com/mbostock/d3/wiki/Chord-Layout", 
         "Layouts - Cluster" = "https://github.com/mbostock/d3/wiki/Cluster-Layout", 
         "Layouts - Force" = "https://github.com/mbostock/d3/wiki/Force-Layout", 
         "Layouts - Hierarchy" = "https://github.com/mbostock/d3/wiki/Hierarchy-Layout", 
         "Layouts - Histogram" = "https://github.com/mbostock/d3/wiki/Histogram-Layout", 
         "Layouts - Pack" = "https://github.com/mbostock/d3/wiki/Pack-Layout", 
         "Layouts - Partition" = "https://github.com/mbostock/d3/wiki/Partition-Layout", 
         "Layouts - Pie" = "https://github.com/mbostock/d3/wiki/Pie-Layout", 
         "Layouts - Stack" = "https://github.com/mbostock/d3/wiki/Stack-Layout", 
         "Layouts - Tree" = "https://github.com/mbostock/d3/wiki/Tree-Layout", 
         "Layouts - Treemap" = "https://github.com/mbostock/d3/wiki/Treemap-Layout", 
         "Scales - Quantitative" = "https://github.com/mbostock/d3/wiki/Quantitative-Scales",
         "Scales - Ordinal" = "https://github.com/mbostock/d3/wiki/Ordinal-Scales",
         "Scales - Timeseries" = "https://github.com/mbostock/d3/wiki/Time-Scales",
         "SVG - Shapes" = "https://github.com/mbostock/d3/wiki/SVG-Shapes",
         "SVG - Axes" = "https://github.com/mbostock/d3/wiki/SVG-Axes",
         "SVG - Controls" = "https://github.com/mbostock/d3/wiki/SVG-Controls",
         "Time - Formatting" = "https://github.com/mbostock/d3/wiki/Time-Formatting",
         "Time - Scales" = "https://github.com/mbostock/d3/wiki/Time-Scales",
         "Time - Intervals" = "https://github.com/mbostock/d3/wiki/Time-Intervals") 

library(magrittr)
colnm <- names(d3urls)
d3x <- xml2::read_html(d3urls[[1]]) %>% rvest::html_nodes("p") %>% rvest::html_text()
d3x <- d3x[grepl("^#.*", d3x)] 
d3x <- gsub("# ", "", d3x) 
r <- c(1:length(d3x))
d3x <- as.data.frame(cbind(r, d3x), stringsAsFactors = FALSE)
names(d3x) <- c("id", colnm[1])

for (i in c(2:40)) {
    x <- xml2::read_html(d3urls[[i]]) %>% rvest::html_nodes("p") %>% rvest::html_text()
    x <- x[grepl("^#.*", x)] 
    x <- gsub("# ", "", x) 
    r <- c(1:length(x))
    x <- as.data.frame(cbind(r, x), stringsAsFactors = FALSE)
    names(x) <- c("id", colnm[i])
    d3x <- dplyr::full_join(d3x, x, by = "id")  
}

rm(x, r)

haven::write_dta(d3x, "~/Desktop/d3Methods.dta")

Then I load the file in Stata 14.1MP8 using:

use ~/Desktop/d3Methods.dta, clear

The problem occurs when using the Stata command 'compress', which is used to optimize storage on disk of the file (e.g., downcasts types to the smallest type possible without loosing precision so things like 1.00000000000000000000000 would be cast as a 1-byte integer value rather than a float/double). In this case, I think there is a problem with the writing functions and how they insert binary zeros around the strings in the data frame (Stata uses binary zeros for padding a column so each record for a string column reserves the same number of bits for storage).

If I write the same data out to a csv:

write.csv(d3x, "~/Desktop/d3Methods.csv", row.names = FALSE)

Then load the same data in Stata:

. import delimited using ~/Desktop/d3Methods.csv, delim(",") varn(1) clear 
(41 vars, 102 obs)

. compress
  (0 bytes saved)

The issue goes away. I couldn't capture the other error since it crashed Stata each time. I can post the .dta files in version 13 and 14 if you'd like to compare it to the output from Haven.

@wbuchanan
Copy link
Author

Figured out the issue here. It seems like nothing in the underlying C library is checking for valid names in Stata. So, the file is being written with variable (column) names like "Behavior - Drag" which is illegal in Stata. To be prototypical in the Stata world, any delimiters should be replaced be a single underscore and names converted to lowercase. It is fine to have "Behavior - Drag" for a variable label, but not for a variable name.

@hadley hadley changed the title write_dta file causing issues when loaded into Stata write_dta needs to check for valid Stata variable names May 30, 2016
@hadley
Copy link
Member

hadley commented May 30, 2016

Could you please point me to the rules for determining valid stata variable names?

@evanmiller
Copy link
Collaborator

See also WizardMac/ReadStat#46

@hadley hadley closed this as completed in c28c02c May 30, 2016
@hadley
Copy link
Member

hadley commented May 30, 2016

I've been burnt too many times with R's helpful auto-renaming rules, so I've opted to be strict here and throw and error.

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants