New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_dta needs to check for valid Stata variable names #132

Closed
wbuchanan opened this Issue Nov 9, 2015 · 4 comments

Comments

Projects
None yet
3 participants
@wbuchanan
Copy link

wbuchanan commented Nov 9, 2015

d3urls <- list("Selections" = "https://github.com/mbostock/d3/wiki/Selections",
         "Transitions" = "https://github.com/mbostock/d3/wiki/Transitions",
         "Arrays" = "https://github.com/mbostock/d3/wiki/Arrays",
         "Requests" = "https://github.com/mbostock/d3/wiki/Requests",
         "Formatting" = "https://github.com/mbostock/d3/wiki/Formatting",
         "Localization" = "https://github.com/mbostock/d3/wiki/Localization",
         "Colors" = "https://github.com/mbostock/d3/wiki/Colors",
         "Namespaces" = "https://github.com/mbostock/d3/wiki/Namespaces",
         "Math" = "https://github.com/mbostock/d3/wiki/Math", 
         "Internals" = "https://github.com/mbostock/d3/wiki/Internals",
         "Behaviors - Drag" = "https://github.com/mbostock/d3/wiki/Drag-Behavior",
         "Behaviors - Zoom" = "https://github.com/mbostock/d3/wiki/Zoom-Behavior",
         "Geo - Paths" = "https://github.com/mbostock/d3/wiki/Geo-Paths", 
         "Geo - Projections" = "https://github.com/mbostock/d3/wiki/Geo-Projections", 
         "Geo - Streams" = "https://github.com/mbostock/d3/wiki/Geo-Streams",
         "Geom - Voronoi" = "https://github.com/mbostock/d3/wiki/Voronoi-Geom", 
         "Geom - Hull" = "https://github.com/mbostock/d3/wiki/Hull-Geom",
         "Geom - Polygon" = "https://github.com/mbostock/d3/wiki/Polygon-Geom", 
         "Geom - Quadtree" = "https://github.com/mbostock/d3/wiki/Quadtree-Geom", 
         "Layouts - Bundle" = "https://github.com/mbostock/d3/wiki/Bundle-Layout", 
         "Layouts - Chord" = "https://github.com/mbostock/d3/wiki/Chord-Layout", 
         "Layouts - Cluster" = "https://github.com/mbostock/d3/wiki/Cluster-Layout", 
         "Layouts - Force" = "https://github.com/mbostock/d3/wiki/Force-Layout", 
         "Layouts - Hierarchy" = "https://github.com/mbostock/d3/wiki/Hierarchy-Layout", 
         "Layouts - Histogram" = "https://github.com/mbostock/d3/wiki/Histogram-Layout", 
         "Layouts - Pack" = "https://github.com/mbostock/d3/wiki/Pack-Layout", 
         "Layouts - Partition" = "https://github.com/mbostock/d3/wiki/Partition-Layout", 
         "Layouts - Pie" = "https://github.com/mbostock/d3/wiki/Pie-Layout", 
         "Layouts - Stack" = "https://github.com/mbostock/d3/wiki/Stack-Layout", 
         "Layouts - Tree" = "https://github.com/mbostock/d3/wiki/Tree-Layout", 
         "Layouts - Treemap" = "https://github.com/mbostock/d3/wiki/Treemap-Layout", 
         "Scales - Quantitative" = "https://github.com/mbostock/d3/wiki/Quantitative-Scales",
         "Scales - Ordinal" = "https://github.com/mbostock/d3/wiki/Ordinal-Scales",
         "Scales - Timeseries" = "https://github.com/mbostock/d3/wiki/Time-Scales",
         "SVG - Shapes" = "https://github.com/mbostock/d3/wiki/SVG-Shapes",
         "SVG - Axes" = "https://github.com/mbostock/d3/wiki/SVG-Axes",
         "SVG - Controls" = "https://github.com/mbostock/d3/wiki/SVG-Controls",
         "Time - Formatting" = "https://github.com/mbostock/d3/wiki/Time-Formatting",
         "Time - Scales" = "https://github.com/mbostock/d3/wiki/Time-Scales",
         "Time - Intervals" = "https://github.com/mbostock/d3/wiki/Time-Intervals") 

library(magrittr)
colnm <- names(d3urls)
d3x <- xml2::read_html(d3urls[[1]]) %>% rvest::html_nodes("p") %>% rvest::html_text()
d3x <- d3x[grepl("^#.*", d3x)] 
d3x <- gsub("# ", "", d3x) 
r <- c(1:length(d3x))
d3x <- as.data.frame(cbind(r, d3x), stringsAsFactors = FALSE)
names(d3x) <- c("id", colnm[1])

for (i in c(2:40)) {
    x <- xml2::read_html(d3urls[[i]]) %>% rvest::html_nodes("p") %>% rvest::html_text()
    x <- x[grepl("^#.*", x)] 
    x <- gsub("# ", "", x) 
    r <- c(1:length(x))
    x <- as.data.frame(cbind(r, x), stringsAsFactors = FALSE)
    names(x) <- c("id", colnm[i])
    d3x <- dplyr::full_join(d3x, x, by = "id")  
}

rm(x, r)

haven::write_dta(d3x, "~/Desktop/d3Methods.dta")

Then I load the file in Stata 14.1MP8 using:

use ~/Desktop/d3Methods.dta, clear

The problem occurs when using the Stata command 'compress', which is used to optimize storage on disk of the file (e.g., downcasts types to the smallest type possible without loosing precision so things like 1.00000000000000000000000 would be cast as a 1-byte integer value rather than a float/double). In this case, I think there is a problem with the writing functions and how they insert binary zeros around the strings in the data frame (Stata uses binary zeros for padding a column so each record for a string column reserves the same number of bits for storage).

If I write the same data out to a csv:

write.csv(d3x, "~/Desktop/d3Methods.csv", row.names = FALSE)

Then load the same data in Stata:

. import delimited using ~/Desktop/d3Methods.csv, delim(",") varn(1) clear 
(41 vars, 102 obs)

. compress
  (0 bytes saved)

The issue goes away. I couldn't capture the other error since it crashed Stata each time. I can post the .dta files in version 13 and 14 if you'd like to compare it to the output from Haven.

@wbuchanan

This comment has been minimized.

Copy link

wbuchanan commented Nov 9, 2015

Figured out the issue here. It seems like nothing in the underlying C library is checking for valid names in Stata. So, the file is being written with variable (column) names like "Behavior - Drag" which is illegal in Stata. To be prototypical in the Stata world, any delimiters should be replaced be a single underscore and names converted to lowercase. It is fine to have "Behavior - Drag" for a variable label, but not for a variable name.

@hadley hadley changed the title write_dta file causing issues when loaded into Stata write_dta needs to check for valid Stata variable names May 30, 2016

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 30, 2016

Could you please point me to the rules for determining valid stata variable names?

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented May 30, 2016

@hadley hadley closed this in c28c02c May 30, 2016

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 30, 2016

I've been burnt too many times with R's helpful auto-renaming rules, so I've opted to be strict here and throw and error.

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.