New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing from Stata: Encoding issues #71

Closed
kwenzig opened this Issue May 26, 2015 · 11 comments

Comments

Projects
None yet
3 participants
@kwenzig
Copy link
Contributor

kwenzig commented May 26, 2015

It seems that there are some encoding issues using read_dta:
I'm working with version 0.2.0.9000. If there are non-ASCII-characters, like german Umlauts (äöüß), they seem to break. It looks like they were treated as UTF-8 by the import-function, but they are actually something like Latin-1 (or similar). I can correct the string by using

iconv(names(attr(*, "labels")), from="L1", to="UTF-8")

PS: With Stata 14 Unicode features have been introduced for the first time: http://www.stata.com/stata14/unicode/
PPS: I can provide a Stata 13 file containing Umlauts. (Stata 14 is not yet available for me.)

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 26, 2015

@evanmiller any thoughts?

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented May 26, 2015

It looks like prior to version 14 Stata used "extended ASCII", i.e. whatever the default code page on the computer was. ReadStat currently assumes it is regular ASCII, without support for umlauts and such. I'll see about adding support for specifying the encoding, since it's not stored in the DTA file itself.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented May 26, 2015

@hadley hadley closed this in cd623bb Jun 19, 2015

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jun 19, 2015

@kwenzig please try with the dev version, and let us know if it still fails

@kwenzig

This comment has been minimized.

Copy link
Contributor

kwenzig commented Jun 24, 2015

I'm afraid, that there are still encoding issues. In this Stata dta file (Version 13) you see Umlauts in the variable label of make ("Mäke and Mödel") and in the contents of this string variable (1st line: "ÄMC Concörd"):
stata
I import the data using haven und write this information to a text file with:

test <- read_dta("C:/Users/kwenzig/Desktop/auto_umlaut.dta")
t1 <- test$make[1]
t2 <- attr(test$make, "label")
t12 <- paste0(t1,"-",t2)
cat(t12,
file=(con <- file("C:/temp/haven.txt", "w", encoding = "UTF-8")),
sep="", fill=FALSE, labels=NULL, append=FALSE)
close(con)

Then I open haven.txt using RStudio in UTF-8-mode and get:

<c4>MC Conc<f6>rd-M<e4>ke and M<f6>del

You find the Stata file here:
https://oc.diw.de/index.php/s/LM0dPQCrsMeJDJW
HTH.

@kwenzig kwenzig changed the title Imprting from Stata: Encoding issues Importing from Stata: Encoding issues Jun 24, 2015

@hadley hadley reopened this Jun 24, 2015

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jun 24, 2015

ReadStat correctly converts these strings to UTF-8, so it is most likely an issue on the haven end.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jun 24, 2015

It seems fine to me:

df <- read_dta("auto_umlaut.dta")
df[1,1]
#> [1] "ÄMC Concörd"
Encoding(df[1, 1])
#> [1] "UTF-8"

I'd suspect something is going wrong in elsewhere in your code that modifies, writes and reads the text file.

@hadley hadley closed this Jun 24, 2015

@kwenzig

This comment has been minimized.

Copy link
Contributor

kwenzig commented Jun 25, 2015

Thanks for trying to replicate this. I tried your code with RGui and there haven works properly. Could this be an RStudio issue I should report there?
rgui
rstudio

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jun 25, 2015

It works for me in RStudio. Do you have the latest version?

@kwenzig

This comment has been minimized.

Copy link
Contributor

kwenzig commented Jun 25, 2015

Here we have Version 0.99.441 (for windows), I will ask to get v0.99.447.

@kwenzig

This comment has been minimized.

Copy link
Contributor

kwenzig commented Jul 1, 2015

Thx. v0.99.447 arrived here and I can report, that it works as expected. Great.

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.