New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_dta can't read ´(\xb4) character #325

Closed
cimentadaj opened this Issue Dec 15, 2017 · 10 comments

Comments

Projects
None yet
3 participants
@cimentadaj
Copy link

cimentadaj commented Dec 15, 2017

Reading the dta files from the European Social Survey raises a warning reading the ´(\xb4) character. However, read_spss does the job fine. To download the data briefly sign up with your email here. The script below downloads the data and saves to tempdir(). Remember to replace your_email with your email in the functions ess_rounds

library(haven)
devtools::install_github("cimentadaj/ess", ref = "dev_branch") # To download ess data from R
library(ess)

# Replace email with your registered email
ess_rounds(7, your_email = your_email, only_download = TRUE, output_dir = tempdir(), format = "stata")
ess_rounds(7, your_email = your_Email, only_download = TRUE, output_dir = tempdir(), format = "spss")

stata_path <- list.files(tempdir(), recursive = TRUE, full.names = TRUE, pattern = ".dta")
spss_path <- list.files(tempdir(), recursive = TRUE, full.names = TRUE, pattern = ".sav")

seven_dta <- read_dta(stata_path)
str(seven_dta$prtvtcfi)
#> Class 'labelled'  atomic [1:40185] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Party voted for in last national election, Finland"
#>   ..- attr(*, "format.stata")= chr "%10.0g"
#>   ..- attr(*, "labels")= Named num [1:22] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. ..- attr(*, "names")=
#> Error in strtrim(encodeString(object, quote = "\"", na.encode = FALSE), : invalid multibyte string at '<b4>s P<61>rty (SPP)"'

seven_spss <- read_spss(spss_path)
str(seven_spss$prtvtcfi)
#> Class 'labelled'  atomic [1:40185] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Party voted for in last national election, Finland"
#>   ..- attr(*, "format.spss")= chr "F2.0"
#>   ..- attr(*, "display_width")= int 10
#>   ..- attr(*, "labels")= Named num [1:22] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. ..- attr(*, "names")= chr [1:22] "The National Coalition Party" "The Swedish People´s Party (SPP)" "The Centre Party" "True Finns" ...

read_dta has problems reading "´". More specifically attr(seven_dta$prtvtcfi, "labels")[2]
In case you're wondering, ess_rounds simply downloads the dta, nothing is done on the data. The problem is also raised by other users here

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 7, 2018

Have you tried specifying the encoding of the file, per the note in the documentation of read_dta()?

If that doesn't help, it would be extremely helpful if you could produce a reprex that doesn't require me to install a package and sign up for an account.

@hadley hadley added the reprex label Jan 7, 2018

@cimentadaj

This comment has been minimized.

Copy link

cimentadaj commented Jan 8, 2018

Their website has not information on the encoding for that wave so I'm not sure if it's different from UTF-8. I try Win1252 in the reprex. Download the data from here

library(haven)
#> Warning: package 'haven' was built under R version 3.4.1

ex <- read_dta("st_ex.dta")
str(ex$prtvtcfi)
#> Class 'labelled'  atomic [1:40185] NA NA NA NA NA NA NA NA NA NA ...
#>   ..- attr(*, "label")= chr "Party voted for in last national election, Finland"
#>   ..- attr(*, "format.stata")= chr "%10.0g"
#>   ..- attr(*, "labels")= Named num [1:22] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. ..- attr(*, "names")=
#> Error in strtrim(encodeString(object, quote = "\"", na.encode = FALSE), : invalid multibyte string at '<b4>s P<61>rty (SPP)"'


ex <- read_dta("st_ex.dta", encoding = "Win 1252")
#> Error in df_parse_dta_file(spec, encoding): Failed to parse /Users/cimentadaj/Downloads/try/st_ex.dta: File has an unsupported character set.
@cimentadaj

This comment has been minimized.

Copy link

cimentadaj commented Jan 9, 2018

I think this might be an encoding problem as you suggest. However on Stata, there's no error but fills it with "The Swedish People¥s Party (SPP)", where "`" is replaced by "¥". However, I tried many different encodings and used readr::guess_encoding on the variable and the problem persists.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 16, 2018

@evanmiller more evidence that I don't understand the encoding API:

library(haven)

# Default encoding
ex <- read_dta("~/Desktop/st_ex.dta")
bad <- names(attr(ex$prtvtcfi, "labels"))[[2]]

bad
#> [1] "The Swedish People\xb4s Party (SPP)"
iconv(bad, from = "ISO-8859-1", to = "UTF-8")
#> [1] "The Swedish People´s Party (SPP)"

# Supply ISO-8859-1
ex <- read_dta("~/Desktop/st_ex.dta", encoding = "ISO-8859-1")
bad <- names(attr(ex$prtvtcfi, "labels"))[[2]]

# This value doesn't seem to change
bad
#> [1] "The Swedish People\xb4s Party (SPP)"
iconv(bad, from = "ISO-8859-1", to = "UTF-8")
#> [1] "The Swedish People´s Party (SPP)"

I'm setting the parser encoding with readstat_set_file_character_encoding(), and I mark the value returned from readstat_string_value() with Rf_mkCharCE(str_value, CE_UTF8) so that R know it's UTF-8.

@hadley hadley added bug and removed reprex labels Jan 16, 2018

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jan 16, 2018

This appears to be a bug in ReadStat's DTA parser. I'll get it fixed soon.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jan 16, 2018

Should be fixed in WizardMac/ReadStat@607d159

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 16, 2018

Awesome - thanks! I'm going to roll this up into a release. Any products of SAS investigation can go to a future release.

@hadley hadley closed this in 1857acf Jan 16, 2018

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jan 16, 2018

@hadley Did you verify the fix? I think the file is supposed to be UTF-8 but is actually Win-1252 (i.e. encoding must be specified even with the patch)

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jan 16, 2018

Yup - I did, and you're right that it's necessary to manually override the encoding.

@lock

This comment has been minimized.

Copy link

lock bot commented Jul 15, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jul 15, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.