New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Special" characters encoding issues with write_* and read_* #697

Closed
dpprdan opened this Issue Jul 18, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@dpprdan

dpprdan commented Jul 18, 2017

EDIT: Skip to my third post, everything else are bugs in base.

There is something wrong in the way readr's write_* and read_* functions deal with "special" characters (on Windows).

x <- c("", "", "¼", "", "", "", "ö")
Encoding(x)

# "latin1" "latin1" "latin1" "UTF-8"  "UTF-8"  "latin1" "latin1"

df <- data.frame(x, stringsAsFactors = FALSE)

print(x)
# "€" "–" "¼" "⅛" "℅" "‰" "ö"

print(df)
#          x
# 1        €
# 2        –
# 3        ¼
# 4 <U+215B>
# 5 <U+2105>
# 6        ‰
# 7        ö

So apparently print() cannot deal with and when they are in a data.frame?

Anyway, this is what readr does

library("readr")
write_csv(df, "df_readr.csv")
read_csv("df_readr.csv")

# Parsed with column specification:
# cols(
#   x = col_character()
# )
# # A tibble: 7 x 1
#          x
#      <chr>
# 1 "\u0080"
# 2 "\u0096"
# 3        ¼
# 4 <U+215B>
# 5 <U+2105>
# 6 "\u0089"
# 7        ö

Interesting. Even more so when we look at the output of write_lines()

write_lines(df$x, "df_readr_lines.txt")
read_lines("df_readr_lines.txt")
# "\u0080" "\u0096" "¼"      "⅛"      "℅"      "\u0089" "ö"     

This is equivalent to what write_lines() does and what I see in Notepad++ (well kinda).

So both write functions can deal with things like and but not with the Euro-symbol or the en-dash (U+2013).

Finally, the output of format_csv is rubbish.

cat(format_csv(df))

# x
# €
# –
# ¼
# â…›
# â„…
# ‰
# ö

For comparison now utils

write.csv(df, "df_utils.csv", fileEncoding = "UTF-8", row.names = FALSE)
read.csv("df_utils.csv", fileEncoding = "UTF-8")
#          x
# 1        €
# 2        –
# 3        ¼
# 4 <U+215B>
# 5 <U+2105>
# 6        ‰
# 7        ö

This looks the same in Notepad++.

So utils::write.csv() writes the same as what print(df) shows in the console. I'd expect this kind of consistency from readr, too.

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.1 (2017-06-30)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2017-07-18
#> Packages -----------------------------------------------------------------
#>  package   * version    date       source                          
#>  backports   1.1.0      2017-05-22 CRAN (R 3.4.0)                  
#>  base      * 3.4.1      2017-06-30 local                           
#>  compiler    3.4.1      2017-06-30 local                           
#>  datasets  * 3.4.1      2017-06-30 local                           
#>  devtools    1.13.2     2017-06-02 CRAN (R 3.4.0)                  
#>  digest      0.6.12     2017-01-27 CRAN (R 3.3.2)                  
#>  evaluate    0.10.1     2017-06-24 CRAN (R 3.4.0)                  
#>  graphics  * 3.4.1      2017-06-30 local                           
#>  grDevices * 3.4.1      2017-06-30 local                           
#>  hms         0.3        2016-11-22 CRAN (R 3.3.2)                  
#>  htmltools   0.3.6      2017-04-28 CRAN (R 3.4.0)                  
#>  knitr       1.16       2017-05-18 CRAN (R 3.4.0)                  
#>  magrittr    1.5        2014-11-22 CRAN (R 3.3.0)                  
#>  memoise     1.1.0      2017-05-29 Github (hadley/memoise@e372cde) 
#>  methods   * 3.4.1      2017-06-30 local                           
#>  R6          2.2.2      2017-06-17 CRAN (R 3.4.0)                  
#>  Rcpp        0.12.12    2017-07-15 CRAN (R 3.4.1)                  
#>  readr     * 1.1.1.9000 2017-07-18 Github (tidyverse/readr@3ea8199)
#>  rlang       0.1.1      2017-05-18 CRAN (R 3.4.0)                  
#>  rmarkdown   1.6        2017-06-15 CRAN (R 3.4.0)                  
#>  rprojroot   1.2        2017-01-16 CRAN (R 3.3.2)                  
#>  stats     * 3.4.1      2017-06-30 local                           
#>  stringi     1.1.5      2017-04-07 CRAN (R 3.3.3)                  
#>  stringr     1.2.0      2017-02-18 CRAN (R 3.3.3)                  
#>  tibble      1.3.3      2017-05-28 CRAN (R 3.4.0)                  
#>  tools       3.4.1      2017-06-30 local                           
#>  utils     * 3.4.1      2017-06-30 local                           
#>  withr       1.0.2      2016-06-20 CRAN (R 3.3.1)                  
#>  yaml        2.1.14     2016-11-12 CRAN (R 3.3.2)
@dpprdan

This comment has been minimized.

dpprdan commented Aug 1, 2017

Upon further investigation, most of these issues are with base R and not with readr.
First, it seems to me that "⅛" is printed as "<U+215B>" with print.data.frame() and print.tbl() is because both call format().

x <- c("", "")
format(x)
# [1] "<U+215B>" "<U+2105>"

See e.g. this blog post and this question on SO.

Second, I believe that the issue with "€", "–", and "‰" printing (or rather written by write_csv) as "\u0080", "\u0096", or "\u0089", is due to a bug in enc2utf8(), see my post to r-devel. In short, enc2utf8 assigns the wrong unicode chararacters to cp1252 characters in the 80 to 9F range.

y <- c("", "", "")
print(y)
# [1] "€" "–" "‰"
enc2utf8(y)
# [1] "\u0080" "\u0096" "\u0089"

The only thing remaining is the strange output from cat(format_csv()) I can reproduce this with enc2utf8() followed by iconv(to = "UTF-8"). I don't know whether this is really the problem here, though.

z <- c("", "", "¼", "", "", "", "ö")
z_df <- data.frame(z, stringsAsFactors = FALSE)
z_utf8 <- enc2utf8(z)
z_utf8 <- iconv(z_utf8, to = "UTF-8")
print(z_utf8)
# [1] "€"  "–"  "¼"  "⅛" "℅" "‰"  "ö" 
cat(readr::format_csv(z_df))
# z
# €
# –
# ¼
# â…›
# â„…
# ‰
# ö
@dpprdan

This comment has been minimized.

dpprdan commented Oct 10, 2017

The problem with format_csv seems to be that the output is "UTF-8" encoded, but that R does not know about it. I.e. it interprets the output as if the Encoding() were "unknown" i.e. native.

Reminder: Output of format_csv is garbled.

x <- c("ä", "ö", "ü", "", "") 
x_df <- data.frame(x, stringsAsFactors = FALSE)
x_fcsv <- readr::format_csv(x_df)
cat(x_fcsv)
#> x
#> ä
#> ö
#> ü
#> â…›
#> â„…

format_csv encodes the output as "UTF-8", but without declaring the encoding.

Encoding(x_fcsv)
#> [1] "unknown"

Which is equivalent to this

Encoding(x)
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown"
y <- enc2utf8(x)
Encoding(y) <- "unknown"
y
#> [1] "ä" "ö" "ü" "⅛" "℅"

Fix: Declare encoding and everything looks fine

Encoding(x_fcsv) <- "UTF-8"
cat(x_fcsv)
#> x
#> ä
#> ö
#> ü
#> ⅛
#> ℅

@jimhester jimhester added the reprex label Dec 7, 2017

@jimhester

This comment has been minimized.

Member

jimhester commented Dec 7, 2017

I unfortunately cannot reproduce your results. It would be helpful to know what locale you are running this under and ideally produce a locale independent example.

@dpprdan

This comment has been minimized.

dpprdan commented Dec 11, 2017

Sorry, I had a session_info() in my first post, but then I said "skip to the third post", so see a current session_info() at the end.

I am on Windows with a cp1252 locale, but this seems to extend to other locales on platforms with non-UTF-8 native encodings as well.

Sys.setlocale(, "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
x <- c("ä", "ö", "ü", "", "") 
x_df <- data.frame(x, stringsAsFactors = FALSE)
x_fcsv <- readr::format_csv(x_df)
cat(x_fcsv)
#> x
#> 盲
#> 枚
#> 眉
#> 鈪?鈩?
Encoding(x_fcsv) <- "UTF-8"
cat(x_fcsv)
#> x
#> ä
#> ö
#> ü
#> ⅛
#> ℅

(Note that I edited this reprex manually, since chars which are not in the current locale's code page are rendered as escapes (e.g. "℅" becomes "<U+2105>".)

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                         
#>  version  R version 3.4.2 (2017-09-28)  
#>  system   x86_64, mingw32               
#>  ui       RTerm                         
#>  language (EN)                          
#>  collate  Chinese (Simplified)_China.936
#>  tz       Europe/Berlin                 
#>  date     2017-12-11
#> Packages -----------------------------------------------------------------
#>  package   * version    date       source                          
#>  backports   1.1.1      2017-09-25 CRAN (R 3.4.1)                  
#>  base      * 3.4.2      2017-09-28 local                           
#>  compiler    3.4.2      2017-09-28 local                           
#>  datasets  * 3.4.2      2017-09-28 local                           
#>  devtools    1.13.4     2017-11-09 CRAN (R 3.4.2)                  
#>  digest      0.6.12     2017-01-27 CRAN (R 3.4.2)                  
#>  evaluate    0.10.1     2017-06-24 CRAN (R 3.4.0)                  
#>  graphics  * 3.4.2      2017-09-28 local                           
#>  grDevices * 3.4.2      2017-09-28 local                           
#>  hms         0.4.0      2017-11-23 CRAN (R 3.4.2)                  
#>  htmltools   0.3.6      2017-04-28 CRAN (R 3.4.0)                  
#>  knitr       1.17       2017-08-10 CRAN (R 3.4.1)                  
#>  magrittr    1.5        2014-11-22 CRAN (R 3.4.2)                  
#>  memoise     1.1.0      2017-10-16 Github (hadley/memoise@d63ae9c) 
#>  methods   * 3.4.2      2017-09-28 local                           
#>  pkgconfig   2.0.1      2017-03-21 CRAN (R 3.4.0)                  
#>  R6          2.2.2      2017-06-17 CRAN (R 3.4.0)                  
#>  Rcpp        0.12.14    2017-11-23 CRAN (R 3.4.2)                  
#>  readr       1.1.1.9000 2017-12-11 Github (tidyverse/readr@2b87368)
#>  rlang       0.1.4      2017-11-05 CRAN (R 3.4.2)                  
#>  rmarkdown   1.8        2017-11-17 CRAN (R 3.4.2)                  
#>  rprojroot   1.2        2017-01-16 CRAN (R 3.4.2)                  
#>  stats     * 3.4.2      2017-09-28 local                           
#>  stringi     1.1.6      2017-11-17 CRAN (R 3.4.2)                  
#>  stringr     1.2.0      2017-02-18 CRAN (R 3.4.2)                  
#>  tibble      1.3.4      2017-08-22 CRAN (R 3.4.1)                  
#>  tools       3.4.2      2017-09-28 local                           
#>  utils     * 3.4.2      2017-09-28 local                           
#>  withr       2.1.0      2017-11-01 CRAN (R 3.4.2)                  
#>  yaml        2.1.15     2017-12-01 CRAN (R 3.4.3)
@yutannihilation

This comment has been minimized.

Member

yutannihilation commented Dec 11, 2017

Hi, I could reproduce this issue on my Windows with CP932 (Shift_JIS). So, this is probably true:

this seems to extend to other locales on platforms with non-UTF-8 native encodings as well.

IIUC, this is reproducible on Appveyor:

Sys.setlocale(locale = "English_United States.1252")
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

# c.f. https://github.com/tidyverse/dplyr/blob/fc663425ce19e5b70ae6367704218d2f4152ed78/tests/testthat/helper-encoding.R#L3
x <- enc2native(c("a", "Gl\u00fcck"))

x
#> [1] "a"     "Glück"
Encoding(x)
#> [1] "unknown" "latin1"

x_df <- data.frame(x, stringsAsFactors = FALSE)

x_fcsv <- readr::format_csv(x_df)
cat(x_fcsv)
#> x
#> a
#> Glück

Encoding(x_fcsv) <- "UTF-8"
cat(x_fcsv)
#> x
#> a
#> Glück

@jimhester jimhester closed this in d6a4a22 Dec 11, 2017

@jimhester

This comment has been minimized.

Member

jimhester commented Dec 11, 2017

Ok we now explicitly mark the output as UTF-8 encoded, which should fix this.

@jimhester jimhester added the bug label Dec 11, 2017

@lock

This comment has been minimized.

lock bot commented Sep 25, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.