Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters garbled from sink() on Windows #59

Open
yihui opened this issue Dec 12, 2015 · 10 comments
Open

Characters garbled from sink() on Windows #59

yihui opened this issue Dec 12, 2015 · 10 comments

Comments

@yihui
Copy link
Collaborator

yihui commented Dec 12, 2015

Some examples:

Sys.setlocale(, 'English')  # can also try 'German_Austria'
# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
# 
# attr(,"class")
# [1] "source"
# 
# [[2]]
# [1] "[1] \"\u009a\"\n"

Sys.setlocale(, 'Chinese')
# [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
# 
# attr(,"class")
# [1] "source"
# 
# [[2]]
# [1] "[1] \"<U+0161>\"\n"

Originally reported at http://stackoverflow.com/q/34096239/559676

With only sink() and textConnection():

sink_test = function(locale = 'English') {
  Sys.setlocale(, locale)
  x = '\u0161'
  y = character()
  con = textConnection('y', local = TRUE, open = 'wr')
  sink(con)
  print(x)
  sink()
  y  
}

sink_test()
# [1] "[1] \"歕""

The problem with this reduced example is only the wrong encoding marked:

z = sink_test()
Encoding(z)
# [1] "latin1"

iconv(z, to = 'UTF-8')
# [1] "[1] \"š\""
@yutannihilation
Copy link
Contributor

I found this issue on investigating hadley/emo#7.

Emojis still fail to keep their characters with sink_test().

sink_test = function(locale = 'English') {
  Sys.setlocale(, locale)
  x = emo::ji('japanese_goblin')
  y = character()
  con = textConnection('y', local = TRUE, open = 'wr')
  sink(con)
  print(x)
  sink()
  y  
}
#> [1] "<f0><U+009F><U+0091><U+00BA> "

Apparently, we need better sink(), which has some good option like useBytes in writeLines(). But I see little hope...

output <- character(0L)
outputCon <- textConnection('output', 'wr')
writeLines(emo::ji('japanese_goblin'), outputCon, useBytes = TRUE)
close(outputCon)
output
#> [1] "村"
`Encoding<-`(output, 'UTF-8')
#> [1] "\xf0\u009f\u0091�"
cat(`Encoding<-`(output, 'UTF-8'))
#> 👺

@yihui
Copy link
Collaborator Author

yihui commented May 13, 2017

I think base R needs better support for UTF-8. I'm counting on @krlmlr to save the world: http://r.789695.n4.nabble.com/source-parse-and-foreign-UTF-8-characters-td4733523.html

@krlmlr
Copy link
Member

krlmlr commented May 13, 2017

Working on it with @dmurdoch ;-)

@yutannihilation
Copy link
Contributor

Oh, @krlmlr, you are always our UTF-8 hero! Cool. Thanks for the information 👍

@vnijs
Copy link

vnijs commented Sep 19, 2018

Not sure but perhaps this is also related tidyverse/readr#884

@yutannihilation
Copy link
Contributor

No, I'm quite sure it's not. In that case, R does things right, but boost won't :(

@kevinushey
Copy link

FWIW I filed a bug report with R and unfortunately it sounds like it will be too expensive for them to fix: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17503

@yihui
Copy link
Collaborator Author

yihui commented Nov 16, 2018

Thanks @kevinushey! Then I wonder if it is possible to write a custom connection that supports UTF-8 instead of the native encoding. I have no idea about how connections in R work, but I remember Simon Urbanek gave a talk in 2013, in which he showed a custom connection based on 0MQ: https://github.com/s-u/zmqc

@krlmlr
Copy link
Member

krlmlr commented Nov 16, 2018

It seems that strings are translated by r-base into native even before they reach the connection. Perhaps we really require a fix in base for sink(), but I'm not sure.

Perhaps Windows will support UTF-8 as native encoding at some point. The "April 2018 insider build" of Windows seems to have some of it: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8

@yihui
Copy link
Collaborator Author

yihui commented Nov 16, 2018

I see. If base R does the translation, I guess there is nothing we can do about it. That is really unfortunate...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants