Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters garbled from sink() on Windows #59

Open
yihui opened this issue Dec 12, 2015 · 10 comments
Open

Characters garbled from sink() on Windows #59

yihui opened this issue Dec 12, 2015 · 10 comments

Comments

@yihui
Copy link
Collaborator

@yihui yihui commented Dec 12, 2015

Some examples:

Sys.setlocale(, 'English')  # can also try 'German_Austria'
# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
# 
# attr(,"class")
# [1] "source"
# 
# [[2]]
# [1] "[1] \"\u009a\"\n"

Sys.setlocale(, 'Chinese')
# [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
# 
# attr(,"class")
# [1] "source"
# 
# [[2]]
# [1] "[1] \"<U+0161>\"\n"

Originally reported at http://stackoverflow.com/q/34096239/559676

With only sink() and textConnection():

sink_test = function(locale = 'English') {
  Sys.setlocale(, locale)
  x = '\u0161'
  y = character()
  con = textConnection('y', local = TRUE, open = 'wr')
  sink(con)
  print(x)
  sink()
  y  
}

sink_test()
# [1] "[1] \"歕""

The problem with this reduced example is only the wrong encoding marked:

z = sink_test()
Encoding(z)
# [1] "latin1"

iconv(z, to = 'UTF-8')
# [1] "[1] \"š\""
@yutannihilation
Copy link
Contributor

@yutannihilation yutannihilation commented May 13, 2017

I found this issue on investigating hadley/emo#7.

Emojis still fail to keep their characters with sink_test().

sink_test = function(locale = 'English') {
  Sys.setlocale(, locale)
  x = emo::ji('japanese_goblin')
  y = character()
  con = textConnection('y', local = TRUE, open = 'wr')
  sink(con)
  print(x)
  sink()
  y  
}
#> [1] "<f0><U+009F><U+0091><U+00BA> "

Apparently, we need better sink(), which has some good option like useBytes in writeLines(). But I see little hope...

output <- character(0L)
outputCon <- textConnection('output', 'wr')
writeLines(emo::ji('japanese_goblin'), outputCon, useBytes = TRUE)
close(outputCon)
output
#> [1] "村"
`Encoding<-`(output, 'UTF-8')
#> [1] "\xf0\u009f\u0091�"
cat(`Encoding<-`(output, 'UTF-8'))
#> 👺

@yihui
Copy link
Collaborator Author

@yihui yihui commented May 13, 2017

I think base R needs better support for UTF-8. I'm counting on @krlmlr to save the world: http://r.789695.n4.nabble.com/source-parse-and-foreign-UTF-8-characters-td4733523.html

@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 13, 2017

Working on it with @dmurdoch ;-)

@yutannihilation
Copy link
Contributor

@yutannihilation yutannihilation commented May 13, 2017

Oh, @krlmlr, you are always our UTF-8 hero! Cool. Thanks for the information 👍

@vnijs
Copy link

@vnijs vnijs commented Sep 19, 2018

Not sure but perhaps this is also related tidyverse/readr#884

@yutannihilation
Copy link
Contributor

@yutannihilation yutannihilation commented Sep 19, 2018

No, I'm quite sure it's not. In that case, R does things right, but boost won't :(

@kevinushey
Copy link

@kevinushey kevinushey commented Nov 15, 2018

FWIW I filed a bug report with R and unfortunately it sounds like it will be too expensive for them to fix: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17503

@yihui
Copy link
Collaborator Author

@yihui yihui commented Nov 16, 2018

Thanks @kevinushey! Then I wonder if it is possible to write a custom connection that supports UTF-8 instead of the native encoding. I have no idea about how connections in R work, but I remember Simon Urbanek gave a talk in 2013, in which he showed a custom connection based on 0MQ: https://github.com/s-u/zmqc

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Nov 16, 2018

It seems that strings are translated by r-base into native even before they reach the connection. Perhaps we really require a fix in base for sink(), but I'm not sure.

Perhaps Windows will support UTF-8 as native encoding at some point. The "April 2018 insider build" of Windows seems to have some of it: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8

@yihui
Copy link
Collaborator Author

@yihui yihui commented Nov 16, 2018

I see. If base R does the translation, I guess there is nothing we can do about it. That is really unfortunate...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants