Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters garbled from sink() on Windows #59

Open
yihui opened this issue Dec 12, 2015 · 10 comments

Comments

Projects
None yet
5 participants
@yihui
Copy link
Collaborator

commented Dec 12, 2015

Some examples:

Sys.setlocale(, 'English')  # can also try 'German_Austria'
# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
# 
# attr(,"class")
# [1] "source"
# 
# [[2]]
# [1] "[1] \"\u009a\"\n"

Sys.setlocale(, 'Chinese')
# [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
# 
# attr(,"class")
# [1] "source"
# 
# [[2]]
# [1] "[1] \"<U+0161>\"\n"

Originally reported at http://stackoverflow.com/q/34096239/559676

With only sink() and textConnection():

sink_test = function(locale = 'English') {
  Sys.setlocale(, locale)
  x = '\u0161'
  y = character()
  con = textConnection('y', local = TRUE, open = 'wr')
  sink(con)
  print(x)
  sink()
  y  
}

sink_test()
# [1] "[1] \"歕""

The problem with this reduced example is only the wrong encoding marked:

z = sink_test()
Encoding(z)
# [1] "latin1"

iconv(z, to = 'UTF-8')
# [1] "[1] \"š\""
@yutannihilation

This comment has been minimized.

Copy link
Contributor

commented May 13, 2017

I found this issue on investigating hadley/emo#7.

Emojis still fail to keep their characters with sink_test().

sink_test = function(locale = 'English') {
  Sys.setlocale(, locale)
  x = emo::ji('japanese_goblin')
  y = character()
  con = textConnection('y', local = TRUE, open = 'wr')
  sink(con)
  print(x)
  sink()
  y  
}
#> [1] "<f0><U+009F><U+0091><U+00BA> "

Apparently, we need better sink(), which has some good option like useBytes in writeLines(). But I see little hope...

output <- character(0L)
outputCon <- textConnection('output', 'wr')
writeLines(emo::ji('japanese_goblin'), outputCon, useBytes = TRUE)
close(outputCon)
output
#> [1] "村"
`Encoding<-`(output, 'UTF-8')
#> [1] "\xf0\u009f\u0091�"
cat(`Encoding<-`(output, 'UTF-8'))
#> 👺
@yihui

This comment has been minimized.

Copy link
Collaborator Author

commented May 13, 2017

I think base R needs better support for UTF-8. I'm counting on @krlmlr to save the world: http://r.789695.n4.nabble.com/source-parse-and-foreign-UTF-8-characters-td4733523.html

@krlmlr

This comment has been minimized.

Copy link
Member

commented May 13, 2017

Working on it with @dmurdoch ;-)

@yutannihilation

This comment has been minimized.

Copy link
Contributor

commented May 13, 2017

Oh, @krlmlr, you are always our UTF-8 hero! Cool. Thanks for the information 👍

@vnijs

This comment has been minimized.

Copy link

commented Sep 19, 2018

Not sure but perhaps this is also related tidyverse/readr#884

@yutannihilation

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

No, I'm quite sure it's not. In that case, R does things right, but boost won't :(

@kevinushey

This comment has been minimized.

Copy link

commented Nov 15, 2018

FWIW I filed a bug report with R and unfortunately it sounds like it will be too expensive for them to fix: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17503

@yihui

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 16, 2018

Thanks @kevinushey! Then I wonder if it is possible to write a custom connection that supports UTF-8 instead of the native encoding. I have no idea about how connections in R work, but I remember Simon Urbanek gave a talk in 2013, in which he showed a custom connection based on 0MQ: https://github.com/s-u/zmqc

@krlmlr

This comment has been minimized.

Copy link
Member

commented Nov 16, 2018

It seems that strings are translated by r-base into native even before they reach the connection. Perhaps we really require a fix in base for sink(), but I'm not sure.

Perhaps Windows will support UTF-8 as native encoding at some point. The "April 2018 insider build" of Windows seems to have some of it: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8

@yihui

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 16, 2018

I see. If base R does the translation, I guess there is nothing we can do about it. That is really unfortunate...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.