Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue: the results with non-ASCII symbols were not reproduced #197

Closed
GegznaV opened this issue Jul 14, 2018 · 6 comments
Closed

Comments

@GegznaV
Copy link

GegznaV commented Jul 14, 2018

I use RStudio on Windows 10. I selected these lines from my .Rmd file:

Sys.setlocale(locale = "Lithuanian")
df <- data.frame(x = 1:5, y = c("Ą", "Č", "Ę", "ū", "ž"))

Sys.setlocale(locale = "Chinese")
capture.output(skimr::skim(df))

Called the reprex RStudio add-in (for selection of R code to be printed on GitHub) and got this result (pay attention to non-ASCII symbols Ą, etc.):

Sys.setlocale(locale = "Lithuanian")
#> [1] "LC_COLLATE=Lithuanian_Lithuania.1257;LC_CTYPE=Lithuanian_Lithuania.1257;LC_MONETARY=Lithuanian_Lithuania.1257;LC_NUMERIC=C;LC_TIME=Lithuanian_Lithuania.1257"
df <- data.frame(x = 1:5, y = c("<U+0104>", "<U+010C>", "<U+0118>", "¨±", "<U+017E>"))

Sys.setlocale(locale = "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
capture.output(skimr::skim(df))
#>  [1] "Skim summary statistics"                                                              
#>  [2] " n obs: 5 "                                                                           
#>  [3] " n variables: 2 "                                                                     
#>  [4] ""                                                                                     
#>  [5] "-- Variable type:factor -------------------------------------------------------------"
#>  [6] " variable missing complete n n_unique                     top_counts"                 
#>  [7] "        y       0        5 5        5 <U+: 1, <U+: 1, <U+: 1, <U+: 1"                 
#>  [8] " ordered"                                                                             
#>  [9] "   FALSE"                                                                             
#> [10] ""                                                                                     
#> [11] "-- Variable type:integer ------------------------------------------------------------"
#> [12] " variable missing complete n mean   sd p0 p25 p50 p75 p100     hist"                  
#> [13] "        x       0        5 5    3 1.58  1   2   3   4    5 <U+00D8>~<U+00D8>~<U+00D8>x<U+00D8>~<U+00D8>x<U+00D8>~<U+00D8>x<U+00D8>~"

Created on 2018-07-14 by the reprex package (v0.2.0).

The result had to be an error:

[1] "LC_COLLATE=Lithuanian_Lithuania.1257;LC_CTYPE=Lithuanian_Lithuania.1257;LC_MONETARY=Lithuanian_Lithuania.1257;LC_NUMERIC=C;LC_TIME=Lithuanian_Lithuania.1257"
[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
 Show Traceback
Error in substr(names(x), 1, options$formats$.levels$max_char) : invalid multibyte string at '<c0>'

This behavior may be related to this issue.

@batpigandme
Copy link
Contributor

Possibly related? #82

@yutannihilation
Copy link
Member

This is the issue laid in R core and is hard to fix on the packages' side...😢

r-lib/evaluate#59

@GegznaV
Copy link
Author

GegznaV commented Jul 15, 2018

Are there any chances that the issue will be solved by the R Core Team?

@isteves
Copy link

isteves commented Jul 27, 2018

Related issue (with a Hebrew example): tidyverse/tibble#433

@dpprdan
Copy link

dpprdan commented Nov 14, 2018

When changing the locale with Sys.setlocale(), the last Sys.setlocale() call seems to determine the locale for the whole reprex session, even when it is called last.
First with only one call to Sys.setlocale()

Sys.setlocale(locale = "English")
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
print("ö")
#> [1] "ö"

Now with a second call (I “reprexed” the part above and then everything from here on down separately. )

Sys.setlocale(locale = "English")
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
print("v")
#> [1] "<f6>"
Sys.setlocale(locale = "C")
#> [1] "C"

I think is what causes @GegznaV’s problem. I am not sure, though, whether this is really related to the other issues mentioned above.

BTW, “ö” should not get converted to “v” but to “U+00F6”, IMO, but that might be a bug in base R.

@jennybc
Copy link
Member

jennybc commented May 18, 2019

I think between:

reprex is handling encoding as well as its dependencies allow (mostly especially the difficulties around encoding on Windows in R itself). I'm closing this. If anyone has a new challenging example, especially one that fails with dev reprex + knitr v1.23, please add it to #262.

I don't consider these reprexes that change the locale mid-reprex to be fair game, unless one can show that rmarkdown/knitr handles that code correctly, but reprex does not.

@jennybc jennybc closed this as completed May 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants