New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect package encoding when parsing source code #605

Closed
jeroen opened this Issue Jun 24, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@jeroen
Member

jeroen commented Jun 24, 2017

There is some support for utf8 however by default test_check() sources all test files as native. I think that if DESCRIPTION contains Encoding: UTF-8 all test files should be sourced as UTF-8.

@gaborcsardi

This comment has been minimized.

Member

gaborcsardi commented Jun 24, 2017

@hadley

This comment has been minimized.

Member

hadley commented Oct 2, 2017

We should pull across the code from klutometis/roxygen#649.

Unless you feel strongly, I think it's better to simply read in everything as UTF-8, and warn the user that testthat doesn't support other encodings.

@hadley hadley added the feature label Oct 2, 2017

@jeroen

This comment has been minimized.

Member

jeroen commented Oct 2, 2017

I don't feel strongly, though if the package declares Encoding: latin1 the correct thing would be to assume that. OTOH, people should just switch to UTF8 to write portable code, so perhaps requiring this in testthat is a good thing.

@hadley hadley closed this in e23e7c4 Oct 3, 2017

@gaborcsardi

This comment has been minimized.

Member

gaborcsardi commented Nov 1, 2017

@hadley There are still problems with this. Reading the files in UTF-8 is one thing, but you also need you to supply the encoding to parse(), otherwise it will mark convert the input strings to the native encoding. So parse() here needs an encoding = "UTF-8" argument:

exprs <- parse(text = lines, n = -1, srcfile = srcfile)

And maybe in test_example as well.

This said, I am not convinced that defaulting to UTF-8 is best. UTF-8 is still a bit painful on Windows, as I have just experienced. It is also not the default, of course.

@gaborcsardi

This comment has been minimized.

Member

gaborcsardi commented Nov 1, 2017

The old testthat actually (by default) keeps UTF-8 files as UTF-8, because it marks them as "unknown" and then parse() does not convert the strings. (Although they are marked as unknown.)

The new behavior is worse, because you end up with stings recoded into the native encoding, typically latin1 on windows, and this conversion loses information, so it is not even possible to convert it back to UTF-8.

Can I fix the parse() calls?

@hadley

This comment has been minimized.

Member

hadley commented Nov 1, 2017

Yes please!

@hadley hadley reopened this Nov 1, 2017

@gaborcsardi

This comment has been minimized.

Member

gaborcsardi commented Nov 1, 2017

Not so easy. :( parse seems to be buggy and always converts to the native encoding, if text= is used. Luckily, with textConnection it works fine. Hopefully this will not make the parsing much slower:

## Need to run this in a latin1 locale
old_locale <- Sys.getlocale("LC_CTYPE")
Sys.setlocale("LC_CTYPE", "en_US.ISO8859-1")

## UTF-8 string, quoted, so we can parse it
lines <- as.raw(c(0x22, 0xc3, 0xa1, 0x72, 0x76, 0xc3, 0xad, 0x7a, 0x74,
  0xc5, 0xb1, 0x72, 0xc5, 0x91, 0x20, 0x74, 0xc3, 0xbc, 0x6b, 0xc3,
  0xb6, 0x72, 0x66, 0xc3, 0xba, 0x72, 0xc3, 0xb3, 0x67, 0xc3, 0xa9,
  0x70, 0x22))
lines <- rawToChar(lines)
Encoding(lines) <- "UTF-8"
stringi::stri_enc_isutf8(lines)
# > [1] TRUE

## Parse it, keep it UTF-8
expr <- parse(text = lines, encoding = "UTF-8")[[1]]
Encoding(expr)
#> [1] "UTF-8"

## Ooops
stringi::stri_enc_isutf8(expr)
#> [1] FALSE

## It was recoded into latin1 (the native encoding) :(
stringi::stri_enc_isutf8(iconv(expr, "latin1", "UTF-8"))
#> [1] TRUE

## With a text connection it is OK. Phew!
expr2 <- parse(textConnection(lines, encoding = "UTF-8"), encoding = "UTF-8")[[1]]
stringi::stri_enc_isutf8(expr2)
#> [1] TRUE

Sys.setlocale("LC_CTYPE", old_locale)

gaborcsardi added a commit that referenced this issue Nov 2, 2017

gaborcsardi added a commit that referenced this issue Nov 2, 2017

@hadley hadley closed this in a4cfc61 Nov 2, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment