Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strip_style() returns garbled output in non-UTF-8 locale when UTF-8 characters are present #138

Closed
january3 opened this issue Sep 28, 2022 · 2 comments

Comments

@january3
Copy link

january3 commented Sep 28, 2022

If the colorized string contains UTF-8 characters, the resulting string from strip_style() is no longer of type "UTF-8" but of type "unknown". The reason for that is that strip_style() uses gsub() with parameter useBytes=TRUE. Manual for gsub states that:

     The main effect of ‘useBytes = TRUE’ is to avoid errors/warnings
     about invalid inputs and spurious matches in multibyte locales,
     but for ‘regexpr’ it changes the interpretation of the output.  It
     inhibits the conversion of inputs with marked encodings, and is
     forced if any input is found which is marked as ‘"bytes"’ (see
     ‘Encoding’).

If R is running under a locale which is non-UTF-8 (i.e. the value of l10n_info()[["UTF-8"]] is FALSE), this may lead to various "interesting" side effects.

Demonstration:

library(crayon)
foo <- paste0('\033[3m', '\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500', '\033[23m')
Encoding(foo)
Encoding(strip_style(foo))

The result is:

[1] "UTF-8"
[1] "unknown"

Now, consider this code:

nchar(foo)
nchar(strip_style(foo))
strip_style(foo)

The result depends on the locale:

under UTF-8 enabled locale (e.g. en_US.UTF-8), the results are:

[1] 17
[1] 8
[1] "────────"

Under another locale, e.g. en_US.ISO-8859-15, the results are

[1] 17
[1] 24
[1] "�\224\200�\224\200�\224\200�\224\200�\224\200�\224\200�\224\200�\224\200"

What would be the expected output under ISO-8850-15

[1] 17
[1] 8
[1] "<U+2500><U+2500><U+2500><U+2500><U+2500><U+2500><U+2500><U+2500>"

Why is that important?

CRAN uses a en_US.ISO-8859-15 locale in one of the platforms for checking. Any package that uses UTF-8 characters in combination with crayon may have unexpected and hard to track bugs when tested by CRAN. This very situation happened to me (package colorDF) and the bug was infuriating to replicate and track.

@gaborcsardi
Copy link
Member

I think this is fixed in dev crayon:

library(crayon)
foo <- paste0('\033[3m', '\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500', '\033[23m')
Encoding(foo)
#> [1] "UTF-8"
Encoding(strip_style(foo))
#> [1] "UTF-8"

nchar(foo)
#> [1] 17
nchar(strip_style(foo))
#> [1] 8
strip_style(foo)
#> [1] "────────"

Sys.setlocale("LC_ALL", "C")
#> [1] "C/C/C/C/C/en_US.UTF-8"

nchar(foo)
#> [1] 17
nchar(strip_style(foo))
#> [1] 8
strip_style(foo)
#> [1] "<U+2500><U+2500><U+2500><U+2500><U+2500><U+2500><U+2500><U+2500>"

Created on 2022-09-28 with reprex v2.0.2

Btw. I also suggest using cli instead crayon, cli handles UTF-8 strings correctly, and it does not rely on base R's nchar() which is incorrect in many cases, it also gives you RGB colors, bright variants, lots of ansi string operations, etc.

s <- "👷🏿"
nchar(s)
#> [1] 2
nchar(s, "width")
#> [1] 4

cli::utf8_nchar(s)
#> [1] 1
cli::utf8_nchar(s, "width")
#> [1] 2

Created on 2022-09-28 with reprex v2.0.2

@january3
Copy link
Author

Indeed, it appears to be fixed. Thank you for the hint about cli!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants