strip_style() returns garbled output in non-UTF-8 locale when UTF-8 characters are present #138

january3 · 2022-09-28T08:35:41Z

If the colorized string contains UTF-8 characters, the resulting string from strip_style() is no longer of type "UTF-8" but of type "unknown". The reason for that is that strip_style() uses gsub() with parameter useBytes=TRUE. Manual for gsub states that:

     The main effect of ‘useBytes = TRUE’ is to avoid errors/warnings
     about invalid inputs and spurious matches in multibyte locales,
     but for ‘regexpr’ it changes the interpretation of the output.  It
     inhibits the conversion of inputs with marked encodings, and is
     forced if any input is found which is marked as ‘"bytes"’ (see
     ‘Encoding’).

If R is running under a locale which is non-UTF-8 (i.e. the value of l10n_info()[["UTF-8"]] is FALSE), this may lead to various "interesting" side effects.

Demonstration:

library(crayon)
foo <- paste0('\033[3m', '\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500', '\033[23m')
Encoding(foo)
Encoding(strip_style(foo))

The result is:

[1] "UTF-8"
[1] "unknown"

Now, consider this code:

nchar(foo)
nchar(strip_style(foo))
strip_style(foo)

The result depends on the locale:

under UTF-8 enabled locale (e.g. en_US.UTF-8), the results are:

[1] 17
[1] 8
[1] "────────"

Under another locale, e.g. en_US.ISO-8859-15, the results are

[1] 17
[1] 24
[1] "�\224\200�\224\200�\224\200�\224\200�\224\200�\224\200�\224\200�\224\200"

What would be the expected output under ISO-8850-15

[1] 17
[1] 8
[1] "<U+2500><U+2500><U+2500><U+2500><U+2500><U+2500><U+2500><U+2500>"

Why is that important?

CRAN uses a en_US.ISO-8859-15 locale in one of the platforms for checking. Any package that uses UTF-8 characters in combination with crayon may have unexpected and hard to track bugs when tested by CRAN. This very situation happened to me (package colorDF) and the bug was infuriating to replicate and track.

The text was updated successfully, but these errors were encountered:

gaborcsardi · 2022-09-28T09:14:25Z

I think this is fixed in dev crayon:

library(crayon)
foo <- paste0('\033[3m', '\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500', '\033[23m')
Encoding(foo)
#> [1] "UTF-8"
Encoding(strip_style(foo))
#> [1] "UTF-8"

nchar(foo)
#> [1] 17
nchar(strip_style(foo))
#> [1] 8
strip_style(foo)
#> [1] "────────"

Sys.setlocale("LC_ALL", "C")
#> [1] "C/C/C/C/C/en_US.UTF-8"

nchar(foo)
#> [1] 17
nchar(strip_style(foo))
#> [1] 8
strip_style(foo)
#> [1] "<U+2500><U+2500><U+2500><U+2500><U+2500><U+2500><U+2500><U+2500>"

^{Created on 2022-09-28 with reprex v2.0.2}

Btw. I also suggest using cli instead crayon, cli handles UTF-8 strings correctly, and it does not rely on base R's nchar() which is incorrect in many cases, it also gives you RGB colors, bright variants, lots of ansi string operations, etc.

s <- "👷🏿"
nchar(s)
#> [1] 2
nchar(s, "width")
#> [1] 4

cli::utf8_nchar(s)
#> [1] 1
cli::utf8_nchar(s, "width")
#> [1] 2

^{Created on 2022-09-28 with reprex v2.0.2}

january3 · 2022-09-28T09:41:50Z

Indeed, it appears to be fixed. Thank you for the hint about cli!

gaborcsardi closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strip_style() returns garbled output in non-UTF-8 locale when UTF-8 characters are present #138

strip_style() returns garbled output in non-UTF-8 locale when UTF-8 characters are present #138

january3 commented Sep 28, 2022 •

edited

gaborcsardi commented Sep 28, 2022

january3 commented Sep 28, 2022

strip_style() returns garbled output in non-UTF-8 locale when UTF-8 characters are present #138

strip_style() returns garbled output in non-UTF-8 locale when UTF-8 characters are present #138

Comments

january3 commented Sep 28, 2022 • edited

gaborcsardi commented Sep 28, 2022

january3 commented Sep 28, 2022

january3 commented Sep 28, 2022 •

edited