Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Replace non-standard Unicode characters with entities for HTML output #1506

Closed
martinmodrak opened this issue Feb 12, 2018 · 16 comments

Comments

Projects
None yet
3 participants
@martinmodrak
Copy link
Contributor

commented Feb 12, 2018

On Windows, locale sometimes messes up Unicode characters in HTML output from knitr. While this can be avoided with proper locale, for HTML, it can also be avoided by using HTML entities. HTML entities also could be somewhat preferable to raw Unicode (not sure about this, really). So my suggestion is to do this transformation by default.

If you agree, I am ready to implement this, but I am not sure, if this transformation should be implemented in knitr, or by modifying the escape_html function in highr.

This came up while working on an issue in skimr (ropensci/skimr#278) where kable can mess up histograms built from Unicode characters.

@yihui

This comment has been minimized.

Copy link
Owner

commented Feb 12, 2018

I'm afraid this is not simple to fix, and knitr is probably not the best place to fix it (it goes back to the evaluate package and base R). There have been long-standing issues like yours:

@martinmodrak

This comment has been minimized.

Copy link
Contributor Author

commented Feb 12, 2018

I understand that the issue is more complex. But I know and have tested that taking UTF-8 string, converting it to HTML entities before passing to knitr::kable and then pass to knitr::kable (with escape=FALSE), is a feasible workaround (for html output). The problem is that setting escape=FALSE introduces other problems, which is why it would make sense to implement the workaround in knitr.

I understand if you consider this an ugly hack or don't like the penalty of increasing output size, but I believe this workaround could solve some real user problems (I may be oblivious to other downsides).

@yihui

This comment has been minimized.

Copy link
Owner

commented Feb 12, 2018

Is the hack going to be something like gsub("<U+2583>", "&#2583;", x)? :)

@martinmodrak

This comment has been minimized.

Copy link
Contributor Author

commented Feb 13, 2018

For HTML, you can be a little cleaner - I noticed that when using kable (at least on my machine) UTF-8 strings arrive into kable just OK. So, when format is HTML, you could transform UTF-8 character to entities before R gets to mangle with them. This is (arguably) more correct than just replacing sequences of the form <U+XXXX> with the entities (since the user can safely put literal <U+XXXX> strings in the data they want to be processed). The code could look something like:

unicode_to_html_entities <- function(char) {
  codes <- utf8ToInt(char)
  return(paste0(purrr::map_chr(codes, encode_as_htmlentity), collapse = ''))
}

encode_as_htmlentity <- function(char_code) {
  if(char_code < 128) {
    return(intToUtf8(char_code))
  } else {
    return(paste0("&#",char_code,";"))
  }
}

But thinking some more, it might also make sense to do all the processing you do, and then replace the <U+XXXX> characters in the final output with actual UTF-8 chars. As long as the output string has UTF-8 encoding set, this could be OK (at least in RStudio and file output, but I dunno what it does to base R terminal), and work for all output formats.

If that's of interest, I did a minor examination on where exactly is the string messed up and it line 157 in table.R.
In particular, when executing:

data.frame(x = intToUtf8(strtoi("0x2508")), stringsAsFactors = FALSE) %>% kable(format = "html")

Breakpoint at line 157:

Browse[2]> x$x
[1] "┈"
Browse[2]> grepl("U",x$x)
[1] FALSE

contents of line 157:

x = replace_na(base::format(as.matrix(x), trim = TRUE, justify = 'none'), is.na(x))

after executing the line

Browse[2]> x[1,1]
         x 
"<U+2508>" 
Browse[2]> grepl("U",x)
[1] TRUE

Which also means that I was mistaken and the escape_html function is called after the string has been mangled and thus cannot be a place for a workaround.

@yihui

This comment has been minimized.

Copy link
Owner

commented Feb 13, 2018

Many thanks for the careful debugging! That was amazing.

So the actual culprit was line 157. I wonder if it was base::format(), as.matrix(), or replace_na() introduced the problem.

@martinmodrak

This comment has been minimized.

Copy link
Contributor Author

commented Feb 14, 2018

OK, so the culprit is base::format. I noticed that base::format is actually only strictly necessary, when the input is only numeric - if there are characters, numbers and stuff are already converted to text within as.matrix (which keeps UTF-8). I wrote a function that converts non-character matrices to character matrices (see the associated pull request). This seems to resolve the issue for me. I am unable to get the test framework running on my machine, so we'll see, if I broke something.

@martinmodrak

This comment has been minimized.

Copy link
Contributor Author

commented Feb 14, 2018

Well actually, the issue is not fully resolved :-) kable now produces correct output, but the UTF-8 is mangled somewhere down the pipeline, which brings us to the original suggestion of converting to HTML entities :-)

@martinmodrak

This comment has been minimized.

Copy link
Contributor Author

commented Feb 14, 2018

I am not really sure, if I should open a separate issue for this, but it turns out my problems are unrelated to HTML, as I am using blogdown, which (I just learned) does not knit to HTML output - instead, it uses knitr to write markdown which is transformed to HTML afterwards. Stepping through knitr, all UTF-8 from kable is (after my fix) preserved up until a call to writeLines at line 261 of output.R. This is the line

    writeLines(if (encoding == '') res else native_encode(res, to = encoding),
               con = output, useBytes = encoding != '')

Interestingly, if I force useBytes = TRUE, then the resulting markdown has preserved UTF-8. I don't know the history behind the choices present in the code, but maybe it could be reasonable to have encoding to default to "UTF-8"?

Only if I don't use kable, I hit the bug in evaluate. E.g. intToUtf8(2585) produces ## [1] "<U+0A19>", but intToUtf8(2585) %>% kable() produces

|x  |
|:--|
|ਙ  |

The UTF-8 is also preserved when using structure( intToUtf8(2585), format = "", class = "knitr_kable").

Which IMHO also means that it might be possible to workaround the evaluate bug - I mean RStudio is definitely able to do this (in RStudio console on my system, intToUtf8(2585) produces [1] "ਙ")

@yihui yihui added this to the v1.20 milestone Feb 19, 2018

yihui added a commit that referenced this issue Feb 19, 2018

@yihui

This comment has been minimized.

Copy link
Owner

commented Feb 19, 2018

The encoding issues are always messy. Now I know much more about character encodings than five years ago, but I guess I'll need some substantial time to clean up the relevant code I wrote before.

For blogdown, .Rmd is compiled to .md through knitr, and .md is converted to .html through Pandoc. What sounds odd to me is that blogdown calls rmarkdown::render(..., encoding = "UTF-8") to render .Rmd files, which means knitr::knit(..., encoding = "UTF-8"). In that case, encoding != '' should be TRUE in your case, but you said encoding = TRUE worked for you when you were debugging this issue.

An object of the class knitr_kable is special in knitr: it is not printed in the usual way (i.e. not via print()) but simply stored verbatim in a list. It is print() (or cat()) that could mangle characters on Windows, and that is the tricky thing in the two evaluate issues I mentioned.

@martinmodrak

This comment has been minimized.

Copy link
Contributor Author

commented Feb 20, 2018

So, actually I was wrong with my assesment - the UTF-8 only survived that long because I debugged the code and something got evaluated in a different context (I reproduced it several times, but I don't think I completely understand it - it has a very blackmagic-y feel to it). Nevertheless, in the default regime, I am now quite sure, UTF-8 is lost in a call to evaluate. Within knit_handlers (utils.R, line 784) UTF-8 is OK, but by the end of the in_dir function it is already lost, in between there are (AFAIK) only system calls.

@yihui yihui modified the milestones: v1.20, v1.21 Feb 20, 2018

@yihui yihui removed this from the v1.21 milestone Dec 9, 2018

@FelixErnst

This comment has been minimized.

Copy link

commented Dec 11, 2018

Hi,

I was about to write on stackoverflow asking a question about a certain encoding issue I am facing, which is specific to Windows and knitr. I found this issue already opened and it sounds very much related.

My issue is the difference between a chunk output in knitr and the normal R console. I am trying to knit this file to a html document on Windows.

---
title: "test"
date: "11 Dezember 2018"
output: html_document
---

x <- readLines(file("test.txt"), encoding = "UTF-8")
x
z <- knitr::kable(data.frame(test = x))
z
z[2:3]

(The R code is of course marked as such, but GitHub strips the code section. Please add them to have running example)

The test.txt file contains just two lines with the following content:

≈Ω

This works fine on Linux, but on Windows the following output is returned for line 2 and 5:
## [1] "\230O". Line 4 returns a small table, but with the correct output. On the R console in RStudio everything works fine. I get the sign returned three times.

I also found this issue, which deals with the same problem.
https://stackoverflow.com/questions/43936536/knitr-generating-utf-8-output-from-chunks

Is this connected to the issue? Any suggestion how to solve, if not? (I don't want to hijack this issue. Let me know, fi I should open a separate one)

Thanks for any help or suggestions.

@yihui

This comment has been minimized.

Copy link
Owner

commented Dec 11, 2018

@FelixErnst If test.txt is encoded in UTF-8 (you didn't provide the file, so we don't know), the correct way to read it is readLines("test.txt", encoding = "UTF-8"). Don't use the file() connection. If you do want to use a connection, the correct way is readLines(file("test.txt", encoding = "UTF-8")), but this is really unnecessary.

@FelixErnst

This comment has been minimized.

Copy link

commented Dec 11, 2018

@yihui Sorry forgot, that uploading the files is also an option. Both are UTF-8 encoded. Removing the call to file() does not change the outcome.

test.txt
vignettes.zip

@yihui

This comment has been minimized.

Copy link
Owner

commented Dec 11, 2018

@FelixErnst Okay, in that case, I guess it is simply impossible to solve the problem. See my first reply above (r-lib/evaluate#59): #1506 (comment) The limit comes from base R, which I cannot modify.

@FelixErnst

This comment has been minimized.

Copy link

commented Dec 11, 2018

@yihui thanks for the quick answer. I followed down the links to the explanation. I am certainly not grasping every aspect of it, but that is an answer I can live with.

Then we hope that Windows 1903 might get a step closer to full UTF-8 support.

@yihui

This comment has been minimized.

Copy link
Owner

commented Dec 11, 2018

Haha. Fingers crossed! 🙏

@yihui yihui added this to the v1.22 milestone Mar 8, 2019

@yihui yihui added the Won't fix label Mar 8, 2019

@yihui yihui closed this Mar 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.