html notebook rendering error when ploting data with unicode chinese char in ggplotly #1762

everdark · 2020-01-18T15:29:19Z

A minimally reprodicuble Rmd file:

---
title: "test"
output: html_notebook
---

```{r}
library(ggplot2)
library(plotly)

df <- data.frame(x=1, y=1, t="我我我我")
p <- ggplot(df, aes(x=x, y=y, fill=t)) +
  geom_point()
ggplotly(p)
```

Then run:

rmarkdown::render("test.Rmd", output_format="html_notebook")

Results in the following error:



processing file: test.Rmd
  |..............                                                        |  20%
  ordinary text without R code

  |............................                                          |  40%
label: unnamed-chunk-1
  |..........................................                            |  60%
  ordinary text without R code

  |........................................................              |  80%
label: unnamed-chunk-2
  |......................................................................| 100%
  ordinary text without R code


output file: test.knit.md

Error in extract(input_str) : Invalid nesting of html_preserve directives
Calls: <Anonymous> ... <Anonymous> -> base -> base -> extract_preserve_chunks -> extract
Execution halted

Interestingly, in RStudio for preview everything is fine.
And I've found out that the length matters.
So "我我我我" will cause the error but NOT "我我我".

Here is my system info:

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4
rmarkdown 2.0
pandoc 2.7.3

I encounter the same error in my Ubuntu 16.04 machine.
I really have no idea what's going on here and it took me quite some time to finally pin down the root cause being UTF-8 Chinese characters. Still no clue why this happens. :(

By filing an issue to this repo, I promise that

I have fully read the issue guide at https://yihui.org/issue/.
I have provided the necessary information about my issue.
- If I'm asking a question, I have already asked it on Stack Overflow or RStudio Community, waited for at least 24 hours, and included a link to my question there.
- If I'm filing a bug report, I have included a minimal, self-contained, and reproducible example, and have also included xfun::session_info('rmarkdown'). I have upgraded all my packages to their latest versions (e.g., R, RStudio, and R packages), and also tried the development version: remotes::install_github('rstudio/rmarkdown').
- If I have posted the same issue elsewhere, I have also mentioned it in this issue.
I have learned the Github Markdown syntax, and formatted my issue correctly.

I understand that my issue may be closed if I don't fulfill my promises.

The text was updated successfully, but these errors were encountered:

everdark · 2020-01-18T17:03:10Z

Btw since the problem lies in the function htmltools::extractPreserveChunks (specifically here), my current workaround is to overwrite that function (removing the sanity check block) using assignInNamespace in my rendering script.

caimiao0714 · 2020-03-19T04:47:23Z

Btw since the problem lies in the function htmltools::extractPreserveChunks (specifically here), my current workaround is to overwrite that function (removing the sanity check block) using assignInNamespace in my rendering script.

I am having the same issue, probably due to the Chinese characters in the data. Could you provide your solution to this issue? Thank you!

cderv · 2020-03-19T20:48:57Z

Unfortunately, I can't reproduce on my end with last dev version. I get a plotly graph with chinese character in legend. But I may not have the correct encoding...

@caimiao0714 do you have another minimal reproducible example ?

@everdark if you think the issue is in htmltools::extractPreserveChunks, a minimal reprex in htmltools would be useful.

Not sure what and where to fix this yet... 🤔

debdagybra · 2020-04-06T17:09:40Z

Hello,

I think I have the same problem with French characters.

I get an error when i try to rmarkdown::render() a DT of a data.frame that contains at least 21 letters with an accent. (e.g. "é" that is very common in French)

The Rstudio preview is fine, the error occurs only with render().

A minimal reproducible Rmd file:

---
title: test
output: html_notebook
---
```{r test_chunk}
library(DT)
test <- data.frame(
  V1=c("ééééé ééééé ééééé ééééé é"),
  V2=1,
  stringsAsFactors = FALSE)
DT::datatable(
  test
)
```

then:

rmarkdown::render(
  input="test.Rmd",
  output_format="html_notebook"
)

returns:

processing file: test.Rmd
  |........................................                      |  50%
  ordinary text without R code
  |.....................................................................| 100%
label: test_chunk

output file: test.knit.md
Error in extract(input_str) : Invalid nesting of html_preserve directives

In my case, it only happens when the number of characters with an accent is >= 21 (globally in the data.frame, it doesn't have to be in one cell or in one variable).
If we replace any one of the "é" with a "e", there will be no error since it will be below 21.

< xfun::session_info('rmarkdown')
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362), RStudio 1.2.5033

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
Locale:
  LC_COLLATE=French_Belgium.1252  LC_CTYPE=French_Belgium.1252    LC_MONETARY=French_Belgium.1252 LC_NUMERIC=C                   
  LC_TIME=French_Belgium.1252    

Package version:
  base64enc_0.1.3 digest_0.6.25   evaluate_0.14   glue_1.4.0      graphics_3.6.3  grDevices_3.6.3 highr_0.8      
  htmltools_0.4.0 jsonlite_1.6.1  knitr_1.28.2    magrittr_1.5    markdown_1.1    methods_3.6.3   mime_0.9       
  Rcpp_1.0.4      rlang_0.4.5     rmarkdown_2.1   stats_3.6.3     stringi_1.4.6   stringr_1.4.0   tinytex_0.21   
  tools_3.6.3     utils_3.6.3     xfun_0.12       yaml_2.2.1     

Pandoc version: 2.9.2.1

By filing an issue to this repo, I promise that

I have fully read the issue guide at https://yihui.org/issue/.
I have provided the necessary information about my issue.
- If I'm asking a question, I have already asked it on Stack Overflow or RStudio Community, waited for at least 24 hours, and included a link to my question there.
- If I'm filing a bug report, I have included a minimal, self-contained, and reproducible example, and have also included xfun::session_info('rmarkdown'). I have upgraded all my packages to their latest versions (e.g., R, RStudio, and R packages), and also tried the development version: remotes::install_github('rstudio/rmarkdown').
- If I have posted the same issue elsewhere, I have also mentioned it in this issue.
I have learned the Github Markdown syntax, and formatted my issue correctly.

I understand that my issue may be closed if I don't fulfill my promises.

cderv · 2020-04-06T20:15:58Z

Thanks @debdagybra for the reprex. I helped me found the issue, that I previously missed in @everdark post. Sorry !

There is an issue with htmltools::extractPreserveChunks indeed because there is an additional closing  added to the htmlwidget but without opening . This cause the extraction to fail.

why need to find why the additional closing part is added without opening, in order to find where the issue is. Still not sure where to fix it.

The 21 characters at least is also odd but with one less é it works. 🤷‍♂

Thank for the report everyone !
I'll keep looking - if someone find a new hint please share. Thanks.

everdark · 2020-04-07T04:12:40Z

Btw since the problem lies in the function htmltools::extractPreserveChunks (specifically here), my current workaround is to overwrite that function (removing the sanity check block) using assignInNamespace in my rendering script.

I am having the same issue, probably due to the Chinese characters in the data. Could you provide your solution to this issue? Thank you!

@caimiao0714,

To workaround it, comment the sanity check block of extractPreserveChunks.
To be specific, redefine the function in your script as the following:

extractPreserveChunks <- function(strval) {

  # Literal start/end marker text. Case sensitive.
  startmarker <- "<!--html_preserve-->"
  endmarker <- "<!--/html_preserve-->"
  # Start and end marker length MUST be different, it's how we tell them apart
  startmarker_len <- nchar(startmarker)
  endmarker_len <- nchar(endmarker)
  # Pattern must match both start and end markers
  pattern <- "<!--/?html_preserve-->"

  # It simplifies string handling greatly to collapse multiple char elements
  if (length(strval) != 1)
    strval <- paste(strval, collapse = "\n")

  # matches contains the index of all the start and end markers
  matches <- gregexpr(pattern, strval)[[1]]
  lengths <- attr(matches, "match.length", TRUE)

  # No markers? Just return.
  if (matches[[1]] == -1)
    return(list(value = strval, chunks = character(0)))

  # If TRUE, it's a start; if FALSE, it's an end
  boundary_type <- lengths == startmarker_len

  # Positive number means we're inside a region, zero means we just exited to
  # the top-level, negative number means error (an end without matching start).
  # For example:
  # boundary_type - TRUE TRUE FALSE FALSE TRUE FALSE
  # preserve_level - 1 2 1 0 1 0
  preserve_level <- cumsum(ifelse(boundary_type, 1, -1))

  # Sanity check.
  if (any(preserve_level < 0) || tail(preserve_level, 1) != 0) {
    #stop("Invalid nesting of html_preserve directives")
  }

  # Identify all the top-level boundary markers. We want to find all of the
  # elements of preserve_level whose value is 0 and preceding value is 1, or
  # whose value is 1 and preceding value is 0. Since we know that preserve_level
  # values can only go up or down by 1, we can simply shift preserve_level by
  # one element and add it to preserve_level; in the result, any value of 1 is a
  # match.
  is_top_level <- 1 == (preserve_level + c(0, preserve_level[-length(preserve_level)]))

  preserved <- character(0)

  top_level_matches <- matches[is_top_level]
  # Iterate backwards so string mutation doesn't screw up positions for future
  # iterations
  for (i in seq.int(length(top_level_matches) - 1, 1, by = -2)) {
    start_outer <- top_level_matches[[i]]
    start_inner <- start_outer + startmarker_len
    end_inner <- top_level_matches[[i+1]]
    end_outer <- end_inner + endmarker_len

    id <- htmltools:::withPrivateSeed(
      paste("preserve", paste(
        format(as.hexmode(sample(256, 8, replace = TRUE)-1), width=2),
        collapse = ""),
        sep = "")
    )

    preserved[id] <- gsub(pattern, "", substr(strval, start_inner, end_inner-1))

    strval <- paste(
      substr(strval, 1, start_outer - 1),
      id,
      substr(strval, end_outer, nchar(strval)),
      sep="")
    substr(strval, start_outer, end_outer-1) <- id
  }

  list(value = strval, chunks = preserved)
}

Then to overwrite the function exported from the package, run this line (after the above function) before you render the rmarkdown file:

assignInNamespace("extractPreserveChunks", extractPreserveChunks, "htmltools")

debdagybra · 2020-04-08T13:34:50Z

Here's another clue.

I have extracted the argument strval in 5 scenarios : no_Accent, 1_Accent, 20_Accents, 21_Accents, 25 Accents.

Everytime, we have a value (nr 35 in my example) that starts with <script type="application/json" and should end with </script>.

But when we have any accent, the value ends with </script> + substr("", 1, n(accents))
The limit before the error is 21 because nchar("") = 21.

So, I still don't know where the error is, but it's before extractPreserveChunks().

Edit: My guess is that the function that writes this string always have "" at the end and does a substr() somewhere based on the number of characters but that the characters with accents aren't counted right.

OK_noAccent
<script type="application/json" data-for="htmlwidget-ef1c39e51614aa5dc818">{"x":{"filter":"none","data":[["1"],["eeeee eeeee eeeee eeeee e"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

OK_1Accent
<script type="application/json" data-for="htmlwidget-8e970aac4d9fbd2b6ff2">{"x":{"filter":"none","data":[["1"],["eeeee eeeee eeeee eeeee é"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script><

OK_20Accents
<script type="application/json" data-for="htmlwidget-a92f35549e8511516e57">{"x":{"filter":"none","data":[["1"],["ééééé ééééé ééééé ééééé e"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script><!--/html_preserve--

ERR_21Accents
<script type="application/json" data-for="htmlwidget-03a8af2c94f2a2b043ee">{"x":{"filter":"none","data":[["1"],["ééééé ééééé ééééé ééééé é"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

ERR_25Accents
<script type="application/json" data-for="htmlwidget-ead922db46c4b7e5f311">{"x":{"filter":"none","data":[["1"],["ééééé ééééé ééééé ééééé ééééé"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

cderv · 2020-04-08T20:55:47Z

Thanks @debdagybra for your analysis !

I now identified the issue and you were right about the substring with a number of character. In fact this is a number of bytes, and the issue is here.

rmarkdown/R/render_html.R

Lines 3 to 8 in ec8fd0f

    
           # TODO: add htmlUnpreserve function to htmlwidgets? 
        
           unpreserved <- substring( 
        
             output, 
        
             n_bytes("<!--html_preserve-->") + 1, 
        
             n_bytes(output) - n_bytes("<!--/html_preserve-->") 
        
           )

It may be time to use htmltools::extractPreserveChunks here instead of this trick with the number of bytes, that does not work well with UTF8 character encoded on two bytes for accentuated characters.
substring expect number of character I guess.

The limit before the error is 21 because nchar("") = 21.

Yes this is it ! nchar("", type = "bytes") = 21

The fix is to be made in rmarkdown now

Thanks all for the investigation !

debdagybra · 2020-04-09T07:31:47Z

Yes, it works perfectly. With the example and with my real data.
Thanks !

cderv · 2020-04-18T17:42:01Z

I've now added the changes in the PR. Waiting to be merged now.
@everdark can you confirm this works for you too now ?

…#1799)

github-actions · 2020-11-03T10:31:15Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.

This comment has been minimized.

Sign in to view

cderv mentioned this issue Apr 18, 2020

Use htmltools functions for htmlwidget notebook annotation #1799

Merged

yihui closed this as completed in #1799 Apr 20, 2020

yihui pushed a commit that referenced this issue Apr 20, 2020

fix #1762: use htmltools functions for htmlwidget notebook annotation (…

131cf02

…#1799)

github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html notebook rendering error when ploting data with unicode chinese char in ggplotly #1762

html notebook rendering error when ploting data with unicode chinese char in ggplotly #1762

everdark commented Jan 18, 2020 •

edited

Loading

everdark commented Jan 18, 2020

caimiao0714 commented Mar 19, 2020

cderv commented Mar 19, 2020

debdagybra commented Apr 6, 2020

cderv commented Apr 6, 2020 •

edited

Loading

everdark commented Apr 7, 2020

debdagybra commented Apr 8, 2020 •

edited

Loading

cderv commented Apr 8, 2020

This comment has been minimized.

debdagybra commented Apr 9, 2020

cderv commented Apr 18, 2020 •

edited

Loading

github-actions bot commented Nov 3, 2020

html notebook rendering error when ploting data with unicode chinese char in ggplotly #1762

html notebook rendering error when ploting data with unicode chinese char in ggplotly #1762

Comments

everdark commented Jan 18, 2020 • edited Loading

everdark commented Jan 18, 2020

caimiao0714 commented Mar 19, 2020

cderv commented Mar 19, 2020

debdagybra commented Apr 6, 2020

cderv commented Apr 6, 2020 • edited Loading

everdark commented Apr 7, 2020

debdagybra commented Apr 8, 2020 • edited Loading

cderv commented Apr 8, 2020

This comment has been minimized.

debdagybra commented Apr 9, 2020

cderv commented Apr 18, 2020 • edited Loading

github-actions bot commented Nov 3, 2020

everdark commented Jan 18, 2020 •

edited

Loading

cderv commented Apr 6, 2020 •

edited

Loading

debdagybra commented Apr 8, 2020 •

edited

Loading

cderv commented Apr 18, 2020 •

edited

Loading