Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html notebook rendering error when ploting data with unicode chinese char in ggplotly #1762

Closed
3 tasks done
everdark opened this issue Jan 18, 2020 · 12 comments · Fixed by #1799
Closed
3 tasks done

Comments

@everdark
Copy link

everdark commented Jan 18, 2020

A minimally reprodicuble Rmd file:

---
title: "test"
output: html_notebook
---

```{r}
library(ggplot2)
library(plotly)

df <- data.frame(x=1, y=1, t="我我我我")
p <- ggplot(df, aes(x=x, y=y, fill=t)) +
  geom_point()
ggplotly(p)
```

Then run:

rmarkdown::render("test.Rmd", output_format="html_notebook")

Results in the following error:



processing file: test.Rmd
  |..............                                                        |  20%
  ordinary text without R code

  |............................                                          |  40%
label: unnamed-chunk-1
  |..........................................                            |  60%
  ordinary text without R code

  |........................................................              |  80%
label: unnamed-chunk-2
  |......................................................................| 100%
  ordinary text without R code


output file: test.knit.md

Error in extract(input_str) : Invalid nesting of html_preserve directives
Calls: <Anonymous> ... <Anonymous> -> base -> base -> extract_preserve_chunks -> extract
Execution halted

Interestingly, in RStudio for preview everything is fine.
And I've found out that the length matters.
So "我我我我" will cause the error but NOT "我我我".

Here is my system info:

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4
rmarkdown 2.0
pandoc 2.7.3

I encounter the same error in my Ubuntu 16.04 machine.
I really have no idea what's going on here and it took me quite some time to finally pin down the root cause being UTF-8 Chinese characters. Still no clue why this happens. :(


By filing an issue to this repo, I promise that

  • I have fully read the issue guide at https://yihui.org/issue/.
  • I have provided the necessary information about my issue.
    • If I'm asking a question, I have already asked it on Stack Overflow or RStudio Community, waited for at least 24 hours, and included a link to my question there.
    • If I'm filing a bug report, I have included a minimal, self-contained, and reproducible example, and have also included xfun::session_info('rmarkdown'). I have upgraded all my packages to their latest versions (e.g., R, RStudio, and R packages), and also tried the development version: remotes::install_github('rstudio/rmarkdown').
    • If I have posted the same issue elsewhere, I have also mentioned it in this issue.
  • I have learned the Github Markdown syntax, and formatted my issue correctly.

I understand that my issue may be closed if I don't fulfill my promises.

@everdark
Copy link
Author

Btw since the problem lies in the function htmltools::extractPreserveChunks (specifically here), my current workaround is to overwrite that function (removing the sanity check block) using assignInNamespace in my rendering script.

@caimiao0714
Copy link

Btw since the problem lies in the function htmltools::extractPreserveChunks (specifically here), my current workaround is to overwrite that function (removing the sanity check block) using assignInNamespace in my rendering script.

I am having the same issue, probably due to the Chinese characters in the data. Could you provide your solution to this issue? Thank you!

@cderv
Copy link
Collaborator

cderv commented Mar 19, 2020

Unfortunately, I can't reproduce on my end with last dev version. I get a plotly graph with chinese character in legend. But I may not have the correct encoding...

@caimiao0714 do you have another minimal reproducible example ?

@everdark if you think the issue is in htmltools::extractPreserveChunks, a minimal reprex in htmltools would be useful.

Not sure what and where to fix this yet... 🤔

@debdagybra
Copy link

Hello,

I think I have the same problem with French characters.

I get an error when i try to rmarkdown::render() a DT of a data.frame that contains at least 21 letters with an accent. (e.g. "é" that is very common in French)

The Rstudio preview is fine, the error occurs only with render().

A minimal reproducible Rmd file:

---
title: test
output: html_notebook
---
```{r test_chunk}
library(DT)
test <- data.frame(
  V1=c("ééééé ééééé ééééé ééééé é"),
  V2=1,
  stringsAsFactors = FALSE)
DT::datatable(
  test
)
```

then:

rmarkdown::render(
  input="test.Rmd",
  output_format="html_notebook"
)

returns:

processing file: test.Rmd
  |........................................                      |  50%
  ordinary text without R code
  |.....................................................................| 100%
label: test_chunk

output file: test.knit.md
Error in extract(input_str) : Invalid nesting of html_preserve directives

In my case, it only happens when the number of characters with an accent is >= 21 (globally in the data.frame, it doesn't have to be in one cell or in one variable).
If we replace any one of the "é" with a "e", there will be no error since it will be below 21.

< xfun::session_info('rmarkdown')
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362), RStudio 1.2.5033

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
Locale:
  LC_COLLATE=French_Belgium.1252  LC_CTYPE=French_Belgium.1252    LC_MONETARY=French_Belgium.1252 LC_NUMERIC=C                   
  LC_TIME=French_Belgium.1252    

Package version:
  base64enc_0.1.3 digest_0.6.25   evaluate_0.14   glue_1.4.0      graphics_3.6.3  grDevices_3.6.3 highr_0.8      
  htmltools_0.4.0 jsonlite_1.6.1  knitr_1.28.2    magrittr_1.5    markdown_1.1    methods_3.6.3   mime_0.9       
  Rcpp_1.0.4      rlang_0.4.5     rmarkdown_2.1   stats_3.6.3     stringi_1.4.6   stringr_1.4.0   tinytex_0.21   
  tools_3.6.3     utils_3.6.3     xfun_0.12       yaml_2.2.1     

Pandoc version: 2.9.2.1

By filing an issue to this repo, I promise that

  • I have fully read the issue guide at https://yihui.org/issue/.
  • I have provided the necessary information about my issue.
    • If I'm asking a question, I have already asked it on Stack Overflow or RStudio Community, waited for at least 24 hours, and included a link to my question there.
    • If I'm filing a bug report, I have included a minimal, self-contained, and reproducible example, and have also included xfun::session_info('rmarkdown'). I have upgraded all my packages to their latest versions (e.g., R, RStudio, and R packages), and also tried the development version: remotes::install_github('rstudio/rmarkdown').
    • If I have posted the same issue elsewhere, I have also mentioned it in this issue.
  • I have learned the Github Markdown syntax, and formatted my issue correctly.

I understand that my issue may be closed if I don't fulfill my promises.

@cderv
Copy link
Collaborator

cderv commented Apr 6, 2020

Thanks @debdagybra for the reprex. I helped me found the issue, that I previously missed in @everdark post. Sorry !

There is an issue with htmltools::extractPreserveChunks indeed because there is an additional closing <!--/html_preserve--> added to the htmlwidget but without opening <!--html_preserve-->. This cause the extraction to fail.

why need to find why the additional closing part is added without opening, in order to find where the issue is. Still not sure where to fix it.

The 21 characters at least is also odd but with one less é it works. 🤷‍♂

Thank for the report everyone !
I'll keep looking - if someone find a new hint please share. Thanks.

@everdark
Copy link
Author

everdark commented Apr 7, 2020

Btw since the problem lies in the function htmltools::extractPreserveChunks (specifically here), my current workaround is to overwrite that function (removing the sanity check block) using assignInNamespace in my rendering script.

I am having the same issue, probably due to the Chinese characters in the data. Could you provide your solution to this issue? Thank you!

@caimiao0714,

To workaround it, comment the sanity check block of extractPreserveChunks.
To be specific, redefine the function in your script as the following:

extractPreserveChunks <- function(strval) {

  # Literal start/end marker text. Case sensitive.
  startmarker <- "<!--html_preserve-->"
  endmarker <- "<!--/html_preserve-->"
  # Start and end marker length MUST be different, it's how we tell them apart
  startmarker_len <- nchar(startmarker)
  endmarker_len <- nchar(endmarker)
  # Pattern must match both start and end markers
  pattern <- "<!--/?html_preserve-->"

  # It simplifies string handling greatly to collapse multiple char elements
  if (length(strval) != 1)
    strval <- paste(strval, collapse = "\n")

  # matches contains the index of all the start and end markers
  matches <- gregexpr(pattern, strval)[[1]]
  lengths <- attr(matches, "match.length", TRUE)

  # No markers? Just return.
  if (matches[[1]] == -1)
    return(list(value = strval, chunks = character(0)))

  # If TRUE, it's a start; if FALSE, it's an end
  boundary_type <- lengths == startmarker_len

  # Positive number means we're inside a region, zero means we just exited to
  # the top-level, negative number means error (an end without matching start).
  # For example:
  # boundary_type - TRUE TRUE FALSE FALSE TRUE FALSE
  # preserve_level - 1 2 1 0 1 0
  preserve_level <- cumsum(ifelse(boundary_type, 1, -1))

  # Sanity check.
  if (any(preserve_level < 0) || tail(preserve_level, 1) != 0) {
    #stop("Invalid nesting of html_preserve directives")
  }

  # Identify all the top-level boundary markers. We want to find all of the
  # elements of preserve_level whose value is 0 and preceding value is 1, or
  # whose value is 1 and preceding value is 0. Since we know that preserve_level
  # values can only go up or down by 1, we can simply shift preserve_level by
  # one element and add it to preserve_level; in the result, any value of 1 is a
  # match.
  is_top_level <- 1 == (preserve_level + c(0, preserve_level[-length(preserve_level)]))

  preserved <- character(0)

  top_level_matches <- matches[is_top_level]
  # Iterate backwards so string mutation doesn't screw up positions for future
  # iterations
  for (i in seq.int(length(top_level_matches) - 1, 1, by = -2)) {
    start_outer <- top_level_matches[[i]]
    start_inner <- start_outer + startmarker_len
    end_inner <- top_level_matches[[i+1]]
    end_outer <- end_inner + endmarker_len

    id <- htmltools:::withPrivateSeed(
      paste("preserve", paste(
        format(as.hexmode(sample(256, 8, replace = TRUE)-1), width=2),
        collapse = ""),
        sep = "")
    )

    preserved[id] <- gsub(pattern, "", substr(strval, start_inner, end_inner-1))

    strval <- paste(
      substr(strval, 1, start_outer - 1),
      id,
      substr(strval, end_outer, nchar(strval)),
      sep="")
    substr(strval, start_outer, end_outer-1) <- id
  }

  list(value = strval, chunks = preserved)
}

Then to overwrite the function exported from the package, run this line (after the above function) before you render the rmarkdown file:

assignInNamespace("extractPreserveChunks", extractPreserveChunks, "htmltools")

@debdagybra
Copy link

debdagybra commented Apr 8, 2020

Here's another clue.

I have extracted the argument strval in 5 scenarios : no_Accent, 1_Accent, 20_Accents, 21_Accents, 25 Accents.

Everytime, we have a value (nr 35 in my example) that starts with <script type="application/json" and should end with </script>.

But when we have any accent, the value ends with </script> + substr("<!--/html_preserve-->", 1, n(accents))
The limit before the error is 21 because nchar("<!--/html_preserve-->") = 21.

So, I still don't know where the error is, but it's before extractPreserveChunks().

Edit: My guess is that the function that writes this string always have "<!--/html_preserve-->" at the end and does a substr() somewhere based on the number of characters but that the characters with accents aren't counted right.

OK_noAccent
<script type="application/json" data-for="htmlwidget-ef1c39e51614aa5dc818">{"x":{"filter":"none","data":[["1"],["eeeee eeeee eeeee eeeee e"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

OK_1Accent
<script type="application/json" data-for="htmlwidget-8e970aac4d9fbd2b6ff2">{"x":{"filter":"none","data":[["1"],["eeeee eeeee eeeee eeeee é"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script><

OK_20Accents
<script type="application/json" data-for="htmlwidget-a92f35549e8511516e57">{"x":{"filter":"none","data":[["1"],["ééééé ééééé ééééé ééééé e"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script><!--/html_preserve--

ERR_21Accents
<script type="application/json" data-for="htmlwidget-03a8af2c94f2a2b043ee">{"x":{"filter":"none","data":[["1"],["ééééé ééééé ééééé ééééé é"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script><!--/html_preserve-->

ERR_25Accents
<script type="application/json" data-for="htmlwidget-ead922db46c4b7e5f311">{"x":{"filter":"none","data":[["1"],["ééééé ééééé ééééé ééééé ééééé"],[1]],"container":"<table class=\\"display\\">\\n <thead>\\n <tr>\\n <th> <\\/th>\\n <th>V1<\\/th>\\n <th>V2<\\/th>\\n <\\/tr>\\n <\\/thead>\\n<\\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script><!--/html_preserve-->

@cderv
Copy link
Collaborator

cderv commented Apr 8, 2020

Thanks @debdagybra for your analysis !

I now identified the issue and you were right about the substring with a number of character. In fact this is a number of bytes, and the issue is here.

# TODO: add htmlUnpreserve function to htmlwidgets?
unpreserved <- substring(
output,
n_bytes("<!--html_preserve-->") + 1,
n_bytes(output) - n_bytes("<!--/html_preserve-->")
)

It may be time to use htmltools::extractPreserveChunks here instead of this trick with the number of bytes, that does not work well with UTF8 character encoded on two bytes for accentuated characters.
substring expect number of character I guess.

The limit before the error is 21 because nchar("<!--/html_preserve-->") = 21.

Yes this is it ! nchar("<!--/html_preserve-->", type = "bytes") = 21

The fix is to be made in rmarkdown now

Thanks all for the investigation !

@cderv

This comment has been minimized.

@debdagybra
Copy link

Yes, it works perfectly. With the example and with my real data.
Thanks !

@cderv
Copy link
Collaborator

cderv commented Apr 18, 2020

I've now added the changes in the PR. Waiting to be merged now.
@everdark can you confirm this works for you too now ?

@github-actions
Copy link

github-actions bot commented Nov 3, 2020

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants