Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON parser error in resolving PR #492

Closed
cderv opened this issue Apr 18, 2020 · 5 comments
Closed

JSON parser error in resolving PR #492

cderv opened this issue Apr 18, 2020 · 5 comments
Labels
bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example

Comments

@cderv
Copy link
Contributor

cderv commented Apr 18, 2020

I've recently got a weird issue

withr::with_temp_libpaths(
  remotes::install_github("rstudio/rmarkdown#1799")
)
#> Using github PAT from envvar GITHUB_PAT
#> Error in github_resolve_ref.github_pull(meta$ref %||% ref, meta, host = host, : Cannot find GitHub pull request rstudio/rmarkdown#1799
#> JSON: EXPECTED value GOT "

Created on 2020-04-18 by the reprex package (v0.3.0)

The internal remote json parser errors and that prevent remotes to determine the pull request url.

Here my analysis to help:

After digging a bit why we have this " token alone ( the json token to parse is "\""), I think this is related to emoji in PR body. The returned JSON string by curl contains some special character that will break the current json parser. For this particular PR, the emoji will break the parsing of the body: instead of a paragraph as all, it will be letter by letter. From browsing inside the parser:

Browse[2]> tokens[120:140]
 [1] "}"        ","        "\"body\"" ":"        "\""       "T"        "h"       
 [8] "i"        "s"        "f"        "i"        "x"        "e"        "s"       
[15] "#"        "1762"     "\\"       "r"        "\\"       "n"        "\\"  

The faulty token is the 4th one, but it should not exist as it should be parsed as a group I think

"\"This fixes #1762 (...)\""

The special characters that breaks the parser I think are

Browse[2]> tokens[538:557]
 [1] "h"  "i"  "s"  "t"  "o"  "o"  "â"  "˜"  "º"  "ï"  "¸"  "\u008f" "\\" "r"  "\\"
[16] "n"  "\\" "r"  "\\" "n" 

corresponding to the emoji in the PR body

I tried to add a test by simplifying the code I used to come up with the solution. Not sure if this is the simplest unit test. Also I added skip_on_cran() because the other test had this too ☺️

Let's note that obviously it works when removing the emoji et it works using a different emoji.

I guess there is an encoding issue in the return response content of the curl request, and it could be OS related too. I am on windows 10, European locale. (So my system is not UTF8 by default).

I am not sure yet how it should be fixed but if other encounter the same, the issue is now documented.

@jimhester jimhester added the bug an unexpected problem or unintended behavior label Jun 5, 2020
@jimhester jimhester added the reprex needs a minimal reproducible example label Jul 15, 2020
@jimhester
Copy link
Member

I currently cannot reproduce this, I get a different error that does not include the JSON issue.

So I will close this for now, if you encounter another case where this happens please open a new issue, thanks!

withr::with_temp_libpaths(
  remotes::install_github("rstudio/rmarkdown#1799")
)
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo cderv/rmarkdown@html-unpreserve
#> Error in utils::download.file(url, path, method = method, quiet = quiet,  : 
#>   cannot open URL 'https://api.github.com/repos/cderv/rmarkdown/tarball/html-unpreserve'

Created on 2020-07-15 by the reprex package (v0.3.0)

@cderv
Copy link
Contributor Author

cderv commented Jul 15, 2020

You got this error because I deleted the branch. Sorry My bad 😞 . I restored it.

I still get this error on my end, with last CRAN version of packages and R 4.0.2.

If you can't reproduce, it may be an OS and/or encoding issue as I am on a French Windows 10 computer.

Here is a smaller reprex

host <- "api.github.com"
token <- remotes:::github_pat()
path <- "repos/rstudio/rmarkdown/pulls/1799"
remotes:::github_GET(path, host, token)
#> Error: JSON: EXPECTED value GOT "

Created on 2020-07-15 by the reprex package (v0.3.0.9001)

and a more low level one to show what I retrieve from the github issue body, and the weird chars I got that fail the internal json parser. I assume you have a github PAT set

url <- "https://api.github.com/repos/rstudio/rmarkdown/pulls/1799"
headers <- c(Authorization = paste0("token ", token))
#> Error in paste0("token ", token): objet 'token' introuvable
h <- curl::new_handle()
curl::handle_setheaders(h, .list = headers)
#> Error in curl::handle_setheaders(h, .list = headers): objet 'headers' introuvable
res <- curl::curl_fetch_memory(url, handle = h)
res <- rawToChar(res$content)
body <- jsonlite::fromJSON(res)$body
# only to print on several lines on GH issue
cat(strwrap(body, width = 100), sep = "\n")
#> This fixes #1762   The issue was in the unpreserving html widget code for UFT8 special character
#> encoded on two byte. (see https://github.com/rstudio/rmarkdown/issues/1762#issuecomment-611189889).
#>   Using the `htmltools` functions to deal with preserved code extraction seems now the way to do
#> it.   I tried to add a test by simplifying the code I used to come up with the solution. Not sure
#> if this is the simplest unit test. Also I added `skip_on_cran()` because the other test had this
#> too ☺�   In addition, I think this could also have been resolved by making sure to use
#> `bytes` Encoding before using `substrings` like in 
#> https://github.com/rstudio/rmarkdown/blob/1c859f47b990c1b6d3a1fbc4b877fbaf5ff18512/R/util.R#L283-L288
#> However, `htmltools` functions seemed dedicated to this usage.

remotes:::json$parse(res)
#> Error: JSON: EXPECTED value GOT "

Created on 2020-07-15 by the reprex package (v0.3.0.9001)

I can investigate further if you want. Seems like an edge case but an sign of something wrong somewhere maybe 🤔

@jimhester jimhester reopened this Jul 15, 2020
@cderv
Copy link
Contributor Author

cderv commented Jul 15, 2020

I don't have this issue on linux ubuntu 18.04 (tested on WSL on my computer). So it is possible it is a windows only issue, something to do with incorrect encoding, maybe when converting from raw to char.

Here is the minimal reprex I got for this I believe to be a conversion issue in rawToChar with encoding

On windows

raw_st <- as.raw(c(0x74, 0x6f, 0x6f, 0x20, 0xe2, 0x98, 0xba, 0xef, 0xb8, 
                   0x8f, 0x20, 0x5c, 0x72, 0x5c, 0x6e))
rawToChar(raw_st)
#> [1] "too â\230ºï¸\217 \\r\\n"

On Linux

raw_st <- as.raw(c(0x74, 0x6f, 0x6f, 0x20, 0xe2, 0x98, 0xba, 0xef, 0xb8,
                   0x8f, 0x20, 0x5c, 0x72, 0x5c, 0x6e))
rawToChar(raw_st)
#> [1] "too ☺️\\r\\n"

Created on 2020-07-15 by the reprex package (v0.3.0)

So in remotes, the issue happens here

json$parse(rawToChar(res$content))

as rawToChar gives incorrect results (due to encoding issue ?), the parser fails.

Hope it helps

@jimhester
Copy link
Member

Thank you for the reprex, it made it relatively painless to figure out a fix. rawToChar() doesn't mark the strings with any encoding, so R interprets this as the native locale. In this case GitHub always returns character data in UTF-8, so we needed to set the encoding of the strings to UTF-8 after using rawToChar().

After doing this both your reprex and the original issue are fixed.

@cderv
Copy link
Contributor Author

cderv commented Jul 16, 2020

It seems I was close to the fix after digging this to the bottom. I was missing a small piece... 😄

Anyway, glad it helps ! I am used to dig windows stuff now 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

2 participants