-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lintr fails when parser generates "error: unexpected input" from unicode character token #815
Comments
R also can't parse this:
What's your expected behavior here? |
I can repro this on Arch Linux with R 4.1.0. @MichaelChirico I think the bug report is about lintr crashing instead of relaying the parse error message. |
@MichaelChirico, as @AshesITR have pointed out, the expected behavior is that @AshesITR, I initially thought that it was an encoding issue as I was getting the same > debugonce(lintr:::fix_column_numbers)
> lint(text = "α <- 169 – 144")
# debugging in: fix_column_numbers(fix_tab_indentations(source_file))
# (...)
Browse[2]> content # fix_column_numbers' argument
# line1 col1 line2 col2 id parent token terminal text
# 7 1 1 1 8 7 0 expr FALSE
# 1 1 1 1 1 1 3 SYMBOL TRUE α # alpha character correctly parsed as a name
# 3 1 1 1 1 3 7 expr FALSE
# 2 1 3 1 4 2 7 LEFT_ASSIGN TRUE <-
# 4 1 6 1 8 4 5 NUM_CONST TRUE 169
# 5 1 6 1 8 5 7 expr FALSE
# 6 1 10 1 10 6 0 \xe2 FALSE \xe2 # first byte of UTF-8 encoded en-dash character This table is the object returned by
What should be in the
However, the last row in the table has a I'm sending the PR. |
Fixes fail obtained when a syntax error interrupts the parsing at a Unicode character and, as a consequence, the parsed content returned by 'getParseData(source_file)' includes an incomplete UTF-8 multibyte representation of that character (which is invalid). See the GitHub issue discussion for details. The proposed solution merely "coerces" the getParseData's 'text' column to what it is supposed to be according to its documentation. An alternative would be to actually fix the table before use: > content[!content$terminal, "text"] <- ""
Reproducible example:
The problem seems to be that, when the R parser finds an unexpected token that starts with or is a multibyte Unicode character, it halts the parse (fine) and returns just the first byte of that character as the text for the offending token (bad). The first byte of a multibyte UTF-8 character alone is an invalid character and triggers the error.
I'm still investigating this bug. The source seems to be at least in
utils::getParseData()
or deeper, depending on what is the expected behavior. Anyway, whether or not there is an issue in base functions, I'll submit a pull request with a workaround tomorrow.The text was updated successfully, but these errors were encountered: