Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lintr fails when parser generates "error: unexpected input" from unicode character token #815

Closed
leogama opened this issue Jun 26, 2021 · 3 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@leogama
Copy link
Contributor

leogama commented Jun 26, 2021

Reproducible example:

> lintr::lint(text="x <- 2 × 3")
# Error in nchar(content$text, "chars") : 
#   invalid multibyte string, element 7

The problem seems to be that, when the R parser finds an unexpected token that starts with or is a multibyte Unicode character, it halts the parse (fine) and returns just the first byte of that character as the text for the offending token (bad). The first byte of a multibyte UTF-8 character alone is an invalid character and triggers the error.

I'm still investigating this bug. The source seems to be at least in utils::getParseData() or deeper, depending on what is the expected behavior. Anyway, whether or not there is an issue in base functions, I'll submit a pull request with a workaround tomorrow.

@MichaelChirico
Copy link
Collaborator

R also can't parse this:

parse(text="x <- 2 × 3")
# Error in parse(text = "x <- 2 × 3") : <text>:1:8: unexpected input
# 1: x <- 2 ×
#            ^

What's your expected behavior here?

@AshesITR
Copy link
Collaborator

I can repro this on Arch Linux with R 4.1.0.
This is an issue that should be fixed with #782 once that is ready to merge.

@MichaelChirico I think the bug report is about lintr crashing instead of relaying the parse error message.

@AshesITR AshesITR added the bug an unexpected problem or unintended behavior label Jun 26, 2021
@leogama
Copy link
Contributor Author

leogama commented Jun 26, 2021

@MichaelChirico, as @AshesITR have pointed out, the expected behavior is that lintr returns the error generated by the parser. Currently it gets the error but fails while trying to format that (inside fix_column_numbers).

@AshesITR, I initially thought that it was an encoding issue as I was getting the same nchar() error that blows up when a file is read with the wrong encoding. But if you try to use non-ASCII characters in allowed locations, as an alphabetical character for a variable name, it works. In the example below, what seems to be a 'minus' operator is actually an 'en-dash' and would trigger an error:

> debugonce(lintr:::fix_column_numbers)
> lint(text = "α <- 169 – 144")
# debugging in: fix_column_numbers(fix_tab_indentations(source_file))
# (...)
Browse[2]> content  # fix_column_numbers' argument
#   line1 col1 line2 col2 id parent       token terminal text
# 7     1    1     1    8  7      0        expr    FALSE
# 1     1    1     1    1  1      3      SYMBOL     TRUE    α  # alpha character correctly parsed as a name
# 3     1    1     1    1  3      7        expr    FALSE
# 2     1    3     1    4  2      7 LEFT_ASSIGN     TRUE   <-
# 4     1    6     1    8  4      5   NUM_CONST     TRUE  169
# 5     1    6     1    8  5      7        expr    FALSE
# 6     1   10     1   10  6      0        \xe2    FALSE \xe2  # first byte of UTF-8 encoded en-dash character

This table is the object returned by getParseData(source_file). So, there are two problems:

  1. It returns a string with an incomplete character read from good input text and right encoding.
  2. It returns it as valid text.

What should be in the text column according to documentation:

If includeText is TRUE, the text of all tokens; if it is NA (the default), the text of terminal tokens. If includeText == FALSE, this column is not included. Very long strings (with source of 1000 characters or more) will not be stored; a message giving their length and delimiter will be included instead.

However, the last row in the table has a terminal value of FALSE and its broken text value is returned.

I'm sending the PR.

leogama added a commit to leogama/lintr that referenced this issue Jun 26, 2021
Fixes fail obtained when a syntax error interrupts the parsing at a
Unicode character and, as a consequence, the parsed content returned
by 'getParseData(source_file)' includes an incomplete UTF-8 multibyte
representation of that character (which is invalid). See the GitHub
issue discussion for details.

The proposed solution merely "coerces" the getParseData's 'text'
column to what it is supposed to be according to its documentation.

An alternative would be to actually fix the table before use:
> content[!content$terminal, "text"] <- ""
leogama added a commit to leogama/lintr that referenced this issue Jun 27, 2021
leogama added a commit to leogama/lintr that referenced this issue Jun 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants