Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

Closed
henkkuli opened this issue May 26, 2021 · 6 comments · Fixed by #1830
Closed

CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

henkkuli opened this issue May 26, 2021 · 6 comments · Fixed by #1830
Assignees
Labels
changelog-update Items which need to get mentioned in the changelog
Milestone

Comments

@henkkuli
Copy link
Contributor

The following string code fails to tokenize even though it is valid C++ code:

; 
 #define A 0

Note that there are spaces around the line break. I would expect the #define A 0 to be tokenized as macro, but # produces an error token. The tokens produced are

  • print(list(CFamilyLexer().get_tokens("; \n #define A 0"))):
    [(Token.Punctuation, ';'), (Token.Text, ' \n '), (Token.Error, '#'), (Token.Name, 'define'), (Token.Text, ' '), (Token.Name, 'A'), (Token.Text, ' '), (Token.Literal.Number.Integer, '0'), (Token.Text, '\n')]
    

What I have found

I have done some debugging and found that the spaces around the line break are important. Removing either of the spaces produces correct tokenization.

  • print(list(CFamilyLexer().get_tokens("; \n#define A 0"))):
    [(Token.Punctuation, ';'), (Token.Text, ' \n'), (Token.Comment.Preproc, '#'), (Token.Comment.Preproc, 'define A 0'), (Token.Comment.Preproc, '\n')]
    
  • print(list(CFamilyLexer().get_tokens(";\n #define A 0")))
    [(Token.Punctuation, ';'), (Token.Text, '\n'), (Token.Text, ' '), (Token.Comment.Preproc, '#'), (Token.Comment.Preproc, 'define A 0'), (Token.Comment.Preproc, '\n')]
    

As far as I understand, the relevant part of the tokenizer are the following definitions:

'whitespace': [
    # preprocessor directives: without whitespace
    ('^#', Comment.Preproc, 'macro'),
    # or with whitespace
    ('^(' + _ws1 + ')(#)',
     bygroups(using(this), Comment.Preproc), 'macro'),
    (r'\n', Text),
    (r'\s+', Text),
],

Here in the failing case the ";" first matches as a punctuation. Then '\s+' matches " \n ". Now we would want either '^#', or '^(' + _ws1 + ')(#)' to match the "#", but this isn't the case because ^ matches only the start of the line, and here we have already matched the space at the start of the line. Hence the tokenizer produces an error.

In the case that we remove the space after line break, '\s+' matches " \n". As there is no space at the start of the line, '^#' matches "#" and the lexer begins to tokenize a macro.

In the case that we remove the space before the line break, '\n' comes before \s+ in the list of regexes and consumes "\n". At this point we still haven't matched the space, but it gets matched by '^(' + _ws1 + ')(#)' as it is at the start of the line. Hence the lexer again begins to tokenize a macro.

My analysis

I fail to see why ^ at the start of Comment.Preproc regexes would be necessary. Maybe that could be removed? I also think that \s+ should not match line breaks, but I'm not sure how easy that would be as \s is local-aware.

Ps. The above example is a minimal example I have found. I have encountered this bug in the wild, for example the following code also fails to tokenize:

int main() {
  prepare(); 
  #pragma omp parallel for
  for (int i = 0; i < 10; i++) {}
}
@amitkummer
Copy link
Contributor

Thanks a lot for diving into the lexer and giving a detailed explanation and example. I think your suggested fix of removing the ^'s is probably the best solution here. Would you consider making a pull request for this?

Also, is there a reason you think matching line breaks with the \s+ regex here would be problematic ? If the ^'s are removed, it would maybe be possible to remove this line and let the \s+ match new lines:

(r'\n', Text),

@henkkuli
Copy link
Contributor Author

henkkuli commented Jun 2, 2021

Thanks a lot for diving into the lexer and giving a detailed explanation and example. I think your suggested fix of removing the ^'s is probably the best solution here. Would you consider making a pull request for this?

That seems to be the minimum change needed. I can create a pull request.

Also, is there a reason you think matching line breaks with the \s+ regex here would be problematic ? If the ^'s are removed, it would maybe be possible to remove this line and let the \s+ match new lines:

(r'\n', Text),

Now that I read my message again, I see that what I wrote was not what I mean. What I wanted to say is that another, complementary fix would be to to make \s+ regex to stop at line breaks. This way there would be a split at the line break, so ^ would again be able to match the beginning of the line.

The proposed change (i.e. removing ^) causes the lexer to tokenize some invalid C, for example the following would tokenize even though it is not valid C.

foo(); #define FOO 0

I'm not familiar with how invalid code should be tokenized. In my opinion, tokenizing the #define FOO 0 here as preprocessor macro seems feasible, though others may disagree and argue an error should be produced in this case.

@amitkummer
Copy link
Contributor

Now that I read my message again, I see that what I wrote was not what I mean. What I wanted to say is that another, complementary fix would be to to make \s+ regex to stop at line breaks. This way there would be a split at the line break, so ^ would again be able to match the beginning of the line.

Sounds very good!

I'm not familiar with how invalid code should be tokenized. In my opinion, tokenizing the #define FOO 0 here as preprocessor macro seems feasible, though others may disagree and argue an error should be produced in this case.

It's kind of tough to say what should be done with invalid syntax, but according to the discussion in #1468 and after looking at the Python lexer, I would say -- unless it's easy for you to do, don't bother with outputting an Error token or with providing 'correct' highlighting for invalid code.

@henkkuli
Copy link
Contributor Author

henkkuli commented Jun 2, 2021

I tested both of my proposals, and it seems that changing \s+ to [^\S\n]+ (i.e. spaces without line break) is better than removing ^. If I remove ^ from the preprocessor regexes, then some C-derived languages, such as SWIG fail to tokenize correct. For example the following is valid SWIG

return "std::vector<" #_Tp "," #_Alloc " >";

and only #_Tp and #_Alloc should be tokenized as preprocessor macros while the rest should be tokenized as in regular C++. However removing ^ from the regexes causes the whole #_Tp "," #_Alloc " >"; to be tokenized as preprocessor macro.

On the other hand, changing \s+ to [^\S\n]+ seems to do the correct thing. I ran the tests on tests/examplefiles and tests/snippets, and the only changes seem to be changing from

' \n\t'       Text

to

' '           Text
'\n'          Text

'\t'          Text

as expected.

I'll open a PR for this.

@amitkummer
Copy link
Contributor

Wow, that's awesome to see! And thank you again for your work, I am hopeful your PR will be accepted (pretty) swiftly.

@Anteru
Copy link
Collaborator

Anteru commented Jun 2, 2021

Me too, I'll have to take a day off soon to work through the backlog. Thanks everyone for their contribution!

@Anteru Anteru added the changelog-update Items which need to get mentioned in the changelog label Jul 18, 2021
@Anteru Anteru self-assigned this Jul 18, 2021
@Anteru Anteru added this to the 2.10 milestone Jul 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog-update Items which need to get mentioned in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants