CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

henkkuli · 2021-05-26T17:22:09Z

The following string code fails to tokenize even though it is valid C++ code:

; 
 #define A 0

Note that there are spaces around the line break. I would expect the #define A 0 to be tokenized as macro, but # produces an error token. The tokens produced are

print(list(CFamilyLexer().get_tokens("; \n #define A 0"))):

[(Token.Punctuation, ';'), (Token.Text, ' \n '), (Token.Error, '#'), (Token.Name, 'define'), (Token.Text, ' '), (Token.Name, 'A'), (Token.Text, ' '), (Token.Literal.Number.Integer, '0'), (Token.Text, '\n')]

What I have found

I have done some debugging and found that the spaces around the line break are important. Removing either of the spaces produces correct tokenization.

print(list(CFamilyLexer().get_tokens("; \n#define A 0"))):

[(Token.Punctuation, ';'), (Token.Text, ' \n'), (Token.Comment.Preproc, '#'), (Token.Comment.Preproc, 'define A 0'), (Token.Comment.Preproc, '\n')]

print(list(CFamilyLexer().get_tokens(";\n #define A 0")))

[(Token.Punctuation, ';'), (Token.Text, '\n'), (Token.Text, ' '), (Token.Comment.Preproc, '#'), (Token.Comment.Preproc, 'define A 0'), (Token.Comment.Preproc, '\n')]

As far as I understand, the relevant part of the tokenizer are the following definitions:

'whitespace': [
    # preprocessor directives: without whitespace
    ('^#', Comment.Preproc, 'macro'),
    # or with whitespace
    ('^(' + _ws1 + ')(#)',
     bygroups(using(this), Comment.Preproc), 'macro'),
    (r'\n', Text),
    (r'\s+', Text),
],

Here in the failing case the ";" first matches as a punctuation. Then '\s+' matches " \n ". Now we would want either '^#', or '^(' + _ws1 + ')(#)' to match the "#", but this isn't the case because ^ matches only the start of the line, and here we have already matched the space at the start of the line. Hence the tokenizer produces an error.

In the case that we remove the space after line break, '\s+' matches " \n". As there is no space at the start of the line, '^#' matches "#" and the lexer begins to tokenize a macro.

In the case that we remove the space before the line break, '\n' comes before \s+ in the list of regexes and consumes "\n". At this point we still haven't matched the space, but it gets matched by '^(' + _ws1 + ')(#)' as it is at the start of the line. Hence the lexer again begins to tokenize a macro.

My analysis

I fail to see why ^ at the start of Comment.Preproc regexes would be necessary. Maybe that could be removed? I also think that \s+ should not match line breaks, but I'm not sure how easy that would be as \s is local-aware.

Ps. The above example is a minimal example I have found. I have encountered this bug in the wild, for example the following code also fails to tokenize:

int main() {
  prepare(); 
  #pragma omp parallel for
  for (int i = 0; i < 10; i++) {}
}

The text was updated successfully, but these errors were encountered:

amitkummer · 2021-05-29T17:45:40Z

Thanks a lot for diving into the lexer and giving a detailed explanation and example. I think your suggested fix of removing the ^'s is probably the best solution here. Would you consider making a pull request for this?

Also, is there a reason you think matching line breaks with the \s+ regex here would be problematic ? If the ^'s are removed, it would maybe be possible to remove this line and let the \s+ match new lines:

pygments/pygments/lexers/c_cpp.py

Line 58 in e8803c5

(r'\n', Text),

henkkuli · 2021-06-02T08:59:12Z

Thanks a lot for diving into the lexer and giving a detailed explanation and example. I think your suggested fix of removing the ^'s is probably the best solution here. Would you consider making a pull request for this?

That seems to be the minimum change needed. I can create a pull request.

Also, is there a reason you think matching line breaks with the \s+ regex here would be problematic ? If the ^'s are removed, it would maybe be possible to remove this line and let the \s+ match new lines:

pygments/pygments/lexers/c_cpp.py

Line 58 in e8803c5

(r'\n', Text),

Now that I read my message again, I see that what I wrote was not what I mean. What I wanted to say is that another, complementary fix would be to to make \s+ regex to stop at line breaks. This way there would be a split at the line break, so ^ would again be able to match the beginning of the line.

The proposed change (i.e. removing ^) causes the lexer to tokenize some invalid C, for example the following would tokenize even though it is not valid C.

foo(); #define FOO 0

I'm not familiar with how invalid code should be tokenized. In my opinion, tokenizing the #define FOO 0 here as preprocessor macro seems feasible, though others may disagree and argue an error should be produced in this case.

amitkummer · 2021-06-02T10:11:01Z

Now that I read my message again, I see that what I wrote was not what I mean. What I wanted to say is that another, complementary fix would be to to make \s+ regex to stop at line breaks. This way there would be a split at the line break, so ^ would again be able to match the beginning of the line.

Sounds very good!

I'm not familiar with how invalid code should be tokenized. In my opinion, tokenizing the #define FOO 0 here as preprocessor macro seems feasible, though others may disagree and argue an error should be produced in this case.

It's kind of tough to say what should be done with invalid syntax, but according to the discussion in #1468 and after looking at the Python lexer, I would say -- unless it's easy for you to do, don't bother with outputting an Error token or with providing 'correct' highlighting for invalid code.

henkkuli · 2021-06-02T10:16:30Z

I tested both of my proposals, and it seems that changing \s+ to [^\S\n]+ (i.e. spaces without line break) is better than removing ^. If I remove ^ from the preprocessor regexes, then some C-derived languages, such as SWIG fail to tokenize correct. For example the following is valid SWIG

return "std::vector<" #_Tp "," #_Alloc " >";

and only #_Tp and #_Alloc should be tokenized as preprocessor macros while the rest should be tokenized as in regular C++. However removing ^ from the regexes causes the whole #_Tp "," #_Alloc " >"; to be tokenized as preprocessor macro.

On the other hand, changing \s+ to [^\S\n]+ seems to do the correct thing. I ran the tests on tests/examplefiles and tests/snippets, and the only changes seem to be changing from

' \n\t'       Text

to

' '           Text
'\n'          Text

'\t'          Text

as expected.

I'll open a PR for this.

amitkummer · 2021-06-02T10:44:15Z

Wow, that's awesome to see! And thank you again for your work, I am hopeful your PR will be accepted (pretty) swiftly.

Anteru · 2021-06-02T13:33:46Z

Me too, I'll have to take a day off soon to work through the backlog. Thanks everyone for their contribution!

henkkuli mentioned this issue Jun 2, 2021

Fix CFamilyLexer preprocessor tokenization errors #1830

Merged

Anteru closed this as completed in #1830 Jun 20, 2021

Anteru added the changelog-update Items which need to get mentioned in the changelog label Jul 18, 2021

Anteru self-assigned this Jul 18, 2021

Anteru added this to the 2.10 milestone Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

henkkuli commented May 26, 2021

amitkummer commented May 29, 2021

henkkuli commented Jun 2, 2021

amitkummer commented Jun 2, 2021

henkkuli commented Jun 2, 2021

amitkummer commented Jun 2, 2021

Anteru commented Jun 2, 2021

CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

CFamilyLexer fails to tokenize spaces before preprocessor macro #1820

Comments

henkkuli commented May 26, 2021

What I have found

My analysis

amitkummer commented May 29, 2021

henkkuli commented Jun 2, 2021

amitkummer commented Jun 2, 2021

henkkuli commented Jun 2, 2021

amitkummer commented Jun 2, 2021

Anteru commented Jun 2, 2021