-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFamilyLexer fails to tokenize spaces before preprocessor macro #1820
Comments
Thanks a lot for diving into the lexer and giving a detailed explanation and example. I think your suggested fix of removing the Also, is there a reason you think matching line breaks with the pygments/pygments/lexers/c_cpp.py Line 58 in e8803c5
|
That seems to be the minimum change needed. I can create a pull request.
Now that I read my message again, I see that what I wrote was not what I mean. What I wanted to say is that another, complementary fix would be to to make The proposed change (i.e. removing foo(); #define FOO 0 I'm not familiar with how invalid code should be tokenized. In my opinion, tokenizing the |
Sounds very good!
It's kind of tough to say what should be done with invalid syntax, but according to the discussion in #1468 and after looking at the Python lexer, I would say -- unless it's easy for you to do, don't bother with outputting an |
I tested both of my proposals, and it seems that changing return "std::vector<" #_Tp "," #_Alloc " >"; and only On the other hand, changing
to
as expected. I'll open a PR for this. |
Wow, that's awesome to see! And thank you again for your work, I am hopeful your PR will be accepted (pretty) swiftly. |
Me too, I'll have to take a day off soon to work through the backlog. Thanks everyone for their contribution! |
The following string code fails to tokenize even though it is valid C++ code:
Note that there are spaces around the line break. I would expect the
#define A 0
to be tokenized as macro, but#
produces an error token. The tokens produced areprint(list(CFamilyLexer().get_tokens("; \n #define A 0")))
:What I have found
I have done some debugging and found that the spaces around the line break are important. Removing either of the spaces produces correct tokenization.
print(list(CFamilyLexer().get_tokens("; \n#define A 0")))
:print(list(CFamilyLexer().get_tokens(";\n #define A 0")))
As far as I understand, the relevant part of the tokenizer are the following definitions:
Here in the failing case the
";"
first matches as a punctuation. Then'\s+'
matches" \n "
. Now we would want either'^#',
or'^(' + _ws1 + ')(#)'
to match the"#"
, but this isn't the case because^
matches only the start of the line, and here we have already matched the space at the start of the line. Hence the tokenizer produces an error.In the case that we remove the space after line break,
'\s+'
matches" \n"
. As there is no space at the start of the line,'^#'
matches"#"
and the lexer begins to tokenize a macro.In the case that we remove the space before the line break,
'\n'
comes before\s+
in the list of regexes and consumes"\n"
. At this point we still haven't matched the space, but it gets matched by'^(' + _ws1 + ')(#)'
as it is at the start of the line. Hence the lexer again begins to tokenize a macro.My analysis
I fail to see why
^
at the start ofComment.Preproc
regexes would be necessary. Maybe that could be removed? I also think that\s+
should not match line breaks, but I'm not sure how easy that would be as\s
is local-aware.Ps. The above example is a minimal example I have found. I have encountered this bug in the wild, for example the following code also fails to tokenize:
The text was updated successfully, but these errors were encountered: