Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CFamilyLexer preprocessor tokenization errors #1830

Merged
merged 1 commit into from
Jun 20, 2021

Conversation

henkkuli
Copy link
Contributor

@henkkuli henkkuli commented Jun 2, 2021

CFamilyLexer fails to tokenize preprocessor macros when they are preceded by a line break surrounded by spaces. This is the case because prerpocessor regex rule expects to start at the beginning of the line, but the space regex rule matches also the whitespace after the line break. Now the space rule has been refined not to match the line break. Because of this, the preprocessor regex rule correctly matches prerpocessor tokens even when they are preceded by spaces, at the cost of adding some more tokens in the token stream in some cases.

The main change is in pygments/lexers/c_cpp.py. The generic whitespace rule \s+ has been changed to [^\S\n] to avoid matching line breaks. As a consequence of this, many files under tests/examplefiles changed. All of the changes seem to be of the form

'      \n      ' Text

changed to

'      '      Text
'\n'          Text

'      '      Text

In addition to these, the PR adds three new tests under tests/snippets which test the behavior of the preprocessor tokenizer in different situations. The test tests/snippets/c/test_preproc_file5.txt can be controversial as it tests the behavior in situation where the code is invalid and hence the output contains an error token. I'll let the maintainers decide whether that should be included or removed.

Fixes #1820.

CFamilyLexer failed to tokenize preprocessor macros when they were
preceded by line break surrounded by spaces. This was the case because
prerpocessor regex rule expected to start at the beginning of the line,
but the space regex rule matched also the whitespace after the line
break. Now the space rule has been refined not to match the line break.
Because of this, the preprocessor regex rule correctly matches
prerpocessor tokens even when they are preceded by white spaces, at the
cost of adding some more tokens in the token stream in some cases. This
change preserves the behavior of invalid preprocessor usage failing to
tokenize.
@Anteru Anteru added the changelog-update Items which need to get mentioned in the changelog label Jun 20, 2021
@Anteru Anteru self-assigned this Jun 20, 2021
@Anteru Anteru merged commit fea1fbc into pygments:master Jun 20, 2021
@Anteru
Copy link
Collaborator

Anteru commented Jun 20, 2021

Merged, thanks!

@Anteru Anteru added this to the 2.10 milestone Jul 18, 2021
@Anteru Anteru added A-lexing area: changes to individual lexers and removed changelog-update Items which need to get mentioned in the changelog labels Aug 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-lexing area: changes to individual lexers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CFamilyLexer fails to tokenize spaces before preprocessor macro
2 participants