Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markdown lexer fails on fenced code bocks of type raw #1616

Closed
gerner opened this issue Nov 28, 2020 · 3 comments
Closed

markdown lexer fails on fenced code bocks of type raw #1616

gerner opened this issue Nov 28, 2020 · 3 comments

Comments

@gerner
Copy link
Contributor

@gerner gerner commented Nov 28, 2020

for example, the following produces the indicated error:

$ echo '```raw
hi
```' | pygmentize -l md

*** Error while highlighting:
TypeError: cannot use a bytes pattern on a string-like object
   (file "/home/nick/src/pygments/pygments/lexers/special.py", line 87, in get_tokens_unprocessed)
*** If this is a bug you want to report, please rerun with -v.
```raw

I believe this happens because the markdown lexer will parse the fenced code and see "raw" which invokes the RawTokenLexer which I don't think is meant to be used like a normal lexer (see https://github.com/pygments/pygments/blob/master/pygments/lexers/special.py#L45-L49)

You can see this in the wild using this README.md:

https://github.com/netdata/netdata/blob/master/collectors/perf.plugin/README.md

This repo has nearly 50k stars and over 4.5k forks so it seems pretty reasonable to handle this kind of issue.

@gerner
Copy link
Contributor Author

@gerner gerner commented Nov 28, 2020

Perhaps the fix here is to not have an alias for RawTokenLexer? (https://github.com/pygments/pygments/blob/master/pygments/lexers/special.py#L58)

I'm not familiar with why this exists, but it's not clear why you'd want to expose RawTokenLexer for filename matching, and letting it be exposed in this way allows it to match as a lexer in a variety of contexts.

Or perhaps it shouldn't have a binary regex? What's the motivation for swapping from string input in the interface to a lexer to binary strings? To me (naive about what's going on here) that seems like a violation of the interface.

@gerner
Copy link
Contributor Author

@gerner gerner commented Nov 28, 2020

As a workaround, I tried a monkey patch in my application that seemed to work:

old_rtl = LEXERS["RawTokenLexer"]
new_rtl = (old_rtl[0], old_rtl[1], (), old_rtl[3], old_rtl[4])
LEXERS["RawTokenLexer"] = new_rtl

I'm not suggesting this is the fix, but it shows that this does indeed avoid the issue. I believe this makes MarkdownLexer treat the raw fenced code block case the same as it would an unrecognized language.

I'm not familiar enough with RawTokenLexer to know why it needs to be recognized with the "raw" alias. It seems like this is a pygments internal utility, not meant to be used for general language parsing.

@birkenfeld
Copy link
Contributor

@birkenfeld birkenfeld commented Dec 24, 2020

Thanks for the report. I'm fixing the string/bytes issue for the next 2.7 release, and for 2.8 we'll remove the alias.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants