Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Denial of service with malformed file #1586

Google-Autofuzz opened this issue Oct 26, 2020 · 2 comments

Denial of service with malformed file #1586

Google-Autofuzz opened this issue Oct 26, 2020 · 2 comments


Copy link

@Google-Autofuzz Google-Autofuzz commented Oct 26, 2020

When running the following code with the latest git version of pygments on the attached input results of in 100% CPU consumption for an arbitrary long time:

import sys

import pygments
import pygments.formatters
import pygments.lexers

with open(sys.argv[1], 'rb') as f:
    data =
    lexer = pygments.lexers.guess_lexer(str(data))
    pygments.highlight(str(data), lexer, pygments.formatters.HtmlFormatter())


@Anteru Anteru self-assigned this Oct 26, 2020
@Anteru Anteru added this to the 2.7.3 milestone Oct 26, 2020
Copy link

@kurtmckee kurtmckee commented Nov 8, 2020

The sample input file causes Pygments to guess that this should be parsed by the SspLexer.

The SspLexer is a delegating lexer that uses the following lexers: XmlLexer (which does not choke on the input file) and JspRootLexer. JspRootLexer includes regex patterns from the JavaLexer (which also does not choke on the input file). However, when the JspRootLexer hands things off to the JavaLexer it appears that there is a mis-match in the quotes, and the JavaLexer is encountering catastrophic backtracking in the string literal regex.

I used this code to determine where in the file the JspRootLexer is choking up, and it's happening at line 115, right after these tokens:

(3690, Token.Name, 'o')
(3691, Token.Literal.String, '"; print $$0 "')
(3705, Token.Name, 'c')

The code I used was:

import pygments.lexers.templates

with open('timeout-9a00111e78b5cd0979a370fc9a5cd22e39a249e4.txt', 'rb') as f:
    data =

lexer = pygments.lexers.templates.JspRootLexer()

for i, t, v in lexer.get_tokens_unprocessed(str(data)):
    print((i, t, v))
    if i == 3705:

After stepping forward in the code for a while, I discovered that everything was hanging at pygments.lexer.RegexLexer.get_tokens_unprocessed():625. I added a print() statement just before that line and re-ran the code above, which helped me identify that it's the regex for string literals in the JavaLexer.

I've exploded that regex from a single-line regex to a new regex state named "string", which resolves the catastrophic backtracking and allows the code provided by the reporter to run without hanging.

I'm working on unit test for this and then I can submit a PR to close this issue.

kurtmckee added a commit to kurtmckee/pygments that referenced this issue Nov 9, 2020
@Anteru Anteru closed this in #1594 Nov 9, 2020
Anteru pushed a commit that referenced this issue Nov 9, 2020
* JavaLexer: Demonstrate a catastrophic backtracking bug

* JavaLexer: Fix a catastrophic backtracking bug

Closes #1586
Copy link

@Anteru Anteru commented Nov 9, 2020

Thanks a lot for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants