Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extremely poor performance on certain markdown file #1617

Closed
gerner opened this issue Nov 28, 2020 · 4 comments
Closed

extremely poor performance on certain markdown file #1617

gerner opened this issue Nov 28, 2020 · 4 comments
Milestone

Comments

@gerner
Copy link
Contributor

@gerner gerner commented Nov 28, 2020

This particular markdown takes a long time (21 seconds on my laptop) to parse:
https://github.com/date-fns/date-fns/blob/a9fc0c7b715883349555bfb94daa1059430eda52/src/locale/en-US/snapshot.md

$ time pygmentize -f terminal -l md -o /dev/null /tmp/snapshot.md

real    0m21.318s
user    0m21.306s
sys     0m0.012s

I've seen slow-ish parsing performance on markdown with tables before, however I'm not certain it's tables that's causing the issue.

@gerner
Copy link
Contributor Author

@gerner gerner commented Nov 28, 2020

I do see Error tokens showing up for '\n' on its own line which I think is because the rule here doesn't match a newline:

https://github.com/pygments/pygments/blob/master/pygments/lexers/markup.py#L601

That seems like a separate issue. If I add a rule to match r'\n' (most) of the errors go away, but that doesn't speed up processing any.

@gerner
Copy link
Contributor Author

@gerner gerner commented Nov 28, 2020

I think tables is a red herring. I pulled out the table rules and performance didn't change.

However, this rule looks expensive:

# strikethrough
(r'([^~]*)(~~[^~]+~~)', bygroups(Text, Generic.Deleted)),

I don't know why, but if I comment it out the file is processed in 400ms down from 21s. Note that this file doesn't have the character "~" anywhere in it.

it looks like the leading ([^~]*) group is the culprit here. changing the rule to this gets the same improvement in performance (400ms down from 21s):

# strikethrough
(r'(~~[^~]+~~)', bygroups(Generic.Deleted)),

I don't think we need that since there's already a catch all single character rule that will eat up text characters that don't match.

Also, why the heavy use of bygroups throughout?

@gerner
Copy link
Contributor Author

@gerner gerner commented Nov 28, 2020

Note, if I change the rule as suggested above the output of pygmentize doesn't change and all test cases still pass. It just happens in 400ms instead of 21s.

@gerner gerner changed the title extremely poor performance on table-heavy markdown file extremely poor performance on certain markdown file Nov 28, 2020
@gerner
Copy link
Contributor Author

@gerner gerner commented Jan 6, 2021

This is fixed by #1623

@gerner gerner closed this Jan 6, 2021
@Anteru Anteru added this to the 2.7.4 milestone Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants