Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up JSON and reduce HTML formatter consumption #1569

Merged
merged 8 commits into from Oct 26, 2020

Conversation

@kurtmckee
Copy link
Contributor

@kurtmckee kurtmckee commented Oct 11, 2020

In #1425, the author of an app that depends on Pygments reported slowness when running Pygments against large JSON files.

I investigated by generating a 118MB JSON file as input, using the inputs reported by the author of that app. I found that the regex parser's .get_tokens_unprocessed() took ~63 seconds to lex the entire file. I then rewrote the parser in Python and found that the time was reduced to only ~22 seconds.

I also found that Pygments was consuming ~3GB of memory when formatting the 118MB JSON file in HTML. It appears that the buffered file I/O is caching everything in memory before finally writing to the file all at once as Pygments exits. While I wasn't able to stop the buffering from waiting until the entire file was in memory, I was able to shave off an entire gigabyte of memory consumption by caching the opening span classes that are generated per-token (like <span class="s2"> or <span style="whatever">). I didn't find similar memory consumption issues in the Terminal256 formatter.

For the Terminal256 formatter, this patch cuts the total runtime almost in half, dropping from 2:02 to 1:09 total processing time (as measured by Powershell's Measure-Command).

For the HTML formatter, the gains are more significant:

  • Memory consumption drops by ~33%
  • The HTML formatter with default options drops from 1:56 to 1:04 total processing time.
  • The HTML formatter with the noclasses option drops from 2:59 to 1:03 total processing time.

All output from the HTML and Terminal256 formatters is 100% byte-for-byte identical between master branch and this branch.

kurtmckee added 5 commits Oct 11, 2020
Changes in this patch:

* Update the JSON-LD URL to HTTPS
* Update the list of JSON-LD keywords
* Make the JSON-LD parser less dependent on the JSON lexer implementation
* Add unit tests for the JSON-LD lexer
This includes:

* Testing valid literals
* Testing valid string escapes
* Testing that object keys are tokenized differently from string values
Related to pygments#1425

Included in this change:

* The JSON parser is rewritten
* The JSON bare object parser no longer requires additional code
* `get_tokens_unprocessed()` returns as much as it can to reduce yields
  (for example, side-by-side punctuation is not returned separately)
* The unit tests were updated
* Add unit tests based on Hypothesis test results
Related to pygments#1425

Tested on a 118MB JSON file. Memory consumption tops out at ~3GB before
this patch and drops to only ~2GB with this patch. These were the command
lines used:

python -m pygments -l json -f html -o .\new-code-classes.html .\jc-output.txt
python -m pygments -l json -f html -O "noclasses" -o .\new-code-styles.html .\jc-output.txt
…ting

For a 118MB JSON input file, this reduces memory consumption by ~500MB
and reduces formatting time by ~15 seconds.
@kurtmckee
Copy link
Contributor Author

@kurtmckee kurtmckee commented Oct 11, 2020

I added an LRU cache that further reduces the HTML formatter's memory consumption and processing time. The total run-time for the 118MB JSON test has dropped to ~51 seconds and the memory consumption has dropped to ~1.7GB (down from ~3.2GB).

Copy link
Collaborator

@Anteru Anteru left a comment

Ok, that's a quite impressive JSON parser there :) Well done -- my only concern is reduced test coverage. Yes, the parser is much less prone to some problems, and running those tests won't yield any new insights, but it ensures we don't regress. Unless a test is invalid, it should be kept around.

tests/test_data.py Show resolved Hide resolved
@Anteru Anteru added this to the 2.8 milestone Oct 24, 2020
@Anteru Anteru self-assigned this Oct 24, 2020
@kurtmckee kurtmckee requested a review from Anteru Oct 25, 2020
Anteru
Anteru approved these changes Oct 25, 2020
Copy link
Collaborator

@Anteru Anteru left a comment

I wasn't aware of pytest-timeout, but it does indeed sound very useful, given backtracking is typically going to timeout. @birkenfeld Any concerns with adding pytest-timeout as a dependency?

pygments/lexers/data.py Show resolved Hide resolved
@Anteru Anteru removed this from the 2.8 milestone Oct 26, 2020
@Anteru Anteru added this to the 2.7.3 milestone Oct 26, 2020
@Anteru Anteru merged commit 164dcb5 into pygments:master Oct 26, 2020
13 checks passed
@Anteru
Copy link
Collaborator

@Anteru Anteru commented Oct 26, 2020

Thanks a lot, excellent work!

@kurtmckee
Copy link
Contributor Author

@kurtmckee kurtmckee commented Oct 26, 2020

@birkenfeld
Copy link
Contributor

@birkenfeld birkenfeld commented Oct 28, 2020

@kurtmckee thanks a lot! I am surprised that the lru_cache made such a big difference. I guess it's kind of specific for some languages with lots of similar small tokens like braces etc.

@gerner
Copy link
Contributor

@gerner gerner commented Nov 20, 2020

@kurtmckee nice work. I'm looking forward to using the parser and I'm glad that the backtracking test is still in there :) I can't wait for this to get released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants