-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up JSON and reduce HTML formatter consumption #1569
Conversation
Changes in this patch: * Update the JSON-LD URL to HTTPS * Update the list of JSON-LD keywords * Make the JSON-LD parser less dependent on the JSON lexer implementation * Add unit tests for the JSON-LD lexer
This includes: * Testing valid literals * Testing valid string escapes * Testing that object keys are tokenized differently from string values
Related to pygments#1425 Included in this change: * The JSON parser is rewritten * The JSON bare object parser no longer requires additional code * `get_tokens_unprocessed()` returns as much as it can to reduce yields (for example, side-by-side punctuation is not returned separately) * The unit tests were updated * Add unit tests based on Hypothesis test results
Related to pygments#1425 Tested on a 118MB JSON file. Memory consumption tops out at ~3GB before this patch and drops to only ~2GB with this patch. These were the command lines used: python -m pygments -l json -f html -o .\new-code-classes.html .\jc-output.txt python -m pygments -l json -f html -O "noclasses" -o .\new-code-styles.html .\jc-output.txt
…ting For a 118MB JSON input file, this reduces memory consumption by ~500MB and reduces formatting time by ~15 seconds.
I added an LRU cache that further reduces the HTML formatter's memory consumption and processing time. The total run-time for the 118MB JSON test has dropped to ~51 seconds and the memory consumption has dropped to ~1.7GB (down from ~3.2GB). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's a quite impressive JSON parser there :) Well done -- my only concern is reduced test coverage. Yes, the parser is much less prone to some problems, and running those tests won't yield any new insights, but it ensures we don't regress. Unless a test is invalid, it should be kept around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of pytest-timeout
, but it does indeed sound very useful, given backtracking is typically going to timeout. @birkenfeld Any concerns with adding pytest-timeout
as a dependency?
Thanks a lot, excellent work! |
Thanks! I really had a blast working on this. =)
|
@kurtmckee thanks a lot! I am surprised that the lru_cache made such a big difference. I guess it's kind of specific for some languages with lots of similar small tokens like braces etc. |
@kurtmckee nice work. I'm looking forward to using the parser and I'm glad that the backtracking test is still in there :) I can't wait for this to get released. |
In #1425, the author of an app that depends on Pygments reported slowness when running Pygments against large JSON files.
I investigated by generating a 118MB JSON file as input, using the inputs reported by the author of that app. I found that the regex parser's
.get_tokens_unprocessed()
took ~63 seconds to lex the entire file. I then rewrote the parser in Python and found that the time was reduced to only ~22 seconds.I also found that Pygments was consuming ~3GB of memory when formatting the 118MB JSON file in HTML. It appears that the buffered file I/O is caching everything in memory before finally writing to the file all at once as Pygments exits. While I wasn't able to stop the buffering from waiting until the entire file was in memory, I was able to shave off an entire gigabyte of memory consumption by caching the opening span classes that are generated per-token (like
<span class="s2">
or<span style="whatever">
). I didn't find similar memory consumption issues in the Terminal256 formatter.For the Terminal256 formatter, this patch cuts the total runtime almost in half, dropping from 2:02 to 1:09 total processing time (as measured by Powershell's Measure-Command).
For the HTML formatter, the gains are more significant:
noclasses
option drops from 2:59 to 1:03 total processing time.All output from the HTML and Terminal256 formatters is 100% byte-for-byte identical between master branch and this branch.