ML-ULex's memoization causes massive performance penalties for JSON parsing #284
Closed
2 of 12 tasks
Labels
fixed-in-110.99.5
issues that will be fixed in the 110.99.5 version
json-lib
Issue with JSON component of SML/NJ Library
ml-ulex
performance-bug
something works, but is very slow
Version
110.99.4 (Latest)
Operating System
OS Version
No response
Processor
System Component
SML/NJ Library
Severity
Minor
Description
The lexer generated by the
ml-ulex
tool provides a memoization feature.The
JSONParser
structure uses a lexer generated in this way in its JSON parsing interface.However this feature incurs a massive performance penalty, which is especially unreasonable in cases where the memoization is not used (such as a one-pass JSON parser).
Because of this feature, large JSON files are unable to be parsed in a reasonable amount of time.
Removing the memoization code caused a 24-54x speedup in JSON parsing my testing.
Transcript
Expected Behavior
The parsing of large JSON files should be competitive with other systems.
For example, the JSON parsing library that ships with Python 3.8 finishes in 1.14 seconds on my machine.
The same file took 1656.23 seconds in SML with memoization enabled, and 67.68 seconds with memoization disabled.
Steps to Reproduce
jq '{a: ., b: ., c: ., d: .}' data.json > huge.json
JSONParser.parseFile
function on itAdditional Information
There is an easy modification that can be made to the generated lexer code.
However, the
json.lex.sml
file is auto-generated, so this cannot be the ultimate fix to the problem.Possible solutions would be:
ml-ulex
flag which disabled memoizationml-ulex
template file to be used for these kinds of uses, where the lexer is never backtrackedIntInf.int
is not a good ideaThe modified code (this only affects the final ~30 lines of the generated file):
Email address
skyler DOT soss AT gmail.com
The text was updated successfully, but these errors were encountered: