ML-ULex's memoization causes massive performance penalties for JSON parsing #284

Skyb0rg007 · 2023-10-10T17:11:46Z

Version

110.99.4 (Latest)

Operating System

OS Version

No response

Processor

System Component

SML/NJ Library

Severity

Minor

Description

The lexer generated by the ml-ulex tool provides a memoization feature.
The JSONParser structure uses a lexer generated in this way in its JSON parsing interface.
However this feature incurs a massive performance penalty, which is especially unreasonable in cases where the memoization is not used (such as a one-pass JSON parser).

Because of this feature, large JSON files are unable to be parsed in a reasonable amount of time.
Removing the memoization code caused a 24-54x speedup in JSON parsing my testing.

Transcript

$ CM.make "$/json-lib.cm";
$ val t = Timer.startCPUTimer (); val _ = JSONParser.parseFile "huge.json"; Timer.checkCPUTimer t;
val it = {sys=TIME {usec=191548620},usr=TIME {usec=1656225856}} : {sys:Time.time,usr:Time.time}

$ CM.make "fixed-json-lib.cm";
$ val t = Timer.startCPUTimer (); val _ = JSONParser.parseFile "huge.json"; Timer.checkCPUTimer t;
val it = {sys=TIME {usec=7443630},usr=TIME {usec=67678371}} : {sys:Time.time,usr:Time.time}

Expected Behavior

The parsing of large JSON files should be competitive with other systems.
For example, the JSON parsing library that ships with Python 3.8 finishes in 1.14 seconds on my machine.
The same file took 1656.23 seconds in SML with memoization enabled, and 67.68 seconds with memoization disabled.

Steps to Reproduce

Generate a big JSON file
- I downloaded a large file from here
- To make it even larger for testing, I ran: jq '{a: ., b: ., c: ., d: .}' data.json > huge.json
Run the JSONParser.parseFile function on it
Wait an exorbitant amount of time (took ~30 min on my machine)

Additional Information

There is an easy modification that can be made to the generated lexer code.
However, the json.lex.sml file is auto-generated, so this cannot be the ultimate fix to the problem.
Possible solutions would be:

Convert the lexer into a hand-written one
Add a ml-ulex flag which disabled memoization
Write a custom ml-ulex template file to be used for these kinds of uses, where the lexer is never backtracked
Release a new version of the JSON library which addresses this issue
- there are other issues such as 271 to fix here, which may cause incompatibilities
- Unconditionally parsing integers into IntInf.int is not a good idea
  - That behavior was made default in Python 3.11 to address DOS attacks, and has been available in OCaml's Yojson
- The JSON parsing infrastructure is still very slow even after this change, and there are likely more opportunities to further improve performance

The modified code (this only affects the final ~30 lines of the generated file):

type strm = ULexBuffer.stream * yystart_state

fun lex sm (yystrm, ss) = let
   val (tok, span, yystrm', ss') = innerLex (yystrm, ss, sm)
   in (tok, span, (yystrm', ss'))
   end

(* Just remove references to `STRM` and `ref`s *)
fun streamify input = (yystreamify' 0 input, INITIAL)
fun getPos (strm, _) = ULexBuffer.getpos strm

Email address

skyler DOT soss AT gmail.com

The text was updated successfully, but these errors were encountered:

JohnReppy · 2024-03-14T16:15:47Z

Rewrote the JSON parsers to work directly on the input source (instead of using a ML-ulex lexer. This change speeds up parsing by about a factor of eight on the data.json example, which is a 13.7Mb file. The speedup appears to be greater for huge.json.

Skyb0rg007 added the bug Something isn't working label Oct 10, 2023

JohnReppy added ml-ulex json-lib Issue with JSON component of SML/NJ Library performance-bug something works, but is very slow and removed bug Something isn't working labels Oct 11, 2023

JohnReppy self-assigned this Oct 11, 2023

JohnReppy changed the title ~~ML-ULex's memozation causes massive performance penalties~~ ML-ULex's memozation causes massive performance penalties for JSON parsing Mar 14, 2024

JohnReppy added the fixed-in-110.99.5 issues that will be fixed in the 110.99.5 version label Mar 14, 2024

JohnReppy closed this as completed Mar 14, 2024

JohnReppy changed the title ~~ML-ULex's memozation causes massive performance penalties for JSON parsing~~ ML-ULex's memoization causes massive performance penalties for JSON parsing Mar 14, 2024

Skyb0rg007 mentioned this issue Jul 12, 2024

JSONParser: fix to unicode escape handling #318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML-ULex's memoization causes massive performance penalties for JSON parsing #284

ML-ULex's memoization causes massive performance penalties for JSON parsing #284

Skyb0rg007 commented Oct 10, 2023 •

edited

Loading

JohnReppy commented Mar 14, 2024

ML-ULex's memoization causes massive performance penalties for JSON parsing #284

ML-ULex's memoization causes massive performance penalties for JSON parsing #284

Comments

Skyb0rg007 commented Oct 10, 2023 • edited Loading

Version

Operating System

OS Version

Processor

System Component

Severity

Description

Transcript

Expected Behavior

Steps to Reproduce

Additional Information

Email address

JohnReppy commented Mar 14, 2024

Skyb0rg007 commented Oct 10, 2023 •

edited

Loading