Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML-ULex's memoization causes massive performance penalties for JSON parsing #284

Closed
2 of 12 tasks
Skyb0rg007 opened this issue Oct 10, 2023 · 1 comment
Closed
2 of 12 tasks
Assignees
Labels
fixed-in-110.99.5 issues that will be fixed in the 110.99.5 version json-lib Issue with JSON component of SML/NJ Library ml-ulex performance-bug something works, but is very slow

Comments

@Skyb0rg007
Copy link
Contributor

Skyb0rg007 commented Oct 10, 2023

Version

110.99.4 (Latest)

Operating System

  • Any
  • Linux
  • macOS
  • Windows
  • Other Unix

OS Version

No response

Processor

  • Any
  • Arm (using Rosetta)
  • PowerPC
  • Sparc
  • x86 (32-bit)
  • x86-64 (64-bit)
  • Other

System Component

SML/NJ Library

Severity

Minor

Description

The lexer generated by the ml-ulex tool provides a memoization feature.
The JSONParser structure uses a lexer generated in this way in its JSON parsing interface.
However this feature incurs a massive performance penalty, which is especially unreasonable in cases where the memoization is not used (such as a one-pass JSON parser).

Because of this feature, large JSON files are unable to be parsed in a reasonable amount of time.
Removing the memoization code caused a 24-54x speedup in JSON parsing my testing.

Transcript

$ CM.make "$/json-lib.cm";
$ val t = Timer.startCPUTimer (); val _ = JSONParser.parseFile "huge.json"; Timer.checkCPUTimer t;
val it = {sys=TIME {usec=191548620},usr=TIME {usec=1656225856}} : {sys:Time.time,usr:Time.time}

$ CM.make "fixed-json-lib.cm";
$ val t = Timer.startCPUTimer (); val _ = JSONParser.parseFile "huge.json"; Timer.checkCPUTimer t;
val it = {sys=TIME {usec=7443630},usr=TIME {usec=67678371}} : {sys:Time.time,usr:Time.time}

Expected Behavior

The parsing of large JSON files should be competitive with other systems.
For example, the JSON parsing library that ships with Python 3.8 finishes in 1.14 seconds on my machine.
The same file took 1656.23 seconds in SML with memoization enabled, and 67.68 seconds with memoization disabled.

Steps to Reproduce

  1. Generate a big JSON file
    • I downloaded a large file from here
    • To make it even larger for testing, I ran: jq '{a: ., b: ., c: ., d: .}' data.json > huge.json
  2. Run the JSONParser.parseFile function on it
  3. Wait an exorbitant amount of time (took ~30 min on my machine)

Additional Information

There is an easy modification that can be made to the generated lexer code.
However, the json.lex.sml file is auto-generated, so this cannot be the ultimate fix to the problem.
Possible solutions would be:

  • Convert the lexer into a hand-written one
  • Add a ml-ulex flag which disabled memoization
  • Write a custom ml-ulex template file to be used for these kinds of uses, where the lexer is never backtracked
  • Release a new version of the JSON library which addresses this issue
    • there are other issues such as 271 to fix here, which may cause incompatibilities
    • Unconditionally parsing integers into IntInf.int is not a good idea
    • The JSON parsing infrastructure is still very slow even after this change, and there are likely more opportunities to further improve performance

The modified code (this only affects the final ~30 lines of the generated file):

type strm = ULexBuffer.stream * yystart_state

fun lex sm (yystrm, ss) = let
   val (tok, span, yystrm', ss') = innerLex (yystrm, ss, sm)
   in (tok, span, (yystrm', ss'))
   end

(* Just remove references to `STRM` and `ref`s *)
fun streamify input = (yystreamify' 0 input, INITIAL)
fun getPos (strm, _) = ULexBuffer.getpos strm

Email address

skyler DOT soss AT gmail.com

@Skyb0rg007 Skyb0rg007 added the bug Something isn't working label Oct 10, 2023
@JohnReppy JohnReppy added ml-ulex json-lib Issue with JSON component of SML/NJ Library performance-bug something works, but is very slow and removed bug Something isn't working labels Oct 11, 2023
@JohnReppy JohnReppy self-assigned this Oct 11, 2023
@JohnReppy JohnReppy changed the title ML-ULex's memozation causes massive performance penalties ML-ULex's memozation causes massive performance penalties for JSON parsing Mar 14, 2024
@JohnReppy JohnReppy added the fixed-in-110.99.5 issues that will be fixed in the 110.99.5 version label Mar 14, 2024
@JohnReppy
Copy link
Contributor

Rewrote the JSON parsers to work directly on the input source (instead of using a ML-ulex lexer. This change speeds up parsing by about a factor of eight on the data.json example, which is a 13.7Mb file. The speedup appears to be greater for huge.json.

@JohnReppy JohnReppy changed the title ML-ULex's memozation causes massive performance penalties for JSON parsing ML-ULex's memoization causes massive performance penalties for JSON parsing Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed-in-110.99.5 issues that will be fixed in the 110.99.5 version json-lib Issue with JSON component of SML/NJ Library ml-ulex performance-bug something works, but is very slow
Projects
None yet
Development

No branches or pull requests

2 participants