Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second try at specialized JSON: RapidJSON + custom assembly #1165

Merged
merged 19 commits into from Nov 25, 2021

Conversation

jpivarski
Copy link
Member

No description provided.

@codecov
Copy link

codecov bot commented Nov 24, 2021

Codecov Report

Merging #1165 (036e9a3) into main (e4b6640) will decrease coverage by 0.03%.
The diff coverage is 87.85%.

Impacted Files Coverage Δ
...ward/_v2/operations/convert/ak_from_json_schema.py 85.54% <87.85%> (-0.51%) ⬇️

@jpivarski
Copy link
Member Author

This replaces #1162.

Forth may be good for binary formats (especially unruly ones like ROOT, where you really need the Turing completeness!), but parsing JSON just required too many instructions (at 7 ns per instruction). This PR implements a very simple language for encoding the JSON → Awkward rules, given a JSONSchema, and lets RapidJSON do the JSON parsing. These instructions are invoked by RapidJSON's SAX interface. It's called "JSON assembly" because it's basically a virtual assembly language.

Testing again with this example:

MULTIPLIER = int(10e6)
json_string = b"[" + b", ".join([
    b'[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],' +
    b'[],' +
    b'[{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]'
] * MULTIPLIER) + b"]"

we have

  • 70.0 seconds in Python's json module (projected from a 1/10th dataset, since the full dataset would produce 20 GB of Python lists and dicts)
  • 38.6 seconds with untyped ArrayBuilder—the state of the art if you have no schema
  • 31.5 seconds with schema-specialized Forth code (trading ArrayBuilder navigation costs with Forth navigation costs)
  • 10.4 seconds with RapidJSON + this simple language → Awkward Arrays
  • 6.0 seconds for RapidJSON with no output

So getting just iterating over this simple language and writing out output costs 4.4 seconds, which is a little less than the time RapidJSON spends parsing it. Good enough.

For non-Awkward but JSON-schema users (who might be using Dask's "bag" API now), the value proposition is that using Awkward Array means they can load their data 7× faster, use 10× less memory, and then do fast vectorized operations on what has been loaded, rather than Python iteration.

Now I need to find out why Windows is failing.

@jpivarski jpivarski marked this pull request as ready for review November 25, 2021 03:29
@jpivarski
Copy link
Member Author

Fixed the "Windows bug" (it was revealing a bug that was merely hidden on the other platforms) and tried a few things to get a bit more speed, such as merging identical switch cases and merging the two classes into a single class (that mattered for the Forth implementation...). No changes in speed. I'm leaving the switch cases merged (it's easier to read), but keeping the classes separate (I want to hide the RapidJSON dependency from the header files).

@jpivarski jpivarski enabled auto-merge (squash) November 25, 2021 03:31
@jpivarski
Copy link
Member Author

Another thing to try: each type of instruction should only be writing to one type of output buffer. Replace the single std::vector of abstract ForthOutputBuffets with a std::vector for each specialized subclass, which ought to avoid vtable indirection.

The output dtype, currently an argument for each assembly instruction, would have to become fixed. (Easy way to do that would be to just add assertions in the constructor. That way, the Python code can be unchanged.)

Also, replace std::shared_ptr with std::unique_ptr, as we already know that makes a difference.

@jpivarski jpivarski merged commit e50a2ed into main Nov 25, 2021
@jpivarski jpivarski deleted the jpivarski/ak-from_json_schema-2 branch November 25, 2021 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant