New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second try at specialized JSON: RapidJSON + custom assembly #1165
Conversation
Codecov Report
|
This replaces #1162. Forth may be good for binary formats (especially unruly ones like ROOT, where you really need the Turing completeness!), but parsing JSON just required too many instructions (at 7 ns per instruction). This PR implements a very simple language for encoding the JSON → Awkward rules, given a JSONSchema, and lets RapidJSON do the JSON parsing. These instructions are invoked by RapidJSON's SAX interface. It's called "JSON assembly" because it's basically a virtual assembly language. Testing again with this example: MULTIPLIER = int(10e6)
json_string = b"[" + b", ".join([
b'[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],' +
b'[],' +
b'[{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]'
] * MULTIPLIER) + b"]" we have
So getting just iterating over this simple language and writing out output costs 4.4 seconds, which is a little less than the time RapidJSON spends parsing it. Good enough. For non-Awkward but JSON-schema users (who might be using Dask's "bag" API now), the value proposition is that using Awkward Array means they can load their data 7× faster, use 10× less memory, and then do fast vectorized operations on what has been loaded, rather than Python iteration. Now I need to find out why Windows is failing. |
Fixed the "Windows bug" (it was revealing a bug that was merely hidden on the other platforms) and tried a few things to get a bit more speed, such as merging identical |
Another thing to try: each type of instruction should only be writing to one type of output buffer. Replace the single The output dtype, currently an argument for each assembly instruction, would have to become fixed. (Easy way to do that would be to just add assertions in the constructor. That way, the Python code can be unchanged.) Also, replace |
No description provided.