Replies: 2 comments 1 reply
-
|
Beta Was this translation helpful? Give feedback.
1 reply
-
|
The backend could parse the numbers as well. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I've been thinking about how best to implement a JavaScript library for this. Having worked at Figma for a few months (which uses C++ WASM interfaced with JavaScript), and read the JS documentation, I have a possible direction to take.
We could make simdjson a WASM module implementing the tokenize() function (stage 1). Each instance of the module would have six exports (wasm's method of allocation and data sharing):
tokenize(jsonLength: number): number, which runs stage 1 on the input JSON and produces output indices. The input JSON and output indices are placed injsonandtoken_indiceson the instance, similar to how the current parser works in C++, storing data in the parser itself.json: the input JSON (WebAssembly.Memory formatted as UTF-8 bytes.)token_indices: the output indices (WebAssembly.Memory, array of output indicesjson_len: the length of the input JSON in bytes (u32)token_indices_len: The number of parsed indices (u32)error: The error of the last parse (if any).Then a JSON iterator in JavaScript would let you traverse the JSON. You could iterate at a higher level of arrays/objects/etc., much like with On Demand. This would allocate a bit, but iterator allocation can be escape-analyzed and inlined, and I'd hope JS engines can do that much.
String production is an open question. It would suck to allocate a ton of little stirngs. I'm not sure if string.slice() is efficient (for example, does it avoid copies of strings that never get modified, and how much tax do you pay for that behavior?). There is a stringview library that lets you produce
StringViewobjects referencing byte arrays, but I don't know if those can really be a reasonable substitute for regular strings.UTF-8 validation is another question. It's possible to forego it, obviously, if a JS
stringis the input. But even if we take UInt8Array as input, I haven't found a way to forego UTF-8 validation when producing output strings,. so I'm not sure there will be benefit. (simdjson's tokenizer is naturally UTF-8 validating for everything except strings, as it rejects all non-whitespace, non-structural bytes outside of strings, and those are all in the single-byte ASCII range.)Even if we didn't go the WASM way, I think this is a pretty good boundary for simdjson, and treats allocation and iteration very similarly to its C++ counterpart.
Beta Was this translation helpful? Give feedback.
All reactions