WASM / JavaScript thoughts #1912

jkeiser · 2022-10-13T19:52:03Z

jkeiser
Oct 13, 2022
Maintainer

I've been thinking about how best to implement a JavaScript library for this. Having worked at Figma for a few months (which uses C++ WASM interfaced with JavaScript), and read the JS documentation, I have a possible direction to take.

We could make simdjson a WASM module implementing the tokenize() function (stage 1). Each instance of the module would have six exports (wasm's method of allocation and data sharing):

tokenize(jsonLength: number): number, which runs stage 1 on the input JSON and produces output indices. The input JSON and output indices are placed in json and token_indices on the instance, similar to how the current parser works in C++, storing data in the parser itself.
json: the input JSON (WebAssembly.Memory formatted as UTF-8 bytes.)
token_indices: the output indices (WebAssembly.Memory, array of output indices
json_len: the length of the input JSON in bytes (u32)
token_indices_len: The number of parsed indices (u32)
error: The error of the last parse (if any).

Then a JSON iterator in JavaScript would let you traverse the JSON. You could iterate at a higher level of arrays/objects/etc., much like with On Demand. This would allocate a bit, but iterator allocation can be escape-analyzed and inlined, and I'd hope JS engines can do that much.

String production is an open question. It would suck to allocate a ton of little stirngs. I'm not sure if string.slice() is efficient (for example, does it avoid copies of strings that never get modified, and how much tax do you pay for that behavior?). There is a stringview library that lets you produce StringView objects referencing byte arrays, but I don't know if those can really be a reasonable substitute for regular strings.

UTF-8 validation is another question. It's possible to forego it, obviously, if a JS string is the input. But even if we take UInt8Array as input, I haven't found a way to forego UTF-8 validation when producing output strings,. so I'm not sure there will be benefit. (simdjson's tokenizer is naturally UTF-8 validating for everything except strings, as it rejects all non-whitespace, non-structural bytes outside of strings, and those are all in the single-byte ASCII range.)

Even if we didn't go the WASM way, I think this is a pretty good boundary for simdjson, and treats allocation and iteration very similarly to its C++ counterpart.

lemire · 2022-10-13T20:40:29Z

lemire
Oct 13, 2022
Maintainer

Strings are a big problem but you don't allude to all of them. For example, parsing an integer requires that you (formally) create a string before you can call the relevant function... https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt Will the JIT be able "escape analysis" its way out of producing a temporary string? One difficulty with JavaScript is that you do not get to pick your own compiler.
The path you describe is very good. Note that On Demand may do without an index... However, with the approach you describe, the index would be almost 'free'.
I am unconcerned with UTF-8 validation. It is assuredly a negligible cost.
I have wanted to write a JavaScript On-Demand for a long time, so I would be willing to contribute to such a project. (I have written a fair number of high-performance JavaScript libraries.) However, a prerequisite seems to have some way to dig into how the JIT deals with the code, otherwise, it will just be super frustrating.

1 reply

jkeiser Oct 14, 2022
Maintainer Author

Damn good point about number parsing. At least the parseNumber() function isn't recursive, and relatively straightforward.

lemire · 2023-04-11T20:54:06Z

lemire
Apr 11, 2023
Maintainer

The backend could parse the numbers as well.

oven-sh/bun#2570

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WASM / JavaScript thoughts #1912

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

WASM / JavaScript thoughts #1912

Uh oh!

jkeiser Oct 13, 2022 Maintainer

Replies: 2 comments · 1 reply

Uh oh!

lemire Oct 13, 2022 Maintainer

Uh oh!

jkeiser Oct 14, 2022 Maintainer Author

Uh oh!

lemire Apr 11, 2023 Maintainer

jkeiser
Oct 13, 2022
Maintainer

Replies: 2 comments 1 reply

lemire
Oct 13, 2022
Maintainer

jkeiser Oct 14, 2022
Maintainer Author

lemire
Apr 11, 2023
Maintainer