Very poor Rust/WASM performance vs JavaScript #1119

psiphi75 · 2018-12-20T03:47:14Z

I've implemented the same ray tracing algorithm in JavaScript and Rust/WASM. The results are below:

JavaScript around 15 frames per second (fps)
Rust/WASM around 0.3 fps (compiling with the --release option)

I'm using Web Workers, the tests are done using 8 workers. I have released the demo and the code.

I've reviewed this issue, but there is nothing outstanding.

Running the native Rust version on the console I get around 14.4 fps with only one thread. So in theory native performance should be able to reach around 57 fps. Hence WASM is running around 200 times slower than native.

UPDATE: The JavaScript version utilises around 85% each core on my 4 core (with hyperthreading) CPU. While the WASM version uses around 70 to 100% on one CPU and around 10 to 30% on the other CPUs.

Any ideas?

The text was updated successfully, but these errors were encountered:

chinedufn · 2018-12-20T13:10:26Z

From a quick 15 second fly by of your setup (haven't looked at the code) I noticed two things that may or may not improve the numbers a tad.

Looks like you're optimizing for size, have you tried optimizing for speed and were the results comparable?

It looks like you aren't using wasm-opt to optimize your wasm binary?

Haven't looked at the code yet but first place that I'd look is that you aren't cloning a bunch of data.

chinedufn · 2018-12-20T13:10:51Z

Also, I would try looking at your browser's devtools to see what's going on.

alexcrichton · 2018-12-20T16:34:53Z

Thanks for the report @psiphi75! (and the source to poke around!)

I've done some poking around and it definitely looks like nothing obvious is missing (like --release or something like that). I think though that the main cause of slowdown here isn't the wasm itself but perhaps the architecture of the application? It looks like the wasm implementation is calling toObject on a pretty large Uint8ClampedArray which is causing (at least in Firefox) a lot of memmove/memcpy time to be spent. That in turn could cause a huge amount of memory traffic which may explain the low core utilization.

I wasn't able to dig much farther though, I think the perf tools in Chrome/Firefox still have aways to go with wasm!

In any case, can you detail a bit more about what the "each unit of work" function is on the JS/wasm implementations? I couldn't quite follow what it was and how JS differed itself.

FWIW the profilers showed that very little time was spent in wasm itself, so at least that part is fast here!

psiphi75 · 2018-12-20T19:41:26Z

Thanks @chinedufn and @alexcrichton, I tried the optimisation and removing the opt-level = 's', I presume that means it's -O3 by default on a --release build. But that didn't make a difference.

Thanks for memory tip @alexcrichton, I replaced the following lines with a static Uint8ClampedArray buffer, and it shot up to 20 fps.

      workUnit.message.buffer = new Uint8ClampedArray(
        wasm.memory.buffer,
        cellsPtr,
        constants.SQUARE_SIZE * constants.WIDTH * 4
      );

I'll see how I can optimise this part, and keep you posted.

Yes, the Chrome dev tools a pretty limited for profiling, both for WASM and Web Workers.

alexcrichton · 2018-12-20T20:03:01Z

Oh nice!

FWIW I've found that Firefox's perf.html addon is excellent for profiling, but it has a lot of information that isn't always easy to decipher. I was able to figure out that memmove/memcpy were taking up a lot of time for this example, but I couldn't figure out directly why that was being called or what else was slowing things down.

alexcrichton · 2018-12-20T20:03:22Z

Once you've got that committed/deployed as well I can try to help poking around some more!

psiphi75 · 2018-12-20T20:25:41Z

@alexcrichton, thanks. I'm investigating two options, the first option is the SharedArrayBuffer which is currently disabled in some browsers to the Spectre bug and also require atomics/mutexes which has no support in WASM yet (I believe) and the JavaScript component is too atomic for it to be useful.

The other options is transferable message passing, I believe this could work well, but would require a bit of a refactor.

alexcrichton · 2018-12-20T21:06:12Z

Sounds reasonable to me! If you haven't seen it already we've actually got an example of a parallel raytracer, although it's using SharedArrayBuffer and a whole slew of unstable wasm features so it's only really demo quality! There though the messages between threads are just notifications and all the main chunks of data live in the original SharedArrayBuffer shared between workers.

psiphi75 · 2018-12-20T22:40:49Z

This has been fixed and was never an issue due to wasm-bindgen. It's now running at more than 27 fps in Firefox and around 20 fps in Chrome! The demo has been updated.

I have to say I don't understand the reason, but doing a copy from wasm.memory.buffer into a new Uint8ClampedArray buffer took a very long time.

In a nutshell my JavaScript code changed from:

      const cellsPtr = rt.render(workUnit.message.stripId);
      workUnit.message.buffer = new Uint8ClampedArray(
        wasm.memory.buffer,
        cellsPtr,
        constants.SQUARE_SIZE * constants.WIDTH * 4
      );
      self.postMessage(workUnit.toObject());

to:

      workUnit.message.buffer = new Uint8Array(constants.SQUARE_SIZE * constants.WIDTH * 4);
      rt.render(workUnit.message.stripId, workUnit.message.buffer);
      self.postMessage(workUnit.toObject(), [workUnit.message.buffer.buffer]);

There are two aspects here, the main one I believe was creating the Uint8Array upfront and passing it to the WASM render function and writing to the buffer directly. The other component was to a use a transferrable buffer to send the data back to the main process.

I believe a SharedArrayBuffer will work even better, but is not well supported on various browsers.

Thanks for your help.

chinedufn · 2018-12-20T23:47:21Z

I'd bet that a lot of people will be poking around the issues looking for performance tips.

Some potential different ideas:

A performance tag for issues
A FAQ section in the guide for common performance issues / tips / approaches / things to check
- I like this one
Something else...?

alexcrichton · 2018-12-21T16:21:41Z

Glad to hear @psiphi75! FWIW I still can't manage to get good wasm stacks in perf.html, but Chrome's developer tools report that the workers are spending 30% of their time in RayTracer::trace and another 30% in Object::intersect. That at least sounds like a plausible profile to me!

It looks like a lot of events are happening in the workers rather than log contigurous blocks of work, so maybe a tweaked architecture with less messages between workers would help more? Sort of just shooting in the dark!

@chinedufn I definitely agree! https://rustwasm.github.io/book/game-of-life/time-profiling.html and https://rustwasm.github.io/book/reference/time-profiling.html are hopeful to at least be a start to documentation, but expanding that and/or adding an FAQ here sounds great!

psiphi75 · 2019-06-10T20:00:56Z

Last night I demonstrated this to a few people and performance issue is caused due to the following line,

self.postMessage(workUnit.toObject());

Apparently this serialises/deserialises the object when it's sent from the worker to the main thread.

Hence, it's not related to wasm-bindgen.

Pauan · 2019-06-11T04:11:33Z

Yes, postMessage always serializes the object. However, you can avoid the serialization if it is a Transferrable object, and you pass it as the transfer argument for postMessage. This causes the object to be transferred in a zero-copy way, so it's very fast.

psiphi75 closed this as completed Dec 20, 2018

alexcrichton added the speed Issues related to runtime performance label Dec 21, 2018

chinedufn mentioned this issue Dec 21, 2018

Collection of FAQ / links for speed / perf #1123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very poor Rust/WASM performance vs JavaScript #1119

Very poor Rust/WASM performance vs JavaScript #1119

psiphi75 commented Dec 20, 2018 •

edited

chinedufn commented Dec 20, 2018

chinedufn commented Dec 20, 2018 •

edited

alexcrichton commented Dec 20, 2018

psiphi75 commented Dec 20, 2018

alexcrichton commented Dec 20, 2018

alexcrichton commented Dec 20, 2018

psiphi75 commented Dec 20, 2018

alexcrichton commented Dec 20, 2018

psiphi75 commented Dec 20, 2018 •

edited

chinedufn commented Dec 20, 2018 •

edited

alexcrichton commented Dec 21, 2018

psiphi75 commented Jun 10, 2019 •

edited

Pauan commented Jun 11, 2019

Very poor Rust/WASM performance vs JavaScript #1119

Very poor Rust/WASM performance vs JavaScript #1119

Comments

psiphi75 commented Dec 20, 2018 • edited

chinedufn commented Dec 20, 2018

chinedufn commented Dec 20, 2018 • edited

alexcrichton commented Dec 20, 2018

psiphi75 commented Dec 20, 2018

alexcrichton commented Dec 20, 2018

alexcrichton commented Dec 20, 2018

psiphi75 commented Dec 20, 2018

alexcrichton commented Dec 20, 2018

psiphi75 commented Dec 20, 2018 • edited

chinedufn commented Dec 20, 2018 • edited

alexcrichton commented Dec 21, 2018

psiphi75 commented Jun 10, 2019 • edited

Pauan commented Jun 11, 2019

psiphi75 commented Dec 20, 2018 •

edited

chinedufn commented Dec 20, 2018 •

edited

psiphi75 commented Dec 20, 2018 •

edited

chinedufn commented Dec 20, 2018 •

edited

psiphi75 commented Jun 10, 2019 •

edited