Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very poor Rust/WASM performance vs JavaScript #1119

Closed
psiphi75 opened this issue Dec 20, 2018 · 13 comments
Closed

Very poor Rust/WASM performance vs JavaScript #1119

psiphi75 opened this issue Dec 20, 2018 · 13 comments
Labels
speed Issues related to runtime performance

Comments

@psiphi75
Copy link

psiphi75 commented Dec 20, 2018

I've implemented the same ray tracing algorithm in JavaScript and Rust/WASM. The results are below:

  • JavaScript around 15 frames per second (fps)
  • Rust/WASM around 0.3 fps (compiling with the --release option)

I'm using Web Workers, the tests are done using 8 workers. I have released the demo and the code.

I've reviewed this issue, but there is nothing outstanding.

Running the native Rust version on the console I get around 14.4 fps with only one thread. So in theory native performance should be able to reach around 57 fps. Hence WASM is running around 200 times slower than native.

UPDATE: The JavaScript version utilises around 85% each core on my 4 core (with hyperthreading) CPU. While the WASM version uses around 70 to 100% on one CPU and around 10 to 30% on the other CPUs.

Any ideas?

@chinedufn
Copy link
Contributor

From a quick 15 second fly by of your setup (haven't looked at the code) I noticed two things that may or may not improve the numbers a tad.

  1. Looks like you're optimizing for size, have you tried optimizing for speed and were the results comparable?

image

  1. It looks like you aren't using wasm-opt to optimize your wasm binary?

Haven't looked at the code yet but first place that I'd look is that you aren't cloning a bunch of data.

@chinedufn
Copy link
Contributor

chinedufn commented Dec 20, 2018

Also, I would try looking at your browser's devtools to see what's going on.

@alexcrichton
Copy link
Contributor

Thanks for the report @psiphi75! (and the source to poke around!)

I've done some poking around and it definitely looks like nothing obvious is missing (like --release or something like that). I think though that the main cause of slowdown here isn't the wasm itself but perhaps the architecture of the application? It looks like the wasm implementation is calling toObject on a pretty large Uint8ClampedArray which is causing (at least in Firefox) a lot of memmove/memcpy time to be spent. That in turn could cause a huge amount of memory traffic which may explain the low core utilization.

I wasn't able to dig much farther though, I think the perf tools in Chrome/Firefox still have aways to go with wasm!

In any case, can you detail a bit more about what the "each unit of work" function is on the JS/wasm implementations? I couldn't quite follow what it was and how JS differed itself.

FWIW the profilers showed that very little time was spent in wasm itself, so at least that part is fast here!

@psiphi75
Copy link
Author

Thanks @chinedufn and @alexcrichton, I tried the optimisation and removing the opt-level = 's', I presume that means it's -O3 by default on a --release build. But that didn't make a difference.

Thanks for memory tip @alexcrichton, I replaced the following lines with a static Uint8ClampedArray buffer, and it shot up to 20 fps.

      workUnit.message.buffer = new Uint8ClampedArray(
        wasm.memory.buffer,
        cellsPtr,
        constants.SQUARE_SIZE * constants.WIDTH * 4
      );

I'll see how I can optimise this part, and keep you posted.

Yes, the Chrome dev tools a pretty limited for profiling, both for WASM and Web Workers.

@alexcrichton
Copy link
Contributor

Oh nice!

FWIW I've found that Firefox's perf.html addon is excellent for profiling, but it has a lot of information that isn't always easy to decipher. I was able to figure out that memmove/memcpy were taking up a lot of time for this example, but I couldn't figure out directly why that was being called or what else was slowing things down.

@alexcrichton
Copy link
Contributor

Once you've got that committed/deployed as well I can try to help poking around some more!

@psiphi75
Copy link
Author

@alexcrichton, thanks. I'm investigating two options, the first option is the SharedArrayBuffer which is currently disabled in some browsers to the Spectre bug and also require atomics/mutexes which has no support in WASM yet (I believe) and the JavaScript component is too atomic for it to be useful.

The other options is transferable message passing, I believe this could work well, but would require a bit of a refactor.

@alexcrichton
Copy link
Contributor

Sounds reasonable to me! If you haven't seen it already we've actually got an example of a parallel raytracer, although it's using SharedArrayBuffer and a whole slew of unstable wasm features so it's only really demo quality! There though the messages between threads are just notifications and all the main chunks of data live in the original SharedArrayBuffer shared between workers.

@psiphi75
Copy link
Author

psiphi75 commented Dec 20, 2018

This has been fixed and was never an issue due to wasm-bindgen. It's now running at more than 27 fps in Firefox and around 20 fps in Chrome! The demo has been updated.

I have to say I don't understand the reason, but doing a copy from wasm.memory.buffer into a new Uint8ClampedArray buffer took a very long time.

In a nutshell my JavaScript code changed from:

      const cellsPtr = rt.render(workUnit.message.stripId);
      workUnit.message.buffer = new Uint8ClampedArray(
        wasm.memory.buffer,
        cellsPtr,
        constants.SQUARE_SIZE * constants.WIDTH * 4
      );
      self.postMessage(workUnit.toObject());

to:

      workUnit.message.buffer = new Uint8Array(constants.SQUARE_SIZE * constants.WIDTH * 4);
      rt.render(workUnit.message.stripId, workUnit.message.buffer);
      self.postMessage(workUnit.toObject(), [workUnit.message.buffer.buffer]);

There are two aspects here, the main one I believe was creating the Uint8Array upfront and passing it to the WASM render function and writing to the buffer directly. The other component was to a use a transferrable buffer to send the data back to the main process.

I believe a SharedArrayBuffer will work even better, but is not well supported on various browsers.

Thanks for your help.

@chinedufn
Copy link
Contributor

chinedufn commented Dec 20, 2018

I'd bet that a lot of people will be poking around the issues looking for performance tips.

Some potential different ideas:

  • A performance tag for issues
  • A FAQ section in the guide for common performance issues / tips / approaches / things to check
    • I like this one
  • Something else...?

@alexcrichton
Copy link
Contributor

Glad to hear @psiphi75! FWIW I still can't manage to get good wasm stacks in perf.html, but Chrome's developer tools report that the workers are spending 30% of their time in RayTracer::trace and another 30% in Object::intersect. That at least sounds like a plausible profile to me!

It looks like a lot of events are happening in the workers rather than log contigurous blocks of work, so maybe a tweaked architecture with less messages between workers would help more? Sort of just shooting in the dark!

@chinedufn I definitely agree! https://rustwasm.github.io/book/game-of-life/time-profiling.html and https://rustwasm.github.io/book/reference/time-profiling.html are hopeful to at least be a start to documentation, but expanding that and/or adding an FAQ here sounds great!

@alexcrichton alexcrichton added the speed Issues related to runtime performance label Dec 21, 2018
@psiphi75
Copy link
Author

psiphi75 commented Jun 10, 2019

Last night I demonstrated this to a few people and performance issue is caused due to the following line,

self.postMessage(workUnit.toObject());

Apparently this serialises/deserialises the object when it's sent from the worker to the main thread.

Hence, it's not related to wasm-bindgen.

@Pauan
Copy link
Contributor

Pauan commented Jun 11, 2019

Yes, postMessage always serializes the object. However, you can avoid the serialization if it is a Transferrable object, and you pass it as the transfer argument for postMessage. This causes the object to be transferred in a zero-copy way, so it's very fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
speed Issues related to runtime performance
Projects
None yet
Development

No branches or pull requests

4 participants