Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGPU Performance Issues #5689

Closed
vladmandic opened this issue Oct 3, 2021 · 14 comments
Closed

WebGPU Performance Issues #5689

vladmandic opened this issue Oct 3, 2021 · 14 comments
Assignees
Labels
type:bug Something isn't working

Comments

@vladmandic
Copy link
Contributor

vladmandic commented Oct 3, 2021

i just tried new tfjs-backend-webgpu 0.0.1-alpha.8 on tfjs 3.9.0
environment: chrome 96 canary on windows 11

first, great job on adding tons of new ops - from perspective of supported kernel ops, webgpu is becoming usable!

however, switch to WGSL is anything but useful so far - it comes as a major performance degredation

overall, webgpu has gotten slower than webgl
(and webgl itself has become significantly slower since tfjs 3.4.0 - this is discussed separately in several open issues)

not to mention that new work that has gone into webgl to make it manageable (enable uniforms) has no effect on webgpu

comparing warmup times
(fyi, my app by default uses 8 simple models running in parallel - total models size is actually tiny, below 30mb):

  • webgl (default settings)

    14 sec (double the value with uniforms enabled)

  • webgl with WEBGL_PACK_DEPTHWISECONV=false and WEBGL_USE_SHAPES_UNIFORMS=true

    7 sec (pretty good)

  • webgpu (default settings)

    25 sec (this is incredibily slow)

  • webgpu with WEBGPU_USE_GLSL=true

    15 sec (already slower than webgl)

  • wasm (no real warmup, included for refrerence only)

    2 sec

imo, when developing new backend, goal should be that its better than the previous one - not just that it passes unit tests
if webgpu is not significantly improved, it will be a d.o.a. once released

cc @qjia7 and @xhcao due to work on webgpu
cc @pyu10055 as assignee on webgl performance degradation issue

@vladmandic vladmandic added the type:bug Something isn't working label Oct 3, 2021
@gyagp
Copy link
Contributor

gyagp commented Oct 4, 2021

@vladmandic Thanks for the good comments and data, as always!
Chrome 94 was released on Sep 21, with WebGPU Origin Trial support. This means in addition to Chrome Canary, we may use Chrome Stable (still need option --enable-unsafe-webgpu) for WebGPU experiment now. Unfortunately, Chrome decided not to support GLSL anymore for WebGPU (changes happened in master so all the release channels would be impacted, including Canary and Stable), so WGSL is the only one that can be consumed now. We always align well with WebGPU development (My team also heavily contributes to WebGPU spec, CTS and Chrome impl) and started the TFJS GLSL to WGSL transition in June. After fixing many critical perf issues in Chrome (e.g., workgroup memory init perf regression) together with Google and working around perf issues in TFJS (e.g., hardware limits), we finished the transition after 3+ months of work.
Internally we have daily track of performance against almost all the workloads defined in TFJS e2e benchmarks. Before switching to WGSL, we double-checked there was no performance regression regarding to warmup time and run time. For sure, due to resources, we could only cover very limited platforms (Actually only Intel Coffee Lake and Tiger Lake are under daily test), and very limited workloads. We'd like to hear more details from your side (e.g., hardware configuration) to understand the regression. We'll investigate right after our holidays (We are off from Oct 1 to 7 for National Day Holidays).
BTW,

  1. The uniform idea was already implemented in WebGPU backend. Google thought it great, so we're bringing it to WebGL backend.
  2. Comparing with WebGL, compiled shaders couldn't be cached in Chrome. We already raised this implementation issue to Chrome and it's going to take a while for its implementation (not easy).

Thanks again for your valuable feedback, hopes to hear more details from your side about warmup regression (e.g., hardware configuration), and looks forward to more collaborations in the future!

@vladmandic
Copy link
Contributor Author

Thank you for the notes, here are full details
I've created an automated test so its easy to check all scenarios...

Performance Testing

Environment: tfsj 3.9.0 and tfjs-backend-webgpu 0.0.1-alpha.8
Hardware: Notebook with Intel Coffee Lake i7-8750 and nVidia GTX 1050Ti

Notes

  • WebGPU GSLS code has been recently removed and cannot be compared with new WGSL
  • WebGL warmup has massive benefit of ~80% of browser shader caching
  • WebGPU warmup has little benefit of ~12% of browser shader caching
  • WebGPU is much faster on inference compared to WebGL
  • WebGPU is faster to warmup than WebGL in most cases
    Except when WebGL shaders are cached in browser cross-session and uniforms are enabled
    WebGL is 2x faster than WebGPU in that scenario showing necessity of caching support
  • WebGL performance benefits of uniforms is massive at 2x and I dont see any side-effects
    Will this be enabled by default in the future?
  • WebGL packing caused massive performance regression in TFJS in 3.4.0 (3.3.0 is last unaffected version)
    There are several open issues, but no progress?
  • Using tf.zeros as input is convinient, but does not produce realistic results
    Test using real input image to excercise real-world model execution path

Test Results

{ message: 'initial', warmup: 3134, inference: 2638, tfjs: '3.9.0', backend: 'wasm', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 3119, inference: 2618, tfjs: '3.9.0', backend: 'wasm', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'initial', warmup: 11836, inference: 61, tfjs: '3.9.0', backend: 'webgl', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 2665, inference: 60, tfjs: '3.9.0', backend: 'webgl', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'initial', warmup: 6128, inference: 54, tfjs: '3.9.0', backend: 'webgl', tensors: 304, agent: 'Chrome/94', env: [ { WEBGL_PACK_DEPTHWISECONV: false }, { WEBGL_USE_SHAPES_UNIFORMS: true } ] }
{ message: 'cached', warmup: 1202, inference: 67, tfjs: '3.9.0', backend: 'webgl', tensors: 304, agent: 'Chrome/94', env: [ { WEBGL_PACK_DEPTHWISECONV: false }, { WEBGL_USE_SHAPES_UNIFORMS: true } ] }
{ message: 'initial', warmup: 5018, inference: 23, tfjs: '3.9.0', backend: 'webgpu', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 4454, inference: 22, tfjs: '3.9.0', backend: 'webgpu', tensors: 304, agent: 'Chrome/94', env: [] }

Issues

Using WebGPU backend is causing a lot of warnings although execution seems to work:

> warning Binding size bigger than maximum uniform buffer binding size: binding 0 given 146313216 bytes, maximum is 16384 bytes
    at ValidateBufferBinding (../../third_party/dawn/src/dawn_native/BindGroup.cpp:114)
    at ValidateBindGroupDescriptor (../../third_party/dawn/src/dawn_native/BindGroup.cpp:290)
    at CreateBindGroup (../../third_party/dawn/src/dawn_native/Device.cpp:1043)

Reproduction

Fully automated test in NodeJS using puppeteer and reproducible anytime
Code available at https://gist.github.com/vladmandic/fbdcaf7fe2e2add5c33b98936d4d5740

@vladmandic
Copy link
Contributor Author

vladmandic commented Oct 7, 2021

Above post is using single model (can be re-tested using any model, I've used Inception v4 trained on ImageNet 1k)

However, when I try WebGPU backend on my demo app, it runs at ~3 FPS average while WebGL runs at ~9 FPS
that is 300% negative difference in inference performance!

My best guess is that some ops get executed on CPU thus causing major slowdown

You can try using following URLs:

@qjia7
Copy link
Contributor

qjia7 commented Oct 8, 2021

@vladmandic Can you put the Inception v4 model somewhere that I can access? It seems that http://wyse:10010/models/imagenet/inception-v4/model.json is in your local server. The webgpu warning seems like a bug in our implementation.

And for your demo app, I can reproduce the bad performance for webgpu. Thanks for the reporting. I will take a look.

@vladmandic
Copy link
Contributor Author

vladmandic commented Oct 8, 2021

Can you put the Inception v4 model somewhere that I can access?

To keep it reproducible with a readily available public model, you can use any mid-complexity model,
here's an example with EfficientNet-B5 from TFhub: https://tfhub.dev/google/efficientnet/b5/classification/1
(just convert from TFSavedModel to TFJSGraphModel using tensorflowjs_converter)

{ message: 'initial', warmup: 2645, inference: 1908, tfjs: '3.9.0', backend: 'wasm', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 2330, inference: 1808, tfjs: '3.9.0', backend: 'wasm', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'initial', warmup: 20148, inference: 107, tfjs: '3.9.0', backend: 'webgl', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 5374, inference: 105, tfjs: '3.9.0', backend: 'webgl', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'initial', warmup: 7428, inference: 119, tfjs: '3.9.0', backend: 'webgl', tensors: 394, agent: 'Chrome/94', env: [ { WEBGL_PACK_DEPTHWISECONV: false }, { WEBGL_USE_SHAPES_UNIFORMS: true } ] }
{ message: 'cached', warmup: 2053, inference: 103, tfjs: '3.9.0', backend: 'webgl', tensors: 394, agent: 'Chrome/94', env: [ { WEBGL_PACK_DEPTHWISECONV: false }, { WEBGL_USE_SHAPES_UNIFORMS: true } ] }
{ message: 'initial', warmup: 5087, inference: 64, tfjs: '3.9.0', backend: 'webgpu', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 4427, inference: 70, tfjs: '3.9.0', backend: 'webgpu', tensors: 394, agent: 'Chrome/94', env: [] }

As you can see, data is pretty much the same as with Inception v4 model
(even bigger impact of WEBGL packing and uniforms, but numbers tell the same story)

And for your demo app, I can reproduce the bad performance for webgpu

I've traced it down - there are couple of places where WebGPU is a touch slower than WebGL,
but by far the biggest issue is tf.image.nonMaxSuppressionAsync

WebGL runs NMS in ~25 ms and WebGPU runs NMS in ~135 ms (over 5x slower)

FYI NMS function params are:

boxes.shape = [896, 4]
scores.shape = [896]
maxOutputSize = 1
iouThreshold = 0.1
scoreThreshold = 0.2

Also, it seems like WebGPU has some additional execution latency?
In more complex models, that is not visible since overall execution time is faster than WebGL
But with very simple models that execute in near-real-time WebGPU is slower than WebGL

For example, running a requestAnimationFrame loop on BlazeFace model.execute():

  • WebGL: 12 ms / frame
  • WebGPU: 20 ms / frame

@qjia7
Copy link
Contributor

qjia7 commented Oct 11, 2021

@vladmandic Thanks for the detailed information. I can run your benchmarks using EfficientNet now. Some comments below:

  1. I didn't meet the warning warning Binding size bigger than maximum uniform buffer binding size: binding 0 given 146313216 bytes, maximum is 16384 bytes using EfficientNet. Maybe it's Inception v4 specific. If you can further narrow down which op with what kind of input shapes introduce this warning, it will be helpful for us.
  2. The cached is total unimplemented in chrome browser. See gpuweb:2111, dawn:549. So we have to wait the browser to support it. In TFJS level, we will see if we can further reduce the shader variance to reduce the warmup time.
  3. For tf.image.nonMaxSuppressionAsync, it may be not the culprit. Current, nonMaxSuppressionAsync only runs on cpu. There is no gpu kernel for it. So I guess the slowness is caused by the previous ops before nonMaxSuppressionAsync. nonMaxSuppressionAsync triggers all ops in gpu must finish execution. What kind model are you executing before nonMaxSuppressionAsync?
  4. For small size model, webgpu does have performance issue. We notice that the current conv2d/matmul is not efficient for irregular inputs, like M,N is small, K is very large. Or inputs height/width is smaller than filter height/width. We are working on this kind of shapes optimization. Will update here once we have progress. Thanks.

@vladmandic
Copy link
Contributor Author

vladmandic commented Oct 11, 2021

The cached is total unimplemented in chrome browser. See gpuweb:2111, dawn:549. So we have to wait the browser to support it

Thanks
I took a look and current approach by Chrome team doesnt seem very encouraging and the thread on the spec itself is idle for 8 months :(

In TFJS level, we will see if we can further reduce the shader variance to reduce the warmup time

Much appreciated!

For small size model, webgpu does have performance issue. We notice that the current conv2d/matmul is not efficient for irregular inputs, like M,N is small, K is very large. Or inputs height/width is smaller than filter height/width. We are working on this kind of shapes optimization. Will update here once we have progress. Thanks.

Thanks for confirming

For tf.image.nonMaxSuppressionAsync, it may be not the culprit. Current, nonMaxSuppressionAsync only runs on cpu. There is no gpu kernel for it. So I guess the slowness is caused by the previous ops before nonMaxSuppressionAsync. nonMaxSuppressionAsync triggers all ops in gpu must finish execution. What kind model are you executing before nonMaxSuppressionAsync?

You're right, perf problem is basically ANY first TF opereration executed in JS code (outside of the model) - there is a massive latency penalty
In my previous post, nonMaxSuppressionAsync just happened to be the one

Simple reproduction:

  const numIterations = 50;
  const arr = new Uint8Array(imageData?.data.buffer); // input data in my case is 4k imageData, but can be any dataset
  const t0 = performance.now();
  for (let i = 0; i < numIterations; i++) {
    const rgba = tf.tensor(arr, [imageData.width, imageData.height, 4], 'int32'); // create rgba tensor
    const rgb = tf.slice3d(rgba, [0, 0, 0], [-1, -1, 3]); // strip alpha channel
    const tensor = tf.expandDims(rgb, 0); // create standard image tensor [1, height, width, 3]
    // const data = await tensor.array(); // download data from gpu
    tf.dispose([rgba, rgb, tensor]); // just dispose everything
  }
  const t1 = performance.now();
  const avgTime = Math.round((t1 - t0) / numIterations);
  console.log({ backend: tf.getBackend(), average: avgTime });

This loop in WebGPU is about 3x slower than in WebGL

Setting tf.ENV.set('WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE', 0) reduces latency by 50%, but its still slower than WebGL

Note that when disabled line that downloads data back from GPU is enabled, both WebGL and WebGPU slow down a lot since downloading data is slow (expected)
BUT - overal execution becomes faster for WebGPU than WebGL - WebGPU is fast, its just initial latency thats a killer

So running models in WebGPU is faster than WebGL, but preparing inputs and processing outputs adds huge penalty at the moment

I didn't meet the warning ...
Maybe it's Inception v4 specific.
If you can further narrow down which op with what kind of input shapes introduce this warning, it will be helpful for us.

I'm getting warning even running this simple code from above, no model execution at all

Message is slightly different in Chrome 97 vs 94 and maximum binding size is much bigger, but error is pretty much the same:

warning Binding size (146313216) is larger than the maximum binding size (134217728).
 - While validating entries[0] as a Buffer
 - While validating [BindGroupDescriptor] against [BindGroupLayout]
 - While calling CreateBindGroup([BindGroupDescriptor]).
DATA:  warning [BindGroup] is an error.
    at ValidateObject (../../third_party/dawn/src/dawn_native/Device.cpp:473)
    at ValidateSetBindGroup (../../third_party/dawn/src/dawn_native/ProgrammablePassEncoder.cpp:116)
    at operator() (../../third_party/dawn/src/dawn_native/ComputePassEncoder.cpp:184)
    at FinishInternal (../../third_party/dawn/src/dawn_native/CommandEncoder.cpp:1035)

There are no errors logged in about://gpu in Chrome

@vladmandic vladmandic changed the title regression: tfjs-backend-webgpu major performance degradation WebGPU Performance Issues Oct 11, 2021
@vladmandic
Copy link
Contributor Author

@qjia7 did you have a chance to look at the webgpu latency issues i've mentioned? tests above is from 22 days ago.

@qjia7
Copy link
Contributor

qjia7 commented Nov 3, 2021

@vladmandic Sorry, I can't reproduce the latency issue you mentioned. From the code snippet you paste, it basically does nothing and is executed instantaneously in my side.

  const numIterations = 50;
  const arr = new Uint8Array(imageData?.data.buffer); // input data in my case is 4k imageData, but can be any dataset
  const t0 = performance.now();
  for (let i = 0; i < numIterations; i++) {
    const rgba = tf.tensor(arr, [imageData.width, imageData.height, 4], 'int32'); // create rgba tensor
    const rgb = tf.slice3d(rgba, [0, 0, 0], [-1, -1, 3]); // strip alpha channel
    const tensor = tf.expandDims(rgb, 0); // create standard image tensor [1, height, width, 3]
    // const data = await tensor.array(); // download data from gpu
    tf.dispose([rgba, rgb, tensor]); // just dispose everything
  }
  const t1 = performance.now();
  const avgTime = Math.round((t1 - t0) / numIterations);
  console.log({ backend: tf.getBackend(), average: avgTime });

Some update for the warmup time:

  • In tfjs level, we are reducing some shader variants firstly from binary ops webgpu: Reduce binary ops shader variants #5791. Will start others when we get each operator's exact shader compilation time. Some challenges for us are that 1) it's hard to measure each operator's shader compilation time. Details can be found here, here. 2) browser devtools also lack this support. Fortunately, chrome just supported it in chrome://tracing. We can continue this work once it's ready in canary.

  • In webgpu level, google developer has started the work to cache pipeline in memory.

@vladmandic
Copy link
Contributor Author

@qjia7 thanks for the update!

and the description of the chromium queue handling is sounds like it could be the same root cause for what i'm seeing as extreme latency issues

for reproduction, im guessing your test failed since imageData was empty, but you can even use tf.zeros to reproduce
no model needed, nothing - just a trivial loop

here's a live link: https://vladmandic.github.io/tfjs-utils/src/latency-issue.html

and output on my notebook:

user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4689.0 Safari/537.36
tf version: 3.11.0-20211102
backend: webgl | total time: 4556 ms | average time: 91 ms
backend: webgpu | total time: 13058 ms | average time: 261 ms

basically, for any "real" work, webgpu is really fast - but for simple stuff that is done in js code outside of the model, latency is a killer - its 300% slower than webgl
unfortunately, in real-world, any model inference is followed by some post-processing in js and that is where this impact becomes as showstopper

setting tf.ENV.set('WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE', 0) reduces the latency by 50%,
but its still nowhere as fast as webgl

@qjia7
Copy link
Contributor

qjia7 commented Nov 4, 2021

@vladmandic Thanks for your live case. I can reproduce it now. After debugging, I find the time mainly costs on queue.writeBuffer. It seems that this API needs to be optimized in browser for big data uploading. I reported a bug to chromium https://bugs.chromium.org/p/chromium/issues/detail?id=1266727.

@qjia7
Copy link
Contributor

qjia7 commented Dec 13, 2021

@vladmandic Jiawei in our team has fixed the queue.writeBuffer issue in chromium. You can retest your example https://vladmandic.github.io/tfjs-utils/src/latency-issue.html with latest chrome canary (--enable-unsafe-webgpu). The webgpu backend becomes much faster than before. In my machine, the latency is not that obvious now. We are also trying to use mapAsync instead of writeBuffer, which shows better perf than webgl with your example. But it brings another issue. It's still in discussion, but hope we can provide the most performant solution soon.

And for the long warmup time, we drafted the prototype for the parallel compilation, showing almost 4x speedup for the warmup time. Currently, we are discussing how to expose this capability uniformly between webgl and webgpu. Will keep you updated.

@vladmandic
Copy link
Contributor Author

Thanks @qjia7!

I've tested with Chrome 99 and latency is gone - WebGPU now performs on-par with WebGL

I'm looking forward to other proposed changes (once issue is resolved) for mapAsync in #5928
And I'm guessing you're still discussing how to align changes from #5826 with changes from #5815 ?

Anyhow, I'm closing this issue as resolved...

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

qjia7 added a commit to qjia7/tfjs that referenced this issue Dec 17, 2021
qjia7 added a commit that referenced this issue Dec 22, 2021
PERF
BUG: #5689

* webgpu: Use mapAsync instead of writeBuffer for uploading
* Correct test cases
* Ignore the promise rejection
* Fix buffer was not provided error
* Fix bots failure
* Recover some tests
* Remove unnecessary early-return and reset

* add benchmark test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants