WebGPU Performance Issues #5689

vladmandic · 2021-10-03T12:03:12Z

i just tried new tfjs-backend-webgpu 0.0.1-alpha.8 on tfjs 3.9.0
environment: chrome 96 canary on windows 11

first, great job on adding tons of new ops - from perspective of supported kernel ops, webgpu is becoming usable!

however, switch to WGSL is anything but useful so far - it comes as a major performance degredation

overall, webgpu has gotten slower than webgl
(and webgl itself has become significantly slower since tfjs 3.4.0 - this is discussed separately in several open issues)

not to mention that new work that has gone into webgl to make it manageable (enable uniforms) has no effect on webgpu

comparing warmup times
(fyi, my app by default uses 8 simple models running in parallel - total models size is actually tiny, below 30mb):

webgl (default settings)

14 sec (double the value with uniforms enabled)
webgl with WEBGL_PACK_DEPTHWISECONV=false and WEBGL_USE_SHAPES_UNIFORMS=true

7 sec (pretty good)
webgpu (default settings)

25 sec (this is incredibily slow)
webgpu with WEBGPU_USE_GLSL=true

15 sec (already slower than webgl)
wasm (no real warmup, included for refrerence only)

2 sec

imo, when developing new backend, goal should be that its better than the previous one - not just that it passes unit tests
if webgpu is not significantly improved, it will be a d.o.a. once released

cc @qjia7 and @xhcao due to work on webgpu
cc @pyu10055 as assignee on webgl performance degradation issue

The text was updated successfully, but these errors were encountered:

gyagp · 2021-10-04T01:14:37Z

@vladmandic Thanks for the good comments and data, as always!
Chrome 94 was released on Sep 21, with WebGPU Origin Trial support. This means in addition to Chrome Canary, we may use Chrome Stable (still need option --enable-unsafe-webgpu) for WebGPU experiment now. Unfortunately, Chrome decided not to support GLSL anymore for WebGPU (changes happened in master so all the release channels would be impacted, including Canary and Stable), so WGSL is the only one that can be consumed now. We always align well with WebGPU development (My team also heavily contributes to WebGPU spec, CTS and Chrome impl) and started the TFJS GLSL to WGSL transition in June. After fixing many critical perf issues in Chrome (e.g., workgroup memory init perf regression) together with Google and working around perf issues in TFJS (e.g., hardware limits), we finished the transition after 3+ months of work.
Internally we have daily track of performance against almost all the workloads defined in TFJS e2e benchmarks. Before switching to WGSL, we double-checked there was no performance regression regarding to warmup time and run time. For sure, due to resources, we could only cover very limited platforms (Actually only Intel Coffee Lake and Tiger Lake are under daily test), and very limited workloads. We'd like to hear more details from your side (e.g., hardware configuration) to understand the regression. We'll investigate right after our holidays (We are off from Oct 1 to 7 for National Day Holidays).
BTW,

The uniform idea was already implemented in WebGPU backend. Google thought it great, so we're bringing it to WebGL backend.
Comparing with WebGL, compiled shaders couldn't be cached in Chrome. We already raised this implementation issue to Chrome and it's going to take a while for its implementation (not easy).

Thanks again for your valuable feedback, hopes to hear more details from your side about warmup regression (e.g., hardware configuration), and looks forward to more collaborations in the future!

vladmandic · 2021-10-06T22:20:29Z

Thank you for the notes, here are full details
I've created an automated test so its easy to check all scenarios...

Performance Testing

Environment: tfsj 3.9.0 and tfjs-backend-webgpu 0.0.1-alpha.8
Hardware: Notebook with Intel Coffee Lake i7-8750 and nVidia GTX 1050Ti

Notes

WebGPU GSLS code has been recently removed and cannot be compared with new WGSL
WebGL warmup has massive benefit of ~80% of browser shader caching
WebGPU warmup has little benefit of ~12% of browser shader caching
WebGPU is much faster on inference compared to WebGL
WebGPU is faster to warmup than WebGL in most cases
Except when WebGL shaders are cached in browser cross-session and uniforms are enabled
WebGL is 2x faster than WebGPU in that scenario showing necessity of caching support
WebGL performance benefits of uniforms is massive at 2x and I dont see any side-effects
Will this be enabled by default in the future?
WebGL packing caused massive performance regression in TFJS in 3.4.0 (3.3.0 is last unaffected version)
There are several open issues, but no progress?
Using tf.zeros as input is convinient, but does not produce realistic results
Test using real input image to excercise real-world model execution path

Test Results

{ message: 'initial', warmup: 3134, inference: 2638, tfjs: '3.9.0', backend: 'wasm', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 3119, inference: 2618, tfjs: '3.9.0', backend: 'wasm', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'initial', warmup: 11836, inference: 61, tfjs: '3.9.0', backend: 'webgl', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 2665, inference: 60, tfjs: '3.9.0', backend: 'webgl', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'initial', warmup: 6128, inference: 54, tfjs: '3.9.0', backend: 'webgl', tensors: 304, agent: 'Chrome/94', env: [ { WEBGL_PACK_DEPTHWISECONV: false }, { WEBGL_USE_SHAPES_UNIFORMS: true } ] }
{ message: 'cached', warmup: 1202, inference: 67, tfjs: '3.9.0', backend: 'webgl', tensors: 304, agent: 'Chrome/94', env: [ { WEBGL_PACK_DEPTHWISECONV: false }, { WEBGL_USE_SHAPES_UNIFORMS: true } ] }
{ message: 'initial', warmup: 5018, inference: 23, tfjs: '3.9.0', backend: 'webgpu', tensors: 304, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 4454, inference: 22, tfjs: '3.9.0', backend: 'webgpu', tensors: 304, agent: 'Chrome/94', env: [] }

Issues

Using WebGPU backend is causing a lot of warnings although execution seems to work:

> warning Binding size bigger than maximum uniform buffer binding size: binding 0 given 146313216 bytes, maximum is 16384 bytes
    at ValidateBufferBinding (../../third_party/dawn/src/dawn_native/BindGroup.cpp:114)
    at ValidateBindGroupDescriptor (../../third_party/dawn/src/dawn_native/BindGroup.cpp:290)
    at CreateBindGroup (../../third_party/dawn/src/dawn_native/Device.cpp:1043)

Reproduction

Fully automated test in NodeJS using puppeteer and reproducible anytime
Code available at https://gist.github.com/vladmandic/fbdcaf7fe2e2add5c33b98936d4d5740

vladmandic · 2021-10-07T12:36:06Z

Above post is using single model (can be re-tested using any model, I've used Inception v4 trained on ImageNet 1k)

However, when I try WebGPU backend on my demo app, it runs at ~3 FPS average while WebGL runs at ~9 FPS
that is 300% negative difference in inference performance!

My best guess is that some ops get executed on CPU thus causing major slowdown

You can try using following URLs:

qjia7 · 2021-10-08T04:39:45Z

@vladmandic Can you put the Inception v4 model somewhere that I can access? It seems that http://wyse:10010/models/imagenet/inception-v4/model.json is in your local server. The webgpu warning seems like a bug in our implementation.

And for your demo app, I can reproduce the bad performance for webgpu. Thanks for the reporting. I will take a look.

vladmandic · 2021-10-08T11:17:53Z

Can you put the Inception v4 model somewhere that I can access?

To keep it reproducible with a readily available public model, you can use any mid-complexity model,
here's an example with EfficientNet-B5 from TFhub: https://tfhub.dev/google/efficientnet/b5/classification/1
(just convert from TFSavedModel to TFJSGraphModel using tensorflowjs_converter)

{ message: 'initial', warmup: 2645, inference: 1908, tfjs: '3.9.0', backend: 'wasm', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 2330, inference: 1808, tfjs: '3.9.0', backend: 'wasm', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'initial', warmup: 20148, inference: 107, tfjs: '3.9.0', backend: 'webgl', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 5374, inference: 105, tfjs: '3.9.0', backend: 'webgl', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'initial', warmup: 7428, inference: 119, tfjs: '3.9.0', backend: 'webgl', tensors: 394, agent: 'Chrome/94', env: [ { WEBGL_PACK_DEPTHWISECONV: false }, { WEBGL_USE_SHAPES_UNIFORMS: true } ] }
{ message: 'cached', warmup: 2053, inference: 103, tfjs: '3.9.0', backend: 'webgl', tensors: 394, agent: 'Chrome/94', env: [ { WEBGL_PACK_DEPTHWISECONV: false }, { WEBGL_USE_SHAPES_UNIFORMS: true } ] }
{ message: 'initial', warmup: 5087, inference: 64, tfjs: '3.9.0', backend: 'webgpu', tensors: 394, agent: 'Chrome/94', env: [] }
{ message: 'cached', warmup: 4427, inference: 70, tfjs: '3.9.0', backend: 'webgpu', tensors: 394, agent: 'Chrome/94', env: [] }

As you can see, data is pretty much the same as with Inception v4 model
(even bigger impact of WEBGL packing and uniforms, but numbers tell the same story)

And for your demo app, I can reproduce the bad performance for webgpu

I've traced it down - there are couple of places where WebGPU is a touch slower than WebGL,
but by far the biggest issue is tf.image.nonMaxSuppressionAsync

WebGL runs NMS in ~25 ms and WebGPU runs NMS in ~135 ms (over 5x slower)

FYI NMS function params are:

boxes.shape = [896, 4]
scores.shape = [896]
maxOutputSize = 1
iouThreshold = 0.1
scoreThreshold = 0.2

Also, it seems like WebGPU has some additional execution latency?
In more complex models, that is not visible since overall execution time is faster than WebGL
But with very simple models that execute in near-real-time WebGPU is slower than WebGL

For example, running a requestAnimationFrame loop on BlazeFace model.execute():

WebGL: 12 ms / frame
WebGPU: 20 ms / frame

qjia7 · 2021-10-11T08:14:55Z

@vladmandic Thanks for the detailed information. I can run your benchmarks using EfficientNet now. Some comments below:

I didn't meet the warning warning Binding size bigger than maximum uniform buffer binding size: binding 0 given 146313216 bytes, maximum is 16384 bytes using EfficientNet. Maybe it's Inception v4 specific. If you can further narrow down which op with what kind of input shapes introduce this warning, it will be helpful for us.
The cached is total unimplemented in chrome browser. See gpuweb:2111, dawn:549. So we have to wait the browser to support it. In TFJS level, we will see if we can further reduce the shader variance to reduce the warmup time.
For tf.image.nonMaxSuppressionAsync, it may be not the culprit. Current, nonMaxSuppressionAsync only runs on cpu. There is no gpu kernel for it. So I guess the slowness is caused by the previous ops before nonMaxSuppressionAsync. nonMaxSuppressionAsync triggers all ops in gpu must finish execution. What kind model are you executing before nonMaxSuppressionAsync?
For small size model, webgpu does have performance issue. We notice that the current conv2d/matmul is not efficient for irregular inputs, like M,N is small, K is very large. Or inputs height/width is smaller than filter height/width. We are working on this kind of shapes optimization. Will update here once we have progress. Thanks.

vladmandic · 2021-10-11T13:11:15Z

The cached is total unimplemented in chrome browser. See gpuweb:2111, dawn:549. So we have to wait the browser to support it

Thanks
I took a look and current approach by Chrome team doesnt seem very encouraging and the thread on the spec itself is idle for 8 months :(

In TFJS level, we will see if we can further reduce the shader variance to reduce the warmup time

Much appreciated!

For small size model, webgpu does have performance issue. We notice that the current conv2d/matmul is not efficient for irregular inputs, like M,N is small, K is very large. Or inputs height/width is smaller than filter height/width. We are working on this kind of shapes optimization. Will update here once we have progress. Thanks.

Thanks for confirming

For tf.image.nonMaxSuppressionAsync, it may be not the culprit. Current, nonMaxSuppressionAsync only runs on cpu. There is no gpu kernel for it. So I guess the slowness is caused by the previous ops before nonMaxSuppressionAsync. nonMaxSuppressionAsync triggers all ops in gpu must finish execution. What kind model are you executing before nonMaxSuppressionAsync?

You're right, perf problem is basically ANY first TF opereration executed in JS code (outside of the model) - there is a massive latency penalty
In my previous post, nonMaxSuppressionAsync just happened to be the one

Simple reproduction:

  const numIterations = 50;
  const arr = new Uint8Array(imageData?.data.buffer); // input data in my case is 4k imageData, but can be any dataset
  const t0 = performance.now();
  for (let i = 0; i < numIterations; i++) {
    const rgba = tf.tensor(arr, [imageData.width, imageData.height, 4], 'int32'); // create rgba tensor
    const rgb = tf.slice3d(rgba, [0, 0, 0], [-1, -1, 3]); // strip alpha channel
    const tensor = tf.expandDims(rgb, 0); // create standard image tensor [1, height, width, 3]
    // const data = await tensor.array(); // download data from gpu
    tf.dispose([rgba, rgb, tensor]); // just dispose everything
  }
  const t1 = performance.now();
  const avgTime = Math.round((t1 - t0) / numIterations);
  console.log({ backend: tf.getBackend(), average: avgTime });

This loop in WebGPU is about 3x slower than in WebGL

Setting tf.ENV.set('WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE', 0) reduces latency by 50%, but its still slower than WebGL

Note that when disabled line that downloads data back from GPU is enabled, both WebGL and WebGPU slow down a lot since downloading data is slow (expected)
BUT - overal execution becomes faster for WebGPU than WebGL - WebGPU is fast, its just initial latency thats a killer

So running models in WebGPU is faster than WebGL, but preparing inputs and processing outputs adds huge penalty at the moment

I didn't meet the warning ...
Maybe it's Inception v4 specific.
If you can further narrow down which op with what kind of input shapes introduce this warning, it will be helpful for us.

I'm getting warning even running this simple code from above, no model execution at all

Message is slightly different in Chrome 97 vs 94 and maximum binding size is much bigger, but error is pretty much the same:

warning Binding size (146313216) is larger than the maximum binding size (134217728).
 - While validating entries[0] as a Buffer
 - While validating [BindGroupDescriptor] against [BindGroupLayout]
 - While calling CreateBindGroup([BindGroupDescriptor]).
DATA:  warning [BindGroup] is an error.
    at ValidateObject (../../third_party/dawn/src/dawn_native/Device.cpp:473)
    at ValidateSetBindGroup (../../third_party/dawn/src/dawn_native/ProgrammablePassEncoder.cpp:116)
    at operator() (../../third_party/dawn/src/dawn_native/ComputePassEncoder.cpp:184)
    at FinishInternal (../../third_party/dawn/src/dawn_native/CommandEncoder.cpp:1035)

There are no errors logged in about://gpu in Chrome

vladmandic · 2021-11-02T16:29:46Z

@qjia7 did you have a chance to look at the webgpu latency issues i've mentioned? tests above is from 22 days ago.

qjia7 · 2021-11-03T02:39:56Z

@vladmandic Sorry, I can't reproduce the latency issue you mentioned. From the code snippet you paste, it basically does nothing and is executed instantaneously in my side.

  const numIterations = 50;
  const arr = new Uint8Array(imageData?.data.buffer); // input data in my case is 4k imageData, but can be any dataset
  const t0 = performance.now();
  for (let i = 0; i < numIterations; i++) {
    const rgba = tf.tensor(arr, [imageData.width, imageData.height, 4], 'int32'); // create rgba tensor
    const rgb = tf.slice3d(rgba, [0, 0, 0], [-1, -1, 3]); // strip alpha channel
    const tensor = tf.expandDims(rgb, 0); // create standard image tensor [1, height, width, 3]
    // const data = await tensor.array(); // download data from gpu
    tf.dispose([rgba, rgb, tensor]); // just dispose everything
  }
  const t1 = performance.now();
  const avgTime = Math.round((t1 - t0) / numIterations);
  console.log({ backend: tf.getBackend(), average: avgTime });

Some update for the warmup time:

In tfjs level, we are reducing some shader variants firstly from binary ops webgpu: Reduce binary ops shader variants #5791. Will start others when we get each operator's exact shader compilation time. Some challenges for us are that 1) it's hard to measure each operator's shader compilation time. Details can be found here, here. 2) browser devtools also lack this support. Fortunately, chrome just supported it in chrome://tracing. We can continue this work once it's ready in canary.
In webgpu level, google developer has started the work to cache pipeline in memory.

vladmandic · 2021-11-03T04:05:39Z

@qjia7 thanks for the update!

and the description of the chromium queue handling is sounds like it could be the same root cause for what i'm seeing as extreme latency issues

for reproduction, im guessing your test failed since imageData was empty, but you can even use tf.zeros to reproduce
no model needed, nothing - just a trivial loop

here's a live link: https://vladmandic.github.io/tfjs-utils/src/latency-issue.html

and output on my notebook:

user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4689.0 Safari/537.36
tf version: 3.11.0-20211102
backend: webgl | total time: 4556 ms | average time: 91 ms
backend: webgpu | total time: 13058 ms | average time: 261 ms

basically, for any "real" work, webgpu is really fast - but for simple stuff that is done in js code outside of the model, latency is a killer - its 300% slower than webgl
unfortunately, in real-world, any model inference is followed by some post-processing in js and that is where this impact becomes as showstopper

setting tf.ENV.set('WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE', 0) reduces the latency by 50%,
but its still nowhere as fast as webgl

qjia7 · 2021-11-04T07:32:18Z

@vladmandic Thanks for your live case. I can reproduce it now. After debugging, I find the time mainly costs on queue.writeBuffer. It seems that this API needs to be optimized in browser for big data uploading. I reported a bug to chromium https://bugs.chromium.org/p/chromium/issues/detail?id=1266727.

qjia7 · 2021-12-13T06:41:53Z

@vladmandic Jiawei in our team has fixed the queue.writeBuffer issue in chromium. You can retest your example https://vladmandic.github.io/tfjs-utils/src/latency-issue.html with latest chrome canary (--enable-unsafe-webgpu). The webgpu backend becomes much faster than before. In my machine, the latency is not that obvious now. We are also trying to use mapAsync instead of writeBuffer, which shows better perf than webgl with your example. But it brings another issue. It's still in discussion, but hope we can provide the most performant solution soon.

And for the long warmup time, we drafted the prototype for the parallel compilation, showing almost 4x speedup for the warmup time. Currently, we are discussing how to expose this capability uniformly between webgl and webgpu. Will keep you updated.

vladmandic · 2021-12-13T11:52:17Z

Thanks @qjia7!

I've tested with Chrome 99 and latency is gone - WebGPU now performs on-par with WebGL

I'm looking forward to other proposed changes (once issue is resolved) for mapAsync in #5928
And I'm guessing you're still discussing how to align changes from #5826 with changes from #5815 ?

Anyhow, I'm closing this issue as resolved...

google-ml-butler · 2021-12-13T11:52:19Z

Are you satisfied with the resolution of your issue?
Yes
No

PERF BUG: tensorflow#5689

PERF BUG: #5689 * webgpu: Use mapAsync instead of writeBuffer for uploading * Correct test cases * Ignore the promise rejection * Fix buffer was not provided error * Fix bots failure * Recover some tests * Remove unnecessary early-return and reset * add benchmark test

vladmandic added the type:bug Something isn't working label Oct 3, 2021

rthadur assigned ahmedsabie Oct 5, 2021

This was referenced Oct 6, 2021

WEBGL_PACK_DEPTHWISECONV=true seems to cause significant first inference performance drop #5343

Closed

[perf] improve shader compilation for WebGL with KHR_parallel_shader_compile extension #5205

Closed

vladmandic changed the title ~~regression: tfjs-backend-webgpu major performance degradation~~ WebGPU Performance Issues Oct 11, 2021

vladmandic mentioned this issue Oct 21, 2021

backend tfjs-backend-webgpu is missing several common kernel ops #5496

Closed

qjia7 mentioned this issue Nov 5, 2021

[WebGPU] Use createComputePipelineAsync to parallelly compile the shader #5815

Open

ahmedsabie assigned qjia7 and unassigned ahmedsabie Nov 16, 2021

qjia7 mentioned this issue Dec 8, 2021

webgpu: Use mapAsync instead of writeBuffer for uploading #5928

Merged

vladmandic closed this as completed Dec 13, 2021

qjia7 added a commit to qjia7/tfjs that referenced this issue Dec 17, 2021

webgpu: Use mapAsync instead of writeBuffer for uploading

685c882

PERF BUG: tensorflow#5689

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebGPU Performance Issues #5689

WebGPU Performance Issues #5689

vladmandic commented Oct 3, 2021 •

edited

Loading

gyagp commented Oct 4, 2021

vladmandic commented Oct 6, 2021

vladmandic commented Oct 7, 2021 •

edited

Loading

qjia7 commented Oct 8, 2021

vladmandic commented Oct 8, 2021 •

edited

Loading

qjia7 commented Oct 11, 2021

vladmandic commented Oct 11, 2021 •

edited

Loading

vladmandic commented Nov 2, 2021

qjia7 commented Nov 3, 2021

vladmandic commented Nov 3, 2021

qjia7 commented Nov 4, 2021

qjia7 commented Dec 13, 2021

vladmandic commented Dec 13, 2021

google-ml-butler bot commented Dec 13, 2021

WebGPU Performance Issues #5689

WebGPU Performance Issues #5689

Comments

vladmandic commented Oct 3, 2021 • edited Loading

gyagp commented Oct 4, 2021

vladmandic commented Oct 6, 2021

Performance Testing

Notes

Test Results

Issues

Reproduction

vladmandic commented Oct 7, 2021 • edited Loading

qjia7 commented Oct 8, 2021

vladmandic commented Oct 8, 2021 • edited Loading

qjia7 commented Oct 11, 2021

vladmandic commented Oct 11, 2021 • edited Loading

vladmandic commented Nov 2, 2021

qjia7 commented Nov 3, 2021

vladmandic commented Nov 3, 2021

qjia7 commented Nov 4, 2021

qjia7 commented Dec 13, 2021

vladmandic commented Dec 13, 2021

google-ml-butler bot commented Dec 13, 2021

vladmandic commented Oct 3, 2021 •

edited

Loading

vladmandic commented Oct 7, 2021 •

edited

Loading

vladmandic commented Oct 8, 2021 •

edited

Loading

vladmandic commented Oct 11, 2021 •

edited

Loading