[webgpu] Use numbers directly instead of const variables #7193

haoyunfeix · 2022-12-20T17:48:39Z

FIXES BUG: #6746
Deno uses Naga for wgsl compilation, but Naga currently uses let for global constants(will be fixed in gfx-rs/naga#1829).
This PR helps WebGPU to run pose-detection models on Deno by removing global constants in shaders.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.

This change is

haoyunfeix · 2022-12-21T02:08:25Z

@qjia7 @xhcao @axinging @gyagp PTAL

qjia7

Overall, LGTM. Just some nits.
Please also add the bug id that you fixed in your description message. Like Fixed BUG:#xxx. And the reason that why you do this.

qjia7 · 2022-12-21T02:42:26Z

tfjs-backend-webgpu/src/argminmax_webgpu.ts

@@ -102,14 +103,14 @@ export class ArgMinMaxProgram implements WebGPUProgram {
      ${sharedMemorySnippet}

      ${main('index')} {
-        let outputIndex = index / i32(workgroupSizeX);
+        let outputIndex = index / i32(${workgroupSizeX}u);


index / i32(${workgroupSizeX}u) -> index / ${workgroupSizeX}.
Since ${workgroupSizeX} is already an int, you don't need to make it a uint and force to int again.
Similar for other places.

qjia7 · 2022-12-21T02:48:47Z

tfjs-backend-webgpu/src/matmul_packed_webgpu.ts

-            let BCached0 = mm_Bsub[k * innerElementSize][tileCol];
-            let BCached1 = mm_Bsub[k * innerElementSize + 1][tileCol];
-            let BCached2 = mm_Bsub[k * innerElementSize + 2][tileCol];
+        for (var k = 0; k < ${tileInner} / ${innerElementSize}; k = k + 1) {


k < ${tileInner} / ${innerElementSize}
->
k < ${ tileInner / innerElementSize}

In fact, I prefer ${tileInner} / ${innerElementSize}, because I view ${} in WGSL as the equivalence of macros in C/C++.

Exactly, for (var k = 0; k < 32 / 4; k = k + 1) is more meaningful than for (var k = 0; k < 8; k = k + 1), in case someone would like to read the running WGSL for some reason. However, this may cost additional division operations in this loop.
WDYT, @qjia7

Personally, I think it's weird that you directly use 32 / 4 not 8 in shader. And from the whole loop body, we can also easily infer that each iteration will process 4 elements. And I don't know why we need the exact equivalence between wgsl and js code. Why not directly save this division in shader?

Either one is OK for me. This will not add runtime cost as Tint and shader compiler can do the optimization.

qjia7 · 2022-12-21T02:59:13Z

tfjs-backend-webgpu/src/matmul_packed_webgpu.ts

        mm_Asub[tileCol] = vec4<f32>(${readVectorASnippet(transposeA)});
        workgroupBarrier();

        // Compute acc values for a single thread.
-        for (var k = 0; k < tileSize / 4; k = k + 1) {
-          let rowB = t * tileSize + k * 4;
+        for (var k = 0; k < ${tileSize} / 4; k = k + 1) {


k < ${tileSize} / 4
->
k < ${tileSize / 4}

qjia7 · 2022-12-21T03:00:51Z

tfjs-backend-webgpu/src/matmul_reduce_webgpu.ts

@@ -34,15 +34,16 @@ export function makeMatMulReduceSource(): string {
      let col = coords[2];
      var sum = 0.0;
      let Length = uniforms.dimInner;
-      for (var k = i32(localId.x); k < Length; k = k + i32(workgroupSizeX)) {
+      for (var k = i32(localId.x); k < Length; k = k + i32(${
+      workgroupSizeX}u)) {


k = k + i32(${workgroupSizeX}u)
->
k = k + ${workgroupSizeX}

qjia7 · 2022-12-21T03:01:36Z

tfjs-backend-webgpu/src/matmul_reduce_webgpu.ts

        let dataA = mm_readA(batchA, row, k);
        let dataB = mm_readB(batchB, k, col);
        sum = sum + dataA * dataB;
      }
      sumValues[localId.x] = sum;
      workgroupBarrier();

-      for(var currentSize = workgroupSizeX / 2u; currentSize > 1u;
+      for(var currentSize = ${workgroupSizeX}u / 2u; currentSize > 1u;


${workgroupSizeX}u / 2u
->
${workgroupSizeX / 2}u

qjia7 · 2022-12-21T03:02:51Z

tfjs-backend-webgpu/src/reduce_webgpu.ts

@@ -97,20 +98,20 @@ export class ReduceProgram implements WebGPUProgram {
          return offset;
       }
       ${main('index')} {
-         let outputIndex = index / i32(workgroupSizeX);
+         let outputIndex = index / i32(${workgroupSizeX}u);


i32(${workgroupSizeX}u
->
${workgroupSizeX}

qjia7 · 2022-12-21T03:03:36Z

tfjs-backend-webgpu/src/reduce_webgpu.ts

         for (var k = i32(localId.x); k < Length && outputIndex < uniforms.size;
-             k = k + i32(workgroupSizeX)) {
+             k = k + i32(${workgroupSizeX})) {


haoyunfeix

Done.

Reviewable status: 0 of 1 approvals obtained (waiting on @gyagp, @qjia7, and @xhcao)

tfjs-backend-webgpu/src/matmul_packed_webgpu.ts line 216 at r4 (raw file):