webgpu: Support depthwise conv2d with nchw format #6084

qjia7 · 2022-01-27T06:37:28Z

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.

This change is

Linchenn · 2022-06-07T18:06:18Z

@qjia7 Hi Jiajia, from my perspective, computing through NCHW may have better locality for I/O, which has potential performance gains. I am curious why you did not merge this?

qjia7 · 2022-06-08T01:02:58Z

@qjia7 Hi Jiajia, from my perspective, computing through NCHW may have better locality for I/O, which has potential performance gains. I am curious why you did not merge this?

Yes, you are right. This PR does have perf gain for depthwise conv2d. But at that time, I hadn't figured out how to optimize conv2d using NCHW. So in this PR, it introduces several extra transpose before and after depthwise-conv2d, which is not good for the whole model. But now, we have clear target to use NCHW for the whole model. I will revisit this PR soon after I finish some code cleanup/refactoring. Thanks.

Linchenn · 2022-06-08T07:15:37Z

Got it, and thank you for explaining this! 👍🏻

Linchenn

Thank you Jiajia! The implementation of NCHW is very impressive!

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @qjia7, and @xhcao)

tfjs-backend-webgpu/src/kernels/DepthwiseConv2dNative.ts line 59 at r11 (raw file):

      convInfo.strideHeight === 1 && convInfo.strideWidth === 1 &&
      convInfo.dilationWidth === 1 && convInfo.dilationHeight === 1 &&
      convInfo.inChannels === convInfo.outChannels) {

Since the workGroupSize is fixed as [8, 8, 1], if we get into this branch with filterSize > 64, I think the assert of filterSize <= workGroupSize will be triggered?

tfjs-backend-webgpu/src/depthwise_conv2d_nchw_shared_webgpu.ts line 145 at r11 (raw file):

        // Load one tile of X into local memory.
        mm_Asub[localRow][localCol] = readX(batch, d1, inputRowStart, inputColStart);

I think your logics here is right, but could you help me understand it?

It looks like all the workers will read the same data, because the parameters in readX(batch, d1, inputRowStart, inputColStart) are constant for all workers, in-relevant from the workers' positions?

Also, why do we reads 4 times for mm_Asub and why don't we just do something like:

          let wRow = localIndex / ${this.filterWidth};
          let wCol = localIndex % ${this.filterWidth};
          mm_Asub[wRow][wCol] = getW(batch, d1, inputRowStart + wRow, inputColStart + wCol);

Linchenn

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @qjia7, and @xhcao)

tfjs-backend-webgpu/src/depthwise_conv2d_webgpu.ts line 111 at r11 (raw file):

            uniforms.dilation[1];

        // Convolve x(?, ?, d1) with w(:, :, d1, q) to get y(yR, yC, d2).

Could you update this comment for NCHW?

qjia7

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @Linchenn, and @xhcao)

tfjs-backend-webgpu/src/depthwise_conv2d_webgpu.ts line 111 at r11 (raw file):

Previously, Linchenn wrote…

Could you update this comment for NCHW?

Done

tfjs-backend-webgpu/src/kernels/DepthwiseConv2dNative.ts line 59 at r11 (raw file):

Previously, Linchenn wrote…

Since the workGroupSize is fixed as [8, 8, 1], if we get into this branch with filterSize > 64, I think the assert of filterSize <= workGroupSize will be triggered?

Yes, you are right. But I have removed this limitation in the latest commit by using a for to allow some workers load more data to fill in mm_Bsub.
I also changed the workGroupSize to a larger one [16, 16, 1], just for more data reusing in a work group.

tfjs-backend-webgpu/src/depthwise_conv2d_nchw_shared_webgpu.ts line 145 at r11 (raw file):

It looks like all the workers will read the same data

In fact, they read different data. inputRowStart and inputColStart have a 1:1 mapping to localRow and localCol. And for each worker in a work group, it has a unique localRow and localCol. All workers in a work group will be expected to load a tile of input x. If the tile size of mm_Asub is same with work group size, each worker only needs to load one data. However, the current tile size [16][16] is 4 times of the original work group size [8, 8], which means each worker needs to load 4 data. That's way you see we read 4 times for mm_Asub.

As for the reason, why we need a larger tile size, it's because we need to make sure each output data can get all the needed input data. So the right tile size should be [workGroupSizeX + filterWidth - 1][workGroupSizeY + filterHeight - 1]. In the first version, I just set it to [16][16] to avoid the bounder checking, but may waste some shared memory. In the latest commit, I change it to the really needed size.

And the purse of using shared memory is to reuse the data and reduce the time of accessing global memory. And the speed of accessing shared memory is much faster than accessing global memory.

Linchenn

Cool! I guess this will have an obvious performance gain for large filters. Thank you Jiajia so much for the detailed explanation!

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @lina128, @Linchenn, @qjia7, and @xhcao)

tfjs-backend-webgpu/src/kernels/DepthwiseConv2dNative.ts line 59 at r11 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

Yes, you are right. But I have removed this limitation in the latest commit by using a for to allow some workers load more data to fill in mm_Bsub.
I also changed the workGroupSize to a larger one [16, 16, 1], just for more data reusing in a work group.

Great!

qjia7 · 2022-06-10T11:41:20Z

@Linchenn I pushed another commit to limit DepthwiseConv2DNCHWSharedProgram only for nchw format. Although DepthwiseConv2DNCHWSharedProgram itself has good performance, but transpose before and after increase non-ignorable overhead for nhwc. After testing, #6521 can provide better performance for 5x5 nhwc filter overall.

Linchenn · 2022-06-10T17:07:19Z

Great catch and thank you!

xhcao

LGTM with a question, thank you.

xhcao · 2022-06-13T07:35:18Z

tfjs-backend-webgpu/src/depthwise_conv2d_nchw_shared_webgpu.ts

+        ${
+        filterSize < workGroupSize ?
+            `if (wIndex < ${filterSize})` :
+            `for(; wIndex < ${filterSize}; wIndex = wIndex + ${workGroupSize})`}


Is if statement needed here? Is it already a subset of for?

I may be slightly good for perf to use if instead of for since it doesn't need to call wIndex = wIndex + ${workGroupSize}) and another wIndex < ${filterSize} checking if filterSize is smaller than workGroupSize.

Got it, thank you.

qjia7 added 10 commits June 9, 2022 09:26

webgpu: Optimize depthwise conv2d

f11fd95

Fix bug

2bf705e

Rename depthwise_conv2d_nchw_shared_webgpu

4283e36

Update year

4eb876a

Use nchw conditionally

c3846c1

Fix tslint errors

d5ad65d

fix rebase error and use more tight shared size

e660a40

support common nchw depthwise conv2d

e20611d

Add test cases

89cb8cd

Fix non-uniform control warning

7b2de64

qjia7 force-pushed the depthwise_conv2d_nchw branch from 0d02f71 to 7b2de64 Compare June 9, 2022 09:30

Update year

1b6828b

qjia7 requested review from Linchenn, xhcao and lina128 June 9, 2022 09:52

qjia7 changed the title ~~webgpu: Optimize depthwise conv2d with nchw format~~ webgpu: Support depthwise conv2d with nchw format Jun 9, 2022

Linchenn reviewed Jun 9, 2022

View reviewed changes

Linchenn requested changes Jun 9, 2022

View reviewed changes

qjia7 added 2 commits June 10, 2022 11:30

make shader more flexible

73b8ace

nits

52a444b

qjia7 commented Jun 10, 2022

View reviewed changes

Linchenn approved these changes Jun 10, 2022

View reviewed changes

Linchenn and others added 2 commits June 10, 2022 00:24

Merge branch 'master' into depthwise_conv2d_nchw

ace2c45

Use DepthwiseConv2DNCHWSharedProgram only for NCHW

840056c

xhcao approved these changes Jun 13, 2022

View reviewed changes

Merge branch 'master' into depthwise_conv2d_nchw

0da3a2c

qjia7 merged commit bc1059b into tensorflow:master Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu: Support depthwise conv2d with nchw format #6084

webgpu: Support depthwise conv2d with nchw format #6084

qjia7 commented Jan 27, 2022 •

edited by nsthorat

Linchenn commented Jun 7, 2022

qjia7 commented Jun 8, 2022

Linchenn commented Jun 8, 2022

Linchenn left a comment

Linchenn left a comment

qjia7 left a comment

Linchenn left a comment

qjia7 commented Jun 10, 2022

Linchenn commented Jun 10, 2022

xhcao left a comment

xhcao Jun 13, 2022

qjia7 Jun 13, 2022

xhcao Jun 13, 2022

webgpu: Support depthwise conv2d with nchw format #6084

webgpu: Support depthwise conv2d with nchw format #6084

Conversation

qjia7 commented Jan 27, 2022 • edited by nsthorat

Linchenn commented Jun 7, 2022

qjia7 commented Jun 8, 2022

Linchenn commented Jun 8, 2022

Linchenn left a comment

Choose a reason for hiding this comment

Linchenn left a comment

Choose a reason for hiding this comment

qjia7 left a comment

Choose a reason for hiding this comment

Linchenn left a comment

Choose a reason for hiding this comment

qjia7 commented Jun 10, 2022

Linchenn commented Jun 10, 2022

xhcao left a comment

Choose a reason for hiding this comment

xhcao Jun 13, 2022

Choose a reason for hiding this comment

qjia7 Jun 13, 2022

Choose a reason for hiding this comment

xhcao Jun 13, 2022

Choose a reason for hiding this comment

qjia7 commented Jan 27, 2022 •

edited by nsthorat