Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webgpu: Support depthwise conv2d with nchw format #6084

Merged
merged 16 commits into from Jun 13, 2022

Conversation

qjia7
Copy link
Collaborator

@qjia7 qjia7 commented Jan 27, 2022

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

@Linchenn
Copy link
Collaborator

Linchenn commented Jun 7, 2022

@qjia7 Hi Jiajia, from my perspective, computing through NCHW may have better locality for I/O, which has potential performance gains. I am curious why you did not merge this?

@qjia7
Copy link
Collaborator Author

qjia7 commented Jun 8, 2022

@qjia7 Hi Jiajia, from my perspective, computing through NCHW may have better locality for I/O, which has potential performance gains. I am curious why you did not merge this?

Yes, you are right. This PR does have perf gain for depthwise conv2d. But at that time, I hadn't figured out how to optimize conv2d using NCHW. So in this PR, it introduces several extra transpose before and after depthwise-conv2d, which is not good for the whole model. But now, we have clear target to use NCHW for the whole model. I will revisit this PR soon after I finish some code cleanup/refactoring. Thanks.

@Linchenn
Copy link
Collaborator

Linchenn commented Jun 8, 2022

Got it, and thank you for explaining this! 👍🏻

@qjia7 qjia7 requested review from Linchenn, xhcao and lina128 June 9, 2022 09:52
@qjia7 qjia7 changed the title webgpu: Optimize depthwise conv2d with nchw format webgpu: Support depthwise conv2d with nchw format Jun 9, 2022
Copy link
Collaborator

@Linchenn Linchenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Jiajia! The implementation of NCHW is very impressive!

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @qjia7, and @xhcao)


tfjs-backend-webgpu/src/kernels/DepthwiseConv2dNative.ts line 59 at r11 (raw file):

      convInfo.strideHeight === 1 && convInfo.strideWidth === 1 &&
      convInfo.dilationWidth === 1 && convInfo.dilationHeight === 1 &&
      convInfo.inChannels === convInfo.outChannels) {

Since the workGroupSize is fixed as [8, 8, 1], if we get into this branch with filterSize > 64, I think the assert of filterSize <= workGroupSize will be triggered?


tfjs-backend-webgpu/src/depthwise_conv2d_nchw_shared_webgpu.ts line 145 at r11 (raw file):

        // Load one tile of X into local memory.
        mm_Asub[localRow][localCol] = readX(batch, d1, inputRowStart, inputColStart);

I think your logics here is right, but could you help me understand it?

It looks like all the workers will read the same data, because the parameters in readX(batch, d1, inputRowStart, inputColStart) are constant for all workers, in-relevant from the workers' positions?

Also, why do we reads 4 times for mm_Asub and why don't we just do something like:

          let wRow = localIndex / ${this.filterWidth};
          let wCol = localIndex % ${this.filterWidth};
          mm_Asub[wRow][wCol] = getW(batch, d1, inputRowStart + wRow, inputColStart + wCol);

Copy link
Collaborator

@Linchenn Linchenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @qjia7, and @xhcao)


tfjs-backend-webgpu/src/depthwise_conv2d_webgpu.ts line 111 at r11 (raw file):

            uniforms.dilation[1];

        // Convolve x(?, ?, d1) with w(:, :, d1, q) to get y(yR, yC, d2).

Could you update this comment for NCHW?

Copy link
Collaborator Author

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @Linchenn, and @xhcao)


tfjs-backend-webgpu/src/depthwise_conv2d_webgpu.ts line 111 at r11 (raw file):

Previously, Linchenn wrote…

Could you update this comment for NCHW?

Done


tfjs-backend-webgpu/src/kernels/DepthwiseConv2dNative.ts line 59 at r11 (raw file):

Previously, Linchenn wrote…

Since the workGroupSize is fixed as [8, 8, 1], if we get into this branch with filterSize > 64, I think the assert of filterSize <= workGroupSize will be triggered?

Yes, you are right. But I have removed this limitation in the latest commit by using a for to allow some workers load more data to fill in mm_Bsub.
I also changed the workGroupSize to a larger one [16, 16, 1], just for more data reusing in a work group.


tfjs-backend-webgpu/src/depthwise_conv2d_nchw_shared_webgpu.ts line 145 at r11 (raw file):

It looks like all the workers will read the same data

In fact, they read different data. inputRowStart and inputColStart have a 1:1 mapping to localRow and localCol. And for each worker in a work group, it has a unique localRow and localCol. All workers in a work group will be expected to load a tile of input x. If the tile size of mm_Asub is same with work group size, each worker only needs to load one data. However, the current tile size [16][16] is 4 times of the original work group size [8, 8], which means each worker needs to load 4 data. That's way you see we read 4 times for mm_Asub.

As for the reason, why we need a larger tile size, it's because we need to make sure each output data can get all the needed input data. So the right tile size should be [workGroupSizeX + filterWidth - 1][workGroupSizeY + filterHeight - 1]. In the first version, I just set it to [16][16] to avoid the bounder checking, but may waste some shared memory. In the latest commit, I change it to the really needed size.

And the purse of using shared memory is to reuse the data and reduce the time of accessing global memory. And the speed of accessing shared memory is much faster than accessing global memory.

Copy link
Collaborator

@Linchenn Linchenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I guess this will have an obvious performance gain for large filters. Thank you Jiajia so much for the detailed explanation!

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128, @Linchenn, @qjia7, and @xhcao)


tfjs-backend-webgpu/src/kernels/DepthwiseConv2dNative.ts line 59 at r11 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

Yes, you are right. But I have removed this limitation in the latest commit by using a for to allow some workers load more data to fill in mm_Bsub.
I also changed the workGroupSize to a larger one [16, 16, 1], just for more data reusing in a work group.

Great!

@qjia7
Copy link
Collaborator Author

qjia7 commented Jun 10, 2022

@Linchenn I pushed another commit to limit DepthwiseConv2DNCHWSharedProgram only for nchw format. Although DepthwiseConv2DNCHWSharedProgram itself has good performance, but transpose before and after increase non-ignorable overhead for nhwc. After testing, #6521 can provide better performance for 5x5 nhwc filter overall.

@Linchenn
Copy link
Collaborator

Great catch and thank you!

Copy link
Collaborator

@xhcao xhcao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a question, thank you.

${
filterSize < workGroupSize ?
`if (wIndex < ${filterSize})` :
`for(; wIndex < ${filterSize}; wIndex = wIndex + ${workGroupSize})`}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is if statement needed here? Is it already a subset of for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be slightly good for perf to use if instead of for since it doesn't need to call wIndex = wIndex + ${workGroupSize}) and another wIndex < ${filterSize} checking if filterSize is smaller than workGroupSize.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you.

@qjia7 qjia7 merged commit bc1059b into tensorflow:master Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants