Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webgpu: Optimize depthwise conv2d #5209

Merged
merged 2 commits into from Jun 15, 2021
Merged

Conversation

qjia7
Copy link
Collaborator

@qjia7 qjia7 commented Jun 11, 2021

PERF

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

@qjia7
Copy link
Collaborator Author

qjia7 commented Jun 11, 2021

@kainino0x @lina128 @jinjingforever Please take a look, thanks.

In our last meeting, I ever said that the perf of depthwiseConv2d was almost 2x slower than webgl on Intel GPU. With this change, the perf is close between webgpu and webgl.
hand_detector:

DepthwiseConv2dNative 13.88

becomes

DepthwiseConv2dNative 8.28

This optimization is mainly for filter size 3x3 and stride size is 1x1. It's widely used for all of we tested models.
For such kind of depthwise conv2d, due to the stride size is only 1x1, there will be many data are repeated access in each channel tile. If we only calculate one output data in one invocation, we need to access 3x3 data in x and 3x3 data in filter. So to get a 4x4 output data, we need (3x3+3x3)* (4x4) = 288 times memory access. However, if we calculate 4x4 data in one invocation, we only need to access (3x6)*4 data in x and (3x3)*4 data in filter. So it needs (3x6 + 3x3) * 4 = 108 times, which is only half of previous one.

Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation and great perf improvement!

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @jinjingforever and @kainino0x)

@qjia7 qjia7 merged commit 64f3ff1 into tensorflow:master Jun 15, 2021
@qjia7 qjia7 deleted the depthwise_opt branch May 5, 2023 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants