-
Notifications
You must be signed in to change notification settings - Fork 685
[ET-VK] Optimize conv2d s1p0 #14187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ET-VK] Optimize conv2d s1p0 #14187
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14187
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 235536e with merge base 44972ad ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Just some questions about things tried and some small requests for stylistic changes. Thanks for working on this!
outputTexel[3] += dot(inputVec, vec4(weight1OutputChannelPacked[3], weight2OutputChannelPacked[3], weight3OutputChannelPacked[3], weight4OutputChannelPacked[3])); | ||
} | ||
|
||
imageStore(t_out, ivec3(xIdx, yIdx, gid1), op(outputTexel, out_min, out_max)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc, previously the shader calculated 4 output texels but the new one only calculates one. Have you experimented with computing a bigger output tile? Might be something that can get us a further boost. Note that I am also ok with landing the shader in its current form though, given the perf improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a bigger output tile, and on my end it's slightly slower for MobileNet's case. I believe this is due to the excessive vector registers that are used when we increase the tile size.
weight4OutputChannelPacked = texelFetch(t_kernel, ivec2(inputC * 4 + 3, gid1), 0); | ||
|
||
const vec4 bias = texelFetch(t_bias, ivec2(out_pos_z, 0), 0); | ||
outputTexel[0] += dot(inputVec, vec4(weight1OutputChannelPacked[0], weight2OutputChannelPacked[0], weight3OutputChannelPacked[0], weight4OutputChannelPacked[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, computing matmul-like operations with dot
was not as fast as fma
. Did you try computing with fma
as well? Curious to know if you have had a different experience. Btw you can take a look at the infographic in the old shader's comments to see how fma can be used instead of dot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tried it on my end. For me it's essentially the exact same for me to use fma
instead of dot
. Might be some compiler magic happening that makes it be the same operations.
@pytorchbot cherry-pick --onto release/1.0 -c fixnewfeature |
This change improves the execution of the pointwise conv2d s1p0 shader. It does through more of a GEMM-like implementation and employing more explicit loop unrolling. cc @SS-JIA @manuelcandales @cbilgin (cherry picked from commit b265324)
Cherry picking #14187The cherry pick PR is at #14724 and it is recommended to link a fixnewfeature cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
This change improves the execution of the pointwise conv2d s1p0 shader. It does through more of a GEMM-like implementation and employing more explicit loop unrolling.
cc @SS-JIA @manuelcandales @cbilgin