Skip to content

Conversation

alexdean08
Copy link
Collaborator

@alexdean08 alexdean08 commented Sep 11, 2025

This change improves the execution of the pointwise conv2d s1p0 shader. It does through more of a GEMM-like implementation and employing more explicit loop unrolling.

cc @SS-JIA @manuelcandales @cbilgin

@alexdean08 alexdean08 requested a review from SS-JIA as a code owner September 11, 2025 00:06
@alexdean08 alexdean08 added module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ release notes: vulkan Changes to the Vulkan backend delegate labels Sep 11, 2025
Copy link

pytorch-bot bot commented Sep 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14187

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 235536e with merge base 44972ad (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 11, 2025
@facebook-github-bot
Copy link
Contributor

@SS-JIA has imported this pull request. If you are a Meta employee, you can view this in D82254889.

Copy link
Contributor

@SS-JIA SS-JIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Just some questions about things tried and some small requests for stylistic changes. Thanks for working on this!

outputTexel[3] += dot(inputVec, vec4(weight1OutputChannelPacked[3], weight2OutputChannelPacked[3], weight3OutputChannelPacked[3], weight4OutputChannelPacked[3]));
}

imageStore(t_out, ivec3(xIdx, yIdx, gid1), op(outputTexel, out_min, out_max));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, previously the shader calculated 4 output texels but the new one only calculates one. Have you experimented with computing a bigger output tile? Might be something that can get us a further boost. Note that I am also ok with landing the shader in its current form though, given the perf improvement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a bigger output tile, and on my end it's slightly slower for MobileNet's case. I believe this is due to the excessive vector registers that are used when we increase the tile size.

weight4OutputChannelPacked = texelFetch(t_kernel, ivec2(inputC * 4 + 3, gid1), 0);

const vec4 bias = texelFetch(t_bias, ivec2(out_pos_z, 0), 0);
outputTexel[0] += dot(inputVec, vec4(weight1OutputChannelPacked[0], weight2OutputChannelPacked[0], weight3OutputChannelPacked[0], weight4OutputChannelPacked[0]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, computing matmul-like operations with dot was not as fast as fma . Did you try computing with fma as well? Curious to know if you have had a different experience. Btw you can take a look at the infographic in the old shader's comments to see how fma can be used instead of dot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried it on my end. For me it's essentially the exact same for me to use fma instead of dot. Might be some compiler magic happening that makes it be the same operations.

@SS-JIA SS-JIA merged commit b265324 into pytorch:main Oct 1, 2025
129 of 130 checks passed
@SS-JIA
Copy link
Contributor

SS-JIA commented Oct 1, 2025

@pytorchbot cherry-pick --onto release/1.0 -c fixnewfeature

pytorchbot pushed a commit that referenced this pull request Oct 1, 2025
This change improves the execution of the pointwise conv2d s1p0 shader.
It does through more of a GEMM-like implementation and employing more
explicit loop unrolling.

cc @SS-JIA @manuelcandales @cbilgin

(cherry picked from commit b265324)
@pytorchbot
Copy link
Collaborator

Cherry picking #14187

The cherry pick PR is at #14724 and it is recommended to link a fixnewfeature cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ release notes: vulkan Changes to the Vulkan backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants