[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing by pytorchbot · Pull Request #18300 · pytorch/executorch

pytorchbot · 2026-03-18T18:43:28Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #18292 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/492/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/492/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/491/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/492/orig
Differential Revision: D96756792
@diff-train-skip-merge

…blocked weight packing Pull Request resolved: #18292 Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant bottleneck. This diff re-implements the stride=1, padding=0 pointwise path using the same tiled matmul approach as the recently landed linear shader rewrite. The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed weight tile loading) with custom input/output tile load/store functions that map flat spatial indices to channels-packed texture3d coordinates. Weight packing uses the same 4OC×4IC blocked format as linear via the `pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct workgroup size updates during graph resizing. Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw shader for arbitrary stride/padding is left unchanged. EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%). Authored with Claude. ghstack-source-id: 353941147 @exported-using-ghexport Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/)

pytorch-bot · 2026-03-18T18:43:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18300

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…device-based tile selection Pull Request resolved: #18293 Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. ghstack-source-id: 353940602 @exported-using-ghexport Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/)

pytorchbot requested a review from SS-JIA as a code owner March 18, 2026 18:43

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 18, 2026

SS-JIA approved these changes Mar 18, 2026

View reviewed changes

SS-JIA merged commit 69b3f8f into gh/SS-JIA/491/orig Mar 18, 2026
24 of 25 checks passed

SS-JIA deleted the gh/SS-JIA/492/orig branch March 18, 2026 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing#18300

[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing#18300
SS-JIA merged 2 commits intogh/SS-JIA/491/origfrom
gh/SS-JIA/492/orig

pytorchbot commented Mar 18, 2026

Uh oh!

pytorch-bot bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pytorchbot commented Mar 18, 2026

Uh oh!

pytorch-bot bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18300

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 18, 2026 •

edited

Loading