Skip to content

[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing#18300

Merged
SS-JIA merged 2 commits intogh/SS-JIA/491/origfrom
gh/SS-JIA/492/orig
Mar 18, 2026
Merged

[ET-VK][conv2d] Re-implement pointwise conv2d with tiled compute and blocked weight packing#18300
SS-JIA merged 2 commits intogh/SS-JIA/491/origfrom
gh/SS-JIA/492/orig

Conversation

@pytorchbot
Copy link
Collaborator

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #18292 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/492/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/492/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/491/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/492/orig
Differential Revision: D96756792
@diff-train-skip-merge

…blocked weight packing

Pull Request resolved: #18292

Profiling EdgeTAM on Adreno shows pointwise 1×1 convolutions are a dominant
bottleneck. This diff re-implements the stride=1, padding=0 pointwise path
using the same tiled matmul approach as the recently landed linear shader
rewrite.

The new `conv2d_pw_tiled` shader reuses the shared linear tiled infrastructure
(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight, packed
weight tile loading) with custom input/output tile load/store functions that
map flat spatial indices to channels-packed texture3d coordinates.

Weight packing uses the same 4OC×4IC blocked format as linear via the
`pack_fp_linear_weight` shader. Dispatch uses DynamicDispatchNode for correct
workgroup size updates during graph resizing.

Only the stride=1, padding=0 pointwise path is changed; the general conv2d_pw
shader for arbitrary stride/padding is left unchanged.

EdgeTAM first frame on Samsung S25 (Adreno 830): 208 ms → 196 ms (~6%).

Authored with Claude.
ghstack-source-id: 353941147
@exported-using-ghexport

Differential Revision: [D96756792](https://our.internmc.facebook.com/intern/diff/D96756792/)
@pytorchbot pytorchbot requested a review from SS-JIA as a code owner March 18, 2026 18:43
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 18, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18300

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 18, 2026
…device-based tile selection

Pull Request resolved: #18293

Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to
register pressure from the 4x2 output tile (17 vec4 registers per thread).
Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives
4-15x speedup on Mali with no regression on Adreno.

This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a
new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based
tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno.
ghstack-source-id: 353940602
@exported-using-ghexport

Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/)
@SS-JIA SS-JIA merged commit 69b3f8f into gh/SS-JIA/491/orig Mar 18, 2026
24 of 25 checks passed
@SS-JIA SS-JIA deleted the gh/SS-JIA/492/orig branch March 18, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants