[ez][ET-VK][q8ta_conv2d_pw] Halve accumulator to lift Adreno occupancy by SS-JIA · Pull Request #19396 · pytorch/executorch

SS-JIA · 2026-05-08T16:12:05Z

Stack from ghstack (oldest at bottom):

The pointwise quantized conv shader allocated ivec4 out_accum[4][2] = 32 int32 accumulators per thread, which on Adreno 740 pinned 28 full-precision registers per thread and capped ALU fiber occupancy at 37%. AOC reported 26.7% exposed long-latency stalls, evidence that occupancy was too low to hide texture and SSBO latency. Halve the accumulator to 16 ints by reducing TILE_N4 from 2 to 1 (each thread now covers 4 widths × 4 output channels = a single 4×4 output block). The compensating dispatch change is in pick_q8ta_conv2d_pw_global_wg_size: global_wg.x doubles since each thread covers half as many output channel blocks as before. Each thread still loads 1 input ivec4 (4 widths) per K-iter, preserving the natural int8x4 packing alignment, so arithmetic intensity drops only 25% (2.67 → 2.0 MAC/B, in contrast to the variant where TILE_M is halved which drops AI by 50%).

Differential Revision: D103770023

The pointwise quantized conv shader allocated ivec4 out_accum[4][2] = 32 int32 accumulators per thread, which on Adreno 740 pinned 28 full-precision registers per thread and capped ALU fiber occupancy at 37%. AOC reported 26.7% exposed long-latency stalls, evidence that occupancy was too low to hide texture and SSBO latency. Halve the accumulator to 16 ints by reducing TILE_N4 from 2 to 1 (each thread now covers 4 widths × 4 output channels = a single 4×4 output block). The compensating dispatch change is in pick_q8ta_conv2d_pw_global_wg_size: global_wg.x doubles since each thread covers half as many output channel blocks as before. Each thread still loads 1 input ivec4 (4 widths) per K-iter, preserving the natural int8x4 packing alignment, so arithmetic intensity drops only 25% (2.67 → 2.0 MAC/B, in contrast to the variant where TILE_M is halved which drops AI by 50%). Differential Revision: [D103770023](https://our.internmc.facebook.com/intern/diff/D103770023/) [ghstack-poisoned]

pytorch-bot · 2026-05-08T16:12:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19396

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Long GPU queue (g5, g6) on LF fleet

✅ No Failures

As of commit 635b28a with merge base c564936 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-05-08T16:13:08Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…no occupancy" The pointwise quantized conv shader allocated ivec4 out_accum[4][2] = 32 int32 accumulators per thread, which on Adreno 740 pinned 28 full-precision registers per thread and capped ALU fiber occupancy at 37%. AOC reported 26.7% exposed long-latency stalls, evidence that occupancy was too low to hide texture and SSBO latency. Halve the accumulator to 16 ints by reducing TILE_N4 from 2 to 1 (each thread now covers 4 widths × 4 output channels = a single 4×4 output block). The compensating dispatch change is in pick_q8ta_conv2d_pw_global_wg_size: global_wg.x doubles since each thread covers half as many output channel blocks as before. Each thread still loads 1 input ivec4 (4 widths) per K-iter, preserving the natural int8x4 packing alignment, so arithmetic intensity drops only 25% (2.67 → 2.0 MAC/B, in contrast to the variant where TILE_M is halved which drops AI by 50%). Differential Revision: [D103770023](https://our.internmc.facebook.com/intern/diff/D103770023/) [ghstack-poisoned]

Pull Request resolved: #19396 The pointwise quantized conv shader allocated ivec4 out_accum[4][2] = 32 int32 accumulators per thread, which on Adreno 740 pinned 28 full-precision registers per thread and capped ALU fiber occupancy at 37%. AOC reported 26.7% exposed long-latency stalls, evidence that occupancy was too low to hide texture and SSBO latency. Halve the accumulator to 16 ints by reducing TILE_N4 from 2 to 1 (each thread now covers 4 widths × 4 output channels = a single 4×4 output block). The compensating dispatch change is in pick_q8ta_conv2d_pw_global_wg_size: global_wg.x doubles since each thread covers half as many output channel blocks as before. Each thread still loads 1 input ivec4 (4 widths) per K-iter, preserving the natural int8x4 packing alignment, so arithmetic intensity drops only 25% (2.67 → 2.0 MAC/B, in contrast to the variant where TILE_M is halved which drops AI by 50%). ghstack-source-id: 379519735 @exported-using-ghexport Differential Revision: [D103770023](https://our.internmc.facebook.com/intern/diff/D103770023/)

This was referenced May 8, 2026

[ez][ET-VK][partitioner] Allow layout-agnostic ops to accept quantized layouts #19395

Merged

[ET-VK][q8ta_pixel_shuffle] Add fused PixelShuffle custom op for channels-packed int8 tensors #19397

Merged

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 8, 2026

meta-codesync Bot added fb-exported meta-exported labels May 8, 2026

manuelcandales approved these changes May 8, 2026

View reviewed changes

SS-JIA mentioned this pull request May 8, 2026

[ET-VK] Implement aten.pixel_shuffle.default op #19404

Merged

meta-codesync Bot merged commit f691886 into gh/SS-JIA/527/base May 9, 2026
175 checks passed

meta-codesync Bot deleted the gh/SS-JIA/527/head branch May 9, 2026 04:57

meta-codesync Bot temporarily deployed to cherry-pick-bot May 9, 2026 04:57 Inactive

pytorchbot mentioned this pull request May 9, 2026

[ez][ET-VK][q8ta_conv2d_pw] Halve accumulator to lift Adreno occupancy #19437

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ez][ET-VK][q8ta_conv2d_pw] Halve accumulator to lift Adreno occupancy#19396

[ez][ET-VK][q8ta_conv2d_pw] Halve accumulator to lift Adreno occupancy#19396
meta-codesync[bot] merged 2 commits intogh/SS-JIA/527/basefrom
gh/SS-JIA/527/head

SS-JIA commented May 8, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19396

❗ 1 Active SEVs

✅ No Failures

Uh oh!

github-actions Bot commented May 8, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented May 8, 2026 •

edited

Loading

pytorch-bot Bot commented May 8, 2026 •

edited

Loading

This PR needs a `release notes:` label