[ET-VK][q8_ops] Add int8x4_buffer_to_nchw shader and refactor Int8x4Staging by SS-JIA · Pull Request #17708 · pytorch/executorch

SS-JIA · 2026-02-25T15:26:51Z

Stack from ghstack (oldest at bottom):

Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover
the full staging lifecycle for kInt8x4 buffer tensors.

Rename and split of the old prepack function:
The old add_staging_to_int8x4_buffer_node (which used a static dispatch
node for prepacking TensorRef data into a packed int8x4 buffer) is renamed
to add_prepack_int8x4_buffer_node to clarify its role. Two new runtime
staging functions are added alongside it:

add_staging_to_int8x4_buffer_node: reads NCHW data from a staging buffer
into a kInt8x4 buffer tensor at execute time, using a DynamicDispatchNode
wrapping the existing nchw_to_int8x4_buffer shader.
add_int8x4_buffer_to_staging_node: writes packed int8x4 data back from a
kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time,
using a new int8x4_buffer_to_nchw shader.

New shader (int8x4_buffer_to_nchw.glsl):
Implements the reverse of nchw_to_int8x4_buffer. One thread per output
int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered
element indices, looks up each element's position in the packed int8x4 buffer
via tensor4d_idx_to_buf_idx, extracts the packed byte, and assembles 4
bytes into a single output int32. Works for any GPUMemoryLayout.

Staging.cpp dispatch:
add_staging_to_tensor_node and add_tensor_to_staging_node now both
dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4.
prepack_op is updated to call add_prepack_int8x4_buffer_node.

TestQ8taBinary.cpp is updated to include Int8x4Staging.h and call
add_prepack_int8x4_buffer_node.

Differential Revision: D94364640

…taging Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover the full staging lifecycle for kInt8x4 buffer tensors. **Rename and split of the old prepack function:** The old `add_staging_to_int8x4_buffer_node` (which used a static dispatch node for prepacking TensorRef data into a packed int8x4 buffer) is renamed to `add_prepack_int8x4_buffer_node` to clarify its role. Two new runtime staging functions are added alongside it: - `add_staging_to_int8x4_buffer_node`: reads NCHW data from a staging buffer into a kInt8x4 buffer tensor at execute time, using a `DynamicDispatchNode` wrapping the existing `nchw_to_int8x4_buffer` shader. - `add_int8x4_buffer_to_staging_node`: writes packed int8x4 data back from a kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time, using a new `int8x4_buffer_to_nchw` shader. **New shader (int8x4_buffer_to_nchw.glsl):** Implements the reverse of `nchw_to_int8x4_buffer`. One thread per output int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered element indices, looks up each element's position in the packed int8x4 buffer via `tensor4d_idx_to_buf_idx`, extracts the packed byte, and assembles 4 bytes into a single output int32. Works for any GPUMemoryLayout. **Staging.cpp dispatch:** `add_staging_to_tensor_node` and `add_tensor_to_staging_node` now both dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4. `prepack_op` is updated to call `add_prepack_int8x4_buffer_node`. **TestQ8taBinary.cpp** is updated to include Int8x4Staging.h and call `add_prepack_int8x4_buffer_node`. Differential Revision: [D94364640](https://our.internmc.facebook.com/intern/diff/D94364640/) [ghstack-poisoned]

pytorch-bot · 2026-02-25T15:26:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17708

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1c319ca with merge base 63f9724 ():

NEW FAILURE - The following job has failed:

Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 3f7d03ebf505b37be559498254c36ec2c7b6e9708380af2b04bf01372652a035 /exec failed with exit code 55

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-25T15:27:57Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…taging Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover the full staging lifecycle for kInt8x4 buffer tensors. **Rename and split of the old prepack function:** The old `add_staging_to_int8x4_buffer_node` (which used a static dispatch node for prepacking TensorRef data into a packed int8x4 buffer) is renamed to `add_prepack_int8x4_buffer_node` to clarify its role. Two new runtime staging functions are added alongside it: - `add_staging_to_int8x4_buffer_node`: reads NCHW data from a staging buffer into a kInt8x4 buffer tensor at execute time, using a `DynamicDispatchNode` wrapping the existing `nchw_to_int8x4_buffer` shader. - `add_int8x4_buffer_to_staging_node`: writes packed int8x4 data back from a kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time, using a new `int8x4_buffer_to_nchw` shader. **New shader (int8x4_buffer_to_nchw.glsl):** Implements the reverse of `nchw_to_int8x4_buffer`. One thread per output int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered element indices, looks up each element's position in the packed int8x4 buffer via `tensor4d_idx_to_buf_idx`, extracts the packed byte, and assembles 4 bytes into a single output int32. Works for any GPUMemoryLayout. **Staging.cpp dispatch:** `add_staging_to_tensor_node` and `add_tensor_to_staging_node` now both dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4. `prepack_op` is updated to call `add_prepack_int8x4_buffer_node`. **TestQ8taBinary.cpp** is updated to include Int8x4Staging.h and call `add_prepack_int8x4_buffer_node`. Differential Revision: [D94364640](https://our.internmc.facebook.com/intern/diff/D94364640/) ghstack-source-id: 344667754 Pull Request resolved: #17708

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 25, 2026

This was referenced Feb 25, 2026

[ET-VK][ez] Fix global workgroup size underflow for sub-4D tensors in block config dispatch #17707

Merged

[ET-VK] Add Vulkan ops for skin segmentation and EdgeTAM models #17709

Merged

[ET-VK] Add aten.index.Tensor op and tests #17710

Merged

manuelcandales approved these changes Feb 25, 2026

View reviewed changes

SS-JIA merged commit ee8a6a1 into gh/SS-JIA/450/base Feb 25, 2026
203 of 206 checks passed

SS-JIA deleted the gh/SS-JIA/450/head branch February 25, 2026 19:12

SS-JIA temporarily deployed to cherry-pick-bot February 25, 2026 19:12 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][q8_ops] Add int8x4_buffer_to_nchw shader and refactor Int8x4Staging#17708

[ET-VK][q8_ops] Add int8x4_buffer_to_nchw shader and refactor Int8x4Staging#17708
SS-JIA merged 1 commit intogh/SS-JIA/450/basefrom
gh/SS-JIA/450/head

SS-JIA commented Feb 25, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17708

❌ 1 New Failure

Uh oh!

github-actions Bot commented Feb 25, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Feb 25, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 25, 2026 •

edited

Loading

This PR needs a `release notes:` label