[ET-VK][q8_ops] Add int8x4_buffer_to_nchw shader and refactor Int8x4Staging#17708
Merged
SS-JIA merged 1 commit intogh/SS-JIA/450/basefrom Feb 25, 2026
Merged
[ET-VK][q8_ops] Add int8x4_buffer_to_nchw shader and refactor Int8x4Staging#17708SS-JIA merged 1 commit intogh/SS-JIA/450/basefrom
SS-JIA merged 1 commit intogh/SS-JIA/450/basefrom
Conversation
…taging Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover the full staging lifecycle for kInt8x4 buffer tensors. **Rename and split of the old prepack function:** The old `add_staging_to_int8x4_buffer_node` (which used a static dispatch node for prepacking TensorRef data into a packed int8x4 buffer) is renamed to `add_prepack_int8x4_buffer_node` to clarify its role. Two new runtime staging functions are added alongside it: - `add_staging_to_int8x4_buffer_node`: reads NCHW data from a staging buffer into a kInt8x4 buffer tensor at execute time, using a `DynamicDispatchNode` wrapping the existing `nchw_to_int8x4_buffer` shader. - `add_int8x4_buffer_to_staging_node`: writes packed int8x4 data back from a kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time, using a new `int8x4_buffer_to_nchw` shader. **New shader (int8x4_buffer_to_nchw.glsl):** Implements the reverse of `nchw_to_int8x4_buffer`. One thread per output int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered element indices, looks up each element's position in the packed int8x4 buffer via `tensor4d_idx_to_buf_idx`, extracts the packed byte, and assembles 4 bytes into a single output int32. Works for any GPUMemoryLayout. **Staging.cpp dispatch:** `add_staging_to_tensor_node` and `add_tensor_to_staging_node` now both dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4. `prepack_op` is updated to call `add_prepack_int8x4_buffer_node`. **TestQ8taBinary.cpp** is updated to include Int8x4Staging.h and call `add_prepack_int8x4_buffer_node`. Differential Revision: [D94364640](https://our.internmc.facebook.com/intern/diff/D94364640/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17708
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 1c319ca with merge base 63f9724 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This was referenced Feb 25, 2026
Merged
This PR needs a
|
manuelcandales
approved these changes
Feb 25, 2026
SS-JIA
pushed a commit
that referenced
this pull request
Feb 25, 2026
…taging Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover the full staging lifecycle for kInt8x4 buffer tensors. **Rename and split of the old prepack function:** The old `add_staging_to_int8x4_buffer_node` (which used a static dispatch node for prepacking TensorRef data into a packed int8x4 buffer) is renamed to `add_prepack_int8x4_buffer_node` to clarify its role. Two new runtime staging functions are added alongside it: - `add_staging_to_int8x4_buffer_node`: reads NCHW data from a staging buffer into a kInt8x4 buffer tensor at execute time, using a `DynamicDispatchNode` wrapping the existing `nchw_to_int8x4_buffer` shader. - `add_int8x4_buffer_to_staging_node`: writes packed int8x4 data back from a kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time, using a new `int8x4_buffer_to_nchw` shader. **New shader (int8x4_buffer_to_nchw.glsl):** Implements the reverse of `nchw_to_int8x4_buffer`. One thread per output int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered element indices, looks up each element's position in the packed int8x4 buffer via `tensor4d_idx_to_buf_idx`, extracts the packed byte, and assembles 4 bytes into a single output int32. Works for any GPUMemoryLayout. **Staging.cpp dispatch:** `add_staging_to_tensor_node` and `add_tensor_to_staging_node` now both dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4. `prepack_op` is updated to call `add_prepack_int8x4_buffer_node`. **TestQ8taBinary.cpp** is updated to include Int8x4Staging.h and call `add_prepack_int8x4_buffer_node`. Differential Revision: [D94364640](https://our.internmc.facebook.com/intern/diff/D94364640/) ghstack-source-id: 344667754 Pull Request resolved: #17708
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover
the full staging lifecycle for kInt8x4 buffer tensors.
Rename and split of the old prepack function:
The old
add_staging_to_int8x4_buffer_node(which used a static dispatchnode for prepacking TensorRef data into a packed int8x4 buffer) is renamed
to
add_prepack_int8x4_buffer_nodeto clarify its role. Two new runtimestaging functions are added alongside it:
add_staging_to_int8x4_buffer_node: reads NCHW data from a staging bufferinto a kInt8x4 buffer tensor at execute time, using a
DynamicDispatchNodewrapping the existing
nchw_to_int8x4_buffershader.add_int8x4_buffer_to_staging_node: writes packed int8x4 data back from akInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time,
using a new
int8x4_buffer_to_nchwshader.New shader (int8x4_buffer_to_nchw.glsl):
Implements the reverse of
nchw_to_int8x4_buffer. One thread per outputint32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered
element indices, looks up each element's position in the packed int8x4 buffer
via
tensor4d_idx_to_buf_idx, extracts the packed byte, and assembles 4bytes into a single output int32. Works for any GPUMemoryLayout.
Staging.cpp dispatch:
add_staging_to_tensor_nodeandadd_tensor_to_staging_nodenow bothdispatch to the int8x4-specific functions when the tensor dtype is kInt8x4.
prepack_opis updated to calladd_prepack_int8x4_buffer_node.TestQ8taBinary.cpp is updated to include Int8x4Staging.h and call
add_prepack_int8x4_buffer_node.Differential Revision: D94364640