[ET-VK][ez] Fix global workgroup size underflow for sub-4D tensors in block config dispatch#17707
Merged
[ET-VK][ez] Fix global workgroup size underflow for sub-4D tensors in block config dispatch#17707
Conversation
… block config dispatch `pick_linear_global_wg_with_block_config` used `sizes[ndim - 1 - inner_dim]` to index tensor dimensions, which underflows when block config dimensions (WHCN ordering) reference indices >= ndim. For example, a 2D tensor [1, 144] with inner_dim=2 (C) would compute sizes[2-1-2] = sizes[-1], reading 0 and producing global_wg=(0,1,1). DispatchNode::encode() skips dispatches with any zero workgroup component, so the shader never executes and the output stays all zeros. This caused the skin segmentation model to produce bbox_iou=0 on Android, since its 2D intermediate tensors (keypoints, bbox) were never computed. Fix by using `utils::val_at` with negative indices to safely access WHCN dimensions, consistent with `pick_extents_global_wg_with_block_config`. val_at returns 1 for out-of-bounds indices, correctly handling tensors with fewer than 4 dimensions. Also adds 1D, 2D, and 3D tensor shapes to q8ta_binary test cases to prevent regression. Differential Revision: [D94364639](https://our.internmc.facebook.com/intern/diff/D94364639/) ghstack-source-id: 344667756 Pull Request resolved: #17707
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17707
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 115 PendingAs of commit dc6a0d8 with merge base 63f9724 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
added 3 commits
February 25, 2026 07:26
…taging Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover the full staging lifecycle for kInt8x4 buffer tensors. **Rename and split of the old prepack function:** The old `add_staging_to_int8x4_buffer_node` (which used a static dispatch node for prepacking TensorRef data into a packed int8x4 buffer) is renamed to `add_prepack_int8x4_buffer_node` to clarify its role. Two new runtime staging functions are added alongside it: - `add_staging_to_int8x4_buffer_node`: reads NCHW data from a staging buffer into a kInt8x4 buffer tensor at execute time, using a `DynamicDispatchNode` wrapping the existing `nchw_to_int8x4_buffer` shader. - `add_int8x4_buffer_to_staging_node`: writes packed int8x4 data back from a kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time, using a new `int8x4_buffer_to_nchw` shader. **New shader (int8x4_buffer_to_nchw.glsl):** Implements the reverse of `nchw_to_int8x4_buffer`. One thread per output int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered element indices, looks up each element's position in the packed int8x4 buffer via `tensor4d_idx_to_buf_idx`, extracts the packed byte, and assembles 4 bytes into a single output int32. Works for any GPUMemoryLayout. **Staging.cpp dispatch:** `add_staging_to_tensor_node` and `add_tensor_to_staging_node` now both dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4. `prepack_op` is updated to call `add_prepack_int8x4_buffer_node`. **TestQ8taBinary.cpp** is updated to include Int8x4Staging.h and call `add_prepack_int8x4_buffer_node`. Differential Revision: [D94364640](https://our.internmc.facebook.com/intern/diff/D94364640/) ghstack-source-id: 344667754 Pull Request resolved: #17708
Implement several missing Vulkan operators needed to reduce graph fragmentation in the skin segmentation and EdgeTAM models. **Skin segmentation ops:** - aten.where.self: already had C++ and GLSL implementations but was missing the Python partitioner registration. - aten.bitwise_and.Tensor: added as a new binary_op shader variant operating on uint8 (bool) tensors. **EdgeTAM partitioning fixes:** - Comparison ops (eq, lt, le, gt, ge): were registered under the generic BinaryOp features which inherited FP_INT_T as the output dtype set. The partitioner correctly rejected these because their outputs are bool tensors. Split them into a dedicated register_comparison_ops registration with outputs_dtypes=BOOL_T. The binary_op.glsl shader already handles bool output via the IS_COMPARISON_OP path (uint8 storage), so no shader changes are needed. - aten.copy.default: not in the op registry, causing a subgraph break in the first-frame model. This op appears when valid_num_points.to() is called with matching dtype (a no-op cast). Add it to RemoveRedundantOpsTransform so it is eliminated before the partitioner runs. Also register it as an ephemeral op as a fallback. The removal logic requires a _src_arg1_ops set to handle the copy.default(self, src) argument order, where the replacement target is args[1] (src) rather than args[0] (self) as in all other redundant ops. Differential Revision: [D94364641](https://our.internmc.facebook.com/intern/diff/D94364641/) ghstack-source-id: 344667759 Pull Request resolved: #17709
Implements `aten.index.Tensor` for the Vulkan backend, supporting 1D self tensors with exactly one non-None index tensor. Includes buffer and texture GLSL shaders, C++ operator registration, and correctness tests. Also extends the op test code generators to handle `c10::List<::std::optional<at::Tensor>>` (`Tensor?[]` in ATen), which is the C++ type for the `indices` argument of `aten.index.Tensor`. Differential Revision: [D94364638](https://our.internmc.facebook.com/intern/diff/D94364638/) ghstack-source-id: 344667758 Pull Request resolved: #17710
This was referenced Feb 25, 2026
This PR needs a
|
manuelcandales
approved these changes
Feb 25, 2026
588f2a6 to
dc6a0d8
Compare
SS-JIA
pushed a commit
that referenced
this pull request
Feb 25, 2026
… block config dispatch `pick_linear_global_wg_with_block_config` used `sizes[ndim - 1 - inner_dim]` to index tensor dimensions, which underflows when block config dimensions (WHCN ordering) reference indices >= ndim. For example, a 2D tensor [1, 144] with inner_dim=2 (C) would compute sizes[2-1-2] = sizes[-1], reading 0 and producing global_wg=(0,1,1). DispatchNode::encode() skips dispatches with any zero workgroup component, so the shader never executes and the output stays all zeros. This caused the skin segmentation model to produce bbox_iou=0 on Android, since its 2D intermediate tensors (keypoints, bbox) were never computed. Fix by using `utils::val_at` with negative indices to safely access WHCN dimensions, consistent with `pick_extents_global_wg_with_block_config`. val_at returns 1 for out-of-bounds indices, correctly handling tensors with fewer than 4 dimensions. Also adds 1D, 2D, and 3D tensor shapes to q8ta_binary test cases to prevent regression. Differential Revision: [D94364639](https://our.internmc.facebook.com/intern/diff/D94364639/) ghstack-source-id: 344667756 Pull Request resolved: #17707
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
pick_linear_global_wg_with_block_configusedsizes[ndim - 1 - inner_dim]to index tensor dimensions, which underflows when block config dimensions
(WHCN ordering) reference indices >= ndim. For example, a 2D tensor [1, 144]
with inner_dim=2 (C) would compute sizes[2-1-2] = sizes[-1], reading 0 and
producing global_wg=(0,1,1). DispatchNode::encode() skips dispatches with any
zero workgroup component, so the shader never executes and the output stays
all zeros.
This caused the skin segmentation model to produce bbox_iou=0 on Android,
since its 2D intermediate tensors (keypoints, bbox) were never computed.
Fix by using
utils::val_atwith negative indices to safely access WHCNdimensions, consistent with
pick_extents_global_wg_with_block_config.val_at returns 1 for out-of-bounds indices, correctly handling tensors with
fewer than 4 dimensions.
Also adds 1D, 2D, and 3D tensor shapes to q8ta_binary test cases to prevent
regression.
Differential Revision: D94364639