[ET-VK][ez] Fix global workgroup size underflow for sub-4D tensors in block config dispatch by SS-JIA · Pull Request #17707 · pytorch/executorch

SS-JIA · 2026-02-25T15:26:46Z

Stack from ghstack (oldest at bottom):

pick_linear_global_wg_with_block_config used sizes[ndim - 1 - inner_dim]
to index tensor dimensions, which underflows when block config dimensions
(WHCN ordering) reference indices >= ndim. For example, a 2D tensor [1, 144]
with inner_dim=2 (C) would compute sizes[2-1-2] = sizes[-1], reading 0 and
producing global_wg=(0,1,1). DispatchNode::encode() skips dispatches with any
zero workgroup component, so the shader never executes and the output stays
all zeros.

This caused the skin segmentation model to produce bbox_iou=0 on Android,
since its 2D intermediate tensors (keypoints, bbox) were never computed.

Fix by using utils::val_at with negative indices to safely access WHCN
dimensions, consistent with pick_extents_global_wg_with_block_config.
val_at returns 1 for out-of-bounds indices, correctly handling tensors with
fewer than 4 dimensions.

Also adds 1D, 2D, and 3D tensor shapes to q8ta_binary test cases to prevent
regression.

Differential Revision: D94364639

… block config dispatch `pick_linear_global_wg_with_block_config` used `sizes[ndim - 1 - inner_dim]` to index tensor dimensions, which underflows when block config dimensions (WHCN ordering) reference indices >= ndim. For example, a 2D tensor [1, 144] with inner_dim=2 (C) would compute sizes[2-1-2] = sizes[-1], reading 0 and producing global_wg=(0,1,1). DispatchNode::encode() skips dispatches with any zero workgroup component, so the shader never executes and the output stays all zeros. This caused the skin segmentation model to produce bbox_iou=0 on Android, since its 2D intermediate tensors (keypoints, bbox) were never computed. Fix by using `utils::val_at` with negative indices to safely access WHCN dimensions, consistent with `pick_extents_global_wg_with_block_config`. val_at returns 1 for out-of-bounds indices, correctly handling tensors with fewer than 4 dimensions. Also adds 1D, 2D, and 3D tensor shapes to q8ta_binary test cases to prevent regression. Differential Revision: [D94364639](https://our.internmc.facebook.com/intern/diff/D94364639/) ghstack-source-id: 344667756 Pull Request resolved: #17707

pytorch-bot · 2026-02-25T15:26:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17707

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 115 Pending

As of commit dc6a0d8 with merge base 63f9724 ():

NEW FAILURE - The following job has failed:

Lint / link-check / lint-file-size (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…taging Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover the full staging lifecycle for kInt8x4 buffer tensors. **Rename and split of the old prepack function:** The old `add_staging_to_int8x4_buffer_node` (which used a static dispatch node for prepacking TensorRef data into a packed int8x4 buffer) is renamed to `add_prepack_int8x4_buffer_node` to clarify its role. Two new runtime staging functions are added alongside it: - `add_staging_to_int8x4_buffer_node`: reads NCHW data from a staging buffer into a kInt8x4 buffer tensor at execute time, using a `DynamicDispatchNode` wrapping the existing `nchw_to_int8x4_buffer` shader. - `add_int8x4_buffer_to_staging_node`: writes packed int8x4 data back from a kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time, using a new `int8x4_buffer_to_nchw` shader. **New shader (int8x4_buffer_to_nchw.glsl):** Implements the reverse of `nchw_to_int8x4_buffer`. One thread per output int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered element indices, looks up each element's position in the packed int8x4 buffer via `tensor4d_idx_to_buf_idx`, extracts the packed byte, and assembles 4 bytes into a single output int32. Works for any GPUMemoryLayout. **Staging.cpp dispatch:** `add_staging_to_tensor_node` and `add_tensor_to_staging_node` now both dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4. `prepack_op` is updated to call `add_prepack_int8x4_buffer_node`. **TestQ8taBinary.cpp** is updated to include Int8x4Staging.h and call `add_prepack_int8x4_buffer_node`. Differential Revision: [D94364640](https://our.internmc.facebook.com/intern/diff/D94364640/) ghstack-source-id: 344667754 Pull Request resolved: #17708

Implement several missing Vulkan operators needed to reduce graph fragmentation in the skin segmentation and EdgeTAM models. **Skin segmentation ops:** - aten.where.self: already had C++ and GLSL implementations but was missing the Python partitioner registration. - aten.bitwise_and.Tensor: added as a new binary_op shader variant operating on uint8 (bool) tensors. **EdgeTAM partitioning fixes:** - Comparison ops (eq, lt, le, gt, ge): were registered under the generic BinaryOp features which inherited FP_INT_T as the output dtype set. The partitioner correctly rejected these because their outputs are bool tensors. Split them into a dedicated register_comparison_ops registration with outputs_dtypes=BOOL_T. The binary_op.glsl shader already handles bool output via the IS_COMPARISON_OP path (uint8 storage), so no shader changes are needed. - aten.copy.default: not in the op registry, causing a subgraph break in the first-frame model. This op appears when valid_num_points.to() is called with matching dtype (a no-op cast). Add it to RemoveRedundantOpsTransform so it is eliminated before the partitioner runs. Also register it as an ephemeral op as a fallback. The removal logic requires a _src_arg1_ops set to handle the copy.default(self, src) argument order, where the replacement target is args[1] (src) rather than args[0] (self) as in all other redundant ops. Differential Revision: [D94364641](https://our.internmc.facebook.com/intern/diff/D94364641/) ghstack-source-id: 344667759 Pull Request resolved: #17709

Implements `aten.index.Tensor` for the Vulkan backend, supporting 1D self tensors with exactly one non-None index tensor. Includes buffer and texture GLSL shaders, C++ operator registration, and correctness tests. Also extends the op test code generators to handle `c10::List<::std::optional<at::Tensor>>` (`Tensor?[]` in ATen), which is the C++ type for the `indices` argument of `aten.index.Tensor`. Differential Revision: [D94364638](https://our.internmc.facebook.com/intern/diff/D94364638/) ghstack-source-id: 344667758 Pull Request resolved: #17710

github-actions · 2026-02-25T15:27:29Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

… block config dispatch `pick_linear_global_wg_with_block_config` used `sizes[ndim - 1 - inner_dim]` to index tensor dimensions, which underflows when block config dimensions (WHCN ordering) reference indices >= ndim. For example, a 2D tensor [1, 144] with inner_dim=2 (C) would compute sizes[2-1-2] = sizes[-1], reading 0 and producing global_wg=(0,1,1). DispatchNode::encode() skips dispatches with any zero workgroup component, so the shader never executes and the output stays all zeros. This caused the skin segmentation model to produce bbox_iou=0 on Android, since its 2D intermediate tensors (keypoints, bbox) were never computed. Fix by using `utils::val_at` with negative indices to safely access WHCN dimensions, consistent with `pick_extents_global_wg_with_block_config`. val_at returns 1 for out-of-bounds indices, correctly handling tensors with fewer than 4 dimensions. Also adds 1D, 2D, and 3D tensor shapes to q8ta_binary test cases to prevent regression. Differential Revision: [D94364639](https://our.internmc.facebook.com/intern/diff/D94364639/) ghstack-source-id: 344667756 Pull Request resolved: #17707

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 25, 2026

ssjia added 3 commits February 25, 2026 07:26

This was referenced Feb 25, 2026

[ET-VK][q8_ops] Add int8x4_buffer_to_nchw shader and refactor Int8x4Staging #17708

Merged

[ET-VK] Add Vulkan ops for skin segmentation and EdgeTAM models #17709

Merged

[ET-VK] Add aten.index.Tensor op and tests #17710

Merged

manuelcandales approved these changes Feb 25, 2026

View reviewed changes

SS-JIA changed the base branch from gh/SS-JIA/449/base to main February 25, 2026 19:12

SS-JIA force-pushed the gh/SS-JIA/449/head branch from 588f2a6 to dc6a0d8 Compare February 25, 2026 19:16

SS-JIA merged commit 96351b8 into main Feb 25, 2026
167 of 170 checks passed

SS-JIA deleted the gh/SS-JIA/449/head branch February 25, 2026 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][ez] Fix global workgroup size underflow for sub-4D tensors in block config dispatch#17707

[ET-VK][ez] Fix global workgroup size underflow for sub-4D tensors in block config dispatch#17707
SS-JIA merged 4 commits intomainfrom
gh/SS-JIA/449/head

SS-JIA commented Feb 25, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17707

❌ 1 New Failure, 115 Pending

Uh oh!

github-actions Bot commented Feb 25, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Feb 25, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 25, 2026 •

edited

Loading

This PR needs a `release notes:` label