Skip to content

[ET-VK][ez] Fix global workgroup size underflow for sub-4D tensors in block config dispatch#17707

Merged
SS-JIA merged 4 commits intomainfrom
gh/SS-JIA/449/head
Feb 25, 2026
Merged

[ET-VK][ez] Fix global workgroup size underflow for sub-4D tensors in block config dispatch#17707
SS-JIA merged 4 commits intomainfrom
gh/SS-JIA/449/head

Conversation

@SS-JIA
Copy link
Copy Markdown
Contributor

@SS-JIA SS-JIA commented Feb 25, 2026

Stack from ghstack (oldest at bottom):

pick_linear_global_wg_with_block_config used sizes[ndim - 1 - inner_dim]
to index tensor dimensions, which underflows when block config dimensions
(WHCN ordering) reference indices >= ndim. For example, a 2D tensor [1, 144]
with inner_dim=2 (C) would compute sizes[2-1-2] = sizes[-1], reading 0 and
producing global_wg=(0,1,1). DispatchNode::encode() skips dispatches with any
zero workgroup component, so the shader never executes and the output stays
all zeros.

This caused the skin segmentation model to produce bbox_iou=0 on Android,
since its 2D intermediate tensors (keypoints, bbox) were never computed.

Fix by using utils::val_at with negative indices to safely access WHCN
dimensions, consistent with pick_extents_global_wg_with_block_config.
val_at returns 1 for out-of-bounds indices, correctly handling tensors with
fewer than 4 dimensions.

Also adds 1D, 2D, and 3D tensor shapes to q8ta_binary test cases to prevent
regression.

Differential Revision: D94364639

… block config dispatch

`pick_linear_global_wg_with_block_config` used `sizes[ndim - 1 - inner_dim]`
to index tensor dimensions, which underflows when block config dimensions
(WHCN ordering) reference indices >= ndim. For example, a 2D tensor [1, 144]
with inner_dim=2 (C) would compute sizes[2-1-2] = sizes[-1], reading 0 and
producing global_wg=(0,1,1). DispatchNode::encode() skips dispatches with any
zero workgroup component, so the shader never executes and the output stays
all zeros.

This caused the skin segmentation model to produce bbox_iou=0 on Android,
since its 2D intermediate tensors (keypoints, bbox) were never computed.

Fix by using `utils::val_at` with negative indices to safely access WHCN
dimensions, consistent with `pick_extents_global_wg_with_block_config`.
val_at returns 1 for out-of-bounds indices, correctly handling tensors with
fewer than 4 dimensions.

Also adds 1D, 2D, and 3D tensor shapes to q8ta_binary test cases to prevent
regression.

Differential Revision: [D94364639](https://our.internmc.facebook.com/intern/diff/D94364639/)

ghstack-source-id: 344667756
Pull Request resolved: #17707
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 25, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17707

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 115 Pending

As of commit dc6a0d8 with merge base 63f9724 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 25, 2026
ssjia added 3 commits February 25, 2026 07:26
…taging

Renames Q8taStaging.cpp/h to Int8x4Staging.cpp/h and expands it to cover
the full staging lifecycle for kInt8x4 buffer tensors.

**Rename and split of the old prepack function:**
The old `add_staging_to_int8x4_buffer_node` (which used a static dispatch
node for prepacking TensorRef data into a packed int8x4 buffer) is renamed
to `add_prepack_int8x4_buffer_node` to clarify its role. Two new runtime
staging functions are added alongside it:

- `add_staging_to_int8x4_buffer_node`: reads NCHW data from a staging buffer
  into a kInt8x4 buffer tensor at execute time, using a `DynamicDispatchNode`
  wrapping the existing `nchw_to_int8x4_buffer` shader.
- `add_int8x4_buffer_to_staging_node`: writes packed int8x4 data back from a
  kInt8x4 buffer tensor to a contiguous NCHW staging buffer at execute time,
  using a new `int8x4_buffer_to_nchw` shader.

**New shader (int8x4_buffer_to_nchw.glsl):**
Implements the reverse of `nchw_to_int8x4_buffer`. One thread per output
int32 in the NCHW staging buffer. For each thread it decodes 4 NCHW-ordered
element indices, looks up each element's position in the packed int8x4 buffer
via `tensor4d_idx_to_buf_idx`, extracts the packed byte, and assembles 4
bytes into a single output int32. Works for any GPUMemoryLayout.

**Staging.cpp dispatch:**
`add_staging_to_tensor_node` and `add_tensor_to_staging_node` now both
dispatch to the int8x4-specific functions when the tensor dtype is kInt8x4.
`prepack_op` is updated to call `add_prepack_int8x4_buffer_node`.

**TestQ8taBinary.cpp** is updated to include Int8x4Staging.h and call
`add_prepack_int8x4_buffer_node`.

Differential Revision: [D94364640](https://our.internmc.facebook.com/intern/diff/D94364640/)

ghstack-source-id: 344667754
Pull Request resolved: #17708
Implement several missing Vulkan operators needed to reduce graph
fragmentation in the skin segmentation and EdgeTAM models.

**Skin segmentation ops:**

- aten.where.self: already had C++ and GLSL implementations but was
  missing the Python partitioner registration.
- aten.bitwise_and.Tensor: added as a new binary_op shader variant
  operating on uint8 (bool) tensors.

**EdgeTAM partitioning fixes:**

- Comparison ops (eq, lt, le, gt, ge): were registered under the
  generic BinaryOp features which inherited FP_INT_T as the output
  dtype set. The partitioner correctly rejected these because their
  outputs are bool tensors. Split them into a dedicated
  register_comparison_ops registration with outputs_dtypes=BOOL_T. The
  binary_op.glsl shader already handles bool output via the
  IS_COMPARISON_OP path (uint8 storage), so no shader changes are
  needed.
- aten.copy.default: not in the op registry, causing a subgraph break
  in the first-frame model. This op appears when valid_num_points.to()
  is called with matching dtype (a no-op cast). Add it to
  RemoveRedundantOpsTransform so it is eliminated before the partitioner
  runs. Also register it as an ephemeral op as a fallback. The removal
  logic requires a _src_arg1_ops set to handle the copy.default(self,
  src) argument order, where the replacement target is args[1] (src)
  rather than args[0] (self) as in all other redundant ops.

Differential Revision: [D94364641](https://our.internmc.facebook.com/intern/diff/D94364641/)

ghstack-source-id: 344667759
Pull Request resolved: #17709
Implements `aten.index.Tensor` for the Vulkan backend, supporting 1D
self tensors with exactly one non-None index tensor. Includes buffer
and texture GLSL shaders, C++ operator registration, and correctness
tests.

Also extends the op test code generators to handle
`c10::List<::std::optional<at::Tensor>>` (`Tensor?[]` in ATen), which
is the C++ type for the `indices` argument of `aten.index.Tensor`.

Differential Revision: [D94364638](https://our.internmc.facebook.com/intern/diff/D94364638/)

ghstack-source-id: 344667758
Pull Request resolved: #17710
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@SS-JIA SS-JIA changed the base branch from gh/SS-JIA/449/base to main February 25, 2026 19:12
@SS-JIA SS-JIA force-pushed the gh/SS-JIA/449/head branch from 588f2a6 to dc6a0d8 Compare February 25, 2026 19:16
@SS-JIA SS-JIA merged commit 96351b8 into main Feb 25, 2026
167 of 170 checks passed
SS-JIA pushed a commit that referenced this pull request Feb 25, 2026
… block config dispatch

`pick_linear_global_wg_with_block_config` used `sizes[ndim - 1 - inner_dim]`
to index tensor dimensions, which underflows when block config dimensions
(WHCN ordering) reference indices >= ndim. For example, a 2D tensor [1, 144]
with inner_dim=2 (C) would compute sizes[2-1-2] = sizes[-1], reading 0 and
producing global_wg=(0,1,1). DispatchNode::encode() skips dispatches with any
zero workgroup component, so the shader never executes and the output stays
all zeros.

This caused the skin segmentation model to produce bbox_iou=0 on Android,
since its 2D intermediate tensors (keypoints, bbox) were never computed.

Fix by using `utils::val_at` with negative indices to safely access WHCN
dimensions, consistent with `pick_extents_global_wg_with_block_config`.
val_at returns 1 for out-of-bounds indices, correctly handling tensors with
fewer than 4 dimensions.

Also adds 1D, 2D, and 3D tensor shapes to q8ta_binary test cases to prevent
regression.

Differential Revision: [D94364639](https://our.internmc.facebook.com/intern/diff/D94364639/)

ghstack-source-id: 344667756
Pull Request resolved: #17707
@SS-JIA SS-JIA deleted the gh/SS-JIA/449/head branch February 25, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants