[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv by SS-JIA · Pull Request #17221 · pytorch/executorch

SS-JIA · 2026-02-04T20:25:06Z

Stack from ghstack (oldest at bottom):

This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.

Key Components Added

Two New GLSL Compute Shaders

q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).

Both shaders use:

4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
dotPacked4x8AccSatEXT for efficient int8 dot products
Texture2D for weight storage, buffers for input/output
Per-channel weight quantization with symmetric int8 weights

C++ Operator Implementation (Q8taConv2dPW.cpp)

prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
format optimized for the shader's access pattern
add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
metadata UBOs
add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
prepacking, and kernel selection

Test Infrastructure Updates

TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
quantize → conv2d_pw → dequantize for end-to-end testing
test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
- Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
- Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
- Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
  kPackedInt8_4C
- Both texture and buffer storage types for floating-point tensors
- Reference implementation comparison for correctness validation

Architecture

The shader handles layout flexibility via:

Layout specialization constants (outp_layout, inp_layout) passed from C++
BufferMetadata UBOs providing runtime strides for input/output tensors
compute_outp_buffer_idx() function that computes correct buffer indices based
on layout
get_outer_packed_dim_block_size() from block_indexing.glslh to determine
stride patterns

Differential Revision: D92307253

This commit adds a flexible memory layout implementation for quantized pointwise (1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory layouts, rather than being restricted to a single fixed layout. Key Components Added 1. Two New GLSL Compute Shaders - q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses BufferMetadata UBOs and layout specialization constants to support multiple memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses scalar array indexing for output writes to handle different stride patterns. - q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C layout that uses simpler ivec4 indexing. Currently not enabled in production (gated by if (false) in C++). Both shaders use: - 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread) - dotPacked4x8AccSatEXT for efficient int8 dot products - Texture2D for weight storage, buffers for input/output - Per-channel weight quantization with symmetric int8 weights 2. C++ Operator Implementation (Q8taConv2dPW.cpp) - prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D format optimized for the shader's access pattern - add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer metadata UBOs - add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader - q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight prepacking, and kernel selection 3. Test Infrastructure Updates - TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps quantize → conv2d_pw → dequantize for end-to-end testing - test_q8ta_conv2d_pw.cpp: Comprehensive test suite with: - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.) - Performance test cases (480→160, 48→22, 128→128, 576→64 channels) - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C - Both texture and buffer storage types for floating-point tensors - Reference implementation comparison for correctness validation Architecture The shader handles layout flexibility via: 1. Layout specialization constants (outp_layout, inp_layout) passed from C++ 2. BufferMetadata UBOs providing runtime strides for input/output tensors 3. compute_outp_buffer_idx() function that computes correct buffer indices based on layout 4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine stride patterns Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/) [ghstack-poisoned]

pytorch-bot · 2026-02-04T20:25:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17221

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Pending, 1 Unrelated Failure

As of commit 8724d37 with merge base 477867a ():

NEW FAILURES - The following jobs have failed:

pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 41812f39e9c9d17f25d08250a787f3c218c29f14ed2159ce3ad9b37ce515ddc4 /exec failed with exit code 139
pull / test-vulkan-operators-linux / linux-job (gh)
RuntimeError: Command docker exec -t 7cf98a7aa63befef6d3707051c6bff362306622bd8d9c712c7f97dcf8f47faae /exec failed with exit code 134
pull / unittest-editable / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv3_model

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-samsung-quantmodels-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This commit adds a flexible memory layout implementation for quantized pointwise (1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory layouts, rather than being restricted to a single fixed layout. Key Components Added 1. Two New GLSL Compute Shaders - q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses BufferMetadata UBOs and layout specialization constants to support multiple memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses scalar array indexing for output writes to handle different stride patterns. - q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C layout that uses simpler ivec4 indexing. Currently not enabled in production (gated by if (false) in C++). Both shaders use: - 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread) - dotPacked4x8AccSatEXT for efficient int8 dot products - Texture2D for weight storage, buffers for input/output - Per-channel weight quantization with symmetric int8 weights 2. C++ Operator Implementation (Q8taConv2dPW.cpp) - prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D format optimized for the shader's access pattern - add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer metadata UBOs - add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader - q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight prepacking, and kernel selection 3. Test Infrastructure Updates - TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps quantize → conv2d_pw → dequantize for end-to-end testing - test_q8ta_conv2d_pw.cpp: Comprehensive test suite with: - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.) - Performance test cases (480→160, 48→22, 128→128, 576→64 channels) - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C - Both texture and buffer storage types for floating-point tensors - Reference implementation comparison for correctness validation Architecture The shader handles layout flexibility via: 1. Layout specialization constants (outp_layout, inp_layout) passed from C++ 2. BufferMetadata UBOs providing runtime strides for input/output tensors 3. compute_outp_buffer_idx() function that computes correct buffer indices based on layout 4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine stride patterns Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/) ghstack-source-id: 338324595 Pull Request resolved: #17221

github-actions · 2026-02-04T20:26:30Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

This was referenced Feb 4, 2026

[ET-VK][testing] Add per-shader timing breakdown to benchmark output #17105

Open

[ET-VK] Add alignment fields to PackedDimInfo for padded size calculation #17170

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 4, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv#17221

[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv#17221
SS-JIA wants to merge 1 commit intogh/SS-JIA/410/basefrom
gh/SS-JIA/410/head

SS-JIA commented Feb 4, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SS-JIA commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17221

❌ 3 New Failures, 1 Pending, 1 Unrelated Failure

Uh oh!

github-actions bot commented Feb 4, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SS-JIA commented Feb 4, 2026 •

edited

Loading

pytorch-bot bot commented Feb 4, 2026 •

edited

Loading

This PR needs a `release notes:` label