Skip to content

[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv#17221

Open
SS-JIA wants to merge 1 commit intogh/SS-JIA/410/basefrom
gh/SS-JIA/410/head
Open

[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv#17221
SS-JIA wants to merge 1 commit intogh/SS-JIA/410/basefrom
gh/SS-JIA/410/head

Conversation

@SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Feb 4, 2026

Stack from ghstack (oldest at bottom):

This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.

Key Components Added

  1. Two New GLSL Compute Shaders
  • q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
    BufferMetadata UBOs and layout specialization constants to support multiple
    memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
    scalar array indexing for output writes to handle different stride patterns.
  • q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
    layout that uses simpler ivec4 indexing. Currently not enabled in production
    (gated by if (false) in C++).

Both shaders use:

  • 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
  • dotPacked4x8AccSatEXT for efficient int8 dot products
  • Texture2D for weight storage, buffers for input/output
  • Per-channel weight quantization with symmetric int8 weights
  1. C++ Operator Implementation (Q8taConv2dPW.cpp)
  • prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
    format optimized for the shader's access pattern
  • add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
    metadata UBOs
  • add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
  • q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
    prepacking, and kernel selection
  1. Test Infrastructure Updates
  • TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
    quantize → conv2d_pw → dequantize for end-to-end testing
  • test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
    • Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
    • Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
    • Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
      kPackedInt8_4C
    • Both texture and buffer storage types for floating-point tensors
    • Reference implementation comparison for correctness validation

Architecture

The shader handles layout flexibility via:

  1. Layout specialization constants (outp_layout, inp_layout) passed from C++
  2. BufferMetadata UBOs providing runtime strides for input/output tensors
  3. compute_outp_buffer_idx() function that computes correct buffer indices based
    on layout
  4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine
    stride patterns

Differential Revision: D92307253

This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.

Key Components Added

1. Two New GLSL Compute Shaders

- q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
  BufferMetadata UBOs and layout specialization constants to support multiple
  memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
  scalar array indexing for output writes to handle different stride patterns.
- q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
  layout that uses simpler ivec4 indexing. Currently not enabled in production
  (gated by if (false) in C++).

Both shaders use:

- 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
- dotPacked4x8AccSatEXT for efficient int8 dot products
- Texture2D for weight storage, buffers for input/output
- Per-channel weight quantization with symmetric int8 weights

2. C++ Operator Implementation (Q8taConv2dPW.cpp)

- prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
  format optimized for the shader's access pattern
- add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
  metadata UBOs
- add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
- q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
  prepacking, and kernel selection

3. Test Infrastructure Updates

- TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
  quantize → conv2d_pw → dequantize for end-to-end testing
- test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
  - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
  - Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
  - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
    kPackedInt8_4C
  - Both texture and buffer storage types for floating-point tensors
  - Reference implementation comparison for correctness validation

Architecture

The shader handles layout flexibility via:

1. Layout specialization constants (outp_layout, inp_layout) passed from C++
2. BufferMetadata UBOs providing runtime strides for input/output tensors
3. compute_outp_buffer_idx() function that computes correct buffer indices based
   on layout
4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine
   stride patterns

Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 4, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17221

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Pending, 1 Unrelated Failure

As of commit 8724d37 with merge base 477867a (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

SS-JIA pushed a commit that referenced this pull request Feb 4, 2026
This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.

Key Components Added

1. Two New GLSL Compute Shaders

- q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
  BufferMetadata UBOs and layout specialization constants to support multiple
  memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
  scalar array indexing for output writes to handle different stride patterns.
- q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
  layout that uses simpler ivec4 indexing. Currently not enabled in production
  (gated by if (false) in C++).

Both shaders use:

- 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
- dotPacked4x8AccSatEXT for efficient int8 dot products
- Texture2D for weight storage, buffers for input/output
- Per-channel weight quantization with symmetric int8 weights

2. C++ Operator Implementation (Q8taConv2dPW.cpp)

- prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
  format optimized for the shader's access pattern
- add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
  metadata UBOs
- add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
- q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
  prepacking, and kernel selection

3. Test Infrastructure Updates

- TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
  quantize → conv2d_pw → dequantize for end-to-end testing
- test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
  - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
  - Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
  - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
    kPackedInt8_4C
  - Both texture and buffer storage types for floating-point tensors
  - Reference implementation comparison for correctness validation

Architecture

The shader handles layout flexibility via:

1. Layout specialization constants (outp_layout, inp_layout) passed from C++
2. BufferMetadata UBOs providing runtime strides for input/output tensors
3. compute_outp_buffer_idx() function that computes correct buffer indices based
   on layout
4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine
   stride patterns

Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)

ghstack-source-id: 338324595
Pull Request resolved: #17221
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 4, 2026
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant