[ET-VK][Ops] torchao.choose_qparams_affine vulkan impl and shader (buffer only) and cleanup #13003

pytorchbot · 2025-07-30T16:15:26Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #12577 by @ahmtox
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/45/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/45/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/44/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/45/orig
@diff-train-skip-merge

cc @SS-JIA @manuelcandales @cbilgin

Pull Request resolved: #12574 # Changes * Introduce `linear_qta8a_qga4w` custom operator in `custom_ops_lib.py` to handle dynamic activation + grouped weight quantized linear operations * Add pattern matching and fusion logic in `FuseQuantizedOpsTransform` to detect and replace dequant + dequant + linear sequences with the new fused operator * Implement comprehensive test coverage in `test_vulkan_passes.py` for the QTA8A_QGA4W fusion pattern validation * Add 4-bit weight packing utilities and grouped quantization support for efficient memory usage # Motivation The existing quantization workflow in Vulkan backend processes dynamic activation + grouped weight quantized linear operations as separate quantize/dequantize/linear steps, which creates performance overhead through: * Multiple kernel dispatches instead of a single fused operation * Intermediate tensor allocations for dequantized weights and activations * Suboptimal memory bandwidth utilization The new `linear_qta8a_qga4w` operator fuses the entire sequence into a single operation that: * Directly processes 8-bit quantized activations with per-token scales/zero-points * Handles 4-bit grouped quantized weights with configurable group sizes * Eliminates intermediate dequantization steps by performing dequantization inline * Reduces memory footprint through packed 4-bit weight storage This aligns with the broader goal of optimizing quantized model inference in the Vulkan backend by leveraging graph-level transformations to improve computational efficiency while maintaining numerical accuracy. ghstack-source-id: 299473616 @exported-using-ghexport Differential Revision: [D78291269](https://our.internmc.facebook.com/intern/diff/D78291269/)

Pull Request resolved: #12575 # Changes * Implement `torchao.quantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support * Add block-wise quantization mode in `quantize_texture.glsl` and `quantize_buffer.glsl` shaders for configurable tensor block quantization * Introduce comprehensive test suite in `affine_test.cpp` with multi-dimensional tensor validation and reference implementation * Extend quantization infrastructure in `Quantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters BE: Improved the documentation in the shader logic which is more detailed and clear NOTE: I delegated the quantize_affine and future affine operators through a new custom test file denoted as `affine_test.cpp` as the other quantization testing framework was getting a little large, and it makes more sense to separate the namespace between torchao and quantized_decomposed. I believe the _decomposed namespace is getting phased out in favor of this affine operator so deprecation will be easier in the future. # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.quantize_affine` operator, which is essential for enabling dynamic quantization efficiently. The `quantize_affine` operator provides flexible block-wise quantization that allows different scale and zero-point values for tensor blocks, enabling: * **Block-wise Quantization**: Applies quantization parameters to configurable tensor blocks rather than entire tensors, improving quantization accuracy for heterogeneous data distributions * **Affine Transformation**: Uses the formula `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)` for precise floating-point to integer mapping # Operator Description The `quantize_affine` operator converts floating-point tensor values to n-bit integer representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise quantization divides tensors into blocks and applies separate quantization parameters to each block, allowing fine-grained control over quantization precision. The quantization formula is: `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)` **Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping. # Block-wise Quantization Implementation Block-wise quantization enables fine-grained quantization by dividing tensors into blocks and applying separate quantization parameters to each block. The implementation uses several key data structures computed in `Quantize.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec` * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w` # Shader Algorithm Overview ## Texture Storage Implementation (`quantize_texture.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on texture dimensions - **Local WG Size**: Default with special handling for batch dimension quantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`) **Block-wise Mode Algorithm**: The shader processes 3D texture positions where each position represents a texel containing 4 width-packed components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing. For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension. The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. Each component is quantized using its corresponding block's parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])` and written to the output texel. ## Buffer Storage Implementation (`quantize_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on buffer element count - **Local WG Size**: Default sizing without special constraints **Block-wise Mode Algorithm**: The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations. For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. The element value is loaded using the corresponding input buffer index: `value = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Quantization applies the block-specific parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])`. **Future Improvements**: Dynamic workgroup sizing based on block dimensions, there is likely a better method to making it better than what it is currently. ghstack-source-id: 299473617 @exported-using-ghexport Differential Revision: [D78302195](https://our.internmc.facebook.com/intern/diff/D78302195/)

…anup Pull Request resolved: #12576 # Changes * Implement `torchao.dequantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support * Add block-wise dequantization mode in `dequantize_texture.glsl` and `dequantize_buffer.glsl` shaders for configurable tensor block dequantization * Extend dequantization infrastructure in `Dequantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters * Support integer-to-floating-point conversion with precise reconstruction of original values BE: Improved the documentation in the shader logic which is more detailed and clear # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.dequantize_affine` operator, which is essential for completing the quantization-dequantization cycle in dynamic quantization workflows. The `dequantize_affine` operator provides flexible block-wise dequantization that reconstructs floating-point values from quantized integer blocks, enabling: * **Block-wise Dequantization**: Reconstructs floating-point values from configurable tensor blocks using separate scale and zero-point parameters, enabling precise recovery of original data distributions * **Affine Transformation**: Uses the formula `value = (qvalue - zero_point) * scale` for accurate integer-to-floating-point mapping # Operator Description The `dequantize_affine` operator converts n-bit integer tensor values back to floating-point representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise dequantization divides tensors into blocks and applies separate dequantization parameters to each block, allowing fine-grained reconstruction of the original floating-point precision. The dequantization formula is: `value = (qvalue - zero_point) * scale` **Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping. Input tensors must be integer types (kByte, kChar, kInt). # Block-wise Dequantization Implementation Block-wise dequantization enables fine-grained reconstruction by dividing tensors into blocks and applying separate dequantization parameters to each block. The implementation uses the same key data structures computed in `Dequantize.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec` * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w` # Shader Algorithm Overview ## Texture Storage Implementation (`dequantize_texture.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on texture dimensions - **Local WG Size**: Default with special handling for batch dimension dequantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`) **Block-wise Mode Algorithm**: The shader processes 3D texture positions where each position represents a texel containing 4 width-packed integer components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing. For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension. The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. Each integer component is dequantized using its corresponding block's parameters: `value = dequantize_val(qvalue, t_scale[block_id], t_zero_point[block_id])` where `dequantize_val()` applies the formula `(qvalue - zero_point) * scale`. The reconstructed floating-point values are written to the output texel with proper type handling for double precision outputs. ## Buffer Storage Implementation (`dequantize_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on buffer element count - **Local WG Size**: Default sizing without special constraints **Block-wise Mode Algorithm**: The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations. For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. The quantized integer value is loaded using the corresponding input buffer index: `qvalue = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Dequantization applies the block-specific parameters: `value = dequantize_val(qvalue, t_scale[block_id], t_zero_point[block_id])` to reconstruct the original floating-point value. **Future Improvements**: Dynamic workgroup sizing based on block dimensions ghstack-source-id: 299473614 @exported-using-ghexport Differential Revision: [D78435552](https://our.internmc.facebook.com/intern/diff/D78435552/)

Pull Request resolved: #12577 # Changes * Implement `torchao.choose_qparams_affine` operator in Vulkan backend with comprehensive buffer storage support * Add block-wise quantization parameter computation in `choose_qparams_buffer.glsl` shader for configurable tensor block analysis * Extend quantization parameter infrastructure in `ChooseQParams.cpp` to handle affine transformations with configurable block sizes and multiple mapping types * Support three quantization mapping strategies: ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR for optimal parameter selection * Consolidated the logic for choosing scale and zero point between affine cases and regular quantized_decomposed cases. BE: Improved the documentation in the shader logic which is more detailed and clear # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.choose_qparams_affine` operator, which is essential for computing optimal quantization parameters in dynamic quantization workflows. The `choose_qparams_affine` operator provides flexible block-wise parameter computation that analyzes statistical distributions within tensor blocks, enabling: * **Block-wise Parameter Computation**: Analyzes configurable tensor blocks to compute optimal scale and zero-point values, improving quantization accuracy for heterogeneous data distributions * **Multiple Mapping Types**: Supports ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR quantization strategies for different precision-performance trade-offs # Operator Description The `choose_qparams_affine` operator computes optimal quantization parameters (scale and zero_point) from floating-point tensor blocks using statistical analysis of data distributions. Block-wise parameter computation divides tensors into blocks and analyzes each block independently to determine the best quantization mapping for subsequent quantization operations. The parameter calculation varies by mapping type: - **ASYMMETRIC**: `scale = (max - min) / (quant_max - quant_min)`, `zero_point = quant_min - round(min / scale)` - **SYMMETRIC**: `scale = max_abs / ((quant_max - quant_min) / 2)`, `zero_point = midpoint` - **SYMMETRIC_NO_CLIPPING_ERR**: `scale = max(abs(min)/abs(quant_min), max/quant_max)`, `zero_point = midpoint` **Storage Requirements**: Input tensors must be floating-point (kFloat) with width-packed layout. Output scale/zero_point tensors use buffer storage. NOTE: Texture storage implementation is not supported due to complexity of block-wise coordinate mapping in 3D texture space. This will likely be necessary for better efficiency in the future. # Block-wise Parameter Computation Implementation Block-wise parameter computation enables fine-grained quantization analysis by dividing tensors into blocks and computing separate scale/zero_point parameters for each block. The implementation uses several key data structures computed in `ChooseQParams.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `ceil(tensor_size_whcn / block_size_vec)` to handle non-divisible dimensions * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation * **`mapping_type`**: Integer encoding of quantization strategy (0=ASYMMETRIC, 1=SYMMETRIC, 2=SYMMETRIC_NO_CLIPPING_ERR) The block coordinate calculation uses: `block_coord = block_id_to_coord(block_id)` which converts linear block IDs back to 4D WHCN coordinates, then computes element ranges: `t0 = block_coord * blockSize` and `tEnd = t0 + blockSize` for nested loop iteration. # Shader Algorithm Overview ## Buffer Storage Implementation (`choose_qparams_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: `{nBlocks, 1u, 1u}` where `nBlocks = total number of blocks` computed from `ceil(tensor_size / block_size)` for each dimension - **Local WG Size**: `{1u, 1u, 1u}` (single thread per block for simplicity, though could be optimized for larger blocks) **Block-wise Mode Algorithm**: The shader uses a sophisticated multi-level nested approach to process tensor blocks efficiently. Each thread is assigned multiple blocks using strided access: `for (uint block_id = gl_GlobalInvocationID.x; block_id < TOTAL_BLOCKS; block_id += STRIDE)` where `STRIDE = gl_WorkGroupSize.x * gl_NumWorkGroups.x`. For each assigned block, the algorithm performs several key steps: **1. Block Coordinate Conversion**: The `block_id_to_coord(block_id)` function converts linear block IDs to 4D WHCN coordinates using modular arithmetic. **2. Element Range Calculation**: Computes the inclusive start coordinate `t0 = bc * blockSize` and exclusive end coordinate `tEnd = t0 + blockSize` to define the block's element boundaries in tensor space. **3. Nested Loop Min/Max Scan**: Uses four nested loops to iterate through all elements within the block: `for (int n = t0.w; n < tEnd.w; ++n) for (int c = t0.z; c < tEnd.z; ++c) for (int h = t0.y; h < tEnd.y; ++h) for (int w = t0.x; w < tEnd.x; ++w)` Each element is accessed using `tidx_to_bufi(ivec4(w,h,c,n), t_in_strides)` to convert 4D tensor coordinates to linear buffer indices with proper stride handling. **4. Parameter Calculation**: Calls `calc_scale_zp(lo, hi, quant_min, quant_max, mapping_type, eps, scale, zp)` which implements the three mapping strategies: * **ASYMMETRIC (mapping_type=0)**: Maps full range [min, max] to [quant_min, quant_max] preserving data distribution * **SYMMETRIC (mapping_type=1)**: Centers around zero using `max_abs = max(abs(min), abs(max))` for balanced quantization * **SYMMETRIC_NO_CLIPPING_ERR (mapping_type=2)**: Computes separate scales for positive/negative ranges and uses the maximum to prevent clipping **Future Improvements**: Implement workgroup-level reduction for large blocks, optimize memory access patterns for better cache utilization, and explore texture storage implementation with simplified block alignment constraints. ghstack-source-id: 299473615 @exported-using-ghexport Differential Revision: [D78436638](https://our.internmc.facebook.com/intern/diff/D78436638/)

pytorch-bot · 2025-07-30T16:15:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13003

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

morelos added 4 commits July 29, 2025 23:43

pytorchbot requested a review from SS-JIA as a code owner July 30, 2025 16:15

pytorch-bot bot added the module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ label Jul 30, 2025

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 30, 2025

ahmtox added 4 commits July 30, 2025 09:17

Merge branch 'gh/ahmtox/41/orig' into gh/ahmtox/42/orig

b70f056

Merge branch 'gh/ahmtox/42/orig' into gh/ahmtox/43/orig

0bf16d4

Merge branch 'gh/ahmtox/43/orig' into gh/ahmtox/44/orig

feeea45

Merge branch 'gh/ahmtox/44/orig' into gh/ahmtox/45/orig

f45409e

Base automatically changed from gh/ahmtox/44/orig to gh/ahmtox/43/orig July 30, 2025 17:03

Base automatically changed from gh/ahmtox/43/orig to gh/ahmtox/42/orig July 30, 2025 17:09

Base automatically changed from gh/ahmtox/42/orig to gh/ahmtox/41/orig July 30, 2025 17:09

Merge branch 'gh/ahmtox/41/orig' into gh/ahmtox/45/orig

59f2a23

ahmtox merged commit 328836f into gh/ahmtox/41/orig Jul 30, 2025
20 of 22 checks passed

ahmtox deleted the gh/ahmtox/45/orig branch July 30, 2025 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK][Ops] torchao.choose_qparams_affine vulkan impl and shader (buffer only) and cleanup #13003

[ET-VK][Ops] torchao.choose_qparams_affine vulkan impl and shader (buffer only) and cleanup #13003

Uh oh!

pytorchbot commented Jul 30, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ET-VK][Ops] torchao.choose_qparams_affine vulkan impl and shader (buffer only) and cleanup #13003

[ET-VK][Ops] torchao.choose_qparams_affine vulkan impl and shader (buffer only) and cleanup #13003

Uh oh!

Conversation

pytorchbot commented Jul 30, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13003

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorchbot commented Jul 30, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading