-
Notifications
You must be signed in to change notification settings - Fork 684
Support aoti_torch_cuda__weight_int4pack_mm #15030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/desertfire/1/base
Are you sure you want to change the base?
Conversation
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` [ghstack-poisoned]
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a543a05 Pull Request resolved: #15030
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15030
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 3 New Failures, 3 Cancelled JobsAs of commit 892aa33 with merge base 9b03c13 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
cuda_shim_cpp_unittest("aoti_torch__reinterpret_tensor") | ||
cuda_shim_cpp_unittest("aoti_torch_copy_") | ||
cuda_shim_cpp_unittest("aoti_torch_cuda_guard") | ||
cuda_shim_cpp_unittest("aoti_torch_cuda__weight_int4pack_mm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larryliu0820 , I didn't find a CMakeLists.txt for all these unit tests. I suppose we can only test them in fbcode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline for additional documentations (i used claude code to generate docs)
This is great, thank you!
// This is a clone of aten/src/ATen/native/cuda/int4mm.cu from PyTorch, | ||
// with at::Tensor replaced with ETensor and aten utility functions/macros | ||
// replaced with their executorch equivalents. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add these lines for future reference
// This file is a port of PyTorch's int4mm.cu kernel implementation
// (aten/src/ATen/native/cuda/int4mm.cu) adapted for the ExecuTorch runtime.
//
// PORTING NOTES:
// --------------
// 1. KERNEL CODE (lines 36-1067): Identical to PyTorch - preserved 100%
// - All utility templates, vector types, and conversion logic unchanged
// - Tensor core kernels (tinygemm_m16n8k16_chunk_kernel) byte-for-byte identical
// - Same inline PTX assembly for mma.sync.aligned instructions
// - Identical performance characteristics and register allocation
//
// 2. API ADAPTATIONS:
// - Replaced at::Tensor with executorch::backends::aoti::Tensor
// - Changed from C++ API to extern "C" for AOTI dlsym() compatibility
// - Output returned via pointer-to-pointer instead of by-value
// - Error handling uses ET return codes instead of exceptions
//
// 3. REMOVED FEATURES:
// - _convert_weight_to_int4pack_cuda(): Weight conversion happens offline
// during model export via optimum-executorch. Runtime only consumes
// pre-packed weights.
// - isCDNA2orLater() runtime check: Removed dependency on ATen GPU detection
// hooks. ROCm support relies on compile-time guards only.
//
// 4. INFRASTRUCTURE CHANGES:
// - Removed c10::cuda::CUDAGuard: Device management handled by AOTI backend
// - Removed at::cuda::getCurrentCUDAStream(): Stream passed explicitly
// - Input validation simplified: AOTI pre-validates tensors during export
ret0 != nullptr, | ||
InvalidArgument, | ||
"aoti_torch_cuda__weight_int4pack_mm failed: ret0 is null"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ET_CHECK_OR_RETURN_ERROR(
qGroupSize == 32 || qGroupSize == 64 || qGroupSize == 128 || qGroupSize == 256,
InvalidArgument,
"aoti_torch_cuda__weight_int4pack_mm: qGroupSize must be 32/64/128/256, got %lld",
static_cast<long long>(qGroupSize));
#endif | ||
|
||
AOTITorchError aoti_torch_cuda__weight_int4pack_mm( | ||
Tensor* self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should check whether self is bfloat16?
|
||
AOTITorchError aoti_torch_cuda__weight_int4pack_mm( | ||
Tensor* self, | ||
Tensor* mat2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check whether mat2 is int32
#ifdef __cplusplus | ||
extern "C" { | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/**
* Performs quantized INT4 matrix multiplication.
*
* INT4 weights are stored in a packed tensor core layout optimized for
* NVIDIA Ampere+ GPUs (sm_80+) using m16n8k16 tensor core tiles.
*
* HARDWARE REQUIREMENTS:
* - CUDA Compute Capability >= 8.0 (Ampere or later)
* - BFloat16 support (native on sm_80+)
*
* TENSOR REQUIREMENTS:
* @param self Input activation matrix [m, k]
* - Must be BFloat16 dtype
* - Must be 2D
* - Must be on CUDA device
* - Row-major layout (contiguous)
*
* @param mat2 Quantized weight matrix in packed tensor core layout
* - Must be Int32 dtype (contains packed INT4 values)
* - Must be 4D: [n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]
* where InnerKTiles = 2, 4, or 8
* - Each Int32 contains 8 packed INT4 values
* - Layout optimized for tensor core access patterns
* - Must be on CUDA device
*
* @param qGroupSize Quantization group size (number of values sharing scale/zero)
* - Must be one of: 32, 64, 128, or 256
* - Smaller groups = higher accuracy but more metadata
* - Must evenly divide k dimension
*
* @param qScaleAndZeros Dequantization parameters [k/qGroupSize][n][2]
* - Must be BFloat16 dtype
* - Must be 3D
* - [:, :, 0] contains scales
* - [:, :, 1] contains zero points
* - Must be on CUDA device
*
* @param ret0 Output parameter for result matrix [m, n]
* - Allocated by this function as BFloat16
* - Must not be null
* - Caller is responsible for freeing via aoti_torch_delete_tensor_object()
*
* @return AOTITorchError error code:
* - Error::Ok: Success
* - Error::InvalidArgument: Null pointer, wrong dtype, wrong dimensions,
* or invalid qGroupSize
* - Error::Internal: CUDA kernel launch failure
*/
|
||
// Enum for supported data types in et-cuda backend | ||
enum class SupportedDTypes : int32_t { | ||
INT32 = 3, // PyTorch's int64 dtype code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch's int32 dtype code
Wait, why is the "Run latency" slower than in int4 cc @swolchok |
Stack from ghstack (oldest at bottom):
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.
Using the Voxtral runner as an example,
With the bfloat16 format, here is the generated ptd file size and latency.
With
--qlinear 4w_hqq --qlinear_encoder 4w_hqq
, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.Differential Revision: D84395275