-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm #26092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for GPTQv2 format checkpoints by introducing a new CUDA kernel, gptq_gemm_v2. The changes are well-structured and include necessary updates to the build system, Python bindings, and tests. My main feedback is regarding the implementation of the new CUDA kernel, which is a near-duplicate of the existing q_gemm.cu. This creates a significant maintenance overhead. I've suggested a refactoring approach to merge the v1 and v2 logic into a single file using template parameters to avoid code duplication.
csrc/quantization/gptq/q_gemm_v2.cu
Outdated
| /* | ||
| Adapted from `q_gemm.cu`, which is adapted from | ||
| https://github.com/turboderp/exllamav2 and | ||
| https://github.com/qwopqwop200/GPTQ-for-LLaMa. | ||
| This supports GPTQ v2 format checkpoints (checkpoint_format: 'gptq_v2'), | ||
| by removing the v1-specific "zero + 1" logic during dequantization. | ||
| Specifically, GPTQ v1 format checkpoints store (zero - 1), and need to + 1 at | ||
| runtime during dequantization. GPTQ v2 format checkpoints store the zero point | ||
| as is, and doesn't require + 1 at runtime. For more details, please refer to | ||
| ModelCloud/GPTQModel: | ||
| https://github.com/ModelCloud/GPTQModel/blob/020ac04b74f6263f22491e6a6a034cb4fa5bf181/gptqmodel/utils/model.py#L625 | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new file q_gemm_v2.cu is almost a complete duplicate of q_gemm.cu, with the only difference being the handling of the zero point (+1 for v1, no offset for v2). This introduces a significant maintenance burden, as any future bug fixes or performance improvements in q_gemm.cu would need to be manually ported to this file.
To avoid this code duplication, I suggest merging the logic into the existing q_gemm.cu file and using a template parameter to differentiate between the v1 and v2 formats.
For example, you could template the kernels on a boolean V2_FORMAT:
template <bool V2_FORMAT, bool first_block, int m_count>
__global__ void gemm_half_q_half_gptq_4bit_kernel(...) {
// ...
const int zero_point_offset = V2_FORMAT ? 0 : 1;
// ...
dequant_4bit_8_prep_zero(zeros[0] + zero_point_offset, z1z16[0], y1y16[0]);
// ...
}The pick_..._kernel functions can then be updated to select the correct template instantiation based on a new is_v2 boolean parameter. This approach would apply to all duplicated kernels (gemm_..., reconstruct_..., etc.), consolidating the logic into a single, more maintainable file. The author has already mentioned being open to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this improves maintainability — if it is expected that this fallback GPTQ kernel will be updated (not likely since vllm already has several optimized GPTQ kernels like Marlin).
For now, maybe it is also OK to keep the duplicated code? I can continue to work on templating the functions if the reviewer think it's necessary.
|
This pull request has merge conflicts that must be resolved before it can be |
|
Could you have this format just use the gptq_marlin kernels? These are much more performant, should already support zero point, and are widely used by default for >sm80 |
Thanks for your time : ) In my case, I am serving some low-bit (e.g., 2/3-bit) models. Since Although 4/8-bit models are more popular, it is good to support 2/3-bit ones for efficiency consideration. On my machines, the 2-bit models runs faster than 4-bit ones. Some benchmark results (single A100 80GB): # 2-bit GPTQv2
vllm bench latency --model BitDistiller/Qwen-8B-w2g64-gptq --input-len 8192 --output-len 1 --batch-size 16 --dtype float16
Avg latency: 8.880313996163506 seconds
# 4-bit GPTQ
vllm bench latency --model JunHowie/Qwen3-8B-GPTQ-Int4 --input-len 8192 --output-len 1 --batch-size 16
Avg latency: 12.523354894698908 seconds
# 4-bit AWQ
vllm bench latency --model Qwen/Qwen3-8B-AWQ --input-len 8192 --output-len 1 --batch-size 16
Avg latency: 12.61037957224374 seconds
|
|
Ah I did not consider that gptq_gemm supports lower precisions than 4-bit. Okay @xxxxyu , I think this is a valid case if we can keep the code and binary size impact of this format addition down. It also give some justification for keeping around gptq_gemm for long term |
|
I've merged GPTQv2 format support into Now I will check the calling logic of 4-bit models. |
|
Hi @mgoin, some update on marlin override: Both vllm/vllm/model_executor/layers/quantization/gptq_marlin.py Lines 93 to 97 in f377333
vllm/vllm/model_executor/layers/quantization/gptq_bitblas.py Lines 52 to 56 in f377333
Other quantization/linear to consider:
|
|
Now 2/3/8-bit GPTQv2 models can run correctly, but 4-bit has some issues with With vllm v0.10.1.1 from pypi, when running https://huggingface.co/JunHowie/Qwen3-8B-GPTQ-Int4 (GPTQv1 format) and forcing Usually this won't be triggered as "gptq_marlin" is the default quantization for 4-bit. My proposal to deal with this:
|
|
This pull request has merge conflicts that must be resolved before it can be |
…points Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
…rrectly handled Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
…_v2_format flag Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
|
Hi @mgoin, I think I've addressed concerns in #26092 (comment). Do you have some time to review?
After reading the source code, and doing some tests, I've figured out the current state of vLLM's support for GPTQ(v2) format models:
Specifically:
In summary, this PR completes vLLM's GPTQ format support by adding GPTQv2 format support to 2/3-bit, or asymmetric quantization. It maintains vLLM's original override priorities (gptq_marlin>gptq_bitblas>gptq), and won't cause errors with marlin/bitblas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM sorry for the delay @xxxxyu, just would like to prune the test time down a bit if possible
tests/quantization/test_gptq_v2.py
Outdated
| MODELS = [ | ||
| ("BitDistiller/Qwen-8B-w2g64-gptq", "gptq_v2", True), | ||
| ("BitDistiller/Llama-3.1-8B-Instruct-w2g64-gptq", "gptq_v2", False), | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use smaller models? I would appreciate 1B params at most for CI cost
Also do you need both? It seems that enable_thinking isn't related to quantization so not important to test here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try quantizing a dummy model < 1B for the test. One is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qwen3-1.7B can generate normal texts after simple 2-bit quantization, but Qwen3-0.6B cannot. So I uploaded https://huggingface.co/XXXXyu/Qwen3-1.7B-w2g64-gptq_v2 for testing. Model size is 1.0GB, should be OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can switch to 0.6B 2-bit too, but the checking logic would be simplified to only check if !!!! exists.
Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
|
Hi @mgoin, deprecating the buggy 4-bit |
|
@xxxxyu Let's just leave the 4bit error change out for now and we can address in a followup PR |
Signed-off-by: Xiangyu Li <xiangyu.sdlc@foxmail.com>
|
@mgoin I've replaced the 4-bit error message with a warning. Is this ready to merge? |
…tion, by adapting gptq_gemm (vllm-project#26092)
…tion, by adapting gptq_gemm (vllm-project#26092)
…tion, by adapting gptq_gemm (vllm-project#26092)
…tion, by adapting gptq_gemm (vllm-project#26092) Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…tion, by adapting gptq_gemm (vllm-project#26092) Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Purpose
Add support for GPTQv2 format model checkpoints, by adapting
gptq_gemmCUDA kernel with correct zero point handling. The condition is determined by"checkpoint_format": "gptq_v2"inquantize_config.json.Currently, vllm treats GPTQv2 and GPTQv1 (the default GPTQ) format checkpoints equally, and causes gibberish output (repeated
!!!!) with GPTQv2 format checkpoints (details in #26343).What is GPTQv2?
GPTQv2 is both a quantization algorithm and a checkpoint format. This PR adds support for the GPTQv2 checkpoint format, not the quantization algorithm. Specifically:
v2=True).format='gptq_v2'). Checkpoints in GPTQv2 format will show"checkpoint_format": "gptq_v2"inquantize_config.json.Also:
https://github.com/ModelCloud/GPTQModel/blob/020ac04b74f6263f22491e6a6a034cb4fa5bf181/gptqmodel/utils/model.py#L625).
Why support GPTQv2 format?
GPTQv2, as a checkpoint format, provides higher accuracy for asymmetrically (especially low-bit) quantized models.
By default, GPTQModel uses GPTQv2 format internally, but converts to GPTQv1 format when storing the quantized checkpoint, mainly for compatibility purpose. However, this conversion is not lossless, and potentially harms model accuracy, especially for low-bit quantized models.
Specifically, the v2 -> v1 conversion requires subtracting 1 from the zero point (ranging [0, 2^b]), causing both
zero=0andzero=1stored as the same value. For example, in INT2 quantization, this reduces the actual range of zero point values from {0,1,2,3} to {1,2,3} (both will be stored as {0,1,2} in GPTQv1 format).How to support GPTQv2 format?
The only difference between GPTQv1 and GPTQv2 format, is how they stores the zero points. Specifically, GPTQv1 format subtracts 1 from zero points, and GPTQv2 format does not.
The current
gptq_gemmkernel resumes the original zero points by adding 1 back to the zero points, when dequantizing weights. For example:vllm/csrc/quantization/gptq/q_gemm.cu
Line 412 in be22bb6
To support GPTQv2 format, in the newly added
gptq_gemm_v2kernel, I simply removed all thezero + 1or-zero -1logic, and uses the zero points as is.Test Plan
I added/updated 2 tests to the codebase:
tests/kernels/quantization/test_gptq.py, by adding an opcheck of the newly addedgptq_gemm_v2kernel.tests/quantization/test_gptq_v2.py, to test whether vllm correctly handles GPTQv2 format checkpoints and generates correct outputs.For this PR specifically, I am testing with the following code:
Test Result
Running the above test code, I get:

The output now is correct.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.