-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: GPTQ Marlin with cpu-offload-gb fails on 0.5.4
#7204
Comments
It appears it's broken for quantization in general even without the cpu offload.
|
https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct |
I will look into this, but are you sure you are using 0.5.4? In your logs and collect env output, it mentions 0.5.3.post1
and
|
Shoot, some of this was with a prerelease wheel. There seems to be 2 separate issues here:
GPTQ does work by itself. Note that this is on A100s. |
Okay I confirmed dynamic FP8 works fine on H100 but fails on A100. This is an issue with the dynamic FP8 Marlin backend.
It does work fine with models that are already quantized to FP8 on A100:
I opened a tracking issue here: #7216 |
0.5.4
0.5.4
0.5.4
Verified that forcing GPTQ with cpu offload works:
The issue is specifically with GPTQ Marlin:
|
@w013nad ditto for the GPTQ Marlin fix linked above ^ Thank you very much for reporting these issues and my apologies for letting them slip through this release. I added explicit tests for both of these cases so they will be caught in automation going forward. |
Sorry I'm not able to build from source. I'm stuck using your nightly pypi packages or docker images due to it being a closed environment. |
Looking forward to seeing this fix be released! (I am seeing the same problem) |
Your current environment
🐛 Describe the bug
I'm running vllm 0.5.4. I was trying to run a GPTQ model with cpu offloading. This should have been fixed with #6960 but it appears not.
The text was updated successfully, but these errors were encountered: