-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPTQ support #916
Add GPTQ support #916
Conversation
Great to see! Someone ping me when it's merged and I'll mention support in my GPTQ readmes. |
@chu-tianxiang Thanks for your efforts! 👍 It should better to add use_safetensors args in api endpoints files. 😄 I try this PR, all Fine for me. |
From my testing with offline inference of Open-Orca/OpenOrca-Platypus2-13B vs quantized, quantized is about 3% faster. The test was a batch of 500 prompts, with max_tokens=512. The biggest difference is I can now run inference on both of my 3090's individually, so it's effectively twice as fast. |
@chu-tianxiang Does this support the latest Falcon-180B-GPTQ?🙃 |
Can you merge this with the latest main branch commit? There is some performance testing I want to do using code-llama which support was just added in main. |
I just merged the main branch. Safetensors is enabled by default in main branch now, so to @esmeetu : I haven't tested Falcon-180B-GPTQ yet. I'll give it a try. |
Thanks for this. A noob question, I tried to install from pip but seems broken and not installed. Can you tell me how I can install from yours?
|
I'm not sure what caused the error, this branch can be installed from source with pip
Besides you need to install huggingface/optimum and AutoGPTQ to use GPTQ models. There was a bug in AutoGPTQ which was not fixed until recently, so it probably has to be installed from source too. |
Hi @chu-tianxiang, thanks a million for the PR. We are in the process of merging the AWQ PR (#1032), which includes common interface for different quantization methods. We will get back to this PR after that. |
An initial implementation of GPTQ compatible with the interface of AWQ is added to the gptq_compat branch. Only LLaMA 4-bit quantization is supported. Following the practice of AWQ, kernels are copied into vLLM so there's no dependency on AutoGPTQ. However AutoGPTQ implements many different kernels to be imported dynamically. For simplicity only the default 4-bit kernel codes are copied. |
Now that AWQ is in, will this PR get merged? Unfortunately, I cannot use AWQ due to its Ampere requirement. Would love to use GPTQ though! |
Until it is integrated into vLLM, you can use TGI, it supports GPTQ.
|
c77bfb5
to
612d7b1
Compare
Thank for your great work! I did some experiments on bigcode model(starcoder), the generation speed is slower than gptq_hf_old, almost twice. Have you done similar experiments? |
Thanks for your experiment. It turns out to be related to the fact that vllm by default pads the sequence length to multiple of 8 here which is exactly the kernel switch threshold for GPTQ. I disabled the padding in old branch but forgot to do the same in the new. It seems the padding has slight negative effect on AWQ as well. I'll do more tests and fix it later. |
Hi @chu-tianxiang, please, consider adding 'gtpq' as possible choice in |
@chu-tianxiang When i using gptq model. There is no throughput difference with batch size 1 and 2. |
Another error I found is: |
vllm/model_executor/models/opt.py
Outdated
@@ -266,10 +286,11 @@ def forward( | |||
|
|||
class OPTForCausalLM(nn.Module): | |||
|
|||
def __init__(self, config): | |||
def __init__(self, config, quant_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quant_config
should be optional, or have default value = None
@pweglik Thank you very much! The issue with loading the model that you mentioned should have been fixed earlier. Regarding the other two problems, I have added gptq to the quantization argument and fixed the bug with OPT model. You're more than welcome to create a PR for any additional problems. @esmeetu I'm not sure what you mean by throughput of batch size 2. The performance of GPTQ kernels is more likely to be affected by the batch size compared to regular fp16 models. |
While refactoring the GPTQ kernel code, I benchmarked most currently available kernels for 4-bit GEMM. The result is as below. Some kernels don't natively support act-order models, so I insert an extra reorder operation (e.g. Benchmarked method include exllamav2(exllamav2 will use dequant & matmul when batch size is above threshold, which behavior is disabled here), gpt-fast, GPTQ act-order(with slight modification of block number), GPTQ sequential, GPTQ triton, AWQ GEMM, AWQ GEMV, Dequant & matmul. When the batch size is small, exllamav2 is the fastest while gpt-fast is competitive enough (especially considering an extra reorder operation is inserted). When the batch is large enough, simple dequant & matmul is so strong a baseline that it beats all other custom kernels. While the kernels differ a lot in implementation details, say, some using tensor cores while some don't, AWQ GEMM and GPTQ triton launches way fewer blocks and do more calculation per thread. I think that may be the part of the reason their performance are better at big batch size. |
this mr don't support 8bits? i got the log
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chu-tianxiang LGTM! Thanks a million for this great work! And apologies for the very delayed reviews. The GPTQ support would have not been possible without your continuous updates on the PR. Also, thanks for the clean and high-quality code. We'd really really appreciate it. Thanks again for this amazing work!
name = name.replace(weight_name, param_name) | ||
# Skip loading extra bias for GPTQ models. | ||
if name.endswith(".bias") and name not in params_dict: | ||
continue | ||
param = params_dict[name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it doesn't ned to be addressed right now, can we somehow factor out this part? This seems to be error prone.
hey, i just noticed the recent most push fails on rocm Line 222 in b81a6a6
can you add this line in the setup.py along with awq quantization like so Line 228 in b81a6a6
If this feature is suppose to support rocm can you also check as hipify is not able to import #include<hipblas.h> Thank you |
@hex-plex Thanks for reporting the issue! We will temporarily set it as CUDA-only feature and add it back once it is tested. |
Hi, @chu-tianxiang. I have a performance-related question. The speed is faster when Model(group size = 32): https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-GPTQ/tree/gptq-4bit-32g-actorder_True The g-1'speed is x2 faster than g32. |
Good job |
Encountering the same. did you find any solution to this? |
Add GPTQ support to vllm.
Update(0925): the previous implementation is hardly compatible with recent introduced AWQ interface. So I moved the previous implementation to gptq_hf_old branch and reimplemented most parts. Kernel files are copied from AutoGPTQ which itself is modified from exllama-v1
Unfortunately new implementation only supports 4-bit quantization. If you wanna use 3-bit model, please refer to the old branch.
Note:
Currently whenSimilar with huggingface TGI, now a separate kernel is used whendesc_act=True
andgroup_size!=-1
, row parallel layers won't be partitioned. Honestly I haven't found an elegant way to deal with act order without giving up exllama's reorder trick for acceleration.desc_act=True
andgroup_size!=-1
andworld_size > 1
For old branch
Models can be loaded in the same way as normal.
Currently tested models:
TheBloke/Llama-2-7b-Chat-GPTQ
,TheBloke/Llama-2-13b-Chat-GPTQ
,TheBloke/Llama-2-70b-Chat-GPTQ
Qwen/Qwen-7B-Chat-Int4
TheBloke/baichuan-7B-GPTQ
mlabonne/gpt2-GPTQ-4bit
PanEa/dolly-v2-gptj-enhanced-auto-gptq
TheBloke/stablecode-instruct-alpha-3b-GPTQ
TheBloke/BLOOMChat-176B-v1-GPTQ
TheBloke/falcon-7b-instruct-GPTQ
,TheBloke/Falcon-180B-Chat-GPTQ
cczhong/internlm-chat-7b-4bit-gptq
TheBloke/starcoderplus-GPTQ
casperhansen/mpt-7b-8k-chat-gptq
There're various different configurations for GPTQ and the code has not been thoroughly tested yet.