-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to do Model Quantization? #9
Comments
It should work. If something went wrong, feel free to post here. |
It seems that quantize part of the model will show error like this during inference:
|
Yeah I got it. This is the bug of cpm_kernel, which we cannot control actually... You can avoid this by changing |
But
|
emm... Could you post more detailed error info? |
This is all the error info from my terminal: $ python web_demo.py --from_pretrained cogvlm-chat --version chat --english --fp16 --quant 8
[2023-10-11 10:58:42,487] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-11 10:58:47,433] [INFO] building CogVLMModel model ...
[2023-10-11 10:58:47,437] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-11 10:58:47,438] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-11 10:59:00,976] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 17639685376
[2023-10-11 10:59:08,248] [INFO] [RANK 0] global rank 0 is loading checkpoint /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:53,805] [INFO] [RANK 0] > successfully loaded /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:55,647] [INFO] [RANK 0] > Quantizing model weight to 8 bits
[2023-10-11 11:00:55,699] [INFO] [RANK 0] > Quantized 5033164800 parameters in total.
web_demo.py:168: GradioDeprecationWarning: 'scale' value should be an integer. Using 4.5 will cause issues.
with gr.Column(scale=4.5):
web_demo.py:182: GradioDeprecationWarning: 'scale' value should be an integer. Using 5.5 will cause issues.
with gr.Column(scale=5.5):
web_demo.py:183: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
result_text = gr.components.Chatbot(label='Multi-round conversation History', value=[("", "Hi, What do you want to know about this image?")]).style(height=550)
3.47.1
3.47.1
Running on local URL: http://0.0.0.0:8080
To create a public link, set `share=True` in `launch()`.
history []
error message 'NoneType' object has no attribute 'read'
history []
Floating point exception (core dumped) |
I tested... It's also because of cpm_kernels. They do not support some operations in our model... Therefore, quantization with cpm_kernels is not supported for now. |
It works for two 3090 with model parallel. |
So may I just remove quantization related code right now? Or waiting for some progress? |
@aisensiy Try add a line of code to web_demo.py: |
Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model. |
You mean quantizing the full model is possible? (I know litte about this stuff) |
Yes, but it depends on the cuda kernel support. cpm_kernels misses some implementation for bf16 and slicing fp16. I'm not sure whether bitsandbytes works. Theoretically, quantization everything is possible. But practically, some packages may have bug. |
How to set? |
or
|
|
It is caused by the PyTorch version |
while I wait for quantization support, would like to try bitsandbytes
Interestingly, when using has_fp16_weights=False, not only the quality of caption deteriorates alot, but also the time taken to caption images increases. has_fp16_weights=True takes almost same time as normal nn.Linear layer. |
We now support 4-bit quantization! See README for more details. |
Is this OK?
The text was updated successfully, but these errors were encountered: