Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do Model Quantization? #9

Closed
miandai opened this issue Oct 11, 2023 · 20 comments
Closed

How to do Model Quantization? #9

miandai opened this issue Oct 11, 2023 · 20 comments

Comments

@miandai
Copy link

miandai commented Oct 11, 2023

  1. update web_demo.py:CogVLMModel.from_pretrained(...,device=f'cpu',...)
  2. python web_demo.py --version chat --english --quant 4

Is this OK?

@1049451037
Copy link
Member

It should work. If something went wrong, feel free to post here.

@aisensiy
Copy link
Contributor

It seems that quantize part of the model will show error like this during inference:

error message expected scalar type BFloat16 but found Half

@1049451037
Copy link
Member

Yeah I got it. This is the bug of cpm_kernel, which we cannot control actually... You can avoid this by changing --bf16 to --fp16 when running code.

@aisensiy
Copy link
Contributor

But --fp16 will show another error message and the process crush:

Floating point exception (core dumped)

@1049451037
Copy link
Member

emm... Could you post more detailed error info?

@aisensiy
Copy link
Contributor

This is all the error info from my terminal:

$ python web_demo.py --from_pretrained cogvlm-chat --version chat --english --fp16 --quant 8

[2023-10-11 10:58:42,487] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-11 10:58:47,433] [INFO] building CogVLMModel model ...
[2023-10-11 10:58:47,437] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-11 10:58:47,438] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-11 10:59:00,976] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 17639685376
[2023-10-11 10:59:08,248] [INFO] [RANK 0] global rank 0 is loading checkpoint /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:53,805] [INFO] [RANK 0] > successfully loaded /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:55,647] [INFO] [RANK 0] > Quantizing model weight to 8 bits
[2023-10-11 11:00:55,699] [INFO] [RANK 0] > Quantized 5033164800 parameters in total.
web_demo.py:168: GradioDeprecationWarning: 'scale' value should be an integer. Using 4.5 will cause issues.
  with gr.Column(scale=4.5):
web_demo.py:182: GradioDeprecationWarning: 'scale' value should be an integer. Using 5.5 will cause issues.
  with gr.Column(scale=5.5):
web_demo.py:183: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  result_text = gr.components.Chatbot(label='Multi-round conversation History', value=[("", "Hi, What do you want to know about this image?")]).style(height=550)
3.47.1
3.47.1
Running on local URL:  http://0.0.0.0:8080

To create a public link, set `share=True` in `launch()`.
history []
error message 'NoneType' object has no attribute 'read'
history []
Floating point exception (core dumped)

@1049451037
Copy link
Member

I tested... It's also because of cpm_kernels. They do not support some operations in our model... Therefore, quantization with cpm_kernels is not supported for now.

@aisensiy
Copy link
Contributor

Ok, and only quantize the lm part seems not make a lot memory usage shrink...so it is acceptable...

Here is a screenshot using bf16:

screenshot

This is really a huge memory usage...is it possible to make it work with 4090 in the future?

@1049451037
Copy link
Member

It works for two 3090 with model parallel.

@aisensiy
Copy link
Contributor

I tested... It's also because of cpm_kernels. They do not support some operations in our model... Therefore, quantization with cpm_kernels is not supported for now.

So may I just remove quantization related code right now? Or waiting for some progress?

@miandai
Copy link
Author

miandai commented Oct 12, 2023

@aisensiy Try add a line of code to web_demo.py:

image

image

@1049451037
Copy link
Member

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

@aisensiy
Copy link
Contributor

aisensiy commented Oct 12, 2023

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

You mean quantizing the full model is possible? (I know litte about this stuff)

@1049451037
Copy link
Member

Yes, but it depends on the cuda kernel support. cpm_kernels misses some implementation for bf16 and slicing fp16. I'm not sure whether bitsandbytes works.

Theoretically, quantization everything is possible. But practically, some packages may have bug.

@Blankit
Copy link

Blankit commented Oct 16, 2023

It works for two 3090 with model parallel.

How to set?

@1049451037
Copy link
Member

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

or

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

@Blankit
Copy link

Blankit commented Oct 16, 2023

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

or

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun: error: unrecognized arguments: --nproc-per-node=2

@Blankit
Copy link

Blankit commented Oct 20, 2023

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

or

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun: error: unrecognized arguments: --nproc-per-node=2

It is caused by the PyTorch version

@rahimentezari
Copy link

rahimentezari commented Nov 2, 2023

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

while I wait for quantization support, would like to try bitsandbytes
Is this correct as I read from bitsandbytes?
in cogvlm_model.py change the GLU class linear layers to

class GLU(nn.Module):
    def __init__(self, args, in_features):
        super().__init__()
        # self.linear_proj = nn.Linear(in_features, args.hidden_size, bias=False)
        # self.norm1 = nn.LayerNorm(args.hidden_size)
        # self.act1 = nn.GELU()
        # self.act2 = nn.functional.silu
        # self.dense_h_to_4h = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
        # self.gate_proj = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
        # self.dense_4h_to_h = nn.Linear(args.inner_hidden_size, args.hidden_size, bias=False)

        self.linear_proj = bnb.nn.Linear8bitLt(in_features, args.hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.norm1 = nn.LayerNorm(args.hidden_size)
        self.act1 = nn.GELU()
        self.act2 = nn.functional.silu
        self.dense_h_to_4h = bnb.nn.Linear8bitLt(args.hidden_size, args.inner_hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.gate_proj = bnb.nn.Linear8bitLt(args.hidden_size, args.inner_hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.dense_4h_to_h = bnb.nn.Linear8bitLt(args.inner_hidden_size, args.hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)

    def forward(self, x):
        x = self.linear_proj(x)
        x = self.act1(self.norm1(x))
        x = self.act2(self.gate_proj(x)) * self.dense_h_to_4h(x)
        x = self.dense_4h_to_h(x)
        return x

Interestingly, when using has_fp16_weights=False, not only the quality of caption deteriorates alot, but also the time taken to caption images increases. has_fp16_weights=True takes almost same time as normal nn.Linear layer.

@1049451037
Copy link
Member

1049451037 commented Dec 7, 2023

We now support 4-bit quantization! See README for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants