How to do Model Quantization？ #9

miandai · 2023-10-11T03:46:01Z

update web_demo.py：CogVLMModel.from_pretrained(...,device=f'cpu',...)
python web_demo.py --version chat --english --quant 4

Is this OK？

1049451037 · 2023-10-11T08:44:28Z

It should work. If something went wrong, feel free to post here.

aisensiy · 2023-10-11T10:20:31Z

It seems that quantize part of the model will show error like this during inference:

error message expected scalar type BFloat16 but found Half

1049451037 · 2023-10-11T10:24:38Z

Yeah I got it. This is the bug of cpm_kernel, which we cannot control actually... You can avoid this by changing --bf16 to --fp16 when running code.

aisensiy · 2023-10-11T10:33:36Z

But --fp16 will show another error message and the process crush:

Floating point exception (core dumped)

1049451037 · 2023-10-11T10:47:47Z

emm... Could you post more detailed error info?

aisensiy · 2023-10-11T11:06:20Z

This is all the error info from my terminal:

$ python web_demo.py --from_pretrained cogvlm-chat --version chat --english --fp16 --quant 8

[2023-10-11 10:58:42,487] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-11 10:58:47,433] [INFO] building CogVLMModel model ...
[2023-10-11 10:58:47,437] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-11 10:58:47,438] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-11 10:59:00,976] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 17639685376
[2023-10-11 10:59:08,248] [INFO] [RANK 0] global rank 0 is loading checkpoint /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:53,805] [INFO] [RANK 0] > successfully loaded /output/sat/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-11 11:00:55,647] [INFO] [RANK 0] > Quantizing model weight to 8 bits
[2023-10-11 11:00:55,699] [INFO] [RANK 0] > Quantized 5033164800 parameters in total.
web_demo.py:168: GradioDeprecationWarning: 'scale' value should be an integer. Using 4.5 will cause issues.
  with gr.Column(scale=4.5):
web_demo.py:182: GradioDeprecationWarning: 'scale' value should be an integer. Using 5.5 will cause issues.
  with gr.Column(scale=5.5):
web_demo.py:183: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  result_text = gr.components.Chatbot(label='Multi-round conversation History', value=[("", "Hi, What do you want to know about this image?")]).style(height=550)
3.47.1
3.47.1
Running on local URL:  http://0.0.0.0:8080

To create a public link, set `share=True` in `launch()`.
history []
error message 'NoneType' object has no attribute 'read'
history []
Floating point exception (core dumped)

1049451037 · 2023-10-11T11:31:35Z

I tested... It's also because of cpm_kernels. They do not support some operations in our model... Therefore, quantization with cpm_kernels is not supported for now.

aisensiy · 2023-10-11T12:58:53Z

Ok, and only quantize the lm part seems not make a lot memory usage shrink...so it is acceptable...

Here is a screenshot using bf16:

This is really a huge memory usage...is it possible to make it work with 4090 in the future?

1049451037 · 2023-10-12T01:55:50Z

It works for two 3090 with model parallel.

aisensiy · 2023-10-12T02:26:20Z

I tested... It's also because of cpm_kernels. They do not support some operations in our model... Therefore, quantization with cpm_kernels is not supported for now.

So may I just remove quantization related code right now? Or waiting for some progress?

miandai · 2023-10-12T02:34:31Z

@aisensiy Try add a line of code to web_demo.py：

1049451037 · 2023-10-12T02:39:17Z

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

aisensiy · 2023-10-12T02:44:20Z

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

You mean quantizing the full model is possible? (I know litte about this stuff)

1049451037 · 2023-10-12T03:03:08Z

Yes, but it depends on the cuda kernel support. cpm_kernels misses some implementation for bf16 and slicing fp16. I'm not sure whether bitsandbytes works.

Theoretically, quantization everything is possible. But practically, some packages may have bug.

Blankit · 2023-10-16T09:46:12Z

It works for two 3090 with model parallel.

How to set?

1049451037 · 2023-10-16T09:47:47Z

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

or

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

Blankit · 2023-10-16T11:49:28Z

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

or

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun: error: unrecognized arguments: --nproc-per-node=2

Blankit · 2023-10-20T07:05:35Z

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

or

torchrun --standalone --nnodes=1 --nproc-per-node=2 web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16

torchrun: error: unrecognized arguments: --nproc-per-node=2

It is caused by the PyTorch version

rahimentezari · 2023-11-02T15:46:53Z

Yes, I think you can remove the quantization related in this repo. If quantization is necessary for you, maybe you can try bitsandbytes to quantize our model.

while I wait for quantization support, would like to try bitsandbytes
Is this correct as I read from bitsandbytes?
in cogvlm_model.py change the GLU class linear layers to

class GLU(nn.Module):
    def __init__(self, args, in_features):
        super().__init__()
        # self.linear_proj = nn.Linear(in_features, args.hidden_size, bias=False)
        # self.norm1 = nn.LayerNorm(args.hidden_size)
        # self.act1 = nn.GELU()
        # self.act2 = nn.functional.silu
        # self.dense_h_to_4h = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
        # self.gate_proj = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
        # self.dense_4h_to_h = nn.Linear(args.inner_hidden_size, args.hidden_size, bias=False)

        self.linear_proj = bnb.nn.Linear8bitLt(in_features, args.hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.norm1 = nn.LayerNorm(args.hidden_size)
        self.act1 = nn.GELU()
        self.act2 = nn.functional.silu
        self.dense_h_to_4h = bnb.nn.Linear8bitLt(args.hidden_size, args.inner_hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.gate_proj = bnb.nn.Linear8bitLt(args.hidden_size, args.inner_hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)
        self.dense_4h_to_h = bnb.nn.Linear8bitLt(args.inner_hidden_size, args.hidden_size, bias=False, has_fp16_weights=False, threshold=6.0)

    def forward(self, x):
        x = self.linear_proj(x)
        x = self.act1(self.norm1(x))
        x = self.act2(self.gate_proj(x)) * self.dense_h_to_4h(x)
        x = self.dense_4h_to_h(x)
        return x

Interestingly, when using has_fp16_weights=False, not only the quality of caption deteriorates alot, but also the time taken to caption images increases. has_fp16_weights=True takes almost same time as normal nn.Linear layer.

1049451037 · 2023-12-07T07:07:59Z

We now support 4-bit quantization! See README for more details.

aisensiy mentioned this issue Oct 12, 2023

remove quantization related part for not supported right now #19

Merged

rahimentezari mentioned this issue Nov 2, 2023

Fast captioning #75

Closed

zRzRzRzRzRzRzR closed this as completed Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do Model Quantization？ #9

How to do Model Quantization？ #9

miandai commented Oct 11, 2023

1049451037 commented Oct 11, 2023

aisensiy commented Oct 11, 2023

1049451037 commented Oct 11, 2023

aisensiy commented Oct 11, 2023

1049451037 commented Oct 11, 2023

aisensiy commented Oct 11, 2023

1049451037 commented Oct 11, 2023

aisensiy commented Oct 11, 2023

1049451037 commented Oct 12, 2023

aisensiy commented Oct 12, 2023

miandai commented Oct 12, 2023

1049451037 commented Oct 12, 2023

aisensiy commented Oct 12, 2023 •

edited

Loading

1049451037 commented Oct 12, 2023

Blankit commented Oct 16, 2023

1049451037 commented Oct 16, 2023

Blankit commented Oct 16, 2023

Blankit commented Oct 20, 2023

rahimentezari commented Nov 2, 2023 •

edited

Loading

1049451037 commented Dec 7, 2023 •

edited

Loading

How to do Model Quantization？ #9

How to do Model Quantization？ #9

Comments

miandai commented Oct 11, 2023

1049451037 commented Oct 11, 2023

aisensiy commented Oct 11, 2023

1049451037 commented Oct 11, 2023

aisensiy commented Oct 11, 2023

1049451037 commented Oct 11, 2023

aisensiy commented Oct 11, 2023

1049451037 commented Oct 11, 2023

aisensiy commented Oct 11, 2023

1049451037 commented Oct 12, 2023

aisensiy commented Oct 12, 2023

miandai commented Oct 12, 2023

1049451037 commented Oct 12, 2023

aisensiy commented Oct 12, 2023 • edited Loading

1049451037 commented Oct 12, 2023

Blankit commented Oct 16, 2023

1049451037 commented Oct 16, 2023

Blankit commented Oct 16, 2023

Blankit commented Oct 20, 2023

rahimentezari commented Nov 2, 2023 • edited Loading

1049451037 commented Dec 7, 2023 • edited Loading

aisensiy commented Oct 12, 2023 •

edited

Loading

rahimentezari commented Nov 2, 2023 •

edited

Loading

1049451037 commented Dec 7, 2023 •

edited

Loading