33B llama quantization post-inference time #56

ACoder-J · 2024-03-25T02:43:01Z

Why is it that after I quantize, my inference is 2 times slower than the original `model?

Quantize command：python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --offload_activations --wandb --save $SAVE_PATH --dtype float32

Quantitative results：

In addition, looking at the GPU usage, it seems that it has become a parallel pipeline. I use four cores to load the model, but only one GPU utilization is 100% at the same time.

Have you encountered the same problem as me again? Is the inference speed of your quantized model normal?

The text was updated successfully, but these errors were encountered:

BlackSamorez · 2024-03-25T10:53:25Z

Hi @ACoder-J !
The inference speed is really bottlenecked by transformers for multiple reasons (transformers being slow and kernel launches taking long).
One way to achieve proper speedup is to compile your model into a cuda graph. It's a new feature added to PyTorch in 2.0.0 and refined in 2.2.1. People of transformers have also been working hard to properly support and document it. To check out what you can do with it already, please refer to this AQLM demo.
Another way to properly serve AQLM is to use the vllm integration. It's not complete yet but it might be usable already (not sure).

github-actions · 2024-04-25T01:45:55Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-05-09T01:46:28Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Apr 25, 2024

github-actions bot closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

33B llama quantization post-inference time #56

33B llama quantization post-inference time #56

ACoder-J commented Mar 25, 2024 •

edited

Loading

BlackSamorez commented Mar 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented May 9, 2024

33B llama quantization post-inference time #56

33B llama quantization post-inference time #56

Comments

ACoder-J commented Mar 25, 2024 • edited Loading

BlackSamorez commented Mar 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented May 9, 2024

ACoder-J commented Mar 25, 2024 •

edited

Loading