Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

33B llama quantization post-inference time #56

Closed
ACoder-J opened this issue Mar 25, 2024 · 3 comments
Closed

33B llama quantization post-inference time #56

ACoder-J opened this issue Mar 25, 2024 · 3 comments
Labels

Comments

@ACoder-J
Copy link

ACoder-J commented Mar 25, 2024

Why is it that after I quantize, my inference is 2 times slower than the original `model?

Quantize command:python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --offload_activations --wandb --save $SAVE_PATH --dtype float32

Quantitative results:
image

In addition, looking at the GPU usage, it seems that it has become a parallel pipeline. I use four cores to load the model, but only one GPU utilization is 100% at the same time.
image

Have you encountered the same problem as me again? Is the inference speed of your quantized model normal?

@BlackSamorez
Copy link
Collaborator

Hi @ACoder-J !
The inference speed is really bottlenecked by transformers for multiple reasons (transformers being slow and kernel launches taking long).
One way to achieve proper speedup is to compile your model into a cuda graph. It's a new feature added to PyTorch in 2.0.0 and refined in 2.2.1. People of transformers have also been working hard to properly support and document it. To check out what you can do with it already, please refer to this AQLM demo.
Another way to properly serve AQLM is to use the vllm integration. It's not complete yet but it might be usable already (not sure).

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Apr 25, 2024
Copy link

github-actions bot commented May 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants