-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues while attempting LLaMA-3 Quantization #78
Comments
This was fixed by editing your code to use AutoTokenizer instead of LlamaTokenizer. |
Now it runs out of memory:
|
Running on a big server fixed that issue |
Documenting my run here: https://github.com/catid/AQLM/blob/main/catid_readme.md Please suggest any improvements based on your experience |
Not sure if the AutoTokenizer is ready for Llama 3 models, cause they changed the template and the tokenizer itself so you take a look at https://github.com/meta-llama/llama-recipes to avoid surprises in the final quality of the quantized model. |
Model is up here if you want to check https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft |
Using a smaller microbatch_size fixed that |
Hi, @catid. We are also running Llama-3 quantization at the moment. Concerning the issue with tokenizer, we have a fix for this and will soon merge into the main branch, The cause of the issue is that Llama-3 uses FastTokenizer instead of the default one. About the OOM - I guess it is hard to fit microbatch>1 into 80Gb of VRAM. |
@catid In addition, you may reduce memory usage via |
Final model here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm Currently running lm-eval stuff on it |
I'd like to do the 70B model at 3 bits, which seems like it would cost about $5k on runpod. Do you have access to cheaper compute or otherwise already doing it @Godofnothing ? |
Evaluation results without global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft Evaluation results with global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm The results are exactly the same so I think the fine tuned model did not get saved out properly? I think the weights are the same for both =( |
Honestly a bit disappointed in the performance of AQLM. It loses -2.5% accuracy on arc_challenge, and the inference is very slow. Worried the 70B version will be disappointing as well. |
@catid Unfortunately, 2 bit quantization at the moment doesn't offer We have just published quantized We are currently running 70B models anyway; I would suggest you to not spend your money. Concerning the slow inference - did you run the conversion script |
Cool I hope you do a 3 bit version so that we can do longer context or speculative decoding with the 8B model |
Ah mine is bigger because I did a 4-bit quant not a 2 bit quant. IMHO 2 bits is too small for an 8B model |
Oof your GSM8k (8-shot) is really bad 74% vs 34% maybe something is broken in the quantization or your scripts? |
Maybe should fix that before spending the $$ doing a 70B model |
@catid We observed earlier on Llama-2 that the quality on GSM8k drops much more dramatically compared to other tasks. Perplexity evaluation, hellaswag, winogrande, arc-easy/challenge and similar stuff provide too optimistic estimate of the model performance. We have measurements on 1x15 AQLM quant and 2-bit QuIP# Llama-2-7b and drops are in either case are quite severe. Specifically, fp16 llama-2-7b has ~14.7% accuracy on GSM8k, whereas AQLM and QuIP# (the finetuned version) yield ~6.2% and 5.4%, respectively. Our conjecture, is that the calibration set used to find the optimal model configuration doesn’t involve math, therefore this task is in some sense OOD for the resulting model. |
@catid 2x16 kernel at the moment is not as efficient as the 1x16 and 2x8, unfortunately. So I would not recommend using it. |
@catid, added more evaluations to our model 1x16 on hub. Drops on five 0-shots are significant, but not catastrophic. |
Checking to see if this repo works for the new L3 models. Running this script:
I see:
The text was updated successfully, but these errors were encountered: