Issues while attempting LLaMA-3 Quantization #78

catid · 2024-04-20T21:44:02Z

Checking to see if this repo works for the new L3 models. Running this script:

export CUDA_VISIBLE_DEVICES=0,1   # or e.g. 0,1,2,3
export MODEL_PATH=/home/catid/models/Meta-Llama-3-8B-Instruct
export DATASET_PATH=pajama
export SAVE_PATH=/home/catid/models/cat-llama-3-8b-instruct-aqlm
export WANDB_PROJECT=aqlm
export WANDB_NAME=aqlm8

python main.py $MODEL_PATH $DATASET_PATH \
 --nsamples=1024 \
 --val_size=128 \
 --num_codebooks=1 \
 --nbits_per_codebook=16 \
 --in_group_size=8 \
 --relative_mse_tolerance=0.01 \
 --finetune_batch_size=32 \
 --finetune_max_epochs=10 \
 --finetune_early_stop=3 \
 --finetune_keep_best \
 --local_batch_size=1 \
 --offload_activations \
 --wandb \
 --resume \
 --save $SAVE_PATH

I see:

============ Load model... ============
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.65it/s]
Loading pretrained model ...
Model loaded sucсessfully ...

============ Quantizing model... ============
Loading data ...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'.
The class this function is called from is 'LlamaTokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Traceback (most recent call last):
  File "/home/catid/sources/AQLM/main.py", line 892, in <module>
    quantize_model(model, args)
  File "/home/catid/sources/AQLM/main.py", line 41, in quantize_model
    data = get_loaders(
  File "/home/catid/sources/AQLM/src/datautils.py", line 226, in get_loaders
    tokenizer = LlamaTokenizer.from_pretrained(
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 169, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

The text was updated successfully, but these errors were encountered:

catid · 2024-04-20T21:47:56Z

This was fixed by editing your code to use AutoTokenizer instead of LlamaTokenizer.

catid · 2024-04-20T21:51:04Z

Now it runs out of memory:

Loaded data from pajama; len(data)=1024 sequences

Starting AQ quantization ...
catching layer inputs from data
Traceback (most recent call last):
  File "/home/catid/sources/AQLM/main.py", line 892, in <module>
    quantize_model(model, args)
  File "/home/catid/sources/AQLM/main.py", line 59, in quantize_model
    results = quantize_aq(model, train_data, val_data, args)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/catid/sources/AQLM/main.py", line 169, in quantize_aq
    outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
  File "/home/catid/sources/AQLM/main.py", line 169, in <listcomp>
    outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

catid · 2024-04-20T22:33:37Z

Running on a big server fixed that issue

catid · 2024-04-20T23:14:13Z

Documenting my run here: https://github.com/catid/AQLM/blob/main/catid_readme.md

Please suggest any improvements based on your experience

Mayorc1978 · 2024-04-21T14:32:54Z

Not sure if the AutoTokenizer is ready for Llama 3 models, cause they changed the template and the tokenizer itself so you take a look at https://github.com/meta-llama/llama-recipes to avoid surprises in the final quality of the quantized model.

catid · 2024-04-21T17:54:24Z

Model is up here if you want to check https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft

catid · 2024-04-21T17:54:56Z

Trying to figure out how to get the finetune script to work. Currently it runs out of memory on the big server.

catid · 2024-04-21T18:18:34Z

Using a smaller microbatch_size fixed that

Godofnothing · 2024-04-21T19:59:33Z

Hi, @catid. We are also running Llama-3 quantization at the moment.

Concerning the issue with tokenizer, we have a fix for this and will soon merge into the main branch, The cause of the issue is that Llama-3 uses FastTokenizer instead of the default one.

About the OOM - I guess it is hard to fit microbatch>1 into 80Gb of VRAM.

Godofnothing · 2024-04-21T20:10:20Z

@catid In addition, you may reduce memory usage via --finetune_dtype=bfloat16.

catid · 2024-04-21T20:13:29Z

Final model here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm

Currently running lm-eval stuff on it

catid · 2024-04-21T20:14:50Z

I'd like to do the 70B model at 3 bits, which seems like it would cost about $5k on runpod. Do you have access to cheaper compute or otherwise already doing it @Godofnothing ?

catid · 2024-04-21T22:07:08Z

Evaluation results without global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft

Evaluation results with global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm

The results are exactly the same so I think the fine tuned model did not get saved out properly? I think the weights are the same for both =(

catid · 2024-04-21T22:13:11Z

Honestly a bit disappointed in the performance of AQLM. It loses -2.5% accuracy on arc_challenge, and the inference is very slow. Worried the 70B version will be disappointing as well.

Godofnothing · 2024-04-22T04:28:11Z

@catid Unfortunately, 2 bit quantization at the moment doesn't offer lossless quantization. In addition, it seems like very accurate models are harder to compress without noticeable degradation relative to the floating point model.

We have just published quantized Meta-Llama-3-8B-Instruct with 1x16 quantization to the hub. Drops are more pronounced on more challenging MMLU and GSM8k tasks.

We are currently running 70B models anyway; I would suggest you to not spend your money.

Concerning the slow inference - did you run the conversion script convert_to_hf.py? Seems like the checkpoint is in improper format as it is larger compared to the one posted in our quantized model repo.

catid · 2024-04-22T04:40:13Z

Cool I hope you do a 3 bit version so that we can do longer context or speculative decoding with the 8B model

catid · 2024-04-22T04:43:18Z

Ah mine is bigger because I did a 4-bit quant not a 2 bit quant. IMHO 2 bits is too small for an 8B model

catid · 2024-04-22T04:44:24Z

Oof your GSM8k (8-shot) is really bad 74% vs 34% maybe something is broken in the quantization or your scripts?

catid · 2024-04-22T04:45:18Z

Maybe should fix that before spending the $$ doing a 70B model

Godofnothing · 2024-04-22T06:56:45Z

@catid We observed earlier on Llama-2 that the quality on GSM8k drops much more dramatically compared to other tasks. Perplexity evaluation, hellaswag, winogrande, arc-easy/challenge and similar stuff provide too optimistic estimate of the model performance.

We have measurements on 1x15 AQLM quant and 2-bit QuIP# Llama-2-7b and drops are in either case are quite severe. Specifically, fp16 llama-2-7b has ~14.7% accuracy on GSM8k, whereas AQLM and QuIP# (the finetuned version) yield ~6.2% and 5.4%, respectively.

Our conjecture, is that the calibration set used to find the optimal model configuration doesn’t involve math, therefore this task is in some sense OOD for the resulting model.

Godofnothing · 2024-04-22T08:08:30Z

@catid 2x16 kernel at the moment is not as efficient as the 1x16 and 2x8, unfortunately. So I would not recommend using it.

Godofnothing · 2024-04-22T18:25:09Z

@catid, added more evaluations to our model 1x16 on hub. Drops on five 0-shots are significant, but not catastrophic.

catid closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues while attempting LLaMA-3 Quantization #78

Issues while attempting LLaMA-3 Quantization #78

catid commented Apr 20, 2024 •

edited

Loading

catid commented Apr 20, 2024

catid commented Apr 20, 2024

catid commented Apr 20, 2024

catid commented Apr 20, 2024

Mayorc1978 commented Apr 21, 2024

catid commented Apr 21, 2024 •

edited

Loading

catid commented Apr 21, 2024

catid commented Apr 21, 2024

Godofnothing commented Apr 21, 2024 •

edited

Loading

Godofnothing commented Apr 21, 2024

catid commented Apr 21, 2024 •

edited

Loading

catid commented Apr 21, 2024

catid commented Apr 21, 2024

catid commented Apr 21, 2024

Godofnothing commented Apr 22, 2024 •

edited

Loading

catid commented Apr 22, 2024

catid commented Apr 22, 2024

catid commented Apr 22, 2024 •

edited

Loading

catid commented Apr 22, 2024

Godofnothing commented Apr 22, 2024 •

edited

Loading

Godofnothing commented Apr 22, 2024

Godofnothing commented Apr 22, 2024

Issues while attempting LLaMA-3 Quantization #78

Issues while attempting LLaMA-3 Quantization #78

Comments

catid commented Apr 20, 2024 • edited Loading

catid commented Apr 20, 2024

catid commented Apr 20, 2024

catid commented Apr 20, 2024

catid commented Apr 20, 2024

Mayorc1978 commented Apr 21, 2024

catid commented Apr 21, 2024 • edited Loading

catid commented Apr 21, 2024

catid commented Apr 21, 2024

Godofnothing commented Apr 21, 2024 • edited Loading

Godofnothing commented Apr 21, 2024

catid commented Apr 21, 2024 • edited Loading

catid commented Apr 21, 2024

catid commented Apr 21, 2024

catid commented Apr 21, 2024

Godofnothing commented Apr 22, 2024 • edited Loading

catid commented Apr 22, 2024

catid commented Apr 22, 2024

catid commented Apr 22, 2024 • edited Loading

catid commented Apr 22, 2024

Godofnothing commented Apr 22, 2024 • edited Loading

Godofnothing commented Apr 22, 2024

Godofnothing commented Apr 22, 2024

catid commented Apr 20, 2024 •

edited

Loading

catid commented Apr 21, 2024 •

edited

Loading

Godofnothing commented Apr 21, 2024 •

edited

Loading

catid commented Apr 21, 2024 •

edited

Loading

Godofnothing commented Apr 22, 2024 •

edited

Loading

catid commented Apr 22, 2024 •

edited

Loading

Godofnothing commented Apr 22, 2024 •

edited

Loading