Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues while attempting LLaMA-3 Quantization #78

Closed
catid opened this issue Apr 20, 2024 · 22 comments
Closed

Issues while attempting LLaMA-3 Quantization #78

catid opened this issue Apr 20, 2024 · 22 comments

Comments

@catid
Copy link

catid commented Apr 20, 2024

Checking to see if this repo works for the new L3 models. Running this script:

export CUDA_VISIBLE_DEVICES=0,1   # or e.g. 0,1,2,3
export MODEL_PATH=/home/catid/models/Meta-Llama-3-8B-Instruct
export DATASET_PATH=pajama
export SAVE_PATH=/home/catid/models/cat-llama-3-8b-instruct-aqlm
export WANDB_PROJECT=aqlm
export WANDB_NAME=aqlm8

python main.py $MODEL_PATH $DATASET_PATH \
 --nsamples=1024 \
 --val_size=128 \
 --num_codebooks=1 \
 --nbits_per_codebook=16 \
 --in_group_size=8 \
 --relative_mse_tolerance=0.01 \
 --finetune_batch_size=32 \
 --finetune_max_epochs=10 \
 --finetune_early_stop=3 \
 --finetune_keep_best \
 --local_batch_size=1 \
 --offload_activations \
 --wandb \
 --resume \
 --save $SAVE_PATH

I see:

============ Load model... ============
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.65it/s]
Loading pretrained model ...
Model loaded sucсessfully ...

============ Quantizing model... ============
Loading data ...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'.
The class this function is called from is 'LlamaTokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Traceback (most recent call last):
  File "/home/catid/sources/AQLM/main.py", line 892, in <module>
    quantize_model(model, args)
  File "/home/catid/sources/AQLM/main.py", line 41, in quantize_model
    data = get_loaders(
  File "/home/catid/sources/AQLM/src/datautils.py", line 226, in get_loaders
    tokenizer = LlamaTokenizer.from_pretrained(
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 169, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
@catid
Copy link
Author

catid commented Apr 20, 2024

This was fixed by editing your code to use AutoTokenizer instead of LlamaTokenizer.

@catid
Copy link
Author

catid commented Apr 20, 2024

Now it runs out of memory:

Loaded data from pajama; len(data)=1024 sequences

Starting AQ quantization ...
catching layer inputs from data
Traceback (most recent call last):
  File "/home/catid/sources/AQLM/main.py", line 892, in <module>
    quantize_model(model, args)
  File "/home/catid/sources/AQLM/main.py", line 59, in quantize_model
    results = quantize_aq(model, train_data, val_data, args)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/catid/sources/AQLM/main.py", line 169, in quantize_aq
    outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
  File "/home/catid/sources/AQLM/main.py", line 169, in <listcomp>
    outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@catid
Copy link
Author

catid commented Apr 20, 2024

Running on a big server fixed that issue

@catid
Copy link
Author

catid commented Apr 20, 2024

Documenting my run here: https://github.com/catid/AQLM/blob/main/catid_readme.md

Please suggest any improvements based on your experience

@Mayorc1978
Copy link

Not sure if the AutoTokenizer is ready for Llama 3 models, cause they changed the template and the tokenizer itself so you take a look at https://github.com/meta-llama/llama-recipes to avoid surprises in the final quality of the quantized model.

@catid
Copy link
Author

catid commented Apr 21, 2024

Model is up here if you want to check https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft

@catid
Copy link
Author

catid commented Apr 21, 2024

Trying to figure out how to get the finetune script to work. Currently it runs out of memory on the big server.
image

@catid
Copy link
Author

catid commented Apr 21, 2024

Using a smaller microbatch_size fixed that

@Godofnothing
Copy link
Collaborator

Godofnothing commented Apr 21, 2024

Hi, @catid. We are also running Llama-3 quantization at the moment.

Concerning the issue with tokenizer, we have a fix for this and will soon merge into the main branch, The cause of the issue is that Llama-3 uses FastTokenizer instead of the default one.

About the OOM - I guess it is hard to fit microbatch>1 into 80Gb of VRAM.

@Godofnothing
Copy link
Collaborator

@catid In addition, you may reduce memory usage via --finetune_dtype=bfloat16.

@catid
Copy link
Author

catid commented Apr 21, 2024

Final model here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm

Currently running lm-eval stuff on it

@catid
Copy link
Author

catid commented Apr 21, 2024

I'd like to do the 70B model at 3 bits, which seems like it would cost about $5k on runpod. Do you have access to cheaper compute or otherwise already doing it @Godofnothing ?

@catid
Copy link
Author

catid commented Apr 21, 2024

Evaluation results without global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft

Evaluation results with global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm

The results are exactly the same so I think the fine tuned model did not get saved out properly? I think the weights are the same for both =(

@catid
Copy link
Author

catid commented Apr 21, 2024

Honestly a bit disappointed in the performance of AQLM. It loses -2.5% accuracy on arc_challenge, and the inference is very slow. Worried the 70B version will be disappointing as well.

@Godofnothing
Copy link
Collaborator

Godofnothing commented Apr 22, 2024

@catid Unfortunately, 2 bit quantization at the moment doesn't offer lossless quantization. In addition, it seems like very accurate models are harder to compress without noticeable degradation relative to the floating point model.

We have just published quantized Meta-Llama-3-8B-Instruct with 1x16 quantization to the hub. Drops are more pronounced on more challenging MMLU and GSM8k tasks.

We are currently running 70B models anyway; I would suggest you to not spend your money.

Concerning the slow inference - did you run the conversion script convert_to_hf.py? Seems like the checkpoint is in improper format as it is larger compared to the one posted in our quantized model repo.

@catid
Copy link
Author

catid commented Apr 22, 2024

Cool I hope you do a 3 bit version so that we can do longer context or speculative decoding with the 8B model

@catid catid closed this as completed Apr 22, 2024
@catid
Copy link
Author

catid commented Apr 22, 2024

Ah mine is bigger because I did a 4-bit quant not a 2 bit quant. IMHO 2 bits is too small for an 8B model

@catid
Copy link
Author

catid commented Apr 22, 2024

Oof your GSM8k (8-shot) is really bad 74% vs 34% maybe something is broken in the quantization or your scripts?

@catid
Copy link
Author

catid commented Apr 22, 2024

Maybe should fix that before spending the $$ doing a 70B model

@Godofnothing
Copy link
Collaborator

Godofnothing commented Apr 22, 2024

@catid We observed earlier on Llama-2 that the quality on GSM8k drops much more dramatically compared to other tasks. Perplexity evaluation, hellaswag, winogrande, arc-easy/challenge and similar stuff provide too optimistic estimate of the model performance.

We have measurements on 1x15 AQLM quant and 2-bit QuIP# Llama-2-7b and drops are in either case are quite severe. Specifically, fp16 llama-2-7b has ~14.7% accuracy on GSM8k, whereas AQLM and QuIP# (the finetuned version) yield ~6.2% and 5.4%, respectively.

Our conjecture, is that the calibration set used to find the optimal model configuration doesn’t involve math, therefore this task is in some sense OOD for the resulting model.

@Godofnothing
Copy link
Collaborator

@catid 2x16 kernel at the moment is not as efficient as the 1x16 and 2x8, unfortunately. So I would not recommend using it.

@Godofnothing
Copy link
Collaborator

@catid, added more evaluations to our model 1x16 on hub. Drops on five 0-shots are significant, but not catastrophic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants