Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ quantization not working #12

Open
lopuhin opened this issue Dec 1, 2023 · 16 comments
Open

GPTQ quantization not working #12

lopuhin opened this issue Dec 1, 2023 · 16 comments

Comments

@lopuhin
Copy link

lopuhin commented Dec 1, 2023

Running quantize.py with --mode int4-gptq does not seem to work:

  • code tries to import lm-evaluation-harness which is not included/documented/used
  • import in eval.py is incorrect, should probably be from model import Transformer as LLaMA instead of from model import LLaMA
  • after fixing two above issues, next one is a circular import
  • after fixing that, import lm_eval should be replaced with import lm_eval.base
  • there is one other circular import
  • there are a few other missing imports from lm_eval
  • and a few other errors

Overall here are the fixes I had to apply to make it run: lopuhin@86d990b

Based on this, could you please check if the right version of the code was included for GPTQ quantization?

@lopuhin
Copy link
Author

lopuhin commented Dec 1, 2023

One more issue is very high memory usage, it exceeds 128 GB after processing only the first 9 layers with the 13b model.

@jamestwhedbee
Copy link

I am at the third bullet point here as well, going to just follow along to comments here

@lopuhin
Copy link
Author

lopuhin commented Dec 1, 2023

@jamestwhedbee to get rid of those python issues you can try to use this fork in the meantime https://github.com/lopuhin/gpt-fast/ -- but I don't have a solution for high RAM usage yet, so in the end I didn't manage to get a converted model.

@jamestwhedbee
Copy link

jamestwhedbee commented Dec 1, 2023

That looked promising but I unfortunately ran into another issue you probably wouldn't have. I am on AMD so that might be the cause. I can't find anything online related to this issue. I noticed that non-GPTQ int4 quantization does not work for me either, with the same error. int8 quantization works fine and I have run GPTQ int4 quantized models using the auto-gptq library for ROCm before so not sure what this issue is.

Traceback (most recent call last):
  File "/home/telnyxuser/gpt-fast/quantize.py", line 614, in <module>
    quantize(args.checkpoint_path, args.model_name, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
  File "/home/telnyxuser/gpt-fast/quantize.py", line 560, in quantize
    quantized_state_dict = quant_handler.create_quantized_state_dict()
  File "/home/telnyxuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/telnyxuser/gpt-fast/quantize.py", line 423, in create_quantized_state_dict
    weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
  File "/home/telnyxuser/gpt-fast/quantize.py", line 358, in prepare_int4_weight_and_scales_and_zeros
    weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
  File "/home/telnyxuser/.local/lib/python3.10/site-packages/torch/_ops.py", line 753, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

@lopuhin
Copy link
Author

lopuhin commented Dec 1, 2023

I got the same error when trying a conversion on another machine with more RAM but an older NVIDIA GPU.

@MrD005
Copy link

MrD005 commented Dec 4, 2023

anyone solved all the problem. i am getting all the problem discussed in this thread

@MrD005
Copy link

MrD005 commented Dec 4, 2023

@jamestwhedbee @lopuhin i stuck on this
Traceback (most recent call last):
File "quantize.py", line 614, in
quantize(args.checkpoint_path, args.model_name, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
File "quantize.py", line 560, in quantize
quantized_state_dict = quant_handler.create_quantized_state_dict()
File "/root/development/dev/venv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "quantize.py", line 423, in create_quantized_state_dict
weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
File "quantize.py", line 358, in prepare_int4_weight_and_scales_and_zeros
weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
File "/root/development/dev/venv/lib/python3.8/site-packages/torch/_ops.py", line 753, in call
return self._op(*args, **kwargs or {})
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

are you guys able to solve this?

@lopuhin
Copy link
Author

lopuhin commented Dec 4, 2023

RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

@MrD005 I got this error when trying to run on 2080Ti but not on L4 (both using CUDA 12.1) so I suspect this is due to this function missing in lower compute capability.

@MrD005
Copy link

MrD005 commented Dec 4, 2023

@lopuhin i am running it on A100 , python 3.8 , with cuda 11.8 nightly so i think it is not about lower compute capability

@chu-tianxiang
Copy link

According to the code here, probably both cuda 12.x and compute capability 8.0+ are required.

@briandw
Copy link

briandw commented Dec 7, 2023

I had the same _convert_weight_to_int4pack_cuda not available problem. It was due to Cuda 11.8 not supporting the operator. Works now with a RTX4090 and 12.1

@xin-li-67
Copy link

I got this problem on my single RTX4090 with Pytorch nightly installed with Cuda 11.8. After I had switched to Pytorch nightly on CUDA12.1, the problem was gone.

@lufixSch
Copy link

lufixSch commented Jan 7, 2024

@jamestwhedbee did you find a solution for ROCm?

@jamestwhedbee
Copy link

@lufixSch no, but as of last week v0.2.7 of vLLM supports GPTQ with ROCm, and I am seeing pretty good results there. So maybe that is an option for you.

@ce1190222
Copy link

ce1190222 commented Feb 2, 2024

I applied all the fixes mentioned. But I'm still getting this error:-
File "/kaggle/working/quantize.py", line 14, in
from GPTQ import GenericGPTQRunner, InputRecorder
File "/kaggle/working/GPTQ.py", line 12, in
from eval import setup_cache_padded_seq_input_pos_max_seq_length_for_prefill
File "/kaggle/working/eval.py", line 20, in
import lm_eval.base
ModuleNotFoundError: No module named 'lm_eval.base'

I am using lm_eval 0.4.0

@jerryzh168
Copy link
Contributor

jerryzh168 commented Feb 7, 2024

lm_eval 0.3.0 and 0.4.0 support is updated in eb1789b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants