GPTQ quantization not working #12

lopuhin · 2023-12-01T12:05:40Z

Running quantize.py with --mode int4-gptq does not seem to work:

code tries to import lm-evaluation-harness which is not included/documented/used
import in eval.py is incorrect, should probably be from model import Transformer as LLaMA instead of from model import LLaMA
after fixing two above issues, next one is a circular import
after fixing that, import lm_eval should be replaced with import lm_eval.base
there is one other circular import
there are a few other missing imports from lm_eval
and a few other errors

Overall here are the fixes I had to apply to make it run: lopuhin@86d990b

Based on this, could you please check if the right version of the code was included for GPTQ quantization?

The text was updated successfully, but these errors were encountered:

lopuhin · 2023-12-01T12:52:18Z

One more issue is very high memory usage, it exceeds 128 GB after processing only the first 9 layers with the 13b model.

jamestwhedbee · 2023-12-01T18:49:44Z

I am at the third bullet point here as well, going to just follow along to comments here

lopuhin · 2023-12-01T18:53:25Z

@jamestwhedbee to get rid of those python issues you can try to use this fork in the meantime https://github.com/lopuhin/gpt-fast/ -- but I don't have a solution for high RAM usage yet, so in the end I didn't manage to get a converted model.

jamestwhedbee · 2023-12-01T20:56:11Z

That looked promising but I unfortunately ran into another issue you probably wouldn't have. I am on AMD so that might be the cause. I can't find anything online related to this issue. I noticed that non-GPTQ int4 quantization does not work for me either, with the same error. int8 quantization works fine and I have run GPTQ int4 quantized models using the auto-gptq library for ROCm before so not sure what this issue is.

Traceback (most recent call last):
  File "/home/telnyxuser/gpt-fast/quantize.py", line 614, in <module>
    quantize(args.checkpoint_path, args.model_name, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
  File "/home/telnyxuser/gpt-fast/quantize.py", line 560, in quantize
    quantized_state_dict = quant_handler.create_quantized_state_dict()
  File "/home/telnyxuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/telnyxuser/gpt-fast/quantize.py", line 423, in create_quantized_state_dict
    weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
  File "/home/telnyxuser/gpt-fast/quantize.py", line 358, in prepare_int4_weight_and_scales_and_zeros
    weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
  File "/home/telnyxuser/.local/lib/python3.10/site-packages/torch/_ops.py", line 753, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

lopuhin · 2023-12-01T20:57:58Z

I got the same error when trying a conversion on another machine with more RAM but an older NVIDIA GPU.

MrD005 · 2023-12-04T11:15:35Z

anyone solved all the problem. i am getting all the problem discussed in this thread

MrD005 · 2023-12-04T11:51:22Z

@jamestwhedbee @lopuhin i stuck on this
Traceback (most recent call last):
File "quantize.py", line 614, in
quantize(args.checkpoint_path, args.model_name, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
File "quantize.py", line 560, in quantize
quantized_state_dict = quant_handler.create_quantized_state_dict()
File "/root/development/dev/venv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "quantize.py", line 423, in create_quantized_state_dict
weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
File "quantize.py", line 358, in prepare_int4_weight_and_scales_and_zeros
weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
File "/root/development/dev/venv/lib/python3.8/site-packages/torch/_ops.py", line 753, in call
return self._op(*args, **kwargs or {})
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

are you guys able to solve this?

lopuhin · 2023-12-04T11:53:26Z

RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.

@MrD005 I got this error when trying to run on 2080Ti but not on L4 (both using CUDA 12.1) so I suspect this is due to this function missing in lower compute capability.

MrD005 · 2023-12-04T12:19:32Z

@lopuhin i am running it on A100 , python 3.8 , with cuda 11.8 nightly so i think it is not about lower compute capability

chu-tianxiang · 2023-12-04T12:47:05Z

According to the code here, probably both cuda 12.x and compute capability 8.0+ are required.

briandw · 2023-12-07T01:11:06Z

I had the same _convert_weight_to_int4pack_cuda not available problem. It was due to Cuda 11.8 not supporting the operator. Works now with a RTX4090 and 12.1

xin-li-67 · 2023-12-07T09:54:41Z

I got this problem on my single RTX4090 with Pytorch nightly installed with Cuda 11.8. After I had switched to Pytorch nightly on CUDA12.1, the problem was gone.

lufixSch · 2024-01-07T13:00:37Z

@jamestwhedbee did you find a solution for ROCm?

jamestwhedbee · 2024-01-08T15:00:35Z

@lufixSch no, but as of last week v0.2.7 of vLLM supports GPTQ with ROCm, and I am seeing pretty good results there. So maybe that is an option for you.

ce1190222 · 2024-02-02T06:13:10Z

I applied all the fixes mentioned. But I'm still getting this error:-
File "/kaggle/working/quantize.py", line 14, in
from GPTQ import GenericGPTQRunner, InputRecorder
File "/kaggle/working/GPTQ.py", line 12, in
from eval import setup_cache_padded_seq_input_pos_max_seq_length_for_prefill
File "/kaggle/working/eval.py", line 20, in
import lm_eval.base
ModuleNotFoundError: No module named 'lm_eval.base'

I am using lm_eval 0.4.0

jerryzh168 · 2024-02-07T23:17:34Z

lm_eval 0.3.0 and 0.4.0 support is updated in eb1789b

jamestwhedbee mentioned this issue Dec 5, 2023

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"? #21

Open

MrD005 mentioned this issue Dec 5, 2023

NameError: name 'InputRecorder' is not defined #28

Open

lufixSch mentioned this issue Jan 7, 2024

Add gpt-fast loader (pure PyTorch) oobabooga/text-generation-webui#5180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ quantization not working #12

GPTQ quantization not working #12

lopuhin commented Dec 1, 2023 •

edited

Loading

lopuhin commented Dec 1, 2023

jamestwhedbee commented Dec 1, 2023

lopuhin commented Dec 1, 2023

jamestwhedbee commented Dec 1, 2023 •

edited

Loading

lopuhin commented Dec 1, 2023

MrD005 commented Dec 4, 2023

MrD005 commented Dec 4, 2023

lopuhin commented Dec 4, 2023

MrD005 commented Dec 4, 2023

chu-tianxiang commented Dec 4, 2023

briandw commented Dec 7, 2023 •

edited

Loading

xin-li-67 commented Dec 7, 2023

lufixSch commented Jan 7, 2024

jamestwhedbee commented Jan 8, 2024

ce1190222 commented Feb 2, 2024 •

edited

Loading

jerryzh168 commented Feb 7, 2024 •

edited

Loading

GPTQ quantization not working #12

GPTQ quantization not working #12

Comments

lopuhin commented Dec 1, 2023 • edited Loading

lopuhin commented Dec 1, 2023

jamestwhedbee commented Dec 1, 2023

lopuhin commented Dec 1, 2023

jamestwhedbee commented Dec 1, 2023 • edited Loading

lopuhin commented Dec 1, 2023

MrD005 commented Dec 4, 2023

MrD005 commented Dec 4, 2023

lopuhin commented Dec 4, 2023

MrD005 commented Dec 4, 2023

chu-tianxiang commented Dec 4, 2023

briandw commented Dec 7, 2023 • edited Loading

xin-li-67 commented Dec 7, 2023

lufixSch commented Jan 7, 2024

jamestwhedbee commented Jan 8, 2024

ce1190222 commented Feb 2, 2024 • edited Loading

jerryzh168 commented Feb 7, 2024 • edited Loading

lopuhin commented Dec 1, 2023 •

edited

Loading

jamestwhedbee commented Dec 1, 2023 •

edited

Loading

briandw commented Dec 7, 2023 •

edited

Loading

ce1190222 commented Feb 2, 2024 •

edited

Loading

jerryzh168 commented Feb 7, 2024 •

edited

Loading