## GPTQ - 4 bit Quantization
- OBQ 기반 quantization 알고리즘
- Arbitrary order insight, Lazy Batch-Updates, Cholesky Reformulation 기법 사용  
    -> 큰 언어모델에서도 효율적으로 사용가능

In [14]:
import random

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
from transformers import AutoTokenizer


# Define base model and output directory
model_id = "gpt2"
out_dir = "/media/shin/T7/model_ckpt/" + model_id + "-GPTQ"

### Load model & tokenizer with configuration

In [2]:
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128, # lazy batch size
    damp_percent=0.01, # Cholesky reformulation을 위한 값
    desc_act=False # 가중치 영향력 고려한 order -> True: 정확도 향상, 속도 하락
)
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config=quantize_config, cache_dir="/media/shin/T7/huggingface/models")
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir="/media/shin/T7/huggingface/tokenizers")

### Quantize with samples
- 원본 모델과 quantized 모델의 출력 사이의 비교수단
- 샘플 수가 많을수록 quantization quality 상승

In [5]:
n_samples = 1024
data = load_dataset(
    "allenai/c4", 
    data_files="en/c4-train.00001-of-01024.json.gz", 
    split=f"train[:{n_samples*5}]", 
    cache_dir="/media/shin/T7/huggingface/datasets"
)

Downloading data: 100%|██████████| 318M/318M [02:03<00:00, 2.58MB/s]]
Generating train split: 356318 examples [00:00, 419675.50 examples/s]


In [8]:
tokenized_data = tokenizer("\n\n".join(data["text"]), return_tensors="pt")

Token indices sequence length is longer than the specified maximum sequence length for this model (2441065 > 1024). Running this sequence through the model will result in indexing errors


In [13]:
examples_ids = []
for _ in range(n_samples):
    i = random.randint(0, tokenized_data.input_ids.shape[1] - tokenizer.model_max_length - 1)
    j = i + tokenizer.model_max_length
    input_ids = tokenized_data.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})

In [None]:
model.quantize(
    examples_ids,
    batch_size=1,
)
model.save_quantized(out_dir, use_safetensors=True)
tokenizer.save_pretrained(out_dir)

### loaded quantized model & generate

In [17]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Reload model and tokenizer
model = AutoGPTQForCausalLM.from_quantized(
    out_dir,
    device=device,
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(out_dir)

INFO - The layer lm_head is not quantized.


In [21]:
from transformers import pipeline

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
result = generator("I have a dream", do_sample=True, max_length=50)[0]['generated_text']
print(result)

The model 'GPT2GPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCau

I have a dream and can't believe it has arrived!" –Liam MacCulff in Anzac Days "If one does not understand why in a thousand years we were born, what is life? A life of sorrow... A life of


### Reference
https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34