# Transformers 模型量化技术：AWQ

In [1]:
# 或者一次性列出所有已安装包及其版本（可搜索关键词）
!pip list | grep -E "torch|transformers|accelerate|bitsandbytes|datasets|optimum|awq"

accelerate                0.34.2
autoawq                   0.2.9
bitsandbytes              0.46.1
datasets                  4.0.0
optimum                   1.27.0
torch                     2.7.0
torchaudio                2.7.0
torchcodec                0.5
torchvision               0.22.0
transformers              4.54.1


In [2]:
!nvidia-smi

Wed Aug  6 12:37:56 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.144                Driver Version: 570.144        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Quadro RTX 6000                Off |   00000000:B3:00.0  On |                  Off |
| 33%   39C    P5             22W /  260W |     641MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
import torch
print(torch.cuda.is_available())  # 应输出True
print(torch.version.cuda)  # 应与nvcc -V显示的版本一致（或兼容）

True
12.6


In [4]:
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

In [5]:
from transformers import pipeline

model_path = "/home/pactera/aistudy/week03/opt-2.7b"

# 使用 GPU 加载原始的 OPT-2.7b 模型
generator = pipeline('text-generation',
                     model=model_path,
                     device=0,
                     do_sample=True,
                     num_return_sequences=3)

Device set to use cuda:0


In [6]:
generator("The woman worked as a")

Both `max_new_tokens` (=256) and `max_length`(=21) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "The woman worked as a sex worker. I mean, that's kind of a dealbreaker.\nWhat about a hooker?\nI'm not sure, I'm not into that sort of thing."},
 {'generated_text': 'The woman worked as a nurse at a private care home for the elderly in Auckland, New Zealand, and her employer has been slammed for allegedly failing to protect her from a colleague who was verbally and physically abusive and then repeatedly raped her.\n\nPuja Gupta, 36, from Auckland, had been working at the home for the past six years and was being trained to become a registered nurse when she was allegedly raped by a colleague.\n\nPolice are investigating the circumstances surrounding the incident and have referred the case to the Health and Disability Commissioner (HDC).\n\nWhile she was being raped, Gupta suffered an epileptic fit and was taken to hospital where she was diagnosed with epilepsy. She was discharged the next day to recover at home.\n\nThe Herald on Sunday reports that Gupta, a mother 

In [7]:
generator("The man worked as a")

Both `max_new_tokens` (=256) and `max_length`(=21) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The man worked as a mechanic, a police officer and a doctor, and he was a devoted father and husband.\n\nBut as the years went on, he lost his appetite for life. He told his wife he was tired of being sick. He had been diagnosed with cancer.\n\n“He was getting no younger,” said his wife, Nancy. “He didn’t want to get old.”\n\nSo on Friday, the family found out that James E. “Jim” Martin, 76, died peacefully at home on the outskirts of Dallas, surrounded by his wife, two kids, and three grandchildren.\n\nMartin was one of three Dallas men who died unexpectedly in recent weeks.\n\nThe Dallas police officer who died last week was a beloved veteran, well-liked by his colleagues and friends.\n\nAnd earlier this month, a longtime lawyer who was known for his work on behalf of the elderly and the terminally ill died suddenly from a heart attack.\n\nThe causes of death for all three men remain under investigation.\n\nBut the circumstances of each man’s death have raised co

# 使用 AutoAWQ 量化模型

In [8]:
from awq import AutoAWQForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quant_path = "models/opt-2.7b-bab"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

I have left this message as the final dev message to help you transition.

Important Notice:
- AutoAWQ is officially deprecated and will no longer be maintained.
- The last tested configuration used Torch 2.6.0 and Transformers 4.51.3.
- If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project.

Alternative:
- AutoAWQ has been adopted by the vLLM Project: https://github.com/vllm-project/llm-compressor

For further inquiries, feel free to reach out:
- X: https://x.com/casper_hansen_
- LinkedIn: https://www.linkedin.com/in/casper-hansen-804005170/



In [9]:
model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)

('models/opt-2.7b-bab/tokenizer_config.json',
 'models/opt-2.7b-bab/special_tokens_map.json',
 'models/opt-2.7b-bab/vocab.json',
 'models/opt-2.7b-bab/merges.txt',
 'models/opt-2.7b-bab/added_tokens.json',
 'models/opt-2.7b-bab/tokenizer.json')

# 使用 GPU 加载量化模型

In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto").to(0)

You shouldn't move a model that is dispatched using accelerate hooks.


In [11]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [12]:
result = generate_text("Merry Christmas! I'm glad to")
print(result)

Merry Christmas! I'm glad to see you're still around.
I'm still here, just not posting as much. I'm still here though.


In [13]:
result = generate_text("The woman worked as a")
print(result)

The woman worked as a nurse at the hospital and was a member of the hospital's staff.

The woman was a member of the hospital's staff and had been working at the hospital for about a year.

The woman was a member of the hospital's staff and had been working at the hospital for about a year.


