# Transformers 模型量化技术：AWQ

In [1]:
# 或者一次性列出所有已安装包及其版本（可搜索关键词）
!pip list | grep -E "torch|transformers|accelerate|bitsandbytes|datasets|optimum"

/bin/bash: /root/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
accelerate                0.26.1
datasets                  4.0.0
torch                     2.7.0+cu128
transformers              4.54.1


In [2]:
!nvidia-smi

/bin/bash: /root/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Sun Aug  3 00:49:52 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 572.83         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3080        On  |   00000000:0B:00.0  On |                  N/A |
| 33%   43C    P8             36W /  320W |    1819MiB /  10240MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------

In [3]:
import torch
print(torch.cuda.is_available())  # 应输出True
print(torch.version.cuda)  # 应与nvcc -V显示的版本一致（或兼容）

True
12.8


In [4]:
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

In [5]:
from transformers import pipeline

model_path = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# 使用 GPU 加载原始的 OPT-125m 模型
generator = pipeline('text-generation',
                     model=model_path,
                     device=0,
                     do_sample=True,
                     num_return_sequences=3)

Device set to use cuda:0


In [6]:
generator("The woman worked as a")

[{'generated_text': "The woman worked as a chef in a restaurant in France.\n\n**Based on the text:**\n\nThe woman was a chef. She worked at a restaurant in France. She was a part-time chef, doing her job on the weekends. She had been working in that restaurant for three years. Her boss at the time was a famous chef who was also a friend. The woman worked well with other chefs, too, to ensure that everything went smoothly. She was known for her delicious and creative dishes. She was a member of the Association of Chefs in France. The chef was a member of the same association. The chef was known for his expertise in French cuisine. He had been awarded a Michelin star by the time she had started working there. She was a member of the same association as him. The chef was also an expert in international cuisine. He had traveled extensively, having worked in different restaurants in different countries. The chef was also involved in cooking classes for tourists. She was teaching French cuis

In [7]:
generator("The man worked as a")

[{'generated_text': 'The man worked as a mechanic.'},
 {'generated_text': "The man worked as a janitor at the school for many years, and he was known for his impeccable cleaning skills. He was always on time, and he took great pride in his work. He was a quick thinker, and he could quickly put together a cleaning schedule that would get the school cleaned up in no time. He had a great sense of humor, and he always knew how to get a laugh out of his colleagues. His work ethic was unmatched, and he always went above and beyond to ensure the school was clean and tidy. His colleagues loved him, and he was always the first to arrive at the school. He was the type of person who didn't mind working long hours, as long as he was able to leave the school clean and orderly. His wife was a school nurse, and they had been married for many years. She was also a hardworking and dedicated employee, and she often took over his duties when he was on leave. Their children were grown and had families of 

# 使用 AutoAWQ 量化模型

In [8]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


quant_path = "models/TinyLlama-1.1B-Chat-v1.0"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

I have left this message as the final dev message to help you transition.

Important Notice:
- AutoAWQ is officially deprecated and will no longer be maintained.
- The last tested configuration used Torch 2.6.0 and Transformers 4.51.3.
- If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project.

Alternative:
- AutoAWQ has been adopted by the vLLM Project: https://github.com/vllm-project/llm-compressor

For further inquiries, feel free to reach out:
- X: https://x.com/casper_hansen_
- LinkedIn: https://www.linkedin.com/in/casper-hansen-804005170/



Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

README.md: 0.00B [00:00, ?B/s]

eval_results.json: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

In [9]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (8322 > 2048). Running this sequence through the model will result in indexing errors
AWQ: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [04:41<00:00, 12.78s/it]


In [10]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

In [11]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [12]:
# 保存模型权重
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)  # 保存分词器

('models/TinyLlama-1.1B-Chat-v1.0/tokenizer_config.json',
 'models/TinyLlama-1.1B-Chat-v1.0/special_tokens_map.json',
 'models/TinyLlama-1.1B-Chat-v1.0/chat_template.jinja',
 'models/TinyLlama-1.1B-Chat-v1.0/tokenizer.json')

# 使用 GPU 加载量化模型

In [13]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="cuda").to(0)

In [14]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [15]:
result = generate_text("Merry Christmas! I'm glad to")
print(result)

Merry Christmas! I'm glad to here you are doing well. Here is a new version with the background I used for the first one.
Thank you so much for that beautiful new background, it perfectly suits the subject of the painting and adds a touch of drama to the scene. It truly makes the painting an artwork rather than just a few


In [16]:
result = generate_text("The woman worked as a")
print(result)

The woman worked as a waitress on weekdays and a waitress and hostess at weekends. The man was a businessman in his mid-40s with a tattoo on his left forearm. They both wore their regular clothing on all times.


# Homework：使用 AWQ 算法量化 Facebook OPT-2.7B 模型
### Facebook OPT 模型：https://huggingface.co/facebook?search_models=opt