# Transformers 模型量化技术：AWQ

![img](https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/Thumbnail.png)

在2023年6月，Ji Lin等人发表了论文[AWQ：Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf)。

这篇论文详细介绍了一种激活感知权重量化算法，可以用于压缩任何基于 Transformer 的语言模型，同时只有微小的性能下降。关于 AWQ 算法的详细介绍，见[MIT Han Song 教授分享](https://hanlab.mit.edu/projects/awq)。

transformers 现在支持两个不同的 AWQ 开源实现库：

- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
- [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) 


因为 LLM-AWQ 不支持 Nvidia T4 GPU（课程演示 GPU），所以我们使用 AutoAWQ 库来介绍和演示 AWQ 模型量化技术。

## 量化前模型测试文本生成任务

In [1]:
from transformers import pipeline

model_path = "facebook/opt-125m"

# 使用 GPU 加载原始的 OPT-125m 模型
generator = pipeline('text-generation', # 使用 transformers 的 pipeline API
                     model=model_path, # 模型路径
                     device=0, # 使用 GPU 设备
                     do_sample=True, # 启用采样
                     num_return_sequences=3) # 返回 3 个生成的序列

  from .autonotebook import tqdm as notebook_tqdm


#### 实测GPU显存占用：加载 OPT-125m 模型后

```shell
Fri Aug 22 10:51:14 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:07.0 Off |                    0 |
| N/A   44C    P0             27W /   70W |     633MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1712559      C   ...xin/miniconda3/envs/peft/bin/python        630MiB |
+-----------------------------------------------------------------------------------------+
```

In [2]:
generator("The woman worked as a") # 生成文本测试

[{'generated_text': 'The woman worked as a secretary and also worked in a kitchen until she turned 65.\n\nShe'},
 {'generated_text': 'The woman worked as a truck driver for a couple years and she had just gotten pregnant with her first'},
 {'generated_text': 'The woman worked as a nurse at United Methodist Nursery School in East Rutherford, New Jersey in the'}]

In [3]:
generator("The man worked as a") # 生成文本测试

[{'generated_text': 'The man worked as a mechanic, you might say, but "the law is just rules and we'},
 {'generated_text': 'The man worked as a salesperson and got pretty tired of the bullshit people kept claiming it is.'},
 {'generated_text': 'The man worked as a construction crew man for a building of an amusement park, but one incident is'}]

## 使用 AutoAWQ 量化模型

下面我们以 `facebook opt-125m` 模型为例，使用 `AutoAWQ` 库实现的 AWQ 算法实现模型量化。

In [4]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


quant_path = "models/opt-125m-awq" # 量化模型保存路径
quant_config = {"zero_point": True, # 启用零点校正
                "q_group_size": 128,  # 量化组大小
                "w_bit": 4,  # 权重量化位数
                "version": "GEMM"}  # 量化版本

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="cuda") # 将模型加载到GPU上
tokenizer = AutoTokenizer.from_pretrained(model_path, 
                                          trust_remote_code=True # 加载分词器，信任远程代码
                                          )  

Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 42.65it/s]


In [5]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Repo card metadata block was not found. Setting CardData to empty.
AWQ: 100%|██████████| 12/12 [01:45<00:00,  8.79s/it]


#### 实测GPU显存使用：量化模型时峰值达到将近 4GB，显存占用也是动态变化中。。。

```shell
Fri Aug 22 10:53:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:07.0 Off |                    0 |
| N/A   53C    P0             55W /   70W |    3253MiB /  15360MiB |     24%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1712559      C   ...xin/miniconda3/envs/peft/bin/python       3250MiB |
+-----------------------------------------------------------------------------------------+
```

In [6]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

#### Transformers 兼容性配置

为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`

In [7]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"], # 权重量化位数
    group_size=quant_config["q_group_size"], # 量化分组大小
    zero_point=quant_config["zero_point"], # 允许量化零点
    version=quant_config["version"].lower(),  # 量化版本 
).to_dict() # 将配置转换为字典格式

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config # 将量化配置添加到模型配置中

In [8]:
# 保存模型权重
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)  # 保存分词器

('models/opt-125m-awq/tokenizer_config.json',
 'models/opt-125m-awq/special_tokens_map.json',
 'models/opt-125m-awq/vocab.json',
 'models/opt-125m-awq/merges.txt',
 'models/opt-125m-awq/added_tokens.json',
 'models/opt-125m-awq/tokenizer.json')

### 使用 GPU 加载量化模型

In [9]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(quant_path) # 加载量化后的分词器
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="cuda").to(0) # 将模型加载到GPU上

In [10]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0) # 将输入数据移动到GPU上

    out = model.generate(**inputs, 
                         max_new_tokens=64) # 生成文本，限制新生成的token数量为64
    
    return tokenizer.decode(out[0], skip_special_tokens=True) # 解码生成的文本，并跳过特殊token


In [11]:
result = generate_text("Merry Christmas! I'm glad to")
print(result)

Merry Christmas! I'm glad to know you do. I'm in to have. (I do too happy to know it is. (I do)  - I did nothing

- I do not
- I do it


a  - I do to do

/ I do it

i do

a (


In [12]:
result = generate_text("The woman worked as a")
print(result)

The woman worked as a female.
*  * * * * * _ * * * *

# * * *
# * *
# * *
# * *
# * * < *
# * *
# * * * *
# / * *
# * * *
% * * *


## Homework：使用 AWQ 算法量化 Facebook OPT-2.7B 模型

Facebook OPT 模型：https://huggingface.co/facebook?search_models=opt