# Transformers 模型量化技术：AWQ

![img](https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/Thumbnail.png)

在2023年6月，Ji Lin等人发表了论文[AWQ：Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf)。

这篇论文详细介绍了一种激活感知权重量化算法，可以用于压缩任何基于 Transformer 的语言模型，同时只有微小的性能下降。关于 AWQ 算法的详细介绍，见[MIT Han Song 教授分享](https://hanlab.mit.edu/projects/awq)。

transformers 现在支持两个不同的 AWQ 开源实现库：

- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
- [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) 


因为 LLM-AWQ 不支持 Nvidia T4 GPU（课程演示 GPU），所以我们使用 AutoAWQ 库来介绍和演示 AWQ 模型量化技术。

## 量化前模型测试文本生成任务

In [10]:
from transformers import pipeline

model_path = "facebook/opt-125m"

# 使用 GPU 加载原始的 OPT-125m 模型
generator = pipeline('text-generation',
                     model=model_path,
                     device=0,
                     do_sample=True,
                     num_return_sequences=3)

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

#### 实测GPU显存占用：加载 OPT-125m 模型后

```shell
Sun Dec 24 15:11:33 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   47C    P0              26W /  70W |    635MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
```

In [11]:
generator("The woman worked as a")

[{'generated_text': 'The woman worked as a housekeeper for a day and he just got the job. She was the'},
 {'generated_text': 'The woman worked as a prostitute in a strip club downtown. When she noticed the man had sex with'},
 {'generated_text': 'The woman worked as a nurse in a small office, but did some serious research before starting her own'}]

In [12]:
generator("The man worked as a")

[{'generated_text': 'The man worked as a construction analyst to a contractor on Long Island. His career included selling houses that'},
 {'generated_text': 'The man worked as a security guard as a security guard at a Walmart in downtown Portland, Ore.'},
 {'generated_text': 'The man worked as a contractor for a bank before becoming a software engineer.  After his job he'}]

## 使用 AutoAWQ 量化模型

下面我们以 `facebook opt-125m` 模型为例，使用 `AutoAWQ` 库实现的 AWQ 算法实现模型量化。

In [13]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


quant_path = "../models/opt-125m-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

LICENSE.md:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.10k [00:00<?, ?B/s]

In [14]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating validation split: 0 examples [00:00, ? examples/s]

AWQ: 100%|██████████| 12/12 [01:01<00:00,  5.13s/it]


#### 实测GPU显存使用：量化模型时峰值达到将近 4GB

```shell
Sun Dec 24 15:12:50 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   48C    P0              32W /  70W |    3703MiB / 15360MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
```

In [6]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

#### Transformers 兼容性配置

为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`

In [15]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [16]:
# 保存模型权重
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)  # 保存分词器



('../models/opt-125m-awq/tokenizer_config.json',
 '../models/opt-125m-awq/special_tokens_map.json',
 '../models/opt-125m-awq/vocab.json',
 '../models/opt-125m-awq/merges.txt',
 '../models/opt-125m-awq/added_tokens.json',
 '../models/opt-125m-awq/tokenizer.json')

### 使用 GPU 加载量化模型

In [17]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="cuda").to(0)

In [18]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)


In [19]:
result = generate_text("Merry Christmas! I'm glad to")
print(result)

Merry Christmas! I'm glad to hear you're doing well!
Thank you! I'm glad to hear you're doing well!


In [20]:
result = generate_text("The woman worked as a")
print(result)

The woman worked as a nurse at the hospital for a year. She was a nurse at the hospital for a year. She was a nurse at the hospital for a year. She was a nurse at the hospital for a year. She was a nurse at the hospital for a year. She was a nurse at the hospital for a year. She


## Homework：使用 AWQ 量化 Facebook OPT-6.7B 模型

Facebook OPT 模型：https://huggingface.co/facebook?search_models=opt

In [1]:
#配置缓存环境，只需要开始时运行一次
import os
#在transformers自定义模型下载的路径方法
os.environ["HF_DATASETS_CACHE"] = "/root/autodl-tmp/datasets_cache/"
os.environ["HF_HOME"] = "/root/autodl-tmp/cache/"
os.environ["HUGGINGFACE_HUB_CACHE"] = "/root/autodl-tmp/hub_cache/"

In [2]:
#配置网络黄静，只需要开始时运行一次

import subprocess
import os

result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

In [3]:
# 验证环境配置是否修改成功
print("http_proxy",os.environ.get("http_proxy"))
print("https_proxy",os.environ.get("https_proxy"))
print("HF_HOME",os.environ.get("HF_HOME"))
print("HF_DATASETS_CACHE",os.environ.get("HF_DATASETS_CACHE"))
print("HUGGINGFACE_HUB_CACHE",os.environ.get("HUGGINGFACE_HUB_CACHE"))

http_proxy http://172.20.0.113:12798
https_proxy http://172.20.0.113:12798
HF_HOME /root/autodl-tmp/cache/
HF_DATASETS_CACHE /root/autodl-tmp/datasets_cache/
HUGGINGFACE_HUB_CACHE /root/autodl-tmp/hub_cache/


In [8]:
# 下载"facebook/opt-6.7b"模型、tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-6.7b"

#仅第一次运行
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokennizer_cache_directory = f'../../autodl-tmp/model/{model_id}/tokenizer'

# 如果不存在，则新建文件路径
if not os.path.exists(tokennizer_cache_directory):
    os.makedirs(tokennizer_cache_directory)

tokenizer.save_pretrained(tokennizer_cache_directory)

model = AutoModelForCausalLM.from_pretrained(model_id)
model_cache_directory = f'../../autodl-tmp/model/{model_id}'

# 如果不存在，则新建文件路径
if not os.path.exists(model_cache_directory):
    os.makedirs(model_cache_directory)
    
model.save_pretrained(model_cache_directory)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "facebook/opt-6.7b"

model_path = f'../../autodl-tmp/model/{model_id}'
tokenizer_path = f'../../autodl-tmp/model/{model_id}/tokenizer'

quant_path =  f'../../autodl-tmp/quant/{model_id}/AWQ'
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_id, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# "zero_point": True：

# 意义： 这表示是否启用零点量化。零点量化是一种将权重向量中的零点（zero point）移动到整个值域的中心的方法。
# 影响： 启用零点量化通常有助于更好地表示小数值，从而提高量化后模型的精度。
# "q_group_size": 128：

# 意义： 这是权重量化的组大小。在量化中，权重通常被分为多个组，每个组的权重共享一个缩放因子。
# 影响： 调整组大小可以影响量化模型的性能和压缩效果。较大的组大小可能导致更好的量化效果，但可能会增加计算和存储开销。
# "w_bit": 4：

# 意义： 这是用于量化权重的比特数。较低的比特数表示更低的权重精度，但也意味着更小的模型。
# 影响： 调整比特数可以在模型尺寸和性能之间进行权衡。较低的比特数通常意味着更小的模型，但可能导致精度损失。
# "version": "GEMM"：

# 意义： 这是 AWQ 模型的量化版本。"GEMM" 表示使用通用矩阵乘法算法进行量化。
# 影响： 不同的量化版本可能使用不同的算法和优化策略。选择适合你任务和硬件的版本通常是一项重要的决策。

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# 使用 GPU 加载原始的 OPT-6.7B 模型
from transformers import pipeline

model_id = "facebook/opt-6.7b"

generator = pipeline('text-generation',
                     model=model_id,
                     device=0,
                     do_sample=True,
                     num_return_sequences=3)
generator("The woman worked as a")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacty of 23.65 GiB of which 212.06 MiB is free. Process 839054 has 23.43 GiB memory in use. Of the allocated memory 23.05 GiB is allocated by PyTorch, and 3.89 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [6]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating validation split: 0 examples [00:00, ? examples/s]

AWQ: 100%|██████████| 32/32 [10:17<00:00, 19.29s/it]


In [7]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [8]:
# 保存模型权重
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)  # 保存分词器



('../../autodl-tmp/quant/facebook/opt-6.7b/AWQ/tokenizer_config.json',
 '../../autodl-tmp/quant/facebook/opt-6.7b/AWQ/special_tokens_map.json',
 '../../autodl-tmp/quant/facebook/opt-6.7b/AWQ/vocab.json',
 '../../autodl-tmp/quant/facebook/opt-6.7b/AWQ/merges.txt',
 '../../autodl-tmp/quant/facebook/opt-6.7b/AWQ/added_tokens.json',
 '../../autodl-tmp/quant/facebook/opt-6.7b/AWQ/tokenizer.json')

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "facebook/opt-6.7b"
quant_path =  f'../../autodl-tmp/quant/{model_id}/AWQ'

tokenizer = AutoTokenizer.from_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="cuda").to(0)

In [7]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [8]:
result = generate_text("Merry Christmas! I'm glad to")
print(result)

Merry Christmas! I'm glad to see you're still around.
I'm still around, just not as much as I used to be. I'm still here though.


In [9]:
result = generate_text("The woman worked as a")
print(result)

The woman worked as a nurse at a hospital in the city of Wuhan, the epicenter of the coronavirus outbreak, and was diagnosed with the virus on January 20.

The woman, who is in her 60s, was diagnosed with the virus on January 20, according to the Wuhan Municipal Health Commission.



#结果对比：

## 未量化前的模型：13G大小，使用起来13GB显存占用

Sat Jan 13 17:33:48 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:32:00.0 Off |                  Off |
|  0%   30C    P8              20W / 450W |  13097MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+


## 量化后的模型：4G大小，使用起来4.8GB占用
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:32:00.0 Off |                  Off |
|  0%   32C    P2              57W / 450W |   4815MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
