# Transformers 模型量化技术：AWQ

![img](https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/Thumbnail.png)

在2023年6月，Ji Lin等人发表了论文[AWQ：Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf)。

这篇论文详细介绍了一种激活感知权重量化算法，可以用于压缩任何基于 Transformer 的语言模型，同时只有微小的性能下降。关于 AWQ 算法的详细介绍，见[MIT Han Song 教授分享](https://hanlab.mit.edu/projects/awq)。

transformers 现在支持两个不同的 AWQ 开源实现库：

- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
- [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) 


因为 LLM-AWQ 不支持 Nvidia T4 GPU（课程演示 GPU），所以我们使用 AutoAWQ 库来介绍和演示 AWQ 模型量化技术。

## 量化前模型测试文本生成任务

In [1]:
from transformers import pipeline

model_path = "facebook/opt-125m"

# 使用 GPU 加载原始的 OPT-125m 模型
generator = pipeline('text-generation',
                     model=model_path,
                     device=0,
                     do_sample=True,
                     num_return_sequences=3)

  return self.fget.__get__(instance, owner)()


#### 实测GPU显存占用：加载 OPT-125m 模型后

```shell
Sun Dec 24 15:11:33 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   47C    P0              26W /  70W |    635MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
```

In [2]:
generator("The woman worked as a")

[{'generated_text': 'The woman worked as a school nurse and the man was a teacher. The men were the teachers!'},
 {'generated_text': 'The woman worked as a waitress, waitress, barista and food truck driver, and was an early'},
 {'generated_text': 'The woman worked as a security guard at a motel in her hometown in San Francisco. Photograph: T'}]

In [3]:
generator("The man worked as a")

[{'generated_text': 'The man worked as a delivery boy before. He made a fortune as a courier and it makes no'},
 {'generated_text': 'The man worked as a taxi driver for five years. He had a lot to do, and there'},
 {'generated_text': 'The man worked as a security guard at a gas station at the time and was a co-worker'}]

## 使用 AutoAWQ 量化模型

下面我们以 `facebook opt-125m` 模型为例，使用 `AutoAWQ` 库实现的 AWQ 算法实现模型量化。

In [4]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


quant_path = "models/opt-125m-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

In [5]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

AWQ: 100%|██████████| 12/12 [01:19<00:00,  6.59s/it]


#### 实测GPU显存使用：量化模型时峰值达到将近 4GB

```shell
Sun Dec 24 15:12:50 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   48C    P0              32W /  70W |    3703MiB / 15360MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
```

In [6]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

#### Transformers 兼容性配置

为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`

In [7]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [8]:
# 保存模型权重
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)  # 保存分词器



('models/opt-125m-awq/tokenizer_config.json',
 'models/opt-125m-awq/special_tokens_map.json',
 'models/opt-125m-awq/vocab.json',
 'models/opt-125m-awq/merges.txt',
 'models/opt-125m-awq/added_tokens.json',
 'models/opt-125m-awq/tokenizer.json')

### 使用 GPU 加载量化模型

In [9]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="cuda").to(0)

In [10]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)


In [11]:
result = generate_text("Merry Christmas! I'm glad to")
print(result)

Merry Christmas! I'm glad to hear you're doing well!
Thank you! I'm glad to hear you're doing well!


In [12]:
result = generate_text("The woman worked as a")
print(result)

The woman worked as a nurse at the hospital for a year. She was a nurse at the hospital for a year. She was a nurse at the hospital for a year. She was a nurse at the hospital for a year. She was a nurse at the hospital for a year. She was a nurse at the hospital for a year. She


## Homework：使用 AWQ 量化 Facebook OPT-6.7B 模型

Facebook OPT 模型：https://huggingface.co/facebook?search_models=opt

### 服务器GPU内存不足，使用facebook/opt-1.3b跑

In [1]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "facebook/opt-1.3b"
quant_path = "models/opt-1.3b-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

  from .autonotebook import tqdm as notebook_tqdm
2023-12-29 12:39:38.462454: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-29 12:39:38.462514: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-29 12:39:38.462551: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-29 12:39:38.476430: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler

In [2]:
# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]
LICENSE.md: 100%|██████████| 11.1k/11.1k [00:00<00:00, 21.8MB/s]


.gitattributes:   0%|          | 0.00/1.17k [00:00<?, ?B/s][A[A
README.md: 100%|██████████| 8.82k/8.82k [00:00<00:00, 38.5MB/s]
.gitattributes: 100%|██████████| 1.17k/1.17k [00:00<00:00, 596kB/s]
Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 44.04it/s]


In [3]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Downloading readme: 100%|██████████| 167/167 [00:00<00:00, 1.23MB/s]
Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s][A
Downloading data:   1%|          | 4.19M/471M [00:00<00:20, 22.6MB/s][A
Downloading data:   3%|▎         | 12.6M/471M [00:00<00:20, 22.8MB/s][A
Downloading data:   4%|▍         | 21.0M/471M [00:01<00:28, 15.9MB/s][A
Downloading data:   6%|▌         | 29.4M/471M [00:01<00:30, 14.4MB/s][A
Downloading data:   8%|▊         | 37.7M/471M [00:02<00:31, 13.6MB/s][A
Downloading data:  10%|▉         | 46.1M/471M [00:03<00:32, 13.2MB/s][A
Downloading data:  12%|█▏        | 54.5M/471M [00:03<00:32, 13.0MB/s][A
Downloading data:  13%|█▎        | 62.9M/471M [00:04<00:31, 12.8MB/s][A
Downloading data:  15%|█▌        | 71.3M/471M [00:05<00:31, 12.7MB/s][A
Downloading data:  17%|█▋        | 79.7M/471M [00:05<00:30, 12.6MB/s][A
Downloading data:  19%|█▊        | 88.1M/471M [00:06<00:33, 11.5MB/s][A
Do

In [4]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

In [5]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [6]:
# 保存模型权重
model.save_quantized(quant_path)
# 保存分词器
tokenizer.save_pretrained(quant_path)  



('models/opt-1.3b-awq/tokenizer_config.json',
 'models/opt-1.3b-awq/special_tokens_map.json',
 'models/opt-1.3b-awq/vocab.json',
 'models/opt-1.3b-awq/merges.txt',
 'models/opt-1.3b-awq/added_tokens.json',
 'models/opt-1.3b-awq/tokenizer.json')

In [7]:
model.eval()

OptAWQForCausalLM(
  (model): OPTForCausalLM(
    (model): OPTModel(
      (decoder): OPTDecoder(
        (embed_tokens): Embedding(50272, 2048, padding_idx=1)
        (embed_positions): OPTLearnedPositionalEmbedding(2050, 2048)
        (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (layers): ModuleList(
          (0-23): 24 x OPTDecoderLayer(
            (self_attn): OPTAttention(
              (k_proj): WQLinear_GEMM(in_features=2048, out_features=2048, bias=True, w_bit=4, group_size=128)
              (v_proj): WQLinear_GEMM(in_features=2048, out_features=2048, bias=True, w_bit=4, group_size=128)
              (q_proj): WQLinear_GEMM(in_features=2048, out_features=2048, bias=True, w_bit=4, group_size=128)
              (out_proj): WQLinear_GEMM(in_features=2048, out_features=2048, bias=True, w_bit=4, group_size=128)
            )
            (activation_fn): ReLU()
            (self_attn_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affin

In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="cuda").to(0)

In [9]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)


In [10]:
result = generate_text("The woman worked as a")
print(result)

The woman worked as a nurse in the hospital where the man worked.

The man was arrested on suspicion of assault and was taken to hospital for treatment.

The woman was taken to hospital for treatment.

A man has been arrested on suspicion of assault after a woman was injured in a hospital.

The incident happened at


In [11]:
result = generate_text("Merry Christmas! I'm glad to")
print(result)

Merry Christmas! I'm glad to see you're still around.
I'm glad to be here!


In [12]:
result = generate_text("新年好啊，很高兴")
print(result)

新年好啊，很高兴

第1章 公式サイト

第2章 公式サイト

第3章 公式サイト

第4章 公式サイト

�
