# Transformers 量化技术 BitsAndBytes

![](docs/images/qlora.png)

`bitsandbytes`是将模型量化为8位和4位的最简单选择。 

- 8位量化将fp16中的异常值与int8中的非异常值相乘，将非异常值转换回fp16，然后将它们相加以返回fp16中的权重。这减少了异常值对模型性能产生的降级效果。
- 4位量化进一步压缩了模型，并且通常与QLoRA一起用于微调量化LLM（低精度语言模型）。

（`异常值`是指大于某个阈值的隐藏状态值，这些值是以fp16进行计算的。虽然这些值通常服从正态分布（[-3.5, 3.5]），但对于大型模型来说，该分布可能会有很大差异（[-60, 6]或[6, 60]）。8位量化适用于约为5左右的数值，但超过此范围后将导致显著性能损失。一个好的默认阈值是6，但对于不稳定的模型（小型模型或微调）可能需要更低的阈值。）

## 在 Transformers 中使用参数量化

使用 Transformers 库的 `model.from_pretrained()`方法中的`load_in_8bit`或`load_in_4bit`参数，便可以对模型进行量化。只要模型支持使用Accelerate加载并包含torch.nn.Linear层，这几乎适用于任何模态的任何模型。

In [2]:

import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

import torch
from transformers import AutoModelForCausalLM

model_id = "/home/intel/.cache/huggingface/hub/models--THUDM--chatglm3-6b/snapshots/91a0561caa089280e94bf26a9fc3530482f0fe60"
#model_id = "/home/intel/models/qwen2-7B-instruct"
#model_id = "facebook/opt-2.7b"

model_4bit = AutoModelForCausalLM.from_pretrained(model_id,
                                                  device_map="auto",
                                                  load_in_4bit=True,
                                                  torch_dtype=torch.float16,
                                                  trust_remote_code=True)

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 7/7 [00:04<00:00,  1.49it/s]


In [None]:
model_4bit

### 实测GPU显存占用：Int4 量化精度

```shell
Sun Dec 24 18:04:14 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   42C    P0              26W /  70W |   1779MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
```

In [None]:
# 获取当前模型占用的 GPU显存（差值为预留给 PyTorch 的显存）
memory_footprint_bytes = model_4bit.get_memory_footprint()
memory_footprint_mib = memory_footprint_bytes / (1024 ** 2)  # 转换为 MiB

print(f"{memory_footprint_mib:.2f}MiB")

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

quant_model_dir ="models/chaglm3-6b-bnd"

# 保存模型权重
model_4bit.save_pretrained(quant_model_dir)
# 保存分词器
tokenizer.save_pretrained(quant_model_dir)  



In [None]:

tokenizer = AutoTokenizer.from_pretrained(quant_model_dir, trust_remote_code=True)

text = "介绍下上海"
inputs = tokenizer(text, return_tensors="pt").to("cuda")

out = model_4bit.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))

In [None]:

model_4bit.eval()


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM


tokenizer = AutoTokenizer.from_pretrained(quant_model_dir,trust_remote_code=True)
model_4 = AutoModelForCausalLM.from_pretrained(quant_model_dir, trust_remote_code=True, device_map="cuda")

In [8]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model_4.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [None]:
result = generate_text("介绍下北京")
print(result)

### 使用 NF4 精度加载模型

In [None]:
from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config,trust_remote_code=True)

In [None]:
# 获取当前模型占用的 GPU显存（差值为预留给 PyTorch 的显存）
memory_footprint_bytes = model_nf4.get_memory_footprint()
memory_footprint_mib = memory_footprint_bytes / (1024 ** 2)  # 转换为 MiB

print(f"{memory_footprint_mib:.2f}MiB")

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

quant_model_nf4_dir ="models/chaglm3-6b-bnd_nf4"

# 保存模型权重
model_nf4.save_pretrained(quant_model_nf4_dir)
# 保存分词器
tokenizer.save_pretrained(quant_model_nf4_dir)  


tokenizer = AutoTokenizer.from_pretrained(quant_model_nf4_dir, trust_remote_code=True)

text = "介绍下上海"
inputs = tokenizer(text, return_tensors="pt").to("cuda")

out = model_nf4.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))


model_nf4.eval()
from transformers import AutoTokenizer, AutoModelForCausalLM

model_n = AutoModelForCausalLM.from_pretrained(quant_model_nf4_dir, trust_remote_code=True, device_map="cuda")





In [17]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model_n.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [None]:
result = generate_text("介绍下湖南")
print(result)

### 使用双量化加载模型

In [None]:
from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config, trust_remote_code=True)

In [None]:
# 获取当前模型占用的 GPU显存（差值为预留给 PyTorch 的显存）
memory_footprint_bytes = model_double_quant.get_memory_footprint()
memory_footprint_mib = memory_footprint_bytes / (1024 ** 2)  # 转换为 MiB

print(f"{memory_footprint_mib:.2f}MiB")

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

quant_model_2q_dir ="models/chaglm3-6b-bnd_2q"

# 保存模型权重
model_double_quant.save_pretrained(quant_model_2q_dir)
# 保存分词器
tokenizer.save_pretrained(quant_model_2q_dir)  


tokenizer = AutoTokenizer.from_pretrained(quant_model_2q_dir, trust_remote_code=True)

text = "介绍下河北"
inputs = tokenizer(text, return_tensors="pt").to("cuda")

out = model_double_quant.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))


model_double_quant.eval()
from transformers import AutoTokenizer, AutoModelForCausalLM

model_2 = AutoModelForCausalLM.from_pretrained(quant_model_2q_dir, trust_remote_code=True, device_map="cuda")





In [24]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model_2.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [None]:
result = generate_text("介绍下河南")
print(result)

### 使用 QLoRA 所有量化技术加载模型

In [4]:
import torch
from transformers import BitsAndBytesConfig

qlora_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_qlora = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=qlora_config, trust_remote_code=True)

Loading checkpoint shards: 100%|██████████| 7/7 [00:01<00:00,  3.56it/s]


In [5]:
# 获取当前模型占用的 GPU显存（差值为预留给 PyTorch 的显存）
memory_footprint_bytes = model_qlora.get_memory_footprint()
memory_footprint_mib = memory_footprint_bytes / (1024 ** 2)  # 转换为 MiB

print(f"{memory_footprint_mib:.2f}MiB")

3739.69MiB


In [8]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

quant_model_qlora_dir ="models/chaglm3-6b-bnd_qlora"

# 保存模型权重
model_qlora.save_pretrained(quant_model_qlora_dir)
# 保存分词器
tokenizer.save_pretrained(quant_model_qlora_dir)  


tokenizer = AutoTokenizer.from_pretrained(quant_model_qlora_dir, trust_remote_code=True)

text = "介绍下山东"
inputs = tokenizer(text, return_tensors="pt").to("cuda")

out = model_qlora.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))


model_qlora.eval()
from transformers import AutoTokenizer, AutoModelForCausalLM

model_q = AutoModelForCausalLM.from_pretrained(quant_model_qlora_dir, trust_remote_code=True, device_map="cuda")




Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.


Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.


[gMASK] sop 介绍下山东论语
《山东论语》是明朝作家李时中创作的一部儒家经典著作。它是对孔子的思想、言论和行为的总结和概括，旨在传承和弘扬儒家文化。

《山东论语》共分为二十篇，每一篇都涵盖了不同的主题，如孝道、忠诚、仁爱


In [9]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model_q.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [10]:
result = generate_text("介绍下陕西")
print(result)

[gMASK] sop 介绍下陕西西安的美食文化
西安是中国西部地区的重要城市,也是中国历史文化名城之一。西安的美食文化悠久而丰富,是中国美食文化的重要组成部分。

西安的美食以面食为主,其中最有名的是羊肉泡馍、肉夹馍、凉皮、油泼面等。羊肉泡馍
