In [1]:
!pip list | grep -E "torch|transformers|accelerate|llm-compressor|datasets|optimum|llmcompressor"

accelerate                  1.9.0
datasets                    4.0.0
intel_extension_for_pytorch 2.7.0
llmcompressor               0.4.1
torch                       2.7.1
transformers                4.39.3


In [2]:
# 检查GPU环境
!nvidia-smi

Mon Aug 11 14:36:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.144                Driver Version: 570.144        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Quadro RTX 6000                Off |   00000000:B3:00.0  On |                  Off |
| 33%   39C    P8             27W /  260W |     529MiB /  24576MiB |     10%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
import torch
print(torch.cuda.is_available())  # 应输出True
print(torch.version.cuda)  # 查看CUDA版本

True
12.6


In [4]:
# 加载原始模型（未量化）
from transformers import pipeline

model_path = "/home/pactera/aistudy/week03/opt-2.7b"  # 保持原模型路径

# 使用GPU加载原始OPT-2.7b模型
generator = pipeline(
    'text-generation',
    model=model_path,
    device=0,  # 使用第0块GPU
    do_sample=True,
    num_return_sequences=3
)

In [5]:
# 原始模型生成测试（男性职业）
generator("The man worked as a")

[{'generated_text': 'The man worked as a bouncer for the same establishment in Boston the night of the first shooting and'},
 {'generated_text': 'The man worked as a manager for a couple of companies, who know his name. Just like the'},
 {'generated_text': 'The man worked as a mechanic for a car dealership. His customers, some of whom thought their vehicles'}]

# 使用 LLM.int8 () 量化模型

In [6]:
# 使用 LLM.int8() 量化模型
from transformers import AutoModelForCausalLM, AutoTokenizer

# 定义量化路径
quant_path = "models/opt-2.7b-int8"

# 加载并量化模型（LLM.int8()核心配置）
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    load_in_8bit=True,  # 启用8bit量化（LLM.int8()的基础）
    torch_dtype=torch.float16,
    # LLM.int8()会自动处理离群点，无需额外配置
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # 确保pad_token存在

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


In [7]:
# 保存量化模型和分词器
model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"量化模型已保存至：{quant_path}")

量化模型已保存至：models/opt-2.7b-int8


In [8]:
# 加载并测试LLM.int8()量化模型
from transformers import AutoTokenizer, AutoModelForCausalLM

# 加载量化后的模型
tokenizer = AutoTokenizer.from_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(
    quant_path,
    device_map="auto",
    load_in_8bit=True,  # 加载8bit量化模型
    torch_dtype=torch.float16
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


In [9]:
# 定义生成函数
def generate_text(text):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to(model.device)  # 确保输入与模型在同一设备
    
    out = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=True,
        temperature=0.7
    )
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [10]:
# 量化模型生成测试（节日问候）
result = generate_text("Merry Christmas! I'm glad to")
print(result)

# 量化模型生成测试（女性职业）
result = generate_text("The woman worked as a")
print(result)

# 量化模型生成测试（男性职业）
result = generate_text("The man worked as a")
print(result)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Merry Christmas! I'm glad to see you're doing well. (I hope I'm not too late!)  I'm currently working on a story. I have to take a bit of a break because my little one is teething, but I'm back at it and I'm actually feeling better about it.   What's your favorite Christmas
The woman worked as a beautician in the town of Leominster in Western Massachusetts, where she was known to many customers as "The Lady of Leominster."

Police say she was killed in a crash involving a car she was driving.

Investigators say that the car was driving at an abnormally fast rate of speed when
The man worked as a construction worker, but he still had enough money to buy this.
Not really, it's just a joke. The man doesn't think he's doing anything wrong.
The joke is that he's poor and buying a Lamborghini. A joke about the poor can't be funny since it's a stereotype.
