# 大模型（Qwen2.5_Coder_3B） 指令微调教程

## 环境准备

本教程可在 [AutoDL](https://www.autodl.com/home) 的 4090 GPU 实例上运行。


## 教程内容

本教程将介绍以下内容:

0. [AutoDL配置](#GPU实例) - 如何启动相应配置的GPU实例
1. [安装依赖库](#Install) - 如何安装python依赖包
2. [模型准备](#Model) - 如何下载和初始化模型
3. [数据准备](#Data) - 如何准备和处理训练数据
4. [模型训练](#Train) - 如何训练和优化模型
5. [模型保存](#Save) - 如何保存训练结果
6. [模型推理](#Inference) - 如何使用训练好的模型进行推理


## 0.AutoDL配置
- **为什么选择 AutoDL？**： 相对于其他云服务器厂商，AutoDL卡相对便宜很多，而且操作相对简单，上手成本很低。
- **如何配置？**： GPU: RTX 4090(24GB) * 1。  镜像： PyTorch  2.3.0  -->  Python  3.12(ubuntu22.04)  -->  CUDA  12.1


## 1.安装依赖库

In [23]:
!pip install unsloth
!pip install modelscope==1.9.0
!pip install datasets==2.21.0
!pip install addict
!pip install vllm

Looking in indexes: http://mirrors.aliyun.com/pypi/simple
[0mLooking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting datasets<=2.13.0,>=2.8.0 (from modelscope==1.9.0)
  Using cached http://mirrors.aliyun.com/pypi/packages/17/d8/f808e32ed7fa86617b9ac7a37b7dcff894c839108c4871cc33ffc4e65b7d/datasets-2.13.0-py3-none-any.whl (485 kB)
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.21.0
    Uninstalling datasets-2.21.0:
      Successfully uninstalled datasets-2.21.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
trl 0.15.2 requires datasets>=2.21.0, but you have datasets 2.13.0 which is incompatible.
unsloth-zoo 2025.3.15 requires datasets>=2.16.0, but you have datasets 2.13.0 which is incompatible.
unsloth 2025.3.17 requires datasets>=2.16.0, but you have datasets 2.13.0 which i

## 2.模型准备
### 2.1下载模型

In [27]:
from modelscope.hub.snapshot_download import snapshot_download

model_name = "unsloth/Qwen2.5-Coder-3B-Instruct"
local_dir = "./models/Qwen2.5-Coder-3B-Instruct"
snapshot_download(model_name, local_dir=local_dir)

Downloading Model from https://www.modelscope.cn to directory: /autodl-fs/data/models/Qwen2.5-Coder-3B-Instruct


2025-03-20 09:56:02,050 - modelscope - INFO - Target directory already exists, skipping creation.


'./models/Qwen2.5-Coder-3B-Instruct'

### 2.2 模型初始化

In [28]:
from unsloth import FastLanguageModel

# 基础配置参数
max_seq_length = 2048 # 最大序列长度
dtype = None # 自动检测数据类型
load_in_4bit = True # 使用4位量化以减少内存使用




# 加载预训练模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = local_dir, # "unsloth/Qwen2.5-Coder-32B-Instruct", # 选择Qwen2.5 3B指令模型
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # LoRA秩,控制可训练参数数量
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], # 需要训练的目标模块
    lora_alpha = 16, # LoRA缩放因子
    lora_dropout = 0, # LoRA dropout率
    bias = "none", # 是否训练偏置项
    use_gradient_checkpointing = "unsloth", # 使用梯度检查点节省显存
    random_state = 3407, # 随机数种子
    use_rslora = False, # 是否使用稳定版LoRA
    loftq_config = None, # LoftQ配置
)



==((====))==  Unsloth 2025.3.17: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.643 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### 2.3 未经过训练的模型推理

In [29]:
import torch
from transformers import GenerationConfig

# 应用聊天模板
text = tokenizer.apply_chat_template([
    {"role": "user", "content": "How many r's are in strawberry?"}
], tokenize=False, add_generation_prompt=True)

# 配置生成参数
generation_config = GenerationConfig(
    temperature=0.8,
    top_p=0.95,
    max_new_tokens=1024,
)

# 将文本转换为输入张量
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

# 使用标准的 generate 方法生成输出
with torch.no_grad():
    output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config
    )

# 解码输出
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print("未经过监督微调的模型输出: ", output_text)

未经过监督微调的模型输出:  system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
How many r's are in strawberry?
assistant
There are 3 r's in the word "strawberry".


## 3. 数据准备
### 3.1 本地PC下载
datasets 是 Hugging Face 提供的用于加载和处理各种数据集的库。AutoDL上无法直接访问 Hugging Face。 因此数据集 "mlabonne/FineTome-100k" 需要在本地PC下载后，从AutoDL的“文件存储”上传到你所使用的实例存储位置。

本地可以科学上网后，安装 pip install datasets 后，然后运行下面代码。

In [13]:
# 下载数据集 (此段代码本地PC)
from datasets import load_dataset

dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset.save_to_disk("./datasets/FineTome-100k")
print(f"数据集已成功保存到 {local_path}")


ConnectionError: Couldn't reach 'mlabonne/FineTome-100k' on the Hub (LocalEntryNotFoundError)

如果不能科学上网，无法下载对应数据集，我这里也提供了依据下载好的。网盘链接: https://pan.baidu.com/s/1MoVWFoEacQ4_Mu-SVaNkHg?pwd=3mp3 提取码: 3mp3 

随后需要将下载的数据上传到AutoDL 对应的位子 "./datasets/FineTome-100k"。

### 3.2 数据加载

In [30]:
from datasets import load_from_disk
from unsloth.chat_templates import standardize_sharegpt

# 从本地路径加载数据集
dataset_path = "./datasets/FineTome-100k"
dataset = load_from_disk(dataset_path)

### 3.3 格式转换

In [31]:
from unsloth.chat_templates import get_chat_template
import pprint


# 配置分词器使用qwen-2.5对话模板
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)

def formatting_prompts_func(examples):
    """格式化对话数据的函数
    Args:
        examples: 包含对话列表的字典
    Returns:
        包含格式化文本的字典
    """
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

from unsloth.chat_templates import standardize_sharegpt
# 标准化数据集格式
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)


# 查看第5条对话的结构
pprint.pprint(dataset[5])

{'conversations': [{'content': 'How do astronomers determine the original '
                               'wavelength of light emitted by a celestial '
                               'body at rest, which is necessary for measuring '
                               'its speed using the Doppler effect?',
                    'role': 'user'},
                   {'content': 'Astronomers make use of the unique spectral '
                               'fingerprints of elements found in stars. These '
                               'elements emit and absorb light at specific, '
                               'known wavelengths, forming an absorption '
                               'spectrum. By analyzing the light received from '
                               'distant stars and comparing it to the '
                               'laboratory-measured spectra of these elements, '
                               'astronomers can identify the shifts in these '
                               'wa

## 4. 模型训练

In [32]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported


output_dir = "./outputs/01_outputs"    # 检查点是训练过程中的一个快照，
                                       # 它记录了模型在某个特定训练步骤的状态，包括模型的权重、优化器的状态等。


# 配置训练器
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=4,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=1,  # 每个设备的批次大小
        gradient_accumulation_steps=4,  # 梯度累积步数
        warmup_steps=5,  # 预热步数
        max_steps=100,  # 最大训练步数
        learning_rate=2e-4,  # 学习率
        fp16=not is_bfloat16_supported(),  # 是否使用fp16
        bf16=is_bfloat16_supported(),  # 是否使用bf16
        logging_steps=1,  # 日志记录间隔
        optim="paged_adamw_8bit",  # 优化器
        weight_decay=0.01,  # 权重衰减
        lr_scheduler_type="linear",  # 学习率调度器
        seed=3407,  # 随机种子
        output_dir=output_dir,  # 输出目录
        report_to="none",  # 不使用外部日志工具
    ),
)

from unsloth.chat_templates import train_on_responses_only
# 设置仅对助手回复部分计算损失
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n",
)

# 查看输入文本
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

# 查看标签掩码
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

# 获取GPU信息
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

trainer_stats = trainer.train()

# 显示训练统计信息
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")


GPU = NVIDIA GeForce RTX 4090. Max memory = 23.643 GB.
6.217 GB of memory reserved.


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 29,933,568/3,000,000,000 (1.00% trained)


Step,Training Loss
1,0.6904
2,0.6541
3,0.7736
4,0.6078
5,1.1081
6,0.8858
7,0.6535
8,0.8542
9,0.8005
10,0.5051


154.1242 seconds used for training.
2.57 minutes used for training.
Peak reserved memory = 7.834 GB.
Peak reserved memory for training = 1.617 GB.
Peak reserved memory % of max memory = 33.135 %.
Peak reserved memory for training % of max memory = 6.839 %.


## 5. 模型保存

In [33]:
save_path = output_dir + '/lora_model'

# 本地保存模型和分词器
model.save_pretrained(save_path)  # 保存模型权重
tokenizer.save_pretrained(save_path)  # 保存分词器


if False:
    # 在线保存到 HuggingFace Hub
    model.push_to_hub("your_name/lora_model", token = "...") # 上传模型到Hub
    tokenizer.push_to_hub("your_name/lora_model", token = "...") # 上传分词器到Hub


if False:
    # 使用标准Hugging Face接口加载
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer

    model = AutoPeftModelForCausalLM.from_pretrained(
        save_path,  # 模型路径
        load_in_4bit=load_in_4bit,  # 4bit加载
    )
    tokenizer = AutoTokenizer.from_pretrained(save_path)  # 加载分词器


if False: 
    # 把模型以 16 位格式合并保存到 model 目录，同时传入分词器和保存方法
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",) 
    


## 6.模型推理


### 6.1 使用监督微调的 LoRA 进行推理

In [45]:
import torch
import warnings
from peft import PeftModel
from transformers import GenerationConfig


# 禁用 peft 的 UserWarning（关键修改）
warnings.filterwarnings("ignore", category=UserWarning, module="peft")

# 定义 SYSTEM_PROMPT
SYSTEM_PROMPT = "你是一个知识渊博、友好的助手，能准确回答各种问题。"

# 加载 LoRA 权重
model = PeftModel.from_pretrained(model, save_path)

# 应用聊天模板
text = tokenizer.apply_chat_template([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "How many r's are in strawberry?"}
], tokenize=False, add_generation_prompt=True)

# 配置生成参数
generation_config = GenerationConfig(
    temperature=0.8,
    top_p=0.95,
    max_new_tokens=1024,
)

# 将文本转换为输入张量
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

# 使用标准的 generate 方法生成输出
with torch.no_grad():
    output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config
    )

# 解码输出
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print("使用监督微调的 LoRA 模型输出: ", output_text)

使用监督微调的 LoRA 模型输出:  system
你是一个知识渊博、友好的助手，能准确回答各种问题。
user
How many r's are in strawberry?
assistant
The word "strawberry" contains 3 'r's.


### 6.2 配置推理用的分词器

In [46]:
# 配置推理用的分词器
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)
FastLanguageModel.for_inference(model)   # 推理模式（仅需一次）

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PeftModelForCausalLM(
      (base_model): LoraModel(
        (model): PeftModelForCausalLM(
          (base_model): LoraModel(
            (model): PeftModelForCausalLM(
              (base_model): LoraModel(
                (model): PeftModelForCausalLM(
                  (base_model): LoraModel(
                    (model): Qwen2ForCausalLM(
                      (model): Qwen2Model(
                        (embed_tokens): Embedding(151936, 2048, padding_idx=151665)
                        (layers): ModuleList(
                          (0-35): 36 x Qwen2DecoderLayer(
                            (self_attn): Qwen2Attention(
                              (q_proj): lora.Linear4bit(
                                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=True)
                                (lora_dropout): ModuleDict(
                                  (default): Identity()
                          

### 6.3 准备输入

In [47]:
# 准备测试输入
messages = [
    {"role": "user", "content": """Here is a programming problem for testing:

    **Matrix Chain Multiplication Optimization**

    ### Problem:
    Given a chain of matrices `A1, A2, ..., An`, where the dimensions of Ai are `P[i-1] x P[i]`,
    find the optimal parenthesization order that minimizes the total scalar multiplication cost.

    **Input:**
    1. An array `P` representing dimensions, e.g., P = [10, 20, 30, 40].

    **Output:**
    1. The optimal parenthesization order (e.g., `(A1 x (A2 x A3))`).
    2. The minimum scalar multiplication cost.
    3. A comparison to the naive left-to-right multiplication cost.

    ### Constraints:
    - Use dynamic programming to solve this problem efficiently.
    - Provide a solution for P of length up to 10^5 (optional for advanced testing).

    ### Example:
    Input: P = [10, 20, 30]
    Output:
    - Optimal order: `(A1 x A2)`
    - Minimum cost: 6000
    - Naive cost: 6000

    Input: P = [10, 20, 30, 40]
    Output:
    - Optimal order: `((A1 x A2) x A3)`
    - Minimum cost: 18000
    - Naive cost: 24000

    Implement the solution and evaluate it against these criteria."""}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

### 6.4 普通生成输出

In [49]:
# 普通生成（删除冗余的 FastLanguageModel.for_inference）
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=64,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)

print("普通生成结果：")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

普通生成结果：
system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
Here is a programming problem for testing:

    **Matrix Chain Multiplication Optimization**

    ### Problem:
    Given a chain of matrices `A1, A2, ..., An`, where the dimensions of Ai are `P[i-1] x P[i]`,
    find the optimal parenthesization order that minimizes the total scalar multiplication cost.

    **Input:**
    1. An array `P` representing dimensions, e.g., P = [10, 20, 30, 40].

    **Output:**
    1. The optimal parenthesization order (e.g., `(A1 x (A2 x A3))`).
    2. The minimum scalar multiplication cost.
    3. A comparison to the naive left-to-right multiplication cost.

    ### Constraints:
    - Use dynamic programming to solve this problem efficiently.
    - Provide a solution for P of length up to 10^5 (optional for advanced testing).

    ### Example:
    Input: P = [10, 20, 30]
    Output:
    - Optimal order: `(A1 x A2)`
    - Minimum cost: 6000
    - Naive cost: 6000

   

### 6.5 流式生成输出

**使用 TextStreamer 进行流式生成:**

In [50]:
# 流式生成（复用已准备的 inputs）
from transformers import TextStreamer

print("\n流式生成结果：")
text_streamer = TextStreamer(tokenizer, skip_prompt=True)  # 跳过输入提示

model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)


流式生成结果：
To solve the Matrix Chain Multiplication problem using dynamic programming, we need to determine the most efficient way to parenthesize the product of matrices. Here's a step-by-step implementation in Python, along with comments explaining the logic and comparisons:

```python
def matrix_chain_order(P):
    n = len(P) - 1  # Number of matrices

    # Create a DP table to store the minimum cost and corresponding bracketing
    dp = [[0] * n for _ in range(n)]
    bracketing = [[' '] * n for _ in range(n)]

    # Fill the DP table
    for L in range(


tensor([[151644,   8948,    198,   2610,    525,   1207,  16948,     11,   3465,
            553,  54364,  14817,     13,   1446,    525,    264,  10950,  17847,
             13, 151645,    198, 151644,    872,    198,   8420,    374,    264,
          15473,   3491,    369,   7497,   1447,    262,   3070,   6689,  28525,
          58712,   1693,  57739,  56177,    262,  16600,  22079,    510,    262,
          16246,    264,   8781,    315,  35195,   1565,     32,     16,     11,
            362,     17,     11,  60353,   1527,   7808,   1380,    279,  15336,
            315,  55986,    525,   1565,     47,    989,     12,     16,     60,
            856,    393,    989,     60,  12892,    262,   1477,    279,  22823,
          37940,   6375,   2022,   1973,    429,  21187,   4756,    279,   2790,
          17274,  46444,   2783,    382,    262,   3070,   2505,     25,   1019,
            262,    220,     16,     13,   1527,   1334,   1565,     47,     63,
          14064,  15336,    