
# 大模型（Qwen2.5_Coder_3B） GRPO训练教程

## 环境准备

本教程可在 [AutoDL](https://www.autodl.com/home) 的 4090 GPU 实例上运行。


## 教程内容

本教程将介绍以下内容:

0. [AutoDL配置](#GPU实例) - 如何启动相应配置的GPU实例
1. [安装依赖库](#Install) - 如何安装python依赖包
2. [模型准备](#Model) - 如何下载和初始化模型
3. [数据准备](#Data) - 如何准备和处理训练数据
4. [模型训练](#Train) - 如何训练和优化模型
5. [模型保存](#Save) - 如何保存训练结果
6. [模型推理](#Inference) - 如何使用训练好的模型进行推理


## 0.AutoDL配置
- **为什么选择 AutoDL？**： 相对于其他云服务器厂商，AutoDL卡相对便宜很多，而且操作相对简单，上手成本很低。
- **如何配置？**： 
    - GPU: RTX 4090(24GB) * 1。  
    - 镜像： PyTorch  2.3.0  -->  Python  3.12(ubuntu22.04)  -->  CUDA  12.1


## 1.安装依赖库

In [22]:
!pip install unsloth vllm modelscope datasets
!pip install --upgrade packaging  # 新增依赖升级
!pip install peft

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: http://mirrors.aliyun.com/pypi/simple
[0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: http://mirrors.aliyun.com/pypi/simple
[0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: http://mirrors.aliyun.com/pypi/simple
[0m

## 2.模型准备

### 2.1 模型加载

In [9]:
from unsloth import FastLanguageModel


PatchFastRL("GRPO", FastLanguageModel)


# 基础配置参数
max_seq_length = 2048 # 最大序列长度
dtype = None # 自动检测数据类型
load_in_4bit = True # 使用4位量化以减少内存使用
lora_rank = 64   # 选择任何大于 0 的数, 建议 8, 16, 32, 64, 128

local_dir = "./models/Qwen2.5-Coder-3B-Instruct"


# 加载预训练模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = local_dir, # "unsloth/Qwen2.5-Coder-32B-Instruct", # 选择Qwen2.5 3B指令模型
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # LoRA秩,控制可训练参数数量
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], # 需要训练的目标模块
    lora_alpha = lora_rank, # LoRA缩放因子
    lora_dropout = 0, # LoRA dropout率
    bias = "none", # 是否训练偏置项
    use_gradient_checkpointing = "unsloth", # 使用梯度检查点节省显存
    random_state = 3407, # 随机数种子
    use_rslora = False, # 是否使用稳定版LoRA
    loftq_config = None, # LoftQ配置
)



==((====))==  Unsloth 2025.3.17: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.643 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Unsloth 2025.3.17 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


### 2.2 未经过 GRPO 训练的模型推理

In [27]:
import torch
from transformers import GenerationConfig

# 应用聊天模板
text = tokenizer.apply_chat_template([
    {"role": "user", "content": "How many r's are in strawberry?"}
], tokenize=False, add_generation_prompt=True)

# 配置生成参数
generation_config = GenerationConfig(
    temperature=0.8,
    top_p=0.95,
    max_new_tokens=1024,
)

# 将文本转换为输入张量
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

# 使用标准的 generate 方法生成输出
with torch.no_grad():
    output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config
    )

# 解码输出
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print("未经过监督微调的模型输出: ", output_text)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


未经过监督微调的模型输出:  system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
How many r's are in strawberry?
assistant
There are 3 r's in the word "strawberry".


## 3. 数据准备
### 3.1 本地PC准备步骤
datasets 是 Hugging Face 提供的用于加载和处理各种数据集的库。AutoDL上无法直接访问 Hugging Face。 因此数据集 "mlabonne/FineTome-100k" 需要在本地PC下载后，从AutoDL的“文件存储”上传到你所使用的实例存储位置。

本地可以科学上网后，安装 pip install datasets 后，然后运行下面代码。

In [19]:
# 下载数据集 （此段代码本地PC运行）
from datasets import load_dataset
import json

# 下载 gsm8k 数据集的训练集
dataset = load_dataset('openai/gsm8k', 'main', split='train')

# 保存为 JSON 文件
with open('./datasets/gsm8k/gsm8k_train.json', 'w', encoding='utf-8') as f:
    for example in dataset:
        json.dump(example, f, ensure_ascii=False)
        f.write('\n')

ConnectionError: Couldn't reach 'openai/gsm8k' on the Hub (ConnectTimeout)

如果不能科学上网，无法下载对应数据集，我这里也提供了依据下载好的。链接: https://pan.baidu.com/s/1ftrbEn7FHZaYG3CjXEkqzA?pwd=sakn 提取码: sakn 。

随后需要将下载的数据上传到AutoDL 对应的位子 "./datasets/gsm8k"。

### 3.2 数据加载

In [13]:
# 本地数据集路径（请提前上传到AutoDL的 /auto-fs/datasets目录）
import re
import json
from datasets import Dataset


# 加载并准备数据集
SYSTEM_PROMPT = """
响应格式如下：
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    """从文本中提取 XML 格式的答案"""
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    """从文本中提取带有哈希的答案"""
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

'''
def load_local_gsm8k(file_path="./datasets/gsm8k/gsm8k_train.json"):
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    return Dataset.from_list(data)
'''

def load_local_gsm8k(file_path="./datasets/gsm8k/gsm8k_train.json"):
    data = []
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            data.append(json.loads(line.strip()))  # 逐行加载
    return Dataset.from_list(data)

# 加载本地数据集
dataset = load_local_gsm8k()

# 处理数据集
def process_dataset(data):
    data = data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    })
    return data

dataset = process_dataset(dataset)

# 奖励函数
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """正确性奖励函数"""
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    """整数奖励函数"""
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """严格格式奖励函数"""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """软格式奖励函数"""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    """计算 XML 格式的得分"""
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1]) * 0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1) * 0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    """XML计数奖励函数"""
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

## 4. 模型GRPO训练

### 4.1 数据模型验证


```
/models/
└── gsm8k/Qwen2.5-Coder-3B-Instruct   # 模型

/datasets/
└── gsm8k/
    └── gsm8k_train.json  # 数据集

/outputs/
└── 02_outputs/
    ├── grpo_saved_lora/     # LoRA适配器
    └── qwen-grpo-model/     # ModelScope格式模型
```


In [None]:
# 验证模型加载
from modelscope import Model
import pprint
loaded_model = Model.from_pretrained(local_dir)
assert loaded_model is not None, "模型加载失败"

# 验证数据集
assert len(dataset) > 0, "数据集加载失败"
pprint.pprint(dataset[0])  # 验证数据格式

### 4.2 训练

In [18]:
from trl import GRPOConfig, GRPOTrainer
from unsloth import is_bfloat16_supported

save_path = "./outputs/02_outputs"

training_args = GRPOConfig(
    use_vllm=False,  # 使用 vLLM 进行快速推理！
    learning_rate=5e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    optim="adamw_8bit",
    logging_steps=1,
    bf16=is_bfloat16_supported(),  # 是否支持 bfloat16
    fp16=not is_bfloat16_supported(),  # 是否使用 fp16
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,  # 增加到 4 以获得更平滑的训练
    num_generations=8,  # 如果内存不足，请减少
    max_prompt_length=256,
    max_completion_length=200,
    # num_train_epochs=1,  # 设置为 1 进行完整的训练
    max_steps=250,
    save_steps=250,
    max_grad_norm=0.1,
    report_to="none",  # 可以使用 Weights & Biases
    output_dir=save_path,  # AutoDL推荐输出路径
)

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

# 启动训练
train_result = trainer.train()

# 指定路径保存训练好的模型
trainer.save_model(save_path)

# 打印训练结果
print("训练结果：", train_result)


Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 119,734,272/3,000,000,000 (3.99% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
<reasoning>
1. Mr. Benson bought 12 tickets.
2. There is a 5% discount for tickets bought that exceed 10. This means the first 10 tickets do not have a discount, and the remaining 2 tickets have a 5% discount.
3. The cost of one ticket without a discount is $40.
4. Calculate the cost of the first 10 tickets: \(10 \times 40 = 400\).
5. Calculate the cost of the discounted tickets: Each discounted ticket costs \(40 - (5\% \times 40) = 40 - 2 = 38\).
6. Calculate the total cost for the 2 discounted tickets: \(2 \times 38 = 76\).
7. Add the costs of the non-discounted and discounted tickets to get the total amount Mr. Benson paid: \(400 +  
Extracted:
<reasoning>
1. Mr. Benson bought 12 tickets.
2. There is a 5% discount for tickets bought that exceed 10. This means the first 1

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0,0.095375,0.083792,198.75,0.0,0.095375,0.0,0.0,0.0,0.0
2,0.0,0.125,0.0,200.0,0.0,0.125,0.0,0.0,0.0,0.0
3,0.0,0.06975,0.156271,200.0,0.000121,0.06975,0.0,0.0,0.0,0.0
4,0.0,0.174,0.181924,179.75,0.000292,0.1115,0.0,0.0,0.0625,0.0
5,0.0,-0.012875,0.14547,185.875,0.000116,-0.012875,0.0,0.0,0.0,0.0
6,0.0,0.07275,0.147785,200.0,0.000135,0.07275,0.0,0.0,0.0,0.0
7,0.0,0.125,0.0,200.0,0.000166,0.125,0.0,0.0,0.0,0.0
8,0.0,0.167875,0.132088,186.0,0.000283,0.167875,0.0,0.0,0.0,0.0
9,0.0,0.734875,1.129308,187.125,0.000146,0.109875,0.0,0.0,0.125,0.5
10,0.0,0.001375,0.230688,200.0,0.00015,0.001375,0.0,0.0,0.0,0.0


-------------------- Question:
Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? 
Answer:
1500 
Response:
<reasoning>
To determine how much more Jane would pay per month on the house compared to the trailer, we need to calculate the monthly payments for each option and then find the difference.

First, let's calculate the monthly payment for the house:
1. The house costs $480,000.
2. The loan is paid in equal monthly installments over 20 years.

We'll use the formula for the monthly payment of a loan:
\[ P = \frac{L \times r}{1 - (1 + r)^{-n}} \]
where:
- \( P \) is the monthly payment.
- \( L \) is the principal loan amount.
- \( r \) is the monthly interest rate (annual interest rate by 12).
- \( n \) is the number of payments (loan term in years times 12).

For a house:
- \( L = 480,0

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 119,734,272/3,000,000,000 (3.99% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
<reasoning>Mr. Benson bought 12 tickets, which means he bought 2 more than the 10 he needed an exclusive discount for. A normal ticket costs $40, so an exclusive ticket costs 40\*0.95. So an exclusive ticket costs $38. Mr. Benson bought 2 exclusive tickets for $38\*2 and the other 10 normal tickets for $40\*10. So the total cost Mr. Benson paid is $38\*2 + $40\*10 = $388</reasoning>
<answer>388</answer> 
Extracted:
388


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0023,1.328125,1.00209,143.0,0.058549,0.078125,0.3125,0.0,0.4375,0.5
2,0.0019,1.046875,1.271138,179.75,0.047626,0.046875,0.25,0.0,0.25,0.5
3,0.0021,2.388625,0.947753,155.25,0.051462,0.076125,0.3125,0.0,0.5,1.5
4,0.0019,2.15625,1.034559,107.875,0.046841,0.09375,0.3125,0.0,0.5,1.25
5,0.005,0.84375,0.895799,80.125,0.124994,0.15625,0.1875,0.0,0.25,0.25
6,0.0011,0.984375,1.285978,189.125,0.027227,0.046875,0.1875,0.0,0.25,0.5
7,0.0019,1.2875,1.1366,165.375,0.047618,0.1625,0.0,0.0,0.375,0.75
8,0.0023,1.4035,1.104066,133.75,0.056409,0.1535,0.125,0.0,0.375,0.75
9,0.0019,1.802375,1.011981,121.5,0.047243,0.114875,0.1875,0.0,0.5,1.0
10,0.0011,1.0,1.198958,185.375,0.026514,0.125,0.125,0.0,0.25,0.5


-------------------- Question:
Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? 
Answer:
1500 
Response:
<reasoning>First, we need to calculate the monthly payment for each loan. The monthly payment can be calculated using the formula: Total cost ÷ (Number of months × 12). For the house, the payment is $480,000 ÷ (20 years × 12 months/year) = $2,200. For the trailer, the payment is $120,000 ÷ (20 years × 12 months/year) = $600. Then, we subtract the trailer payment from the house payment to find the difference: $2,200 - $600 = $1,600.</reasoning>
<answer>1600</answer> 
Extracted:
1600
-------------------- Question:
Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet les

## 5. 模型保存

In [25]:
from peft import PeftModel
from modelscope.models import Model

# 保存 LoRA 适配器
PeftModel.save_pretrained(model, save_path + '/grpo_saved_lora')

# 保存为 ModelScope 格式（可选）
model.save_pretrained(
    save_path + "/qwen-grpo-model",
    tokenizer=tokenizer
)

## 6.模型推理

### 6.1 使用 GRPO 训练的 LoRA 进行推理

In [30]:

import torch
import warnings
from peft import PeftModel
from transformers import GenerationConfig


# 禁用 peft 的 UserWarning（关键修改）
warnings.filterwarnings("ignore", category=UserWarning, module="peft")

# 定义 SYSTEM_PROMPT
SYSTEM_PROMPT = "你是一个知识渊博、友好的助手，能准确回答各种问题。"

# 加载 LoRA 权重
model = PeftModel.from_pretrained(model, save_path+"/grpo_saved_lora")

# 应用聊天模板
text = tokenizer.apply_chat_template([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "How many r's are in strawberry?"}
], tokenize=False, add_generation_prompt=True)

# 配置生成参数
generation_config = GenerationConfig(
    temperature=0.8,
    top_p=0.95,
    max_new_tokens=1024,
)

# 将文本转换为输入张量
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

# 使用标准的 generate 方法生成输出
with torch.no_grad():
    output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config
    )

# 解码输出
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print("使用监督微调的 LoRA 模型输出: ", output_text)

使用监督微调的 LoRA 模型输出:  system
你是一个知识渊博、友好的助手，能准确回答各种问题。
user
How many r's are in strawberry?
assistant
The word "strawberry" contains 3 'r's.
