### Unsloth

目标：通过GRPO使用OpenR1的数学数据集将`Qwen3-4B-Base`转换为推理模型。

我们首先对模型进行预微调，使GRPO跳过尝试匹配格式 - 这可以加速GRPO的过程。

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # 可以增加以支持更长的推理过程
lora_rank = 32 # 更大的秩 = 更智能，但更慢

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "models/Qwen3-4B-Base",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # LoRA 16位模式下设为False
    fast_inference = True, # 启用vLLM快速推理
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.7, # 如果内存不足则减小此值
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # 选择任何大于0的数！建议值：8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank*2, # *2加速训练
    use_gradient_checkpointing = "unsloth", # 减少内存使用
    random_state = 3407,
)

### GRPO聊天模板
由于我们使用基础模型，应该设置一个聊天模板。你也可以自己制作聊天模板！
1. DeepSeek使用`<think>`和`</think>`，但这**不是**必须的 - 你可以按照自己的喜好进行自定义！
2. 建议使用`system_prompt`至少指导模型的回应。

In [None]:
reasoning_start = "<start_working_out>" # 作为<think>
reasoning_end   = "<end_working_out>"   # 作为</think>
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""
system_prompt

我们在下面创建一个简单的聊天模板。注意`add_generation_prompt`包括前置`<start_working_out>`以引导模型开始其推理过程。

In [None]:
chat_template = \
    "{% if messages[0]['role'] == 'system' %}"\
        "{{ messages[0]['content'] + eos_token }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ '{system_prompt}' + eos_token }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ message['content'] }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ message['content'] + eos_token }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}{{ '{reasoning_start}' }}"\
    "{% endif %}"

# 替换为我们特定的模板：
chat_template = chat_template\
    .replace("'{system_prompt}'",   f"'{system_prompt}'")\
    .replace("'{reasoning_start}'", f"'{reasoning_start}'")
tokenizer.chat_template = chat_template

让我们看看我们的聊天模板在示例上的表现：

In [None]:
tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : f"{reasoning_start}I think it's 2.{reasoning_end}{solution_start}2{solution_end}"},
    {"role" : "user", "content" : "What is 2+2?"},
], tokenize = False, add_generation_prompt = True)

### 预微调以适应格式
我们现在使用NVIDIA的[Open Math Reasoning数据集](https://huggingface.co/datasets/nvidia/OpenMathReasoning)的一个子集，该子集被过滤只包含高质量的DeepSeek R1推理过程。

我们只过滤大约59个左右的例子，首先"预热"/预微调模型，使其理解我们自定义的GRPO格式。

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np

dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
dataset = dataset.to_pandas()[
    ["expected_answer", "problem", "generated_solution"]
]

# 尝试转换为数字 - 如果不是，替换为NaN
is_number = pd.to_numeric(pd.Series(dataset["expected_answer"]), errors = "coerce").notnull()
# 只选择数字
dataset = dataset.iloc[np.where(is_number)[0]]

dataset

我们必须格式化数据集以遵循我们的GRPO风格格式：

In [None]:
def format_dataset(x):
    expected_answer = x["expected_answer"]
    problem = x["problem"]

    # 移除生成的<think>和</think>
    thoughts = x["generated_solution"]
    thoughts = thoughts.replace("<think>", "").replace("</think>", "")

    # 去除左右两边的换行符
    thoughts = thoughts.strip()
    # 添加我们自定义的格式
    final_prompt = \
        reasoning_start + thoughts + reasoning_end + \
        solution_start + expected_answer + solution_end
    return [
        {"role" : "system",    "content" : system_prompt},
        {"role" : "user",      "content" : problem},
        {"role" : "assistant", "content" : final_prompt},
    ]

dataset["Messages"] = dataset.apply(format_dataset, axis = 1)

检查一下是否有效：

In [None]:
tokenizer.apply_chat_template(dataset["Messages"][0], tokenize = False)

让我们将预微调数据集截断到`max_seq_length/2`，因为我们不希望推理过程太长。

注意这可能需要2分钟！

In [None]:
dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply_chat_template(x)))

dataset = dataset.loc[dataset["N"] <= max_seq_length/2].copy()
dataset.shape

然后对消息进行标记化并将其转换为兼容Hugging Face的数据集格式：

In [None]:
from datasets import Dataset

dataset["text"] = tokenizer.apply_chat_template(dataset["Messages"].values.tolist(), tokenize = False)
dataset = Dataset.from_pandas(dataset)
dataset

现在让我们预微调模型，使其遵循我们自定义的GRPO格式！

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 1, # 使用GA模拟批量大小！
        warmup_steps = 5,
        num_train_epochs = 2, # 设置为1进行完整训练运行。
        learning_rate = 2e-4, # 对于长时间训练降低到2e-5
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # 使用此项可用于WandB等
    ),
)

In [None]:
trainer.train()

让我们检查模型是否已经学会遵循自定义格式：

In [None]:
text = tokenizer.apply_chat_template(
    dataset[0]["Messages"][:2],
    tokenize = False,
    add_generation_prompt = True, # 必须添加用于生成
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 0,
    max_new_tokens = 1024,
    streamer = TextStreamer(tokenizer, skip_prompt = False),
)

是的，它确实遵循了格式！太好了！在GRPO步骤之前，让我们移除一些项目

In [None]:
del dataset
torch.cuda.empty_cache()
import gc
gc.collect()

### 数据准备
<a name="Data"></a>

我们使用Hugging Face的[Open R1 Math数据集](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed)。您也可以使用OpenAI著名的[GSM8K数据集](https://huggingface.co/datasets/openai/gsm8k)

In [None]:
from datasets import load_dataset
dataset = load_dataset("open-r1/DAPO-Math-17k-Processed", "en", split = "train")
dataset

让我们看看第一行：

In [None]:
dataset[0]["prompt"]

In [None]:
dataset[0]["solution"]

在GSM8K中，我们注意到所有答案都有####标记，所以我们提取它。但对于Open R1数据集，我们可以跳过下面这个步骤。

In [None]:
def extract_hash_answer(text):
    # if "####" not in text: return None
    # return text.split("####")[1].strip()
    return text
extract_hash_answer(dataset[0]["solution"])

让我们映射数据集！并查看第一行：

In [None]:
dataset = dataset.map(lambda x: {
    "prompt" : [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": x["prompt"]},
    ],
    "answer": extract_hash_answer(x["solution"]),
})
dataset[0]

我们创建一个正则表达式格式来匹配推理部分和答案：

In [None]:
import re

# 添加可选的EOS令牌匹配
solution_end_regex = r"</SOLUTION>[\s]{0,}" + \
    "(?:" + re.escape(tokenizer.eos_token) + ")?"

match_format = re.compile(
    rf"{reasoning_end}.*?"\
    rf"{solution_start}(.+?){solution_end_regex}"\
    rf"[\s]{{0,}}$",
    flags = re.MULTILINE | re.DOTALL
)
match_format

我们验证它是否有效：

In [None]:
match_format.findall(
    "Let me think!<end_working_out>"\
    f"<SOLUTION>\n2\n</SOLUTION>",
)

In [None]:
match_format.findall(
    "<start_working_out>Let me think!<end_working_out>"\
    f"<SOLUTION>  2  </SOLUTION>\n\n",
)

现在我们想创建一个奖励函数来精确匹配格式 - 如果成功，我们奖励3分：

In [None]:
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # 精确匹配格式！
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

如果失败，我们希望通过计算每个符号来奖励模型，如果它至少部分遵循格式：

In [None]:
def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # 计算看到多少关键词 - 如果太多则惩罚！
        # 如果我们看到1个，则加分！

        # 不需要奖励<start_working_out>，因为我们总是预先加上它！
        # score += 0.5 if response.count(reasoning_start) == 1 else -1.0
        score += 0.5 if response.count(reasoning_end)   == 1 else -1.0
        score += 0.5 if response.count(solution_start)  == 1 else -1.0
        score += 0.5 if response.count(solution_end)    == 1 else -1.0
        scores.append(score)
    return scores

最后，我们想提取生成的答案，并对其进行奖励或惩罚！我们还根据答案与真实答案的接近程度通过比率来奖励它：

In [None]:
def check_answer(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        if guess is None:
            scores.append(-2.0)
            continue
        # 正确答案获得5分！
        if guess == true_answer:
            score += 5.0
        # 匹配空格，但奖励较少
        elif guess.strip() == true_answer.strip():
            score += 3.5
        else:
            # 如果答案通过比率接近，我们也会奖励它！
            # 即如果答案在某个范围内，奖励它！
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1: score += 2.0
                elif ratio >= 0.8 and ratio <= 1.2: score += 1.5
                else: score -= 2.5 # 惩罚错误答案
            except:
                score -= 4.5 # 惩罚
        scores.append(score)
    return scores

有时答案可能不是一个数字，而是一个句子，例如"The solution is $20" -> 我们提取20。

我们还移除可能的逗号，例如123,456中的逗号

In [None]:
match_numbers = re.compile(
    solution_start + r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
    flags = re.MULTILINE | re.DOTALL
)
print(match_numbers.findall("<SOLUTION>  0.34  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>  123,456  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>  -0.234  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>17</SOLUTION>"))

我们现在准备我们的主函数，它将打印出生成的响应和真实答案，以及另一个通过`float`将文本转换为浮点数并查看是否相同的奖励函数。

In [None]:
global PRINTED_TIMES
PRINTED_TIMES = 0
global PRINT_EVERY_STEPS
PRINT_EVERY_STEPS = 5

def check_numbers(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    # 只打印每几步
    global PRINTED_TIMES
    global PRINT_EVERY_STEPS
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        print(
            '*'*20 + f"Question:\n{question}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}"
        )
    PRINTED_TIMES += 1

    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:
            scores.append(-2.5)
            continue
        # 转换为数字
        try:
            true_answer = float(true_answer.strip())
            # 移除逗号，如123,456中的逗号
            guess       = float(guess.strip().replace(",", ""))
            scores.append(3.5 if guess == true_answer else -1.5)
        except:
            scores.append(0)
            continue
    return scores

获取前90%的提示长度，这样我们就不会意外截断它们！

也就是说，我们将移除最长的10%提示。

In [None]:
tokenized = dataset.map(
    lambda x: {"tokens" : tokenizer.apply_chat_template(x["prompt"], add_generation_prompt = True, tokenize = True)},
    batched = True,
)
print(tokenizer.decode(tokenized[0]["tokens"]))
tokenized = tokenized.map(lambda x: {"L" : len(x["tokens"])})

import numpy as np
maximum_length = int(np.quantile(tokenized["L"], 0.9))
print("Max Length = ", maximum_length)

# 只过滤小于90%最大长度的样本
dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
del tokenized

<a name="Train"></a>
### 训练模型

现在设置GRPO Trainer和所有配置！

In [None]:
max_prompt_length = maximum_length + 1 # +1以防万一！
max_completion_length = max_seq_length - max_prompt_length

from vllm import SamplingParams
vllm_sampling_params = SamplingParams(
    min_p = 0.1,
    top_p = 1.0,
    top_k = -1,
    seed = 3407,
    stop = [tokenizer.eos_token],
    include_stop_str_in_output = True,
)

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    vllm_sampling_params = vllm_sampling_params,
    temperature = 1.0,
    learning_rate = 5e-6,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # 增加到4以获得更平滑的训练
    num_generations = 4, # 如果内存不足则减少
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    # num_train_epochs = 1, # 设置为1进行完整训练运行
    max_steps = 100,
    save_steps = 100,
    report_to = "none", # 可以使用Weights & Biases
    output_dir = "outputs",

    # 用于可选的训练+评估
    # fp16_full_eval = True,
    # per_device_eval_batch_size = 4,
    # eval_accumulation_steps = 1,
    # eval_strategy = "steps",
    # eval_steps = 1,
)

让我们运行trainer！如果您向上滚动，您将看到奖励表。目标是看到`reward`列增加！

你可能需要等待150到200步才能看到任何动作。前100步可能会得到0奖励。请耐心等待！

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
# 用于可选的训练+评估
# new_dataset = dataset.train_test_split(test_size = 0.01)

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,
        match_format_approximately,
        check_answer,
        check_numbers,
    ],
    args = training_args,
    train_dataset = dataset,

    # 用于可选的训练+评估
    # train_dataset = new_dataset["train"],
    # eval_dataset = new_dataset["test"],
)
trainer.train()

<a name="Inference"></a>
### 推理
现在让我们尝试我们刚刚训练的模型！首先，让我们先尝试没有经过GRPO训练的模型：

In [None]:
text = "What is the sqrt of 101?"

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

现在使用我们刚才用GRPO训练的LoRA - 我们首先保存LoRA！

In [None]:
model.save_lora("grpo_saved_lora")

验证LoRA是否真的训练了！

In [None]:
from safetensors import safe_open

tensors = {}
with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
    # 验证A和B都不是零
    for key in f.keys():
        tensor = f.get_tensor(key)
        n_zeros = (tensor == 0).sum() / tensor.numel()
        assert(n_zeros.item() != tensor.numel())

现在我们加载LoRA并测试：

In [None]:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "What is the sqrt of 101?"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # 必须添加用于生成
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

我们的推理模型好多了 - 它不总是正确的，因为我们只训练了一个小时左右 - 如果我们延长序列长度并训练更长时间，它会更好！

<a name="Save"></a>
### 保存为float16用于VLLM

我们还支持直接保存为`float16`。选择`merged_16bit`表示float16或`merged_4bit`表示int4。我们还允许使用`lora`适配器作为备选。使用`push_to_hub_merged`上传到您的Hugging Face账户！您可以前往https://huggingface.co/settings/tokens获取个人令牌。

In [None]:
# 合并为16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# 合并为4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# 仅LoRA适配器
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp 转换
为了保存为`GGUF` / `llama.cpp`，我们现在原生支持它！我们克隆`llama.cpp`并默认保存为`q8_0`。我们允许所有方法如`q4_k_m`。使用`save_pretrained_gguf`进行本地保存，使用`push_to_hub_gguf`上传到HF。

一些支持的量化方法（完整列表在我们的[Wiki页面](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)）：
* `q8_0` - 快速转换。资源使用高，但通常可接受。
* `q4_k_m` - 推荐。对attention.wv和feed_forward.w2张量的一半使用Q6_K，其余使用Q4_K。
* `q5_k_m` - 推荐。对attention.wv和feed_forward.w2张量的一半使用Q6_K，其余使用Q5_K。

[**新**] 要微调并自动导出到Ollama，请尝试我们的[Ollama笔记本](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# 保存为8位Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# 记得去https://huggingface.co/settings/tokens获取令牌！
# 并将hf更改为您的用户名！
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# 保存为16位GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# 保存为q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# 保存为多个GGUF选项 - 如果您想要多个，这样会快得多！
if False:
    model.push_to_hub_gguf(
        "hf/model", # 将hf更改为您的用户名！
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

现在，在llama.cpp或基于UI的系统（如Jan或Open WebUI）中使用`model-unsloth.gguf`文件或`model-unsloth-Q4_K_M.gguf`文件。您可以在[这里](https://github.com/janhq/jan)安装Jan，在[这里](https://github.com/open-webui/open-webui)安装Open WebUI。

我们完成了！如果您对Unsloth有任何问题，我们有一个[Discord](https://discord.gg/unsloth)频道！如果您发现任何错误或想跟上最新的LLM动态，或需要帮助，加入项目等，欢迎加入我们的Discord！

一些其他链接：
1. 训练您自己的推理模型 - Llama GRPO笔记本[免费Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. 将微调保存到Ollama。[免费笔记本](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision微调 - 放射学用例。[免费Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. 查看我们[文档](https://docs.unsloth.ai/get-started/unsloth-notebooks)上关于DPO、ORPO、持续预训练、对话微调等的笔记本！

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  如果需要帮助，请加入Discord + ⭐️ <i>在<a href="https://github.com/unslothai/unsloth">Github</a>上为我们点星</i> ⭐️
</div>
