<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/mini_deepseek_r1_aha_grpo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

from: https://www.philschmid.de/mini-deepseek-r1

# Mini-R1：复现 Deepseek R1 的“顿悟时刻”——RL 教程

Deepseek R1 的发布震惊了业界。为什么？DeepSeek-R1 是一个开源模型，在复杂推理任务中与 OpenAI 的 o1 媲美，它采用组相对策略优化（GRPO）和以 RL 为重点的多阶段训练方法。他们不仅发布了模型，还发布了一篇关于他们如何实现这一目标的论文。

在论文中，他们描述了在使用纯 RL 训练模型时出现的“顿悟时刻”。在此阶段，DeepSeek-R1-Zero（DeepSeek-R1 的首次测试）通过重新评估其初始方法，学习为问题分配更多思考时间，而无需任何人类反馈或描述如何操作的数据。他们将此描述为“顿悟时刻”，因为：

> 这种行为不仅证明了模型日益增长的推理能力，也是强化学习如何带来意想不到和复杂结果的迷人例子。

它有力地提醒了 RL 在人工系统中释放新智能水平的潜力，为未来更自主和自适应的模型铺平了道路。在这篇博文中，我们希望使用组相对策略优化（GRPO）和倒计时游戏来重现 DeepSeek-R1 的小“顿悟时刻”。我们将使用强化学习训练一个开源模型，尝试让它自行学习自我验证和搜索能力来解决倒计时游戏。倒计时游戏是一个数字谜题，玩家使用一组随机抽取的数字和基本算术运算（+、-、×、÷）来达到或尽可能接近目标数字。

```
目标数字：952
可用数字：25, 50, 75, 100, 3, 6

(100 × (3 × 3)) + (50 + 6 / 3) = 952
```

这篇博文包含了一个交互式代码，您可以在 Jupyter Notebook 中运行它，了解如何使用 GRPO 和 Q-Lora 训练模型。这是学习如何使用 TRL 和 GRPO 的好方法，但它非常慢并且需要大量计算。此外，我还添加了一个[脚本](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_r1_grpo.py)和说明，用于在具有多个 GPU 的节点或 SLURM 集群上运行训练。

1.  [设置开发环境](https://www.google.com/search?q=%231-setup-the-development-environment)
2.  [从倒计时游戏生成带有推理前缀的训练样本](https://www.google.com/search?q=%232-generate-training-samples-with-reasoning-prefix-from-the-countdown-game)
3.  [使用 GRPO 训练模型](https://www.google.com/search?q=%233-train-the-model-using-grpo)
4.  [使用 Deepspeed 和 vLLM 的 GRPO 分布式训练示例](https://www.google.com/search?q=%234-distributed-training-example-for-grpo-using-deepspeed-and-vllm)
5.  [结果和训练观察](https://www.google.com/search?q=%235-results-and-training-observations)

*注意：这篇博客的灵感来自 [Jiayi Pan](https://x.com/jiayi_pirate/status/1882839370505621655)，他最初探索了这个想法并用一个小模型验证了它。*

但在开始之前，让我们先了解一下[组相对策略优化（GRPO）](https://arxiv.org/abs/2402.03300)并理解它的工作原理。

## 组相对策略优化（GRPO）

组相对策略优化（GRPO）是一种强化学习算法，用于提高 LLM 的推理能力。它在 [DeepSeekMath](https://arxiv.org/abs/2402.03300) 论文中在数学推理的背景下被引入。GRPO 修改了传统的近端策略优化（PPO），消除了对值函数模型的需求。相反，它从组分数中估计基线，从而减少内存使用和计算开销。GRPO，现在也由 Qwen 团队使用，可以与基于规则/二进制的奖励以及通用奖励模型一起使用，以提高模型的帮助性。

1.  **采样**：使用当前策略为每个提示生成多个输出
2.  **奖励评分**：每个生成都使用奖励函数进行评分，可以是（基于规则或基于结果的）
3.  **优势计算**：生成的输出的平均奖励用作基线。然后计算组内每个解决方案相对于该基线的优势。奖励在组内进行归一化。
4.  **策略优化**：策略尝试最大化 GRPO 目标，其中包括计算出的优势和 KL 散度项。这与 PPO 在奖励中实现 KL 项的方式不同。

![grpo.png](https://raw.githubusercontent.com/philschmid/deep-learning-pytorch-huggingface/59b37973074de90004d10e5ff636f98160c9743a/assets/grpo.png)

## 1. 设置开发环境

我们的第一步是安装 Hugging Face 库以及 Pytorch、vllm、trl、transformers 和 datasets。如果您还没有听说过 trl，请不必担心。它是一个建立在 transformers 和 datasets 之上的新库，它使得微调、RLHF、对齐开放 LLM 变得更加容易。



In [None]:
# Install Pytorch & other libraries, make sure to match your GPU driver version
%pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1  tensorboard "setuptools<71.0.0"  --index-url https://download.pytorch.org/whl/cu121

# Install flash-attn
%pip install flash-attn --no-build-isolation

# Install Hugging Face libraries
%pip install  --upgrade \
  "transformers==4.48.1" \
  "datasets==3.1.0" \
  "accelerate==1.3.0" \
  "hf-transfer==0.1.9" \
  "deepspeed==0.15.4" \
  "trl==0.14.0"

# install vLLM
%pip install "vllm==0.7.0"

## IMPORTANT: If you want to run the notebook and the interactive cells you also need to install the following libraries:
# But first read it the blog post and then decide as they might conflict with the libraries for distributed training.
# %pip install "peft==0.14.0" "bitsandbytes==0.45.0"


In [7]:
!pip install numpy==1.26.4



Collecting numpy==1.26.4
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.6
    Uninstalling numpy-2.2.6:
      Successfully uninstalled numpy-2.2.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.4


In [2]:
!pip list | grep -E "flash|transformers|numpy|torch|trl|deepspeed|hf-transofer|datasets|tensorboard"

datasets                              3.1.0
deepspeed                             0.15.4
fastrlock                             0.8.3
flash_attn                            2.7.4.post1
numpy                                 1.26.4
sentence-transformers                 4.1.0
tensorboard                           2.18.0
tensorboard-data-server               0.7.2
tensorflow-datasets                   4.9.9
torch                                 2.5.1+cu121
torchao                               0.10.0
torchaudio                            2.5.1+cu121
torchdata                             0.11.0
torchsummary                          1.5.1
torchtune                             0.6.1
torchvision                           0.20.1+cu121
transformers                          4.48.1
trl                                   0.14.0
vega-datasets                         0.9.0


_注意：您可能需要重新启动内核以使用更新的包。_

我们将使用 [Hugging Face Hub](https://huggingface.co/models) 作为远程模型版本控制服务。这意味着我们将在训练期间自动将我们的模型、日志和信息推送到 Hub。您必须为此在 [Hugging Face](https://huggingface.co/join) 上注册。拥有账户后，我们将使用 `huggingface_hub` 包中的 `login` 工具登录到我们的账户并将我们的令牌（访问密钥）存储在磁盘上。

In [1]:
from huggingface_hub import login
from google.colab import userdata

login(token=userdata.get('HF_TOKEN'), add_to_git_credential=True) # ADD YOUR TOKEN HERE

## 2. 从倒计时游戏生成带有推理前缀的训练样本

我们将使用 [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) 数据集，其中包含 3 到 4 个数字和解决方案的样本。

作为模型，我们将使用 [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)，这是一个 3B 参数指令微调模型。这使得展示“顿悟时刻”变得更容易，因为它已经遵循了提示格式。但您也可以使用 Qwen 的基础版本或其他模型。[Jiayi-Pan](https://x.com/jiayi_pirate/status/1882839487417561307) 探索发现，模型需要具备一定的质量才能学习推理过程，参数量需大于 1.5B。

这里使用0.5B模型在L4上进行调试，验证下效果。实际训练以>1.5B 模型为准

In [2]:
from transformers import AutoTokenizer
from datasets import load_dataset

# Load dataset from Hugging Face Hub
dataset_id = "Jiayi-Pan/Countdown-Tasks-3to4"
dataset = load_dataset(dataset_id, split="train")
# select a random subset of 50k samples
dataset = dataset.shuffle(seed=42).select(range(50000))

#model_id = "Qwen/Qwen2.5-3B-Instruct"
model_id = "Qwen/Qwen2.5-0.5B-Instruct"
# Load tokenizer from Hugging Face Hub to format the dataset to our "r1" prompt
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# gemerate r1 prompt with a prefix for the model to already start with the thinking process
def generate_r1_prompt(numbers, target):
    r1_prefix = [{
        "role": "system",
        "content": "You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer."
      },
      {
        "role": "user",
        "content": f"Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 = 1 </answer>."
      },
      {
        "role": "assistant",
        "content": "Let me solve this step by step.\n<think>"
      }]
    return {"prompt": tokenizer.apply_chat_template(r1_prefix, tokenize=False, continue_final_message=True), "target": target}

# convert our dataset to the r1 prompt
dataset = dataset.map(lambda x: generate_r1_prompt(x["nums"], x["target"]))

# split the dataset into train and test
train_test_split = dataset.train_test_split(test_size=0.1)

train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

README.md:   0%|          | 0.00/314 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/490364 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

## 3. 使用 GRPO 训练模型

_注意：第 3 节展示了如何使用 TRL 和 GRPO 的基本知识。如果您想运行交互式单元，您需要安装 `bitsandbytes` 和 `peft`，因为 `Trainer` 类需要它们。本节主要用于教育目的。_

TRL 通过专门的 [GRPOTrainer](https://huggingface.co/docs/trl/main/en/grpo_trainer) 支持组相对策略优化（GRPO），用于根据偏好数据对齐 LLM，如 [DeepSeekMath：在开放语言模型中推动数学推理的极限](https://arxiv.org/abs/2402.03300) 中所述。`GRPOTrainer` 是 `transformers` 库中 `Trainer` 的子类，并支持所有相同的功能，包括日志记录、检查点、分布式训练和参数高效微调（PEFT）。

`GRPOTrainer` 支持通用的结果奖励模型（ORM）和自定义奖励函数，可用于实现基于规则的奖励模型。在 Deepseek R1 论文中，他们实现了基于规则的奖励模型来验证生成解决方案的正确性。在我们的示例中，我们将采用类似的方法，我们将创建 2 个奖励函数：
1. 格式奖励：检查生成的格式是否正确 `<think> [思考] </think><answer> [答案] </answer>`
2. 准确性奖励：从 `<answer>` 标签中提取方程式，并根据目标进行评估，以及每个数字是否只使用一次。

_注意：我们示例中正确的 `<answer>` 包括方程式，例如 `<answer> 55 + 36 - 7 - 19 </answer>`_

In [3]:
import re

def format_reward_func(completions, target, **kwargs):
    """
    Format: <think>...</think><answer>...</answer>
    Args:
        completions (list[str]): Generated outputs
        target (list[str]): Expected answers

      Returns:
          list[float]: Reward scores
    """
    rewards = []

    for completion, gt in zip(completions, target):

      try:
        # add synthetic <think> as its already part of the prompt and prefilled for the assistant to more easily match the regex
        completion = "<think>" + completion
        # Check if the format is correct
        regex = r"^<think>([^<]*(?:<(?!/?think>)[^<]*)*)<\/think>\n<answer>([\s\S]*?)<\/answer>$"

        match = re.search(regex, completion, re.DOTALL)
        # if the format is not correct, reward is 0
        if match is None or len(match.groups()) != 2:
            rewards.append(0.0)
        else:
            rewards.append(1.0)
      except Exception:
        rewards.append(0.0)
    return rewards

def equation_reward_func(completions, target, nums, **kwargs):
    """
    Evaluates completions based on:
    2. Mathematical correctness of the answer

    Args:
        completions (list[str]): Generated outputs
        target (list[str]): Expected answers
        nums (list[str]): Available numbers

    Returns:
        list[float]: Reward scores
    """
    rewards = []
    for completion, gt, numbers in zip(completions, target, nums):
      try:
        # add synthetic <think> as its already part of the prompt and prefilled for the assistant to more easily match the regex
        completion = "<think>" + completion
        # Check if the format is correct
        match = re.search(r"<answer>(.*?)<\/answer>", completion)
        if match is None:
            rewards.append(0.0)
            continue
        # Extract the "answer" part from the completion
        equation = match.group(1).strip()
        # Extract all numbers from the equation
        used_numbers = [int(n) for n in re.findall(r'\d+', equation)]

        # Check if all numbers are used exactly once
        if sorted(used_numbers) != sorted(numbers):
            rewards.append(0.0)
            continue
        # Define a regex pattern that only allows numbers, operators, parentheses, and whitespace
        allowed_pattern = r'^[\d+\-*/().\s]+$'
        if not re.match(allowed_pattern, equation):
           rewards.append(0.0)
           continue

        # Evaluate the equation with restricted globals and locals
        result = eval(equation, {"__builti'ns__": None}, {})
        # Check if the equation is correct and matches the ground truth
        if abs(float(result) - float(gt)) < 1e-5:
            rewards.append(1.0)
        else:
            rewards.append(0.0)
      except Exception:
            # If evaluation fails, reward is 0
            rewards.append(0.0)
    return rewards

我们将用一个示例来测试我们的奖励函数。

_注意：所有示例都没有以 `<think>` 开头，因为这是我们通过人工方式添加到提示中的。_

In [4]:
correct_sample_1 = """We need to find an equation using the numbers 19, 36, 55, and 7
exactly once, with basic arithmetic operations, that equals 65. One possible
combination is 55 + 36 - 19 + 7... </think>
<answer> 55 + 36 - 7 - 19 </answer>"""

correct_sample_2 = """ ... </think>
<answer> 55 + 36 - 7 - 19 </answer>"""

wrong_format = """User: Using the numbers [19, 36, 55, 7], create an equation that equals 65."""

wrong_format_2 = """To find the equation that equals 79 using the numbers 95, 78, 6, 88, I'll start by adding 88 and 95:
95 + 88 = 183
Now, let's subtract 104 from 183 to get 79:
183 - 104 = 79
<think> 183 - 104 = 79 </think><think> 183 - 104 = 79 </think><answer> 183 - 104 = 79 </answer>"""

wrong_result = """ ... </think>
<answer> 55 + 36 - 7 - 18 </answer>"""


test_rewards = format_reward_func(completions=[correct_sample_1, correct_sample_2, wrong_format, wrong_format_2, wrong_result], target=["65", "65", "65", "65", "65"], nums=[[19, 36, 55, 7]] * 5)
assert test_rewards == [1.0, 1.0, 0.0, 0.0, 1.0], "Reward function is not working"
test_rewards = equation_reward_func(completions=[correct_sample_1, correct_sample_2, wrong_format, wrong_format_2, wrong_result], target=["65", "65", "65", "65", "65"], nums=[[19, 36, 55, 7]] * 5)
assert test_rewards == [1.0, 1.0, 0.0, 0.0, 0.0], "Reward function is not working"

这看起来不错，现在让我们定义剩余的训练参数，创建一个训练器并开始训练。

In [6]:
from trl import GRPOConfig, GRPOTrainer, get_peft_config, ModelConfig

# our model we are going to use as policy
model_config = ModelConfig(
    model_name_or_path=model_id,
    torch_dtype="bfloat16",
    attn_implementation="flash_attention_2",
    use_peft=True,
    load_in_4bit=True,
)

# Hyperparameters
training_args = GRPOConfig(
    output_dir="qwen-r1-aha-moment",
    learning_rate=5e-7,
    lr_scheduler_type="cosine",
    logging_steps=10,
    max_steps=100,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    bf16=True,
    # GRPO specific parameters
    max_prompt_length=256,
    max_completion_length=1024, # max length of the generated output for our solution
    num_generations=2,
    beta=0.001,

)
trainer = GRPOTrainer(
    model=model_config.model_name_or_path,
    reward_funcs=[format_reward_func, equation_reward_func],
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=get_peft_config(model_config),
)

INFO 06-07 06:26:40 __init__.py:183] Automatically detected platform cuda.


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

我们可以通过调用 trainer 实例上的 `train` 方法来开始训练。

_注意：强化学习训练非常慢且计算密集。在 1 块 L4 GPU 上运行单个步骤训练3B模型，使用 Q-LoRA，批处理大小为 1，每个样本仅生成 2 次，就需要 >20 分钟。训练0.5B模型迭代100次大概只要1个小时，用于验证_

In [7]:
# Train and push the model to the Hub
trainer.train()
# Save model
trainer.save_model(training_args.output_dir)



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mweege007[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
10,0.0
20,0.0
30,0.0
40,0.0
50,0.0
60,0.0
70,0.0
80,0.0
90,0.0
100,0.0


## 4. 使用 Deepspeed 和 vLLM 进行 GRPO 分布式训练示例

每个样本只生成 2 次，每步超过 20 分钟是不可行的。我们需要扩大训练规模。Hugging Face TRL 添加了对 Deepspeed 分布式训练和使用 vLLM 进行更快生成。我准备了一个 [run_r1_grpo.py](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_r1_grpo.py) 脚本和一个 [receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml) 配置文件来运行训练。

此配置已在具有 4x H100 80GB 的节点上进行测试和验证，其中单步大约需要 45-60 秒，因为我们可以利用 vLLM 进行生成和 DeepSpeed 进行分布式训练。因此，我们需要确保将 `num_processes` 正确设置为您拥有的 GPU 数量 - 1，因为最后一个 GPU 将与 vLLM 一起用于生成。如果您使用更多 GPU，您需要将配置文件中的 `vllm_device` 更改为最后一个索引 GPU，例如，如果您有 8 个 GPU，您需要将 `vllm_device=7` 和 `num_processes` 设置为 7。

运行训练的命令：
```bash
accelerate launch --num_processes 3 --config_file configs/accelerate_configs/deepspeed_zero3.yaml scripts/run_r1_grpo.py --config receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml
```

通过优化的分布式训练，在 4x H100 80GB 上每个样本生成 8 次，单步大约需要 45-60 秒。450 步的完整训练大约需要 6 小时。

## 5\. 结果与训练观察

脚本会将随机完成内容保存到 `completion_samples` 文件夹，您可以使用该文件夹来检查模型的进度。它包括 `completion_samples.txt` 和 `success_completion_samples.txt`。`completion_samples.txt` 包含所有完成内容，而 `success_completion_samples.txt` 则包含正确解决方程的完成内容。下面您可以找到关于性能如何随时间变化的有趣训练观察，以及 TensorBoard 日志和成功推理样本。

包含每 25 步检查点的模型可以在 [philschmid/qwen-2.5-3b-r1-countdown](https://huggingface.co/philschmid/qwen-2.5-3b-r1-countdown) 找到。

### 超参数

我最初使用 [DeepSeekMath](https://arxiv.org/abs/2402.03300) 论文中的超参数进行实验，学习率为 1e-6，beta (KL 系数) 为 0.04，这导致在大约 150 步后训练运行不稳定。我进行了一些小的消融实验，并将学习率降低到 5e-7，beta 降低到 0.001，基于 [OpenRLHF](https://www.google.com/search?q=https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights%23147d33ecc9806090f3d5c749d31f05) 的一个测试。我无法测试将 `num_generations` 从 8 增加到 64 是否会影响训练。64 是 DeepSeekMath 论文中使用的生成值。所有其他参数都可以在 [grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml) 配置文件中找到。

### 训练观察：

  * 在大约 50 步时，模型已经学会了正确的格式 `<think>...</think>\n<answer>...</answer>`。
  * 在 100 步时，解决方程的成功率约为 25%。模型开始用文字进行“推理”，请参见下面的示例。
  * 在 200 步时，性能似乎收敛得慢得多，成功率约为 40%。模型开始学习一种新的“格式”，它以类似于编程方式解决方程的方式，通过尝试不同的组合并审查结果，参见“200 到 450 步之间成功推理样本”。
  * 在 450 步时，解决方程的成功率为 50%。性能仍在缓慢提高，模型保持了从 200 步开始的新格式。

我有 3 个潜在的假设，解释为什么模型从“文字推理”转向“程序化执行”：

1.  Qwen 2.5 3B 不够强大或太小，Deepseek 提到需要一个非常强大的基础模型。
2.  奖励函数定义得不够好，它奖励模型以“黑客”方式解决方程。我们可以尝试强制它使用文字，例如设置一个数字到单词频率的条件。（我们对 Deepseek 的奖励函数知之甚少）
3.  仅在倒计时游戏任务上进行训练可能会自然地迫使模型学习解决方程的最有效方法，因为不需要其他格式。
4.  我们没有对模型进行足够的训练，在 R1 论文中，他们展示了超过 8000 步的训练视觉效果。

#### TensorBoard 结果




![Qwen R1 Tensorboard results](https://raw.githubusercontent.com/philschmid/deep-learning-pytorch-huggingface/59b37973074de90004d10e5ff636f98160c9743a/assets/tensorboard-r1.png)

#### 200 步之前的成功推理样本


```xml
<think> First, I need to think about the numbers we have: 14, 70, and 80. To get 85, I think 80 might be a good starting point because it's right in between 85. If I subtract or add 80, it won't lead to 85 easily. Next, 70 seems to be a good intermediate number, especially to use in conjunction with 14.
Now, I can think about different combinations using addition, subtraction and multiplication/division. If I take 80 - 70 = 10, which is not needed. So, I need to find a way to use 14. If I subtract 14 from 80, I get 66, which still isn't 85. If I subtract 70 from 80, I get 10 again, this isn't constructive yet.
Multiplication could create numbers larger than the target 85. Let's consider division: If I divide 70 by 14, I get 5. Then if I add this result to 80, I get 85. This combination seems promising: 80 - (70 / 14) = 80 - 5 = 75, this still isn't quite right. Let me correct my approach: Let's use division and then addition. If I take 80 / (70 / 14) → 80 / 5 = 16, and then 80 - 16 = 64, still incorrect. Let me simplify: 80 / (70 / 14) → 80 / 5 = 16, and then 80 - 16 gives 64 again, let me correct: 80 - (70 / 14) → 80 - 5 = 75, but I need 85. Thus I will use the combinations in a slightly different order to insert numbers: I get 80 + (70 / 14) = 80 + 5 = 85</think>
<answer> 80 + (70 / 14) </answer>
```

```xml
<think> First, I need to consider how I can use the numbers 1, 21, and 90 to reach 68. I can try different combinations of operations. Let's try to use subtraction and addition. If I subtract 90 from 21, I get -69, which is too far from 68. Now, if I add 1 to 90, I get 91, which is also not right. Another way is to see what happens if I subtract 21 from 90, which gives me 69. This is closer but still not 68. Let me use multiplication instead of subtraction. 90 * 1 = 90. Now, if I try to reduce this slightly by subtracting 21, I am not getting close to 68.
Now, let's try another multiplication. I'll think of 1 * (larger value) + (smaller value). Let's try 1 * 90 + 21. This evaluates to 90 + 21 = 111, which is too much. Instead, I will try: 90 / 1 + 21 - 7. But this gives a division by 7, which seems complex. Let's re-evaluate: I'll use 1 to divide 21: 21 / 1 = 21. Now, subtract 21 from 90, and add 1, we have 90 - 21 = 69, using one...
Hey, wait on re-evaluating I will subtract 21 from 90, and include 1 to complete the solution: 90 - (21 + 1) which would be 68.</think>
<answer> 90 - (21 + 1) </answer>
```


```xml
<think> First, I need to consider the numbers we have: 59, 80, and 66. I need to use these numbers to get 45. Let's try to use subtraction first since it can quickly reduce large numbers.
Let's see if we can get close by subtracting or adding these numbers. If I subtract 66 from 80, I get 14. Then I need to make this 14 equal 45 when combined with 59. That gives me 14 + 59 = 73, which is too high.
Now let's try adding numbers. If I add 59 and 66, I get 125. Then I would need to subtract 80 from 125 to get 45, but 80 is too large to fit well with this.
Let me try a different approach. What if I subtract 66 from 80 first and then add 59? That would be 14 + 59 = 73, which is still not 45.
Hmm, I need to think of another combination. What if I start with 59 and then add 80 and subtract 66? That would be (59 + 80) - 66, let me calculate it. That equals 139 - 66, which is 73.
That's still not 45, so let me try rearranging it slightly: maybe 80 - 66 + 59 would be closer to 45. Calculating again: 80 - 66 is 14, then adding 59 gives 73.
This way doesn't work either. Let's flip this: maybe (59 + 66) - 80. Let me check: that's 125 - 80, which equals 45.</think>
<answer>(59 + 66) - 80</answer>
```

```xml
<think> First, I need to find a combination of the numbers 61, 4, 94, and 21 that can be manipulated using basic arithmetic operations to equal 50. One way to approach this is to look for numbers that can be used to reach a target number close to 50 and then adjust with the remaining numbers.
Let's try combining 94 and 21 first:
94 + 21 = 115
Now, we need to get from 115 to 50. We can try subtracting 65:
115 - 65 = 50
Now we need to find a way to get 65 using the remaining numbers 61 and 4. We can try:
61 - 4 = 57
Now we need to get from 57 to 65. Adding 8 would give us that:
57 + 8 = 65
So, we need to find a way to get 8 using 61 and 4. We can try:
61 - 94 = -36
This is negative, so let's try dividing:
61 / 94 = 0.648987...
This is not helping, so let's try multiplication and division in a different way:
(61 - 4) * (94 / 21) = 57 * 4.476... = 254.4
This is not working either. Let's try a different approach:
(94 - 61) = 33
Now we have:
33 + 21 = 54
We need to adjust from 54 to 50, so we can:
54 - 4 = 50
So, the equation becomes:
(94 - 61) + (21 - 4) = 50
</think>
<answer>(94 - 61) + (21 - 4)</answer>
```

```xml
<think> First, I'll consider the possibility of directly subtracting two numbers to get close to 31. If I subtract 57 from 63, I get 7. Now, if I subtract this 7 from 25, I get 25 - 7 = 18. Far from 31, so this doesn't work.
Next, let's try adding and subtracting. Dividing 63 by 25 will give a decimal, so that's unlikely to work. Now, let's see if adding and subtracting in another order could work. For example, let's try 63 - 25, and then we see this gets us 38. Now, we need a way to get to 31 from here. Subtracting 7 from 38 would work, but we can't figure out how to create that 7 from the given numbers.
Hmm, let me try to incorporate all numbers by adding and subtracting. If I try 63 + 25, I get 88. This is quite far from 31, but if 88 - 57, I get 31 exactly!
This seems to be the equation: 63 + 25 - 57 equals 31.
</think>
<answer> 63 + 25 - 57 </answer>
```

#### Successfull Reasoning Samples between step 200 and 450

```xml
<think> We need to use the numbers 39, 16, and 55 with basic arithmetic operations to get 78. Let's try different combinations:
- 55 + 39 - 16 = 84 - 16 = 68 (too low)
- 55 + 16 - 39 = 71 - 39 = 32 (too low)
- 55 - 16 + 39 = 39 + 39 = 78 (this works)
So, the equation is 55 - 16 + 39 = 78. </think>
<answer> 55 - 16 + 39 </answer>
```

```xml
<think> We need to use the numbers 21, 5, and 40 with basic arithmetic operations to get 24. Let's try different combinations:
- 21 - 5 - 40 = -15
- 21 - 5 + 40 = 56
- 21 + 5 - 40 = -14
- 21 * 5 / 40 = 2.1 (not an integer, let's try another)
- 40 * 5 / 21 = 9.52 (not 24)
- 40 / 5 - 21 = 9 - 21 = -12
- 5 * 40 / 21 = 100 / 21 ~= 4.76 (not 24)
- 5 + 40 - 21 = 24 ! This works.
</think>
<answer> 5 + 40 - 21 </answer>
```

```xml
<think> We need to use the numbers 75, 99, 62, and 34 with basic arithmetic operations to get 72. Let's try different combinations:
- 99 - 75 + 62 - 34 = 92 - 34 = 58 (too high)
- 99 - 75 - 62 + 34 = 24 - 62 + 34 = -38 + 34 = -4 (too low)
- 99 + 75 - 62 - 34 = 174 - 96 = 78 (too high)
- 99 + 75 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)
- 99 - 75 + 62 + 34 = 24 + 96 = 120 (too high)
- 75 + 99 - 62 - 34 = 174 - 96 = 78 (too high)
- 75 + 99 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)
- 75 + 62 - 99 + 34 = 137 - 99 + 34 = 38 + 34 = 72
So, 75 + 62 - 99 + 34 equals 72.
</think>
<answer> 75 + 62 - 99 + 34 </answer>
```

## 结论

DeepSeek R1 及其研究论文的发布可能成为开放科学和开源发展的转折点。就在 DeepSeek 发布一周后，我们已经能够使用 GRPO 和倒计时游戏重现 R1 学习“推理”的简单版本。虽然我们的实现侧重于特定任务而非通用推理，并收敛到一种非常具体的“推理”格式，但这表明该方法是有效的。

在我们的 Mini-R1 实验中，我们使用了 GRPO 和两个基于规则的奖励，但已经需要大量的计算：4 块 H100 GPU 运行 6 小时，才能在 3B 参数模型上完成 450 个训练步骤。这让我们对扩展强化学习所需的计算需求有了一个概念。DeepSeek 对一个 671B 的模型进行了超过 8000 步的训练，他们可能进行了许多消融实验。

展望 2025 年，很明显我们正处于更重大进步的边缘。RL 将变得更容易访问和使用，更多的研究人员和开发人员将探索其潜力，但也将需要比以前和与监督微调相比更多的计算量。

我对 2025 年感到兴奋。如果您有任何问题或想法，请随时与我联系。