# Mini-R1: 复现Deepseek R1的"顿悟时刻"——一个强化学习教程

Deepseek R1的发布震撼了整个行业。为什么？因为DeepSeek-R1是一个开源模型，在复杂推理任务中能与OpenAI的o1匹敌，其核心创新是使用组相对策略优化（Group Relative Policy Optimization, GRPO）和以强化学习（RL）为核心的多阶段训练方法。他们不仅发布了模型，还公开了相关研究论文。

论文中描述了使用纯强化学习训练模型时的"顿悟时刻"。在此阶段，DeepSeek-R1-Zero（DeepSeek-R1的初期版本）学会了通过重新评估初始策略来为问题分配更多思考时间，而无需任何人工反馈或数据指导。他们将此称为"顿悟时刻"：

> 这种行为不仅证明了模型推理能力的提升，更展示了强化学习如何带来意想不到的复杂结果。

这提醒我们，强化学习有潜力解锁人工智能系统的新智能层次，为未来更自主、自适应的模型铺平道路。本文将通过使用GRPO和《倒计时游戏》（Countdown Game）复现DeepSeek-R1的"小顿悟时刻"。我们将训练一个开源模型，利用强化学习使其自主掌握自我验证和搜索能力来解决《倒计时游戏》。该游戏的规则是：玩家使用一组随机抽取的数字和基本算术运算（+、-、×、÷）来达成或尽可能接近目标数字。

```
目标数字: 952
可用数字: 25, 50, 75, 100, 3, 6

(100 × (3 × 3)) + (50 + 6 / 3) = 952
```

本文包含可在Jupyter Notebook中运行的交互式代码，展示如何使用GRPO和Q-LoRa训练模型。虽然这种方法能帮助学习TRL和GRPO，但速度较慢且需要大量算力。此外，我还提供了[脚本](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_r1_grpo.py)和指南，支持在多GPU环境或SLURM集群上运行训练。

1. [配置开发环境](#1-配置开发环境)
2. [从《倒计时游戏》生成带有推理前缀的训练样本](#2-从倒计时游戏生成带有推理前缀的训练样本)
3. [使用GRPO训练模型](#3-使用grpo训练模型)
4. [使用Deepspeed和vLLM进行GRPO分布式训练示例](#4-使用deepspeed和vllm进行grpo分布式训练示例)
5. [结果与训练观察](#5-结果与训练观察)

_注：本文灵感源自[Jiayi Pan](https://x.com/jiayi_pirate/status/1882839370505621655)，他率先探索了这一思路并用小模型验证了可行性。_

在开始前，我们先了解[组相对策略优化（GRPO）](https://arxiv.org/abs/2402.03300)的工作原理。

## 组相对策略优化（GRPO）

GRPO是一种用于提升大语言模型（LLM）推理能力的强化学习算法，首次在[DeepSeekMath](https://arxiv.org/abs/2402.03300)论文中提出。GRPO改进了传统的近端策略优化（PPO），通过消除对价值函数模型的需求，转而从组分数中估计基线，从而降低内存和计算开销。Qwen团队也采用了GRPO，可结合规则/二元奖励或通用奖励模型提升模型的有用性。

1. **采样**：使用当前策略为每个提示生成多个输出。
2. **奖励评分**：使用奖励函数为每个生成结果评分（规则或结果导向）。
3. **优势计算**：以组内生成结果的平均奖励为基线，计算每个解决方案的相对优势（奖励在组内归一化）。
4. **策略优化**：通过最大化GRPO目标函数（包含计算优势和KL散度项）优化策略。

![grpo.png](../assets/grpo.png)

## 1. 配置开发环境

第一步是安装Hugging Face库、PyTorch、vLLM、trl、transformers和datasets。如果你不熟悉trl，它是一个基于transformers和datasets的库，可简化开源LLM的微调、RLHF和对齐过程。

In [None]:
# 安装PyTorch及其他库（确保与GPU驱动版本匹配）
%pip install "torch==2.5.1" tensorboard "setuptools<71.0.0"  --index-url https://download.pytorch.org/whl/cu121

# 安装flash-attn
%pip install flash-attn 

# 安装Hugging Face库
%pip install  --upgrade \
  "transformers==4.48.1" \
  "datasets==3.1.0" \
  "accelerate==1.3.0" \
  "hf-transfer==0.1.9" \
  "deepspeed==0.15.4" \
  "trl==0.14.0"

# 安装vLLM
%pip install "vllm==0.7.0"

## 重要提示：如需运行交互式代码，还需安装以下库（可能与分布式训练库冲突）
# %pip install "peft==0.14.0" "bitsandbytes==0.45.0"

_注：安装后可能需要重启内核以应用更新。_

我们将使用[Hugging Face Hub](https://huggingface.co/models)作为远程模型版本控制服务。训练过程中，模型、日志和信息会自动推送至Hub。请先注册[Hugging Face账号](https://huggingface.co/join)，然后使用`huggingface_hub`的`login`工具登录并存储访问令牌。

In [None]:
from huggingface_hub import login

login(token="", add_to_git_credential=True) # 在此填入你的令牌

## 2. 从《倒计时游戏》生成带有推理前缀的训练样本

我们将使用数据集[Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4)，其中包含3到4个数字的样本及解法。

模型选用[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)——一个30亿参数的指令调优模型。该模型已适配提示格式，便于展示"顿悟时刻"。当然，你也可以使用Qwen的基础版本或其他模型。[Jiayi Pan](https://x.com/jiayi_pirate/status/1882839487417561307)发现模型需具备一定规模（>15亿参数）才能学习推理过程。

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset

# 从Hugging Face Hub加载数据集
dataset_id = "Jiayi-Pan/Countdown-Tasks-3to4"
dataset = load_dataset(dataset_id, split="train")
# 随机选取50k样本
dataset = dataset.shuffle(seed=42).select(range(50000))

# 加载分词器以格式化数据集为"r1"提示
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# 生成带有推理前缀的r1提示
def generate_r1_prompt(numbers, target):
    r1_prefix = [{
        "role": "system",
        "content": "你是一个乐于助人的助手。你首先在脑海中思考推理过程，然后为用户提供答案。"
      },
      { 
        "role": "user",
        "content": f"使用数字{numbers}创建一个等于{target}的方程。可使用基本算术运算（+、-、*、/），每个数字仅用一次。在<think></think>标签中展示思考过程，并在<answer></answer>标签中返回最终方程和答案。示例：<answer>(1 + 2) / 3 = 1</answer>。"
      },
      {
        "role": "assistant",
        "content": "让我一步步解决这个问题。\n<think>"
      }]
    return {"prompt": tokenizer.apply_chat_template(r1_prefix, tokenize=False, continue_final_message=True), "target": target}

# 将数据集转换为r1提示
dataset = dataset.map(lambda x: generate_r1_prompt(x["nums"], x["target"]))

# 划分训练集和测试集
train_test_split = dataset.train_test_split(test_size=0.1)

train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

## 3. 使用GRPO训练模型

_注：本节演示如何使用TRL和GRPO的基础流程。如需运行交互式代码，需安装`bitsandbytes`和`peft`。本节主要用于教学目的。_

TRL通过专用[GRPOTrainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)支持GRPO，用于根据偏好数据对齐LLM（如[DeepSeekMath论文](https://arxiv.org/abs/2402.03300)所述）。`GRPOTrainer`是`transformers`库中`Trainer`的子类，支持日志记录、检查点、分布式训练和参数高效微调（PEFT）等功能。

`GRPOTrainer`支持通用结果奖励模型（ORM）和自定义奖励函数。在Deepseek R1论文中，他们使用规则奖励模型验证生成解法的正确性。本例将采用类似方法，创建两个奖励函数：
1. **格式奖励**：检查生成格式是否正确`<think>...</think><answer>...</answer>`。
2. **准确性奖励**：从`<answer>`标签中提取方程，评估其是否匹配目标数字且所有数字仅用一次。

In [1]:
import re

def format_reward_func(completions, target, **kwargs):
    """
    格式检查：<think>...</think><answer>...</answer>
    参数：
        completions (list[str]): 生成输出
        target (list[str]): 预期答案
    返回：
        list[float]: 奖励分数
    """
    rewards = []
    for completion, gt in zip(completions, target):
        try:
            # 添加<think>前缀以匹配正则
            completion = "<think>" + completion
            regex = r"^<think>([^<]*(?:<(?!/?think>)[^<]*)*<\/think>\n<answer>([\s\S]*?)<\/answer>$"
            match = re.search(regex, completion, re.DOTALL)
            rewards.append(1.0 if match and len(match.groups()) == 2 else 0.0)
        except Exception:
            rewards.append(0.0)
    return rewards

def equation_reward_func(completions, target, nums, **kwargs):
    """
    评估方程正确性：
    1. 方程数学正确性
    2. 所有数字仅使用一次
    参数：
        completions (list[str]): 生成输出
        target (list[str]): 目标数字
        nums (list[str]): 可用数字
    返回：
        list[float]: 奖励分数
    """
    rewards = []
    for completion, gt, numbers in zip(completions, target, nums):
        try:
            completion = "<think>" + completion
            match = re.search(r"<answer>(.*?)<\/answer>", completion)
            if not match:
                rewards.append(0.0)
                continue
            equation = match.group(1).strip()
            used_numbers = [int(n) for n in re.findall(r'\d+', equation)]
            # 检查数字使用情况
            if sorted(used_numbers) != sorted(numbers):
                rewards.append(0.0)
                continue
            # 允许的字符检查
            if not re.match(r'^[\d+\-*/().\s]+$', equation):
                rewards.append(0.0)
                continue
            # 方程求值
            result = eval(equation, {"__builtins__": None}, {})
            rewards.append(1.0 if abs(float(result) - float(gt)) < 1e-5 else 0.0)
        except Exception:
            rewards.append(0.0)
    return rewards

测试奖励函数（注意：示例中未包含初始`<think>`标签，我们已通过代码自动添加）：

In [2]:
correct_sample_1 = """需使用数字19, 36, 55, 7各一次，通过基本运算得到65。可能的解法是55 + 36 - 19 + 7... </think>
<answer> 55 + 36 - 7 - 19 </answer>"""

correct_sample_2 = """ ... </think>
<answer> 55 + 36 - 7 - 19 </answer>"""

wrong_format = """用户：使用数字[19, 36, 55, 7]创建等于65的方程。"""

wrong_format_2 = """尝试用95、78、6、88得到79：
95 + 88 = 183
183 - 104 = 79
<think> 183 - 104 = 79 </think><think> 183 - 104 = 79 </think><answer> 183 - 104 = 79 </answer>"""

wrong_result = """ ... </think>
<answer> 55 + 36 - 7 - 18 </answer>"""

test_rewards = format_reward_func(completions=[correct_sample_1, correct_sample_2, wrong_format, wrong_format_2, wrong_result], target=["65", "65", "65", "65", "65"], nums=[[19, 36, 55, 7]] * 5)
assert test_rewards == [1.0, 1.0, 0.0, 0.0, 1.0], "Reward function is not working"
test_rewards = equation_reward_func(completions=[correct_sample_1, correct_sample_2, wrong_format, wrong_format_2, wrong_result], target=["65", "65", "65", "65", "65"], nums=[[19, 36, 55, 7]] * 5)
assert test_rewards == [1.0, 1.0, 0.0, 0.0, 0.0], "Reward function is not working"

This looks good, now lets define our remaining training parameters, create a trainer and start training. 

In [None]:
from trl import GRPOConfig, GRPOTrainer, get_peft_config, ModelConfig

# our model we are going to use as policy 
model_config = ModelConfig(
    model_name_or_path="Qwen/Qwen2.5-3B-Instruct",
    torch_dtype="bfloat16",
    attn_implementation="flash_attention_2",
    use_peft=True,
    load_in_4bit=True,
)

# Hyperparameters
training_args = GRPOConfig(
    output_dir="qwen-r1-aha-moment",
    learning_rate=5e-7,
    lr_scheduler_type="cosine",
    logging_steps=10,
    max_steps=100,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    bf16=True,
    # GRPO specific parameters
    max_prompt_length=256,
    max_completion_length=1024, # max length of the generated output for our solution
    num_generations=2,
    beta=0.001,
    
)
trainer = GRPOTrainer(
    model=model_config.model_name_or_path,
    reward_funcs=[format_reward_func, equation_reward_func],
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=get_peft_config(model_config),
)

We can start our training by calling the `train` method on the trainer instance. 

_Note: Reinforcement Training is very slow and compute intensive. Running a single step on 1x L4 with Q-LoRA, Batch size of 1 and only 2 generations per samples takes >20 minutes._

In [None]:
# Train and push the model to the Hub
trainer.train()
# Save model
trainer.save_model(training_args.output_dir)

## 4. Distributed Training example for GRPO using Deepspeed and vLLM

More than 20 minutes per step with only 2 generations per sample is not feasible. We need to scale up our training. Hugging Face TRL added support for distributed training with Deepspeed and using vLLM for faster generation. I preprared a [run_r1_grpo.py](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_r1_grpo.py) script and a [receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml) config file to run the training. 

This configuration is tested and validated on a Node with 4x H100 80GBs, where a single step takes around 45-60s, as we can leverage vLLM for generation and DeepSpeed for distributed training. Therefore we need to make sure we correctly set the `num_processes` to the number of GPUs you have - 1 as the last one will be used with vLLM for Generation. If you are using more GPUS you need to change the `vllm_device` in the config file to last index GPU, e.g. if you have 8 GPUs you need to set `vllm_device=7` and your `num_processes` to 7.

command to run the training:
```bash
accelerate launch --num_processes 3 --config_file configs/accelerate_configs/deepspeed_zero3.yaml scripts/run_r1_grpo.py --config receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml
```

With the optimized distributed training a single step with 8 generations per sample on 4x H100 80GBs takes around 45-60s. The full training for 450 steps takes around 6 hours. 

## 5. Results and Training Observations

The script saves random completions to the `completion_samples` folder, which you can use to inspect the model's progress. It includes `completion_samples.txt` and `success_completion_samples.txt`. The `completion_samples.txt` includes all completions, while the `success_completion_samples.txt` which correctly solves the equation. Below you can find the interesating training obeserations on how the performance changes over time, as well as the Tensornoard logs and successfull reasoning samples.

The model with checkpoints for every 25th step can be found at [philschmid/qwen-2.5-3b-r1-countdown](https://huggingface.co/philschmid/qwen-2.5-3b-r1-countdown). 

### Hyperparameters

I started the experiment using the hyperparameters from the [DeepSeekMath](https://arxiv.org/abs/2402.03300) paper with a learning rate of 1e-6 and a beta (KL coefficient) of 0.04, which led to unstable training runs after around 150 steps. I ran some small ablations and decreased both the learning rate to 5e-7 and the beta to 0.001, based on a test from [OpenRLHF](https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05). I coulnd't test how increasing the `num_generations` from 8 to 64 would affect the training. 64 is the generation value, which was used in the DeepSeekMath paper. All other parameters can be found in the [grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/receipes/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml) config file.


### Training Observations: 

* At ~50 steps the model has learned the correct format `<think>...</think>\n<answer>...</answer>`. 
* At 100 steps the success rate for solving the equation is around 25%. The model starts to "reason" with words see examples below. 
* At 200 steps the performance seems to converge much slower and we are at ~40% success rate. The model starts to learn a new "format" where it solves the equation similar to how you would do it programmatically, by trying different combinations and reviewing the results, see "Successfull Reasoning Samples between step 200 and 450".
* At 450 steps we have 50% success rate for solving the equation. The performance still improves slowly and the model kept its new format form from step 200.

I have 3 potential assumptions why the model shifts from "word reasoning" to "programmatic execution":
1. Qwen 2.5 3B is not strong enough or to small, Deepseek mentions that you need a very strong base model. 
2. The reward functions are not good enough defined and it reward hacks the model to solve the equation. We could try to force it to use words, e.g. having a number to word frequency condition. (We don't know much about the reward functions from Deepseek)
3. Only training on the Countdown Game tasks might force the model naturally to learn the most effective way to solve the equation as no other formats are required. 
4. We didn't train the model long enough, in the R1 paper they showed visuals of training for over 8000 steps. 


#### Tensorboard Results

![Qwen R1 Tensorboard results](../assets/tensorboard-r1.png)


#### Successfull Reasoning Samples up to step 200

```xml
<think> First, I need to think about the numbers we have: 14, 70, and 80. To get 85, I think 80 might be a good starting point because it's right in between 85. If I subtract or add 80, it won't lead to 85 easily. Next, 70 seems to be a good intermediate number, especially to use in conjunction with 14. 
Now, I can think about different combinations using addition, subtraction and multiplication/division. If I take 80 - 70 = 10, which is not needed. So, I need to find a way to use 14. If I subtract 14 from 80, I get 66, which still isn't 85. If I subtract 70 from 80, I get 10 again, this isn't constructive yet. 
Multiplication could create numbers larger than the target 85. Let's consider division: If I divide 70 by 14, I get 5. Then if I add this result to 80, I get 85. This combination seems promising: 80 - (70 / 14) = 80 - 5 = 75, this still isn't quite right. Let me correct my approach: Let's use division and then addition. If I take 80 / (70 / 14) → 80 / 5 = 16, and then 80 - 16 = 64, still incorrect. Let me simplify: 80 / (70 / 14) → 80 / 5 = 16, and then 80 - 16 gives 64 again, let me correct: 80 - (70 / 14) → 80 - 5 = 75, but I need 85. Thus I will use the combinations in a slightly different order to insert numbers: I get 80 + (70 / 14) = 80 + 5 = 85</think>
<answer> 80 + (70 / 14) </answer>
```

```xml
<think> First, I need to consider how I can use the numbers 1, 21, and 90 to reach 68. I can try different combinations of operations. Let's try to use subtraction and addition. If I subtract 90 from 21, I get -69, which is too far from 68. Now, if I add 1 to 90, I get 91, which is also not right. Another way is to see what happens if I subtract 21 from 90, which gives me 69. This is closer but still not 68. Let me use multiplication instead of subtraction. 90 * 1 = 90. Now, if I try to reduce this slightly by subtracting 21, I am not getting close to 68.
Now, let's try another multiplication. I'll think of 1 * (larger value) + (smaller value). Let's try 1 * 90 + 21. This evaluates to 90 + 21 = 111, which is too much. Instead, I will try: 90 / 1 + 21 - 7. But this gives a division by 7, which seems complex. Let's re-evaluate: I'll use 1 to divide 21: 21 / 1 = 21. Now, subtract 21 from 90, and add 1, we have 90 - 21 = 69, using one...
Hey, wait on re-evaluating I will subtract 21 from 90, and include 1 to complete the solution: 90 - (21 + 1) which would be 68.</think>
<answer> 90 - (21 + 1) </answer>
```


```xml
<think> First, I need to consider the numbers we have: 59, 80, and 66. I need to use these numbers to get 45. Let's try to use subtraction first since it can quickly reduce large numbers.
Let's see if we can get close by subtracting or adding these numbers. If I subtract 66 from 80, I get 14. Then I need to make this 14 equal 45 when combined with 59. That gives me 14 + 59 = 73, which is too high.
Now let's try adding numbers. If I add 59 and 66, I get 125. Then I would need to subtract 80 from 125 to get 45, but 80 is too large to fit well with this.
Let me try a different approach. What if I subtract 66 from 80 first and then add 59? That would be 14 + 59 = 73, which is still not 45.
Hmm, I need to think of another combination. What if I start with 59 and then add 80 and subtract 66? That would be (59 + 80) - 66, let me calculate it. That equals 139 - 66, which is 73.
That's still not 45, so let me try rearranging it slightly: maybe 80 - 66 + 59 would be closer to 45. Calculating again: 80 - 66 is 14, then adding 59 gives 73.
This way doesn't work either. Let's flip this: maybe (59 + 66) - 80. Let me check: that's 125 - 80, which equals 45.</think>
<answer>(59 + 66) - 80</answer>
```

```xml
<think> First, I need to find a combination of the numbers 61, 4, 94, and 21 that can be manipulated using basic arithmetic operations to equal 50. One way to approach this is to look for numbers that can be used to reach a target number close to 50 and then adjust with the remaining numbers.
Let's try combining 94 and 21 first:
94 + 21 = 115
Now, we need to get from 115 to 50. We can try subtracting 65:
115 - 65 = 50
Now we need to find a way to get 65 using the remaining numbers 61 and 4. We can try:
61 - 4 = 57
Now we need to get from 57 to 65. Adding 8 would give us that:
57 + 8 = 65
So, we need to find a way to get 8 using 61 and 4. We can try:
61 - 94 = -36
This is negative, so let's try dividing:
61 / 94 = 0.648987...
This is not helping, so let's try multiplication and division in a different way:
(61 - 4) * (94 / 21) = 57 * 4.476... = 254.4
This is not working either. Let's try a different approach:
(94 - 61) = 33
Now we have:
33 + 21 = 54
We need to adjust from 54 to 50, so we can:
54 - 4 = 50
So, the equation becomes:
(94 - 61) + (21 - 4) = 50
</think>
<answer>(94 - 61) + (21 - 4)</answer>
```

```xml
<think> First, I'll consider the possibility of directly subtracting two numbers to get close to 31. If I subtract 57 from 63, I get 7. Now, if I subtract this 7 from 25, I get 25 - 7 = 18. Far from 31, so this doesn't work.
Next, let's try adding and subtracting. Dividing 63 by 25 will give a decimal, so that's unlikely to work. Now, let's see if adding and subtracting in another order could work. For example, let's try 63 - 25, and then we see this gets us 38. Now, we need a way to get to 31 from here. Subtracting 7 from 38 would work, but we can't figure out how to create that 7 from the given numbers.
Hmm, let me try to incorporate all numbers by adding and subtracting. If I try 63 + 25, I get 88. This is quite far from 31, but if 88 - 57, I get 31 exactly!
This seems to be the equation: 63 + 25 - 57 equals 31.
</think>
<answer> 63 + 25 - 57 </answer>
```

#### Successfull Reasoning Samples between step 200 and 450

```xml
<think> We need to use the numbers 39, 16, and 55 with basic arithmetic operations to get 78. Let's try different combinations:
- 55 + 39 - 16 = 84 - 16 = 68 (too low)
- 55 + 16 - 39 = 71 - 39 = 32 (too low)
- 55 - 16 + 39 = 39 + 39 = 78 (this works)
So, the equation is 55 - 16 + 39 = 78. </think>
<answer> 55 - 16 + 39 </answer>
```

```xml
<think> We need to use the numbers 21, 5, and 40 with basic arithmetic operations to get 24. Let's try different combinations:
- 21 - 5 - 40 = -15
- 21 - 5 + 40 = 56
- 21 + 5 - 40 = -14
- 21 * 5 / 40 = 2.1 (not an integer, let's try another)
- 40 * 5 / 21 = 9.52 (not 24)
- 40 / 5 - 21 = 9 - 21 = -12
- 5 * 40 / 21 = 100 / 21 ~= 4.76 (not 24)
- 5 + 40 - 21 = 24 ! This works.
</think>
<answer> 5 + 40 - 21 </answer>
```

```xml
<think> We need to use the numbers 75, 99, 62, and 34 with basic arithmetic operations to get 72. Let's try different combinations:
- 99 - 75 + 62 - 34 = 92 - 34 = 58 (too high)
- 99 - 75 - 62 + 34 = 24 - 62 + 34 = -38 + 34 = -4 (too low)
- 99 + 75 - 62 - 34 = 174 - 96 = 78 (too high)
- 99 + 75 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)
- 99 - 75 + 62 + 34 = 24 + 96 = 120 (too high)
- 75 + 99 - 62 - 34 = 174 - 96 = 78 (too high)
- 75 + 99 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)
- 75 + 62 - 99 + 34 = 137 - 99 + 34 = 38 + 34 = 72
So, 75 + 62 - 99 + 34 equals 72.
</think>
<answer> 75 + 62 - 99 + 34 </answer>
```

## Conclusion

The release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we've been able to reproduce a simple version of R1 learned "reasoning" using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific "reasoning" format, it shows that the method is working. 

In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model. This gives us an idea of the compute needs that you will need to scale Reinforcement Learning. Deepseek ran a 671B model for over 8000 steps and they probably did many ablations.

Looking in to 2025, it's clear that we are on the cusp of even more significant progress. RL will become more accessible and user-friendly, more researchers and developers will explore its potential, but also require amount of more compute as before and compared to supervised fine-tuning. 

I am excited for 2025. If you are have any question or ideas feel free to reach out to me. 