Train your own frontier-level deductive reasoning model with reinforcement learning.
This repository contains the training recipe for creating frontier-level deductive reasoning models using reinforcement learning. Our research demonstrates how smaller, open-weight language models can be trained to perform complex logical deduction tasks at frontier-level performance, matching or exceeding proprietary models at a fraction of the cost.
We used the Temporal Clue puzzle dataset to train Qwen 14B and 32B models, improving their deductive reasoning capabilities significantly through Group Relative Policy Optimization (GRPO). Our trained models approach the performance of leading proprietary models like Claude 3.7 Sonnet while maintaining cost efficiency.
- Training Recipe: This repository (recipe for RL training)
- Training Dataset: Temporal Clue Puzzles
- RL Experiments: OpenPipe RL Experiments
- Model Weights:
- Blog Post: Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"
Follow these steps to run the training recipe:
- Sufficient NVIDIA GPUs for your chosen model:
- Qwen 14B requires at least 2 GPUs
- Qwen 32B requires at least 4 GPUs
- uv package manager
- Weights & Biases account
-
Clone this repository:
git clone https://github.com/bradhilton/deductive-reasoning.git cd deductive-reasoning
-
Install dependencies using uv:
uv sync
-
Reinstall torchtune due to an executable naming conflict with Ray Tune 🙈
uv remove torchtune uv add torchtune
-
(Optional) Configure environment variables:
cp .env.example .env
Edit the
.env
file to add your Weights & Biases API key and project name:WANDB_API_KEY=your_wandb_api_key_here WANDB_PROJECT=your_project_name
-
Open the
train.ipynb
notebook ortrain.py
script and configure the training parameters:- Set a unique
run_name
for your experiment - Choose the model (e.g.,
models.qwen_14b()
ormodels.qwen_32b()
) - Adjust other parameters as needed (learning rate, number of iterations, etc.)
- Set a unique
-
Run the training:
- If using the notebook: Execute all cells in
train.ipynb
- If using the script: Run
uv run train.py
- If using the notebook: Execute all cells in
-
Monitor training progress in Weights & Biases.
The training process will save the latest and/or best checkpoints in your output directory, allowing you to resume training if interrupted.
Our training approach used reinforcement learning to incrementally improve models' deductive reasoning capabilities:
- Environment: Temporal Clue puzzles (inspired by the board game Clue/Cluedo) with verifiable solutions
- Algorithm: Group Relative Policy Optimization (GRPO) without KL divergence penalty
- Training Loop:
- Generate model responses to puzzle tasks
- Grade responses and estimate advantages for each group of completions
- Fine-tune the model using clipped policy gradients
- Repeat with new puzzles until peak performance
We used the torchtune library for efficient training and vLLM for inference, with the following key parameters:
- Models: Qwen 2.5 Instruct 14B & 32B
- Tasks per Iteration: 32
- Samples per Task per Iteration: 50
- Learning Rate: 6e-6
Our training produced impressive performance gains, demonstrating that open-weight models can achieve frontier-level reasoning capabilities.
We dramatically improved the cost-accuracy tradeoff compared to proprietary models:
Notably, we discovered that meaningful performance improvements (10-15%) can be achieved with as few as 16 training examples, making this approach accessible even with limited data.
This training recipe is freely available under the MIT license.