Deductive Reasoning

Train your own frontier-level deductive reasoning model with reinforcement learning.

Overview

This repository contains the training recipe for creating frontier-level deductive reasoning models using reinforcement learning. Our research demonstrates how smaller, open-weight language models can be trained to perform complex logical deduction tasks at frontier-level performance, matching or exceeding proprietary models at a fraction of the cost.

We used the Temporal Clue puzzle dataset to train Qwen 14B and 32B models, improving their deductive reasoning capabilities significantly through Group Relative Policy Optimization (GRPO). Our trained models approach the performance of leading proprietary models like Claude 3.7 Sonnet while maintaining cost efficiency.

Resources

Training Recipe: This repository (recipe for RL training)
Training Dataset: Temporal Clue Puzzles
RL Experiments: OpenPipe RL Experiments
Model Weights:
- Deductive Reasoning Qwen 14B
- Deductive Reasoning Qwen 32B
Blog Post: Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"

Getting Started

Follow these steps to run the training recipe:

Prerequisites

Sufficient NVIDIA GPUs for your chosen model:
- Qwen 14B requires at least 2 GPUs
- Qwen 32B requires at least 4 GPUs
uv package manager
Weights & Biases account

Installation

Clone this repository:

git clone https://github.com/bradhilton/deductive-reasoning.git
cd deductive-reasoning

Install dependencies using uv:
```
uv sync
```
Reinstall torchtune due to an executable naming conflict with Ray Tune 🙈
```
uv remove torchtune
uv add torchtune
```
(Optional) Configure environment variables:
```
cp .env.example .env
```
Edit the .env file to add your Weights & Biases API key and project name:
```
WANDB_API_KEY=your_wandb_api_key_here
WANDB_PROJECT=your_project_name
```

Running the Training

Open the train.ipynb notebook or train.py script and configure the training parameters:
- Set a unique run_name for your experiment
- Choose the model (e.g., models.qwen_14b() or models.qwen_32b())
- Adjust other parameters as needed (learning rate, number of iterations, etc.)
Run the training:
- If using the notebook: Execute all cells in train.ipynb
- If using the script: Run uv run train.py
Monitor training progress in Weights & Biases.

The training process will save the latest and/or best checkpoints in your output directory, allowing you to resume training if interrupted.

Methodology

Our training approach used reinforcement learning to incrementally improve models' deductive reasoning capabilities:

Environment: Temporal Clue puzzles (inspired by the board game Clue/Cluedo) with verifiable solutions
Algorithm: Group Relative Policy Optimization (GRPO) without KL divergence penalty
Training Loop:
- Generate model responses to puzzle tasks
- Grade responses and estimate advantages for each group of completions
- Fine-tune the model using clipped policy gradients
- Repeat with new puzzles until peak performance

We used the torchtune library for efficient training and vLLM for inference, with the following key parameters:

Models: Qwen 2.5 Instruct 14B & 32B
Tasks per Iteration: 32
Samples per Task per Iteration: 50
Learning Rate: 6e-6

Results

Our training produced impressive performance gains, demonstrating that open-weight models can achieve frontier-level reasoning capabilities.

We dramatically improved the cost-accuracy tradeoff compared to proprietary models:

Notably, we discovered that meaningful performance improvements (10-15%) can be achieved with as few as 16 training examples, making this approach accessible even with limited data.

License

This training recipe is freely available under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
lib		lib
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
train.ipynb		train.ipynb
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deductive Reasoning

Overview

Resources

Getting Started

Prerequisites

Installation

Running the Training

Methodology

Results

License

About

Releases

Packages

Languages

License

OpenPipe/deductive-reasoning

Folders and files

Latest commit

History

Repository files navigation

Deductive Reasoning

Overview

Resources

Getting Started

Prerequisites

Installation

Running the Training

Methodology

Results

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages