Skip to content

smiles724/DeepSearch

Repository files navigation

βœ… DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

Paper Model License HF Daily Top 1

πŸ“– Overview

DeepSearch is a novel reinforcement learning framework that addresses the fundamental challenge of balancing exploration breadth and depth in Monte Carlo Tree Search (MCTS). By introducing a rollout-guided exploration mechanism with adaptive replay buffers, DeepSearch achieves superior performance on mathematical reasoning tasks.

Built on VERL commit 2bd291e5494db03ba358ef279a334c2f0829b979


DeepSearch Overview

πŸ”§ Environment Setup

Prerequisites

DeepSearch requires the following dependencies:

  • Python >= 3.10
  • verl framework
  • sglang for efficient inference

Installation

  1. Install VERL framework:

    cd DeepSearch
    # Follow the official VERL installation guide
    # https://verl.readthedocs.io/en/latest/start/install.html
  2. Install SGLang:

    # Follow the official SGLang installation guide
    # https://docs.sglang.ai/get_started/install.html
    pip install sglang
  3. Install DeepSearch:

    cd DeepSearch
    pip install -e .   # to active `import deepsearch`

πŸš€ Quick Start

Inference with vLLM

Our pre-trained model fangwu97/DeepSearch-1.5B supports efficient inference using vLLM.

Install dependencies:

pip install vllm>=0.8.5.post1
pip install transformers>=4.52.4

Example usage:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer


def convert_question_to_messages(question: str):
    messages = [
        {"role": "user",
         "content": question + " Let's think step by step and output the final answer within \\boxed{}."}
    ]
    return messages


model_id = "fangwu97/DeepSearch-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=32768
)

model = LLM(
    model=model_id,
    tensor_parallel_size=1
)
prompt = tokenizer.apply_chat_template(
    convert_question_to_messages("Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$."),
    add_generation_prompt=True,
    tokenize=False
)

outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False)
response = outputs[0].outputs[0].text
print(response)

πŸŽ“ Training

DeepSearch employs a multi-round training strategy with iterative data filtering and replay buffer accumulation to progressively improve model performance.

Dataset

We provide the complete training data used in our experiments under the data/ directory:

data/
β”œβ”€β”€ deepmath103k_round0/
β”‚   β”œβ”€β”€ train_0.parquet         # Problems with Pass1@4=0% on nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
β”‚   └── train_25.parquet        # Problems with Pass1@4=25% on nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
β”œβ”€β”€ deepmath103k_round1/
β”‚   β”œβ”€β”€ train_0.parquet         # Problems with Pass1@4=0% on Round0 best checkpoint
β”‚   └── train_25.parquet        # Problems with Pass1@4=25% on Round0 best checkpoint
└── val/
    └── aime2025_boxed_n32.parquet  # Validation set (placeholder, not actively used)

Data source: zwhe99/DeepMath-103K

The dataset is progressively filtered based on model performance across training rounds, focusing on challenging problems that the model has not yet mastered.


Round 0: Initial Training

Launch training:

bash round0.sh

Output structure:

results/
└── deepsearch_1_5b_round0/
    β”œβ”€β”€ checkpoints/              # Model checkpoints saved every 5 steps
    β”‚   β”œβ”€β”€ global_step_5/
    β”‚   β”œβ”€β”€ global_step_10/
    β”‚   └── ...
    └── rollout_data/             # MCTS rollout trajectories saved every step
        β”œβ”€β”€ 1.jsonl
        β”œβ”€β”€ 2.jsonl
        └── ...

Convert checkpoints to HuggingFace format:

Follow the VERL checkpoint conversion guide to merge FSDP/Megatron checkpoints into standard HuggingFace format.


Replay Buffer Registration (Round 0)

Register replay buffer from rollout data:

python deepsearch/data_setup/register_replay_buffer.py \
    --rollout_dir results/deepsearch_1_5b_round0/rollout_data

Output:

results/
└── deepsearch_1_5b_round0/
    └── rollout_data/
        β”œβ”€β”€ 1.jsonl
        β”œβ”€β”€ 2.jsonl
        β”œβ”€β”€ ...
        └── replay_buffer_0.25.json  # Replay buffer with 25% success threshold

Data Filtering for Round 1

We provide pre-filtered training data in data/deepmath103k_round1/. To create your own filtered dataset:

  1. Generate predictions using the best Round 0 checkpoint:

    • Perform n=4 generations per problem on data/deepmath103k_round0
    • Follow the vLLM inference example shown above
  2. Calculate Pass@4 scores:

    • Use the evaluation metric from deepsearch/data_setup/register_replay_buffer.py
  3. Filter data based on performance:

    • Keep problems with Pass1@4 = 0% (still challenging)
    • Keep problems with Pass1@4 = 25% (moderate difficulty)

Round 1: Training with Replay Buffer

Launch Round 1 training:

bash round1.sh

This round incorporates:

  • βœ… Filtered training data from Round 0 evaluation
  • βœ… Accumulated replay buffer from Round 0
  • βœ… Same rollout-guided MCTS exploration strategy

Output structure:

results/
└── deepsearch_1_5b_round1/
    β”œβ”€β”€ checkpoints/              # Round 1 checkpoints
    β”‚   β”œβ”€β”€ global_step_5/
    β”‚   └── ...
    └── rollout_data/             # Round 1 rollout data
        β”œβ”€β”€ 1.jsonl
        └── ...

Extending to More Rounds

To continue training for additional rounds (Round 2, 3, ...):

  1. Register new replay buffer:

    python deepsearch/data_setup/register_replay_buffer.py \
        --rollout_dir results/deepsearch_1_5b_round{N}/rollout_data

    ⚠️ Important: Combine the new replay buffer with previous ones to accumulate high-quality trajectories.

  2. Filter training data:

    • Evaluate the current round's best model on the previous round's training data
    • Select problems with Pass1@4 = 0% or 25%
    • Update data.train_files in the training configuration
  3. Update training configuration:

    • Set data.train_files to the newly filtered dataset path
    • Set actor_rollout_ref.rollout.mcts_config.replay_buffer to the accumulated replay buffer path
  4. Launch next round:

    bash round{N}.sh

πŸ“Š Results

DeepSearch achieves competitive performance on challenging mathematical reasoning benchmarks. Please refer to our paper for detailed experimental results and analysis.


πŸ“ Citation

If you find DeepSearch useful in your research, please consider citing:

@article{wu2025deepsearch,
  title={DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search},
  author={Wu, Fang and Xuan, Weihao and Qi, Heli and Lu, Ximing and Tu, Aaron and Li, Li Erran and ChoiRetry, Yejin},
  journal={arXiv preprint arXiv:2509.25454},
  year={2025}
}

πŸ™ Acknowledgments


🀝 Contributing

We welcome contributions! Please feel free to submit issues or pull requests.


πŸ“§ Contact

For questions or feedback, please open an issue on GitHub.

About

This is the official code of DeepSearch paper!

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •