SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

This repository serves as the official implementation of the paper "SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks".

News

[2026.04] 🎉 This paper has been accepted to the ACL 2026 Main Conference!

💡 Abstract

Proximal Policy Optimization (PPO) was central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput.

Sequence-Level PPO (SPPO) introduces a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates: shifting from a Multi-Step MDP to a Sequence-Level Contextual Bandit.

Sequence-Level Optimization: Treats the entire reasoning chain as a single atomic action, utilizing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling.
Decoupled Small Critic: Because scalar solvability estimation is significantly simpler than generative reasoning, SPPO enables training with a lightweight critic (e.g., 1.5B Critic for a 7B Policy), radically reducing VRAM requirements without sacrificing performance.

Figure 1: Overall Architecture of Sequence-Level PPO (SPPO).

Figure 2: Peak VRAM Allocation Analysis.

Figure 3: Training Efficiency and Performance.

🚀 Key Features

Exclusive SPPO Implementation: Full support for the Sequence-Level Contextual Bandit formulation with Single-Sample Efficiency ($N=1$).
Efficient & Stable: Resolves the temporal credit assignment problem in long-horizon CoT tasks while avoiding the computational bottleneck of multi-sampling.
Extreme Memory Efficiency: Natively supports "Small Critic" architectures (e.g., training a 7B policy with a 1.5B critic), making efficient RL alignment accessible on consumer-grade hardware.
Scalable: Built on top of verl, supporting FSDP and Megatron for training large-scale models.

🛠️ Quick Start

Installation

Option 1: Automated Setup (Recommended)

bash uv_verl.sh

Option 2: Manual Setup

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip

# Install package in editable mode
pip install --no-deps -e .

# Add project root to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd)

⚙️ Training with SPPO

We provide pre-configured scripts for various model sizes and settings.

# 1. DeepSeek-R1-Distill-Qwen 1.5B (SPPO DeepscaleR)
bash run_scripts/run_ds1.5B_PPO_SEQUENCE_shuffle.sh

# 2. DeepSeek-R1-Distill-Qwen 7B (DAPO-17k)
bash run_scripts/run_R1-7B_DAPO_SEQUENCE.sh

# 3. DeepSeek-R1-Distill-Qwen 7B (DAPO-17k with Small Critic)
# Utilizes a 1.5B critic to align the 7B policy.
bash run_scripts/run_R1-7B_DAPO_SEQUENCE_small_critic.sh

📊 Evaluation

Models can be evaluated on the provided AIME24/25, AMC23, MATH, and Minerva benchmarks out of the box using the verl evaluation toolkit. Training logs and checkpoints will automatically be populated in the current working directory.

📜 Citation

If you find SPPO useful for your research, please cite our paper:

@misc{wang2026spposequencelevelppolonghorizon,
      title={SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks}, 
      author={Tianyi Wang and Yixia Li and Long Li and Yibiao Chen and Shaohan Huang and Yun Chen and Peng Li and Yang Liu and Guanhua Chen},
      year={2026},
      eprint={2604.08865},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.08865}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gemini		.gemini
.github		.github
data		data
image		image
run_scripts		run_scripts
scripts		scripts
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
Notice.txt		Notice.txt
README.md		README.md
prepare_dapo.sh		prepare_dapo.sh
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
requirements_transferqueue.txt		requirements_transferqueue.txt
setup.py		setup.py
transformer_dapo_prmopt.py		transformer_dapo_prmopt.py
uv_verl.sh		uv_verl.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

News

💡 Abstract

🚀 Key Features

🛠️ Quick Start

Installation

⚙️ Training with SPPO

📊 Evaluation

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

News

💡 Abstract

🚀 Key Features

🛠️ Quick Start

Installation

⚙️ Training with SPPO

📊 Evaluation

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages