PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

This repository provides the code for the paper PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency.

Zhangyi Liu*, Huaizhi Qu*, Xiaowei Yin*, He Sun, Yanjun Han, Tianlong Chen, and Zhun Deng (*Equal contribution, sorted alphabetically)

PETS introduces a principled study of trajectory allocation for test-time scaling through an optimization framework. Central to our approach is the self-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. We study both offline and online settings: in the offline regime, we connect trajectory allocation to crowdsourcing, yielding theoretical guarantees and an efficient majority-voting-based allocation algorithm; in the online streaming regime, we propose a novel one-shot allocation strategy that adapts budgets to question difficulty. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self-consistency while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation.

Overview

PETS provides a complete toolchain for evaluating LLM performance on various reasoning tasks:

Multiple Benchmarks: AIME, HMMT, BRUMO, GPQA, MMLU-Pro
Confidence Tracking: Per-token confidence statistics via vLLM plugin (required)
Majority Voting: Vote on multiple samples to improve answer accuracy
Parallel Inference: Multi-threaded concurrent requests for efficient large-scale evaluation
Flexible Parallelism: Configurable tensor parallel (TP) and data parallel (DP) sizes

Project Structure

PETS/
├── README.md              # This file
├── install.sh             # Dependency installation (includes confidence plugin)
├── patch/                 # vLLM patches and plugins
│   └── vllm_confidence_plugin/   # vLLM confidence plugin (required)
├── budget_allocation/     # Budget allocation experiments (offline/online)
│   ├── README.md          # Usage and data format for budget allocation
│   └── plots/             # Shared plotting modules for budget curves
└── reasoning/             # Benchmark scripts and shared utilities
    ├── common.py          # Shared utilities (client, inference, confidence, voting)
    ├── aime24.py          # AIME 2024 benchmark
    ├── aime25.py          # AIME 2025 benchmark
    ├── hmmt.py            # HMMT February 2025 benchmark
    ├── brumo.py           # BRUMO 2025 benchmark
    ├── gpqa.py            # GPQA Diamond benchmark
    ├── mmlu_pro.py        # MMLU-Pro benchmark (few-shot CoT)
    └── launch.sh          # Complete launcher (starts server + runs benchmark)

Quick Start

1. Install dependencies (includes confidence plugin)

bash install.sh

This will install vLLM and the required confidence plugin automatically.

2. Run a benchmark (recommended)

Use launch.sh to automatically start the vLLM server with confidence plugin and run the benchmark:

cd reasoning

# Run AIME 2024 with 64 samples on 8 GPUs (TP=1, DP=8)
./launch.sh --model-dir /path/to/your/model --task aime24 --n 64

# Run with custom parallelism: TP=2, DP=4 on 8 GPUs
./launch.sh --model-dir /path/to/your/model --task aime24 --tp-size 2 --dp-size 4

# Run GPQA Diamond
./launch.sh --model-dir /path/to/your/model --task gpqa --n 32 --temp 0.6

Alternative: Manual Server Setup

If you prefer to manage the server separately:

# 1. Enable the confidence plugin (required)
export VLLM_PLUGINS="confidence_logprobs"
export VLLM_CONF_MODE=stats
export VLLM_CONF_TOPK=20
export VLLM_FLAT_LOGPROBS=1

# 2. Start vLLM server with your desired parallelism
vllm serve /path/to/your/model \
  --tensor-parallel-size 2 \
  --data-parallel-size 4 \
  --port 8000 \
  --gpu-memory-utilization 0.97

# 3. Run benchmark (in another terminal)
cd reasoning
python aime24.py --n 64 --temperature 1.1 --top_p 0.95

Parallelism Configuration

Parameter	Description	Default
`--tp-size`	Tensor parallel size (split model across GPUs)	1
`--dp-size`	Data parallel size (replicate model)	1
`--api-server-count`	Number of API server processes	2 * dp-size
`--enable-expert-parallel`	Enable for MoE models (flag => 1, or pass 0/1)	0

Example configurations:

8 GPUs, small model: --tp-size 1 --dp-size 8 (8 replicas)
8 GPUs, large model: --tp-size 4 --dp-size 2 (2 replicas, each on 4 GPUs)
8 GPUs, huge model: --tp-size 8 --dp-size 1 (1 replica across all 8 GPUs)

Confidence Plugin

The vLLM confidence plugin is required for PETS. It provides per-token confidence statistics that are used for:

Tracking inference quality
Confidence-based answer selection
Debugging and analysis

Configuration via environment variables:

Variable	Description	Default
`VLLM_CONF_MODE`	Output mode: stats, per_token, summary, empty	stats
`VLLM_CONF_TOPK`	Top-k for confidence calculation	20

See the README in reasoning/ for benchmark-specific details.

Model-Specific Behavior

gpt-oss Models

When the served model name contains gpt (e.g. gpt-oss-120b), PETS automatically:

Forces temperature=1 and top_p=1 (ignoring user-provided values)
Adds extra_body={"reasoning_effort": "high"} to API requests

These overrides are applied transparently inside common.process_question(). No user action is needed.

Citation

If you use PETS in your research, please cite:

@article{liu2026pets,
  title={PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency},
  author={Liu, Zhangyi and Qu, Huaizhi and Yin, Xiaowei and Sun, He and Han, Yanjun and Chen, Tianlong and Deng, Zhun},
  journal={arXiv preprint arXiv:2602.16745},
  year={2026},
  url={https://arxiv.org/abs/2602.16745}
}

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

Overview

Project Structure

Quick Start

1. Install dependencies (includes confidence plugin)

2. Run a benchmark (recommended)

Alternative: Manual Server Setup

Parallelism Configuration

Confidence Plugin

Model-Specific Behavior

gpt-oss Models

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
budget_allocation		budget_allocation
patch		patch
reasoning		reasoning
.DS_Store		.DS_Store
README.md		README.md
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

Overview

Project Structure

Quick Start

1. Install dependencies (includes confidence plugin)

2. Run a benchmark (recommended)

Alternative: Manual Server Setup

Parallelism Configuration

Confidence Plugin

Model-Specific Behavior

gpt-oss Models

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages