PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency
This repository provides the code for the paper PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency.
Zhangyi Liu*, Huaizhi Qu*, Xiaowei Yin*, He Sun, Yanjun Han, Tianlong Chen, and Zhun Deng (*Equal contribution, sorted alphabetically)
PETS introduces a principled study of trajectory allocation for test-time scaling through an optimization framework. Central to our approach is the self-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. We study both offline and online settings: in the offline regime, we connect trajectory allocation to crowdsourcing, yielding theoretical guarantees and an efficient majority-voting-based allocation algorithm; in the online streaming regime, we propose a novel one-shot allocation strategy that adapts budgets to question difficulty. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self-consistency while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation.
PETS provides a complete toolchain for evaluating LLM performance on various reasoning tasks:
- Multiple Benchmarks: AIME, HMMT, BRUMO, GPQA, MMLU-Pro
- Confidence Tracking: Per-token confidence statistics via vLLM plugin (required)
- Majority Voting: Vote on multiple samples to improve answer accuracy
- Parallel Inference: Multi-threaded concurrent requests for efficient large-scale evaluation
- Flexible Parallelism: Configurable tensor parallel (TP) and data parallel (DP) sizes
PETS/
├── README.md # This file
├── install.sh # Dependency installation (includes confidence plugin)
├── patch/ # vLLM patches and plugins
│ └── vllm_confidence_plugin/ # vLLM confidence plugin (required)
├── budget_allocation/ # Budget allocation experiments (offline/online)
│ ├── README.md # Usage and data format for budget allocation
│ └── plots/ # Shared plotting modules for budget curves
└── reasoning/ # Benchmark scripts and shared utilities
├── common.py # Shared utilities (client, inference, confidence, voting)
├── aime24.py # AIME 2024 benchmark
├── aime25.py # AIME 2025 benchmark
├── hmmt.py # HMMT February 2025 benchmark
├── brumo.py # BRUMO 2025 benchmark
├── gpqa.py # GPQA Diamond benchmark
├── mmlu_pro.py # MMLU-Pro benchmark (few-shot CoT)
└── launch.sh # Complete launcher (starts server + runs benchmark)
bash install.shThis will install vLLM and the required confidence plugin automatically.
Use launch.sh to automatically start the vLLM server with confidence plugin and run the benchmark:
cd reasoning
# Run AIME 2024 with 64 samples on 8 GPUs (TP=1, DP=8)
./launch.sh --model-dir /path/to/your/model --task aime24 --n 64
# Run with custom parallelism: TP=2, DP=4 on 8 GPUs
./launch.sh --model-dir /path/to/your/model --task aime24 --tp-size 2 --dp-size 4
# Run GPQA Diamond
./launch.sh --model-dir /path/to/your/model --task gpqa --n 32 --temp 0.6If you prefer to manage the server separately:
# 1. Enable the confidence plugin (required)
export VLLM_PLUGINS="confidence_logprobs"
export VLLM_CONF_MODE=stats
export VLLM_CONF_TOPK=20
export VLLM_FLAT_LOGPROBS=1
# 2. Start vLLM server with your desired parallelism
vllm serve /path/to/your/model \
--tensor-parallel-size 2 \
--data-parallel-size 4 \
--port 8000 \
--gpu-memory-utilization 0.97
# 3. Run benchmark (in another terminal)
cd reasoning
python aime24.py --n 64 --temperature 1.1 --top_p 0.95| Parameter | Description | Default |
|---|---|---|
--tp-size |
Tensor parallel size (split model across GPUs) | 1 |
--dp-size |
Data parallel size (replicate model) | 1 |
--api-server-count |
Number of API server processes | 2 * dp-size |
--enable-expert-parallel |
Enable for MoE models (flag => 1, or pass 0/1) | 0 |
Example configurations:
- 8 GPUs, small model:
--tp-size 1 --dp-size 8(8 replicas) - 8 GPUs, large model:
--tp-size 4 --dp-size 2(2 replicas, each on 4 GPUs) - 8 GPUs, huge model:
--tp-size 8 --dp-size 1(1 replica across all 8 GPUs)
The vLLM confidence plugin is required for PETS. It provides per-token confidence statistics that are used for:
- Tracking inference quality
- Confidence-based answer selection
- Debugging and analysis
Configuration via environment variables:
| Variable | Description | Default |
|---|---|---|
VLLM_CONF_MODE |
Output mode: stats, per_token, summary, empty | stats |
VLLM_CONF_TOPK |
Top-k for confidence calculation | 20 |
See the README in reasoning/ for benchmark-specific details.
When the served model name contains gpt (e.g. gpt-oss-120b), PETS automatically:
- Forces
temperature=1andtop_p=1(ignoring user-provided values) - Adds
extra_body={"reasoning_effort": "high"}to API requests
These overrides are applied transparently inside common.process_question(). No user action is needed.
If you use PETS in your research, please cite:
@article{liu2026pets,
title={PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency},
author={Liu, Zhangyi and Qu, Huaizhi and Yin, Xiaowei and Sun, He and Han, Yanjun and Chen, Tianlong and Deng, Zhun},
journal={arXiv preprint arXiv:2602.16745},
year={2026},
url={https://arxiv.org/abs/2602.16745}
}MIT License
