CopT is a reasoning pipeline with continuous-space verifiers, enabling LLMs to start with a draft answer and invoke on-policy thinking conditioned on it.
(a) Conceptual comparison between CoT thinking and CopT on-policy thinking. (b) CopT contrasts the output distributions under discrete and continuous inputs. (c) CopT improves peak accuracy, marked by *, across mathematics, coding, and agentic reasoning tasks and nearly halves token usage at matched accuracy.
git clone https://github.com/sdc17/CopT.git
cd CopTconda create -n copt python=3.12
conda activate copt
pip install -r requirements.txt
pip install transformers==5.7.0 # Only for Qwen3.5 support- Qwen3 and Qwen3.5 model families, from 2B to 35B
# Evaluate on Math500, for example
torchrun --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_port $((RANDOM + 20000)) run.py --model_name Qwen/Qwen3-8B --dataset_name math500 --batch_size 128 --method copt
python merge.py --model_name Qwen/Qwen3-8B --dataset_name math500 --method copt
# Reasoning effort control via tau_a and tau_r
torchrun --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_port $((RANDOM + 20000)) run.py --model_name Qwen/Qwen3-8B --dataset_name math500 --batch_size 128 --method copt --tau_a 0.6 --tau_r 0.4
python merge.py --model_name Qwen/Qwen3-8B --dataset_name math500 --method copt- Reasoning effort control
- Decrease
--tau_a: Increase reasoning effort by allowing fewer draft answers to be accepted directly - Decrease
--tau_r: Increase reasoning effort by making on-policy thinking rely less on draft answers - Increase
--tau_a: Decrease reasoning effort by allowing more draft answers to be accepted directly - Increase
--tau_r: Decrease reasoning effort by making on-policy thinking rely more on draft answers
- Decrease
- Increase
--nproc_per_nodeto enable faster evaluation on multiple GPUs - Modify
--model_nameand--dataset_namefor evaluation with different models and datasets - Please check run.sh for more examples
# Evaluate on ZebraArena as an example
torchrun --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_port $((RANDOM + 20000)) run_agents.py --model_name Qwen/Qwen3.5-35B-A3B --dataset_name zebra_arena --batch_size 16 --method copt --zebra_arena_space Small --zebra_arena_max_turns 16
python merge.py --model_name Qwen/Qwen3.5-35B-A3B --dataset_name zebra_arena --method copt --zebra_arena_space Small
# Reasoning effort control via tau_a and tau_r
torchrun --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_port $((RANDOM + 20000)) run_agents.py --model_name Qwen/Qwen3.5-35B-A3B --dataset_name zebra_arena --batch_size 16 --method copt --zebra_arena_space Small --zebra_arena_max_turns 16 --tau_a 1.5 --tau_r 0
python merge.py --model_name Qwen/Qwen3.5-35B-A3B --dataset_name zebra_arena --method copt --zebra_arena_space Small- Download the dataset here or view the raw dataset here
- Specify the dataset path with
--zebra_arena_data_dir - We recommend using Qwen3.5 instead of Qwen3 for agentic tasks
- Same reasoning effort control as general reasoning
- Increase
--nproc_per_nodeto enable faster evaluation on multiple GPUs - Modify
--model_nameand--dataset_namefor evaluation with different models and datasets - Please check run.sh for more examples
We thank the contributors of open-source projects SwiReasoning and ZebraArena.
Please cite if you find our codebase helpful.
@misc{shi2026coptcontrastiveonpolicythinking,
title={CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning},
author={Dachuan Shi and Hanlin Zhu and Xiangchi Yuan and Wanjia Zhao and Kejing Xia and Wen Xiao and Wenke Lee},
year={2026},
eprint={2605.20075},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.20075},
}