CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

👀 TL;DR

CopT is a reasoning pipeline with continuous-space verifiers, enabling LLMs to start with a draft answer and invoke on-policy thinking conditioned on it.

(a) Conceptual comparison between CoT thinking and CopT on-policy thinking. (b) CopT contrasts the output distributions under discrete and continuous inputs. (c) CopT improves peak accuracy, marked by *, across mathematics, coding, and agentic reasoning tasks and nearly halves token usage at matched accuracy.

⚙️ Getting Started

Clone the project

git clone https://github.com/sdc17/CopT.git
cd CopT

Environment setup

conda create -n copt python=3.12
conda activate copt
pip install -r requirements.txt
pip install transformers==5.7.0 # Only for Qwen3.5 support

🔍 Supported Models

Qwen3 and Qwen3.5 model families, from 2B to 35B

📈 General Reasoning

# Evaluate on Math500, for example
torchrun --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_port $((RANDOM + 20000)) run.py --model_name Qwen/Qwen3-8B --dataset_name math500 --batch_size 128 --method copt
python merge.py --model_name Qwen/Qwen3-8B --dataset_name math500 --method copt

# Reasoning effort control via tau_a and tau_r
torchrun --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_port $((RANDOM + 20000)) run.py --model_name Qwen/Qwen3-8B --dataset_name math500 --batch_size 128 --method copt --tau_a 0.6 --tau_r 0.4
python merge.py --model_name Qwen/Qwen3-8B --dataset_name math500 --method copt

Reasoning effort control
- Decrease --tau_a: Increase reasoning effort by allowing fewer draft answers to be accepted directly
- Decrease --tau_r: Increase reasoning effort by making on-policy thinking rely less on draft answers
- Increase --tau_a: Decrease reasoning effort by allowing more draft answers to be accepted directly
- Increase --tau_r: Decrease reasoning effort by making on-policy thinking rely more on draft answers
Increase --nproc_per_node to enable faster evaluation on multiple GPUs
Modify --model_name and --dataset_name for evaluation with different models and datasets
Please check run.sh for more examples

🔧 Agentic Reasoning

# Evaluate on ZebraArena as an example
torchrun --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_port $((RANDOM + 20000)) run_agents.py --model_name Qwen/Qwen3.5-35B-A3B --dataset_name zebra_arena --batch_size 16 --method copt --zebra_arena_space Small --zebra_arena_max_turns 16 
python merge.py --model_name Qwen/Qwen3.5-35B-A3B --dataset_name zebra_arena --method copt --zebra_arena_space Small

# Reasoning effort control via tau_a and tau_r
torchrun --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_port $((RANDOM + 20000)) run_agents.py --model_name Qwen/Qwen3.5-35B-A3B --dataset_name zebra_arena --batch_size 16 --method copt --zebra_arena_space Small --zebra_arena_max_turns 16 --tau_a 1.5 --tau_r 0
python merge.py --model_name Qwen/Qwen3.5-35B-A3B --dataset_name zebra_arena --method copt --zebra_arena_space Small

Download the dataset here or view the raw dataset here
Specify the dataset path with --zebra_arena_data_dir
We recommend using Qwen3.5 instead of Qwen3 for agentic tasks
Same reasoning effort control as general reasoning
Increase --nproc_per_node to enable faster evaluation on multiple GPUs
Modify --model_name and --dataset_name for evaluation with different models and datasets
Please check run.sh for more examples

💬 Acknowledgments

We thank the contributors of open-source projects SwiReasoning and ZebraArena.

✨ BibTeX

Please cite if you find our codebase helpful.

@misc{shi2026coptcontrastiveonpolicythinking,
      title={CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning}, 
      author={Dachuan Shi and Hanlin Zhu and Xiangchi Yuan and Wanjia Zhao and Kejing Xia and Wen Xiao and Wenke Lee},
      year={2026},
      eprint={2605.20075},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.20075}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ZebraArena		ZebraArena
assets		assets
config		config
LICENSE		LICENSE
README.md		README.md
generation_utils.py		generation_utils.py
grader.py		grader.py
helper.py		helper.py
merge.py		merge.py
requirements.txt		requirements.txt
run.py		run.py
run.sh		run.sh
run_agents.py		run_agents.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

👀 TL;DR

⚙️ Getting Started

Clone the project

Environment setup

🔍 Supported Models

📈 General Reasoning

🔧 Agentic Reasoning

💬 Acknowledgments

✨ BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

👀 TL;DR

⚙️ Getting Started

Clone the project

Environment setup

🔍 Supported Models

📈 General Reasoning

🔧 Agentic Reasoning

💬 Acknowledgments

✨ BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages