Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Xin Xu^1,2, Clive Bai¹, Kai Yang¹, Tianhao Chen^2,, Yangkun Chen¹, Weijie Liu¹,

Hao Chen², Yang Wang^3,, Saiyong Yang^1†, Can Yang^2†

¹Tencent HY
²The Hong Kong University of Science and Technology
³The University of Hong Kong

📝 News

[2025/03/17] We released a new compositional training set, Polaris-Composition-1323K, constructed from Polaris53K.
[2025/03/03] We released our evaluation and data generation codes.
[2026/02/12] We released the paper and datasets & models!

🧠 Overview

Composition-RL is a data-efficient RLVR approach that combats the growing number of “too-easy” prompts (pass-rate = 1) by automatically composing multiple verifiable problems into a single, harder yet still-verifiable prompt, and then performing RL on these compositional prompts to maintain informative training signals. Across 4B–30B models, Composition-RL consistently improves reasoning performance over RL on the original dataset, gains further boost with a curriculum that gradually increases composition depth, and enables stronger cross-domain RL (e.g., math + physics) than simply mixing or sequentially training on the two domains.

🚀 Quick Start

Installation

1. Environment setup

conda create -n crl python=3.10 -y
conda activate crl

2. Requirements installation

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.8.5.post1
cd verl
pip install -e .
pip install vertexai
pip install sentence_transformers
pip install flash-attn==2.7.4.post1 --no-build-isolation

Data Generation

Our datasets are available at Composition-RL HF, and you can download them therein. If you want to generate your own dataset, please follow these steps:

First, deploy vllm instances across nodes:

cd deployment

# export your node ip lists, e.g., 192.168.1.101:8,192.168.1.102:8
export NODE_IP_LIST=xxx 
bash nodes_config.sh
bash deploy_vllm_cluster.sh # wait 2-3 mins for serving llms
bash generate_config.sh # It will generate a generated_vllm_config.yaml file

These steps will deploy vllm instances across your nodes. The generated configuration file deployment/generated_vllm_config.yaml contains resp_urls, resp_server_names, and resp_api_keys.

Copy these configurations to /apdcephfs_zwfy3/share_302867165/xxucaxu/codes/BGRPO/NCSP/paper/repo/testtest/NCSP/project/NCSP/config/v4_4step/stable_with_code_math4500_demo.yaml and adapt the dataset_path to your own datasets.

Then, run the following commands:

python3 main.py --config_path project/NCSP/config/v4_4step/your_config.yaml

The results will be saved to the output_folder parameter of your config file.

Finally, make the final dataset using the following command:

python3 project/NCSP/custom_functions/v4_4step/pre_and_post/make_final_dataset.py --path project/NCSP/result/v4_4step/your_path/step10.jsonl --save_path $your_save_path

Run Evaluation

First, download the evaluation datasets using

hf download xx18/Composition-RL-EVA --repo-type=dataset --local-dir ./data/eval

All test datasets are downloaded to the folder data/eval.

for evaluation, use:

bash ./scripts/ray_start.sh # start ray, use pssh to run on multiple nodes if necessary
bash scripts/eval/start_generate.sh

The resulting metrics and evaluation outputs will be saved under the folder your_model_path/eval_results.

🤗 Datasets and Models

We are open-sourcing our complete code and training details for the research community. All our checkpoints can be found in Composition-RL Collection.

Name	Link	Remarks
Evaluation Sets	Composition-RL-EVA	All evaluation datasets used in our paper, including AIME24, AIME25, BeyondAIME, IMO-AnswerBench, GPQA, and MMLU-Pro
MATH-Composition-199K	MATH-Composition-199K	Training set of our main experiments; Results in Table 1 and Section 4.2
MATH-Composition-Depth3	MATH-Composition-Depth3	Training set of our curriculum RL; Results in Table 1 and Section 4.3
Physics-MATH-Composition-141K	Physics-MATH-Composition-141K	Training set of our cross-domain experiments; Results in Table 2 and Section 4.4
Composition-RL-4B	Composition-RL-4B	Initial Model: Qwen3-4b-Base; Training set: MATH-Composition-199K; Results in Table 1
Composition-RL-8B	Composition-RL-8B	Initial Model: Qwen3-8b-Base; Training set: MATH-Composition-199K; Results in Table 1
Composition-RL-14B	Composition-RL-14B	Initial Model: Qwen3-14b-Base; Training set: MATH-Composition-199K; Results in Table 1
Composition-RL-30B-A3B	Composition-RL-30B-A3B	Initial Model: Qwen3-30b-a3b-Base; Training set: MATH-Composition-199K; Results in Table 1
Baseline-4B-MATH12K	Baseline-4B-MATH12K	Initial Model: Qwen3-4b-Base; Training set: MATH12K; Results in Table 1
Composition-RL-4B-Depth1_2	Composition-RL-4B-Depth1_2	Initial Model: Baseline-4B-MATH12K; Training set: MATH-Composition-199K; Results in Table 1
Composition-RL-4B-Depth1_2_3	Composition-RL-4B-Depth1_2_3	Initial Model: Composition-RL-4B-Depth1_2; Training set: MATH-Composition-Depth3; Results in Table 1
Composition-RL-4B-Physics_Math	Composition-RL-4B-Physics_Math	Initial Model: Qwen3-4b-Base; Training set: Physics-MATH-Composition-141K; Results in Table 2
Polaris-Composition-1323K	Polaris-Composition-1323K	Compositional prompts constructed from Polaris53K

📮Contact

If you have any questions or would like to discuss collaboration, please feel free to contact:
Xin Xu — xxuca@connect.ust.hk

Saiyong Yang — stevesyang@tencent.com

Can Yang - macyang@ust.hk

🤝 Acknowledgement

We are deeply grateful for the following GitHub repositories, as their valuable code and efforts have been incredibly helpful:

📚 Citation

If you find our work helpful for your research, please consider citing our paper:

@article{xu2026composition-rl,
  title={Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models},
  author={Xu, Xin and Bai, Clive and Yang, Kai and Chen, Tianhao and Chen, Yangkun and Liu, Weijie and Chen, Hao and Wang, Yang and Yang, Saiyong and Yang, Can},
  journal={arXiv preprint arXiv:2602.12036},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
NCSP		NCSP
deepscaler		deepscaler
deployment		deployment
figures		figures
rllm		rllm
scripts		scripts
verl		verl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

📝 News

🧠 Overview

🚀 Quick Start

Installation

1. Environment setup

2. Requirements installation

Data Generation

Run Evaluation

🤗 Datasets and Models

📮Contact

🤝 Acknowledgement

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

📝 News

🧠 Overview

🚀 Quick Start

Installation

1. Environment setup

2. Requirements installation

Data Generation

Run Evaluation

🤗 Datasets and Models

📮Contact

🤝 Acknowledgement

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages