Skip to content

XinXU-USTC/Composition-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

arXiv
1Tencent HY 
2The Hong Kong University of Science and Technology 
3The University of Hong Kong 

📝 News

  • [2025/03/17] We released a new compositional training set, Polaris-Composition-1323K, constructed from Polaris53K.
  • [2025/03/03] We released our evaluation and data generation codes.
  • [2026/02/12] We released the paper and datasets & models!

🧠 Overview

Entropy Control

Composition-RL is a data-efficient RLVR approach that combats the growing number of “too-easy” prompts (pass-rate = 1) by automatically composing multiple verifiable problems into a single, harder yet still-verifiable prompt, and then performing RL on these compositional prompts to maintain informative training signals. Across 4B–30B models, Composition-RL consistently improves reasoning performance over RL on the original dataset, gains further boost with a curriculum that gradually increases composition depth, and enables stronger cross-domain RL (e.g., math + physics) than simply mixing or sequentially training on the two domains.

🚀 Quick Start

Installation

1. Environment setup

conda create -n crl python=3.10 -y
conda activate crl

2. Requirements installation

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.8.5.post1
cd verl
pip install -e .
pip install vertexai
pip install sentence_transformers
pip install flash-attn==2.7.4.post1 --no-build-isolation

Data Generation

Our datasets are available at Composition-RL HF, and you can download them therein. If you want to generate your own dataset, please follow these steps:

First, deploy vllm instances across nodes:

cd deployment

# export your node ip lists, e.g., 192.168.1.101:8,192.168.1.102:8
export NODE_IP_LIST=xxx 
bash nodes_config.sh
bash deploy_vllm_cluster.sh # wait 2-3 mins for serving llms
bash generate_config.sh # It will generate a generated_vllm_config.yaml file

These steps will deploy vllm instances across your nodes. The generated configuration file deployment/generated_vllm_config.yaml contains resp_urls, resp_server_names, and resp_api_keys.

Copy these configurations to /apdcephfs_zwfy3/share_302867165/xxucaxu/codes/BGRPO/NCSP/paper/repo/testtest/NCSP/project/NCSP/config/v4_4step/stable_with_code_math4500_demo.yaml and adapt the dataset_path to your own datasets.

Then, run the following commands:

python3 main.py --config_path project/NCSP/config/v4_4step/your_config.yaml

The results will be saved to the output_folder parameter of your config file.

Finally, make the final dataset using the following command:

python3 project/NCSP/custom_functions/v4_4step/pre_and_post/make_final_dataset.py --path project/NCSP/result/v4_4step/your_path/step10.jsonl --save_path $your_save_path

Run Evaluation

First, download the evaluation datasets using

hf download xx18/Composition-RL-EVA --repo-type=dataset --local-dir ./data/eval

All test datasets are downloaded to the folder data/eval.

for evaluation, use:

bash ./scripts/ray_start.sh # start ray, use pssh to run on multiple nodes if necessary
bash scripts/eval/start_generate.sh

The resulting metrics and evaluation outputs will be saved under the folder your_model_path/eval_results.

🤗 Datasets and Models

We are open-sourcing our complete code and training details for the research community. All our checkpoints can be found in Composition-RL Collection.

Name Link Remarks
Evaluation Sets Composition-RL-EVA All evaluation datasets used in our paper, including AIME24, AIME25, BeyondAIME, IMO-AnswerBench, GPQA, and MMLU-Pro
MATH-Composition-199K MATH-Composition-199K Training set of our main experiments; Results in Table 1 and Section 4.2
MATH-Composition-Depth3 MATH-Composition-Depth3 Training set of our curriculum RL; Results in Table 1 and Section 4.3
Physics-MATH-Composition-141K Physics-MATH-Composition-141K Training set of our cross-domain experiments; Results in Table 2 and Section 4.4
Composition-RL-4B Composition-RL-4B Initial Model: Qwen3-4b-Base; Training set: MATH-Composition-199K; Results in Table 1
Composition-RL-8B Composition-RL-8B Initial Model: Qwen3-8b-Base; Training set: MATH-Composition-199K; Results in Table 1
Composition-RL-14B Composition-RL-14B Initial Model: Qwen3-14b-Base; Training set: MATH-Composition-199K; Results in Table 1
Composition-RL-30B-A3B Composition-RL-30B-A3B Initial Model: Qwen3-30b-a3b-Base; Training set: MATH-Composition-199K; Results in Table 1
Baseline-4B-MATH12K Baseline-4B-MATH12K Initial Model: Qwen3-4b-Base; Training set: MATH12K; Results in Table 1
Composition-RL-4B-Depth1_2 Composition-RL-4B-Depth1_2 Initial Model: Baseline-4B-MATH12K; Training set: MATH-Composition-199K; Results in Table 1
Composition-RL-4B-Depth1_2_3 Composition-RL-4B-Depth1_2_3 Initial Model: Composition-RL-4B-Depth1_2; Training set: MATH-Composition-Depth3; Results in Table 1
Composition-RL-4B-Physics_Math Composition-RL-4B-Physics_Math Initial Model: Qwen3-4b-Base; Training set: Physics-MATH-Composition-141K; Results in Table 2
Polaris-Composition-1323K Polaris-Composition-1323K Compositional prompts constructed from Polaris53K

📮Contact

If you have any questions or would like to discuss collaboration, please feel free to contact:
Xin Xu — xxuca@connect.ust.hk

Saiyong Yang — stevesyang@tencent.com

Can Yang - macyang@ust.hk

🤝 Acknowledgement

We are deeply grateful for the following GitHub repositories, as their valuable code and efforts have been incredibly helpful:

📚 Citation

If you find our work helpful for your research, please consider citing our paper:

@article{xu2026composition-rl,
  title={Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models},
  author={Xu, Xin and Bai, Clive and Yang, Kai and Chen, Tianhao and Chen, Yangkun and Liu, Weijie and Chen, Hao and Wang, Yang and Yang, Saiyong and Yang, Can},
  journal={arXiv preprint arXiv:2602.12036},
  year={2026}
}

About

Official repository for the paper "Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors