[ICLR 2026] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

If you find DPH-RL useful, please ⭐ star this repo!

🔍 What is DPH-RL?

DPH-RL is an RL algorithm improved upon GRPO. It maintains policy diversity by pre-calculating the f-divergence from reference policy samples, which removes the need to load a reference model.

Method

Result

Math and SQL generation experiments show that DPH-RL both improves in-domain Pass@1 and Pass@k scores and effectively prevents catastrophic forgetting on out-of-domain tasks.

In-Domain

OOD and Keep

✨Getting started

Data Prepared

This repo is forked from verl. For SQL tasks, you need to load the databases. Please download the bird train and dev databases from bird bench, spider dev databases from spider. Then copy the databases to your local dir /cache/. Your /cache/ directory should look like the image below.

Our data is split into three parts:

1. Original Training Data

data/sql/bird_train.parquet

2. Test Sets

data/sql/spider_dev.parquet
data/sql/bird_dev.parquet

3. Pre-Sampling Stage Data

This is our extra collected training data, referred to as $\mathcal{D}{\text{exp}}$ and $\mathcal{D}{\text{pef}}$ in the paper. You can find it in the data/sql/llama3.1-8b/ directory.

Installation

You can install dependencies by running the following commands:

pip install requirements.txt

Evaluation

We launch a server for evaluation using the following code.

cd your_path
python rl/scorer/scorer_server_without_ray.py -c ./rl/scorer/sql.yaml

Then, evaluation is performed via the IP address and port. For an implementation reference, please see the code in verl/utils/reward_score/sql.py.

Training

Notes: Please remember set your SWANLAB_API_KEY or WANDB_API_KEY. All of our scripts are based on multi-machine deployment. We've mounted the evaluation server on non-Rank 0 machines to reduce cluster load. If you only have a single machine, please launch the evaluation server separately.

Pre-Sampling Stage

For the llama-sql experiment, you can skip this step and directly use the data in `data/llama3.1-8b/`.

For DPH-RL, you need to split a complete dataset into two sub-datasets by performing a correct-ness check k times. This requires the following steps:

bash scripts/llama/offline_sampling.sh sampling

This script samples the training data eight times and saves the data by default to $PROJECT_DIR/data/sql/llama3.1-8b/generate_data/0.jsonl.

Next, use

python ./data/sql/process_data/exact_correct_id.py

it split the data into data/sql/llama3.1-8b/train_wrong.parquet and data/sql/llama3.1-8b/train_correct.parquet.

To facilitate further exploration, we first generated data in the train format from data/sql/llama3.1-8b/train_correct.parquet. We sampled each data point only once, and saved the correct data to $PROJECT_DIR/data/sql/llama3.1-8b/8b_llama3.1_all_right.pt. Please use

bash scripts/llama/get_correct_data_tensor.sh

Now, you can directly load this .pt file for model training. The actor_rollout_ref.actor.generate_sft parameter can be used to determine whether to sample SFT data.

DPH-RL

You can implement different methods by directly calling the corresponding scripts in scripts/llama. The sft_loss_mode and sft_loss_coeff parameters are used to select the specific method and adjust hyperparameters.

The following table outlines the key considerations: default set use_kl_loss=False

`sft_loss_mode`	Description	Additional Settings	`sft_loss_coeff`
forward	Forward KL	None	0.01~0.05
js	The JS definition	`use_kl_loss=True`	0.05~0.2
js_low_var	The JS Generator	`use_kl_loss=False`	0.05~0.2
reverse_kl	Reverse KL	`data.sft_files=None` `data.sft_pt=None`	0.01~0.05
alpha	$\alpha$ divergence	`data.sft_files=None` `data.sft_pt=None`	0.01~0.05

📑 Cite

If this paper can help you, please cite it:

@misc{li2025choicedivergenceneglectedkey,
      title={The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward}, 
      author={Long Li and Jiaran Hao and Jason Klein Liu and Zhijian Zhou and Xiaoyu Tan and Wei Chu and Zhe Wang and Shirui Pan and Chao Qu and Yuan Qi},
      year={2025},
      eprint={2509.07430},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.07430}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data/sql		data/sql
docker		docker
docs		docs
examples		examples
figures		figures
recipe		recipe
rl/scorer		rl/scorer
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICLR 2026] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

🔍 What is DPH-RL?

Result

✨Getting started

Data Prepared

1. Original Training Data

2. Test Sets

3. Pre-Sampling Stage Data

Installation

Evaluation

Training

Pre-Sampling Stage

For the llama-sql experiment, you can skip this step and directly use the data in `data/llama3.1-8b/`.

DPH-RL

📑 Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICLR 2026] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

🔍 What is DPH-RL?

Result

✨Getting started

Data Prepared

1. Original Training Data

2. Test Sets

3. Pre-Sampling Stage Data

Installation

Evaluation

Training

Pre-Sampling Stage

For the llama-sql experiment, you can skip this step and directly use the data in data/llama3.1-8b/.

DPH-RL

📑 Cite

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

For the llama-sql experiment, you can skip this step and directly use the data in `data/llama3.1-8b/`.

Packages