Rethinking On-Policy Distillation of Large Language Models:
Phenomenology, Mechanism, and Recipe

🎉News

[2026-04-15] We investigate the dynamics and mechanisms of on-policy distillation (OPD) of LLMs, and propose practical strategies to recover failing OPD. Check it out: Paper.

📖Overview

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student’s perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%--99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

✨Getting Started

Environment Setup

Our code is mainly based on verl (v0.7.0). To prepare the environment used for OPD and RL:

conda create -n verl python==3.12
conda activate verl
cd verl/
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install math-verify

And we use LlamaFactory (v0.9.5) for SFT training. To prepare the environment for SFT:

conda create -n sft python==3.11
cd LlamaFactory/
pip install -e .
pip install -r requirements/metrics.txt

Training

OPD

Use the following command to start on-policy distillation:

bash on_policy_distillation.sh

Key Parameters

Parameter	Default	Description
Distillation Method
`ADV_ESTIMATOR`	`token_reward_direct`	It can't be modified if you use OPD
`ACTOR_MODEL_PATH`	—	Path to the student (policy) model to be trained
`REWARD_MODEL_PATH`	—	Path to the teacher model that provides token-level reward signals
Generation Control
`N_RESPONSES`	`4`	Number of rollout responses generated per prompt
`MAX_PROMPT_LENGTH`	`1024`	Maximum token length for prompts
`MAX_RESP_LENGTH`	`7168`	Maximum token length for responses during training
`MAX_VAL_RESP_LENGTH`	`31744`	Maximum token length for responses during validation (set larger to ensure complete generation)
Top-K & Weighting Strategy
`LOG_PROB_TOP_K`	`16`	Number of Top-K tokens retained when computing token-level rewards; setting to `0` falls back to sampled-token OPD
`TOP_K_STRATEGY`	`only_stu`	Strategy for selecting the Top-K token set. Options: `only_stu` (select Top-K from the student, then query the teacher for corresponding log-probs), `only_tch` (select Top-K from the teacher), `intersection` (keep tokens appearing in both student and teacher Top-K), `union` (merge student and teacher Top-K), `union-intersection` (tokens in either Top-K but not both, i.e. symmetric difference)
`REWARD_WEIGHT_MODE`	`student_p`	Weighting scheme for token rewards. `student_p`: weighted by student probability; `teacher_p`: weighted by teacher probability; `none`: no weighting

Note

You can use scripts/infer/dedup_deepmath.py to deduplicate DeepMath against DAPO-Math-17K and avoid data overlap, as the experiments shown in Section 5.2 in our paper.

SFT

Use scripts/infer/vllm_rollout.py to rollout teacher responses that will later be used for student SFT.

Key Parameters

Parameter	Default	Description
`--input-parquet`	required	Path to the parquet file that provides prompts for teacher rollout
`--model-path`	required	Path to the teacher model checkpoint used to generate responses
`--gpu-ids`	`0,1,2,3,4,5,6,7`	Comma-separated GPU IDs used for multiprocessing rollout
`--enable-thinking`	`false`	Whether to enable the model's thinking template when formatting prompts
`--enable-rejection-sampling`	`true`	Whether to reject invalid outputs and retry generation
`--max-attempts-per-rollout`	`3`	Maximum number of retries for each rollout slot when rejection sampling is enabled

Below is an example command for generating teacher responses with Qwen3-4B (Non-thinking):

python scripts/infer/vllm_rollout.py \
  --input-parquet datasets/OpenThoughts3-1.2M-math.parquet \
  --model-path model/Qwen3-4B \
  --gpu-ids 0,1,2,3,4,5,6,7 \
  --enable-thinking false \
  --enable-rejection-sampling true \
  --max-attempts-per-rollout 3

After the rollout finishes, use the generated teacher responses for student SFT. An example SFT training command is:

llamafactory-cli train LlamaFactory/examples/train_full/qwen3_base_full_sft.yaml

RL (GRPO)

We use GRPO as the RL algorithm. To enable RL, set ADV_ESTIMATOR=grpo and LOG_PROB_TOP_K=0. A reference script grpo.sh is provided.

Important

Non-thinking Models: When training a non-thinking model (e.g., Qwen3-1.7B (Non-thinking)) using OPD or RL, you must add +data.apply_chat_template_kwargs.enable_thinking=False to the training script.

Validation

We reuse the evaluation pipeline from JustRL.

Generation (Optional)

cd scripts/val/eval
python gen_vllm.py

Before running generation, set MODEL_NAMES in gen_vllm.py to the checkpoint(s) you want to evaluate. And set appropriate available_workers.

Grading

cd scripts/val/eval
python grade.py

The grading script processes all JSONL files in the output directory and generates grading_results.json. If needed, you can enable the LLM-based verifier with:

python grade.py --enable_model_verifier

All experiments were conducted on 8 x NVIDIA A800 80GB GPUs.

📨Contact

Bingxiang He: hebx24@mails.tsinghua.edu.cn
Ning Ding: dingning@mail.tsinghua.edu.cn

🎈Citation

If you find this work helpful, please cite us:

@article{li2026rethinking,
  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
  journal={arXiv preprint arXiv:2604.13016},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rethinking On-Policy Distillation of Large Language Models:
Phenomenology, Mechanism, and Recipe

🎉News

📖Overview

✨Getting Started

Environment Setup

Training

OPD

SFT

RL (GRPO)

Validation

📨Contact

🎈Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LlamaFactory		LlamaFactory
datasets		datasets
figs		figs
scripts		scripts
verl		verl
README.md		README.md
grpo.sh		grpo.sh
on_policy_distillation.sh		on_policy_distillation.sh

Folders and files

Latest commit

History

Repository files navigation

Rethinking On-Policy Distillation of Large Language Models:Phenomenology, Mechanism, and Recipe

🎉News

📖Overview

✨Getting Started

Environment Setup

Training

OPD

SFT

RL (GRPO)

Validation

📨Contact

🎈Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Rethinking On-Policy Distillation of Large Language Models:
Phenomenology, Mechanism, and Recipe

Packages