You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AwesomeOPD is an awesome list summarising open-source repositories and papers for training LLMs (and VLMs / agents / draft models) with On-Policy Distillation (OPD) and On-Policy Self-Distillation (OPSD):
π― OPD = C1 + C2.C1: student samples its own trajectories y ~ Ο_student(Β·|x) during training. C2: teacher provides per-token / sequence supervision on those student samples. Methods that only partially satisfy are flagged in π Strictness notes per section.
πͺ OPSD = special case where teacher is the same model, conditioned on privileged context (verified trace / answer / "be concise" prefix / longer context) or an earlier checkpoint.
π Each entry is annotated along four design axes β teacher source (external Β· same model with privileged context Β· earlier checkpoint Β· multi-teacher Β· discriminator), supervision signal (logits / top-k / sequence reward / verbal score / discriminator / verifier / feature), rollout consumption (all / selected / truncated / replaced / as PG samples), and pipeline slot (cold-start / mid / RL-replacement / inside-RL / inter-stage / compression / continual-anchor).
β οΈ Built by reading paper PDFs, project pages, and source code with LLM coding agents; manually reviewed but errors possible. PRs welcome.
π If you find this repository helpful for your research, please cite it via the "Cite this repository" button in the right sidebar of the GitHub page.
π Last updated: 2026-06-08
Taxonomy:
π Surveys, Foundations & Position Papers β meta-references and seed papers (GKD, MiniLLM, Thinking Machines blog, Tencent / THUNLP surveys)
π¬ White-Box β logit-based OPD on student rollouts with an external teacher
Catalogues 50+ methods; useful as a reference index.
THUNLP Rethinking OPD
Reverse KL with progressive top-K alignment
Student
White-box
Token
Identifies two success conditions: compatible thinking patterns + genuinely new teacher capability. Recipe = off-policy cold-start + teacher-aligned prompt selection.
Lightning OPD
Cached teacher log-probs over SFT rollouts (offline OPD)
Student (cached)
White-box
Token
Introduces "teacher consistency" β same teacher must be used for SFT and OPD or else gradient bias. Eliminates the live teacher server.
OPSD Survey
(survey)
(survey)
(survey)
(survey)
Categorize eight designs; useful as a reference index.
π Strictness notes (against the strict OPD definition C1: student samples its own trajectories during training + C2: teacher provides supervision on those samples)
Lightning OPD β β οΈ partially satisfies C1: teacher log-probs are pre-computed once over SFT rollouts and reused during training; student doesn't actively sample during the OPD step. Authors call this "offline OPD" explicitly. Listed in OPD because the data is past-student-generated rollouts, not teacher-generated.
π¬ OPD with Larger External Teachers β White-Box
White-box methods use teacher logits / log-probabilities to supervise the student on student-generated rollouts. Each entry below has been verified to (a) train on student rollouts and (b) operate at the token level.
Methods that turned out to be RL-style on verification have been moved to OPD-RL Hybrids; off-policy / pure-loss-function / pretraining-side methods are excluded from this list.
Unifies KD as token-level reweighted likelihood; lightweight on-policy sampling preserves training efficiency.
π OPD with Black-Box / Outcome-Based Teachers
When the teacher is API-only (no logits), OPD uses scalar rewards, verbal scores, preferences, or adversarial discriminators β all evaluated on student rollouts. Entries that turned out to use static teacher data only (Lion, SuperCorrect, DAIL, SODA) are excluded from this list.
A trained discriminator distinguishes student outputs from teacher (e.g. GPT-5) responses; minimax game makes the discriminator co-evolve into an on-policy reward model. Qwen2.5-14B student becomes comparable to GPT-5-Chat on LMSYS.
OVD
Verbal scores (0β9) on student trajectories
Student
Sequence
General
Replaces token-level logit matching with verbal scoring; +25.7% over baselines.
Minimal edits keep samples proximal to the student distribution, targeting reasoning gains with knowledge retention.
SODA
DPO: teacher responses as preferred vs. base student (qβ) zero-shot responses as rejected
Mixed
Sequence
Cross-architecture
"Semi on-policy" paradigm: captures student-specific inferior behaviors from a one-time static snapshot of qβ, eliminating the need for dynamic rollouts or adversarial training. 10Γ faster and 27% less peak GPU memory than GAD. Outperforms GAD on 15/16 benchmarks
β»οΈ Self-Distillation with Privileged Context β OPSD
Same model = teacher = student, but the teacher is conditioned on something the student doesn't see (verified trace, ground-truth answer, "be concise" prefix, longer context, document, β¦). The gap exists because of the conditioning, not weights.
Several entries previously listed here turned out on verification to use static teacher data or a fixed self-rewritten dataset rather than student rollouts; those have been excluded. SPIN was reclassified to Iterative Self-Bootstrapping.
COPSD β crosslingual OPSD; teacher sees English problem translation + reference solution, student rolls out in low-resource language (17 African languages)
CODE β OPSD on Knowledge Editing + Casual Editing
π Click to view technical details
Method
Privileged Context (Teacher)
Loss / Divergence
Granularity
Domain
Notes
OPSD (Self-Distilled Reasoner)
Verified reasoning trace
Per-token RKL with point-wise clipping
Token
Math reasoning
Same-model OPSD; matches GRPO with 1Γ8 rollouts and 1024 length vs. GRPO's 8Γ16 / 16k. The canonical OPSD paper. Built on TRL's GOLD trainer.
SDFT-Continual (idanshen)
Demo-conditioned same model
RKL on student rollouts vs. demo-conditioned teacher
Token
Continual learning
Self-distillation enables continual learning.
MTP Self-Distill
Multi-token prediction same model
RKL on student rollouts
Token
General
Multi-Token Prediction via Self-Distillation. Author-stated on-policy.
OPCD
In-context-knowledge-augmented same model
RKL on student rollouts
Token
Knowledge internalisation
Internalise context to be faithful even after context is removed.
GATES
Document-conditioned tutor (same model)
RKL gated by tutor consensus
Token (gated)
Document QA
Both tutor and student sample rollouts; on-policy student-rollout updates contribute "modest additional improvement" on top of off-policy distillation. Mixed.
EMPOΒ²
Same model conditioned on self-generated memory "tips" β summaries of its own past rollouts β guiding tip-conditioned exploratory rollouts; the base policy acts without memory
Hybrid on-/off-policy: high-reward tip-conditioned trajectories are selectively self-distilled into the memory-free base policy, alongside GRPO
Token
Multi-turn LLM agents (ScienceWorld, WebShop)
Memory is training-only scaffolding for exploration β its benefit is internalized into the weights and removed at inference. +128.6% / +11.3% over GRPO on ScienceWorld / WebShop; strong OOD adaptability with no parameter updates. Implemented in Microsoft Agent Lightning. Same self-distillation-from-derived-context motivation as Skill-SD, which it predates.
CRISP / OPSDC
"Be concise" instruction prefix
Per-token RKL on student rollouts
Token
Reasoning compression
Compresses long-CoT without entropy collapse (unlike RL-with-length-penalty).
OEL (Online Experiential Learning)
Same model with interactive game environment
RKL on student rollouts
Token
Game / planning
Self-distillation on interactive trajectories.
Why-Does-SD-Degrade (analysis)
Varies (controlled study over rich-vs-thin context teachers)
RKL on student rollouts (analysis only)
Token
Math reasoning (in-domain + OOD)
Diagnostic paper, not a training method. Finds that conditioning the teacher on richer privileged context suppresses epistemic verbalization (uncertainty expression) in the student β fast in-domain gains but up to 40% OOD drops on Qwen3-8B / DeepSeek-Distill-Qwen-7B / Olmo3-7B-Instruct. Implication: privileged-context richness is a double-edged knob in OPSD.
Apple SSD
Same model w/ temperature/truncation sampling
Cross-entropy on its own samples
Sequence
Code generation
"Embarrassingly simple" β sample, then SFT on those samples. Degenerate OPSD; "decoding-config" privilege.
Skill-SD
Trajectory-derived skill summaries condition the teacher only
GRPO + importance-weighted reverse-KL on student rollouts
Token
Multi-turn agentic tasks (AppWorld, Sokoban)
Extends OPSD to multi-turn agentic interaction with dynamic training-only skills.
SD-Zero
Reviser conditioned on generator's response + binary reward
Per-token KL: distill reviser β generator on student rollouts
Token
Math / code reasoning
Single model plays Generator + Reviser; reviser's reward-conditioned token distribution becomes dense supervision over the generator's response. Outperforms RFT, GRPO, SDFT under matched sample budget on Qwen3-4B-Instruct / Olmo-3-7B-Instruct (β₯10% over base). Exhibits token-level self-localization and iterative self-evolution.
Ο-Play
Teacher conditioned on Question Construction Path (QCP) β the reverse-direction artifact emitted by an examiner agent when it generates the task
Per-token reverse KL on student rollouts; teacher is an EMA copy of the student (Ο=0.05)
Self-play loop examiner β student/teacher with no external data. The QCP is privileged because it captures the reverse solution path the examiner used to construct the task; the teacher sees it, the student doesn't. Converts sparse-reward self-play into dense per-token supervision; data-free Ο-Play surpasses fully supervised search agents and is 2β3Γ more sample-efficient than conventional self-play.
OPSDL
Short-context same model
Point-wise RKL
Token
Long-context
On-Policy Self-Distillation for Long-Context LMs.
MSD
English (high-resource) query translation + CoT instruction (privileged crosslingual context)
Same model = teacher = student; ships on-policy MSD (student samples its own multilingual responses, strict C1+C2) and off-policy MSD (teacher-sampled) variants. Requires no translated response data β only multilingual queries.
COPSD
English problem translation + reference solution (privileged crosslingual context)
Per-token reverse KL with full-vocabulary logit distillation; teacher fixed during training; gradients flow only through student
Token
Multilingual math reasoning (PolyMath, AfriMGSM β 17 low-resource African languages)
Same model serves as both student (rollouts in low-resource language) and teacher (English-conditioned). Transfers a model's own high-resource reasoning behavior to low-resource languages; improves answer-format adherence and test-time scaling.
Polarity-gated objective (bounded local reverse-KL approximation)
Token
Mathematical reasoning
Formulates self-distillation as teacher hypothesis validation rather than unconditional imitation. Constructs a multi-teacher pool where each teacher is conditioned on different skill-mistake pairs. Uses verifier outcomes to determine teacher polarity, distilling helpful stances and reversing misleading ones. Achieves competitive performance with answer-conditioned OPSD under weaker assumptions.
CODE
Explicit causal narrative (Cognitive scaffold detailing the transition logic)
Stage 1: SFT on filtered teacher trajectories. Stage 2: Forward KL divergence on student-sampled rollouts
Token
Knowledge Editing (Multi-hop reasoning)
Exposes Epistemic Dissonance inherent in static fact overwriting. Uses a two-stage framework (Causal Bootstrapping + Causal Internalization) to permanently engrave explicitly causal transition logic into parametric memory, achieving knowledge evolution rather than isolated fact injection.
π Strictness notes
Apple SSD β β οΈ C2 is degenerate: no teacher KL signal; pure self-generated SFT (sample with temperature/truncation, then SFT on those samples). Closer to STaR-style self-bootstrapping than to OPSD. Kept because the "teacher" is the same model with a different decoding config β privileged-context-by-decoding.
GATES β β οΈ Authors' own ablation says off-policy trajectory-level distillation drives the primary gains; on-policy student-rollout updates contribute only "modest additional improvement". Mixed; the OPSD leg is genuine but secondary.
SD-Zero β privileged context is non-textual: the reviser is conditioned on the generator's full response plus its scalar binary reward. C1 β (generator samples its own rollouts), C2 β (per-token KL from reviser). Compared head-to-head against GRPO in the paper but is not itself an RL method β there is no policy-gradient objective; the reward is a conditioning signal, not a return. Listed in OPSD rather than OPD-RL Hybrids for that reason.
Why-Does-SD-Degrade β analysis-only; no new training algorithm proposed. Listed here because the failure mode it characterises (epistemic-verbalization collapse under rich privileged context) is specific to OPSD.
Ο-Play β teacher and student have separate parameter sets; the teacher is an EMA-tracking copy of the student rather than literally the same weights. Listed in OPSD because (i) the paper itself frames the method as "Privileged Self-Distillation" and (ii) the gap between teacher and student exists because of QCP conditioning, not weight divergence (the EMA target collapses to the student in the limit). C1 β (student samples its own rollouts), C2 β (per-token RKL from QCP-conditioned teacher).
π Iterative Self-Bootstrapping
Same model is the teacher, but as a frozen earlier checkpoint, not a privileged-context view. The teacher snapshot is frozen for one round, the student trains, then the snapshot rolls forward. Listed separately because the supervision is typically sequence-level / preference, not per-token logit-distillation.
SPIN β β οΈ C1 β (student samples), but C2 fails strict per-token logit form: supervision is sequence-level DPO preference against the previous frozen checkpoint. More accurately "iterative on-policy DPO" than per-token OPD. Kept because the "teacher = previous self" pattern is what people search for in OPD lists.
rStar / rStar-Math / rStar2-Agent β β οΈ MCTS-filtered student samples + SFT; the "teacher signal" is a step-level PPM / discriminator score, not per-token logit KL. Iterative self-improvement, not classical OPD.
π€ OPD-RL Hybrids β Inside-RL OPD
Methods that fuse OPD with RLVR / GRPO / PPO / DPO. Teacher logits become a dense reward shaping or trust-region anchor inside an RL objective; or BoN / preference signals are used as the imitation target.
Treats Best-of-N as the target distribution; iterative anchor; Jeffreys divergence.
Faster WIND
Win-rate dominance
Same model BoN
Student (iterative)
Sequence
Alignment
Game-theoretic acceleration of BOND.
AlignDistil
RLHF-equivalent KD
DPO-derived combination of DPO model + ref-model logits
Student
Token
Alignment
Re-frames DPO as policy distillation.
LUFFY
Mixed-Policy GRPO + policy shaping
Off-policy R1 traces inserted into student rollouts
Mixed
Token + sequence
Reasoning
"Learn to reason under off-policy guidance". On-policy student-roll + off-policy teacher-trace mix.
KETCHUP
k-step return REINFORCE on KD
External teacher
Student
Sequence
General
RL-based KD with k-step Bellman returns.
KDRL
Joint reverse-KL + GRPO rule-based reward
External teacher (Skywork-OR1)
Student
Token + outcome
Reasoning
Unified KD + RL objective.
SDPO
Custom self-distillation policy gradient
Feedback-conditioned same model = self-teacher
Student
Token
Code, tool-use, science
Sample student rollout, get tokenised feedback, re-evaluate under feedback-conditioned self-teacher, distill the corrected next-token distribution back into policy.
KEPO
Knowledge-enhanced PO
Knowledge-base teacher
Mixed
Sequence
Reasoning
Adds KB grounding to preference RL.
Open-AgentRL
GRPO-TCR
Multi-domain teachers
Student
Token
Reasoning / GUI / Coding
Includes process-reward modelling via SandboxFusion.
DDT
On-policy SFT theory
Theoretical
Student
Token
General
Distribution Discriminant Theory; foundations for on-policy SFT.
π³-KD
AVRIL inverse-RL
Joint reward + policy distillation
Student
Token + sequence
General
IRL-flavoured experiential KD.
RLAD
PPO/GRPO ratio anchored to teacherβold-policy mixture
External teacher (Qwen3-32B)
Student
Token
Reasoning
Trust-region likelihood-ratio.
OpenClaw-RL
GRPO + OPD
Judge model extracts hindsight hints, teacher token-logprob gap = directional advantage
Mixed
Token
Terminal / GUI / SWE / Tool-call
Unifies binary RL and OPD in one trainer.
Probing-to-Refine
"Explanatory probes" force logical articulation; GRPO + dialogue-structure reward
Self-probe
Student
Sequence
Reasoning
Reinforcement Distillation via Explanatory Inversion.
HDPO
RL on most prompts; on "cliff" prompts generate privileged rollouts and self-distill
Same model w/ privilege
Student
Token
Reasoning
Privileged self-distillation as RL fallback.
Self-Distilled RLVR (RLSD)
RLVR direction + teacher evidence-ratio modulates magnitude
Same model + privileged answer
Student
Token + outcome
Reasoning
Combines self-distillation magnitudes with RLVR directions.
NPO / AutoNPO
Mixed-Policy GRPO
Verifier-filtered trajectories from a later checkpoint of the same training run
Mixed
Sequence
Reasoning (RLVR)
"Learn from your near-future self". Picks a teacher that is strong enough (higher Q than current policy) yet close enough (low V vs. external teachers like R1), maximising effective Q/V signal. AutoNPO adaptively schedules the interventions; preserves higher entropy than vanilla GRPO.
ROSD
SDPO-style self-distillation policy optimization
Reflection-conditioned same model = self-teacher
Student
Token (error-localized)
Reasoning (science, tool-use, math)
Uses a self-reflector to extract a corrective idea and localize the first erroneous span, then restricts self-distillation to the aligned error span to improve LLM reasoning.
π Strictness notes
LUFFY β β οΈ Mixed-policy: half on-policy student rollouts (C1+C2 β) + half off-policy R1 traces inserted into GRPO (C1 β on the off-policy half). Net is OPD-flavor with off-policy import.
NPO / AutoNPO β β οΈ Same mixed-policy GRPO pattern as LUFFY, but the off-policy traces come from a near-future checkpoint of the same run instead of an external R1 teacher. Authors frame it as RLVR, not OPD; included here as an OPD variant because (a) the imported trajectories play the same "stronger-self teacher" role, and (b) the paper itself explicitly invites follow-up work to inject the near-future-self signal via on-policy distillation. Strict per-token logit KL (C2) is not the loss β supervision is verifier-filtered sequence-level trajectory mixing inside GRPO.
BOND, Faster WIND β β οΈ Iterative self-bootstrapping; teacher = same model's BoN distribution. Loss is Jeffreys / win-rate-dominance at the sequence level β no per-token logit supervision (C2 partially fails strict form). More accurately "on-policy iterative alignment" than OPD.
KETCHUP β β οΈ Sequence-level RL-based KD with k-step Bellman returns; the paper itself self-describes as "RL-based KD". Closer to RL with KD-anchor reward than per-token OPD.
π³-KD β β οΈ Built on AVRIL inverse-RL framework with joint reward modeling; closer to IRL+OPD hybrid than pure OPD.
DDT β β οΈ Theoretical foundations paper for "on-policy SFT" (Distribution Discriminant Theory); not a specific deployable algorithm. Kept for completeness.
KEPO, Open-AgentRL, Probing-to-Refine β β οΈ C1 β (on-policy student rollouts), but the per-token KL component vs. sequence-level reward shaping vs. preference optimization is not fully resolved from abstracts. Listed because the papers self-describe as OPD/on-policy distillation but exact form of C2 needs full-paper reading.
π§ Reasoning OPD (by application)
Genuine OPD work on math / code / long-CoT reasoning. Off-policy SFT-distill from R1, pure RL methods (Skywork-OR1, SimpleRL-Zoo, Time-R1), and analysis-only papers are excluded from this list β each had no student-rollout-with-teacher-supervision component.
The reasoning-OPD canon already lives across OPSD (siyan-zhao/OPSD, CRISP, SD-Zero), Iterative Self-Bootstrapping (rStar / rStar-Math), OPD-RL Hybrids (LUFFY, RLAD, KDRL, RLSD, HDPO), and White-Box (REOPOLD, Fast OPD, Entropy-Aware OPD, TIP, SCOPE, PACED). This section only lists items not already covered above.
π Click to view technical details
Method
Loss / Objective
Data
Teacher
Granularity
Base / Benchmark
Notes
OPD for AV Motion Planning
GPT-Driver framework + GKD on student-generated trajectories
Student
White-box (LLM teacher)
Token
Driving
5Γ model-size reduction.
Rethinking OPD (THUNLP)
RKL with progressive top-K alignment + off-policy cold-start
Mixed
White-box (Qwen3-4B/1.7B teacher pairs)
Token
Math reasoning
Identifies teacher-novelty and thinking-pattern compatibility as success conditions.
πΌοΈ Multimodal OPD (VLM, Video, Audio, Image)
Strict OPD work in non-text modalities. Many "R1"/"GRPO" multimodal models that bear the brand are pure RL (no teacher-distillation loss) and are excluded.
Single- or multi-teacher; supports strong-to-weak and cross-modal
Outcome-guided margin calibration + offline/online data balancing
Student rollouts
Dual-perspective recipe: addresses (i) insufficient exploration of informative student states via data balancing and (ii) unreliable teacher supervision via margin calibration restoring order-consistency between correct/incorrect trajectories.
π€ Agent & Embodied OPD (by application)
Genuine OPD where the student is an agent rolling out actions; teacher (or self) supervises those trajectories. Pure-RL agent works (WebRL, WebAgent-R1, InfiGUI-G1, GUI-R1) and off-policy SFT-on-teacher-trajectories (Nardien, AgentRefine, Chain-of-Agents, MapCoder-Lite, SAD, Structured-Web) are excluded.
Distillation of the draft model so it better mimics the verifier/target. The on-policy element here is over the drafter's own continuations as judged by the target. Listed separately because the goal is inference speedup, not student capability.
This section only lists drafters trained with the drafter's own rollouts. Off-policy drafter training (EAGLE-1/2, Medusa, Hydra, Kangaroo, ReDrafter, BiTA, SpecDec++, LayerSkip, FREE, AdaSPEC, POSS) and training-free system tricks (Ouroboros, Sequoia, TriForce, SwiftKV, SuffixDecoding) are excluded.
HASS, Falcon β β οΈ Partial on-policy: multi-step draft trajectory / glancing distillation uses drafter samples for a subset of the training signal. Listed because the on-policy leg drives the gains.
π Click to view technical details
Method
Drafter type
On-/Off-policy
Loss
Notes
Online Speculative Decoding (OSD)
Draft-model
On-policy / online
Online KD on rejected tokens
The canonical online/on-policy SD paper.
DistillSpec
Draft-model
On-policy (draft samples)
Choice of FKL/RKL/JSD/TVD
The seminal "OPD for SD" paper.
HASS
Self-speculative
Partial on-policy (multi-step draft trajectory in training)
KL-controlled (off-policy default; integrates into GRPO)
One of many
PyTorch
Yes (async distributed)
distill_loss_weight.
slime
Reverse KL token-level
OPD as additive penalty on any advantage estimator
PyTorch + Megatron
Yes (SGLang teacher mode)
Behind GLM-4.5/4.6/4.7.
NeMo-RL
FKL / RKL / mixed (configurable kl_type)
OPD documented
PyTorch
Yes (Ray + Megatron + vLLM)
Replaces archived NeMo-Aligner.
KDFlow
FKL / RKL / JSD / AKL + Skewed-KL/RKL variants
Yes β KD-first
PyTorch
Yes (Ray + SGLang teacher + FSDP2 student)
Decoupled backends; transmits teacher hidden states (zero-copy) and recomputes logits on student to cut comm cost; 1.44β6.36Γ speedup over homogeneous-backend baselines. Native cross-tokenizer; VLM support (Qwen3-VL). Colocate mode shares GPUs via SGLang sleep/wakeup.
Excluded (no native OPD support, or distillation pipeline is offline / fixed-corpus rather than student-rollout): axolotl, OpenRLHF, allenai/open-instruct, prime-rl, TextBrewer (pre-LLM era), open-r1 (off-policy SFT + GRPO), Modelopt, Tunix v0.1.6, DistillKit, easydistill.
π Strictness notes β frameworks judged by whether they ship a recipe that satisfies C1+C2
LLaMA-Factory β β οΈ OPD only available via TRL integration; no native OPD trainer. Listed for users who already use LLaMA-Factory and want to know it can host OPD.
π Industrial / Production Model Reports
Flagship model technical reports that publicly describe on-policy distillation in their post-training pipeline. Reports whose tech papers don't actually describe student-rollout distillation (Qwen2.5, Qwen2.5-Math, MiMo predecessor, DeepSeek-V3 / V3.2-Exp / R1, Phi-4, Hunyuan-Large / A13B, Kimi-K2 / K2.5, Yi-Lightning, DistilQwen) are excluded.
Multi-Teacher On-Policy Distillation (MOPD) β "the student model samples from its own evolving distribution and receives token-level supervision from domain-specific teachers"
RL + OPD objective with step-level (not token-level) supervision. Reasoning results modeled as reasoning tree; alignment performed at tree nodes. Domain teachers: SWE, WebCoding, Terminal, WebSearch, General.
79.6% SWE-bench Verified (vs Claude Opus 4.6 80.8%); Tree Training eliminates redundant computation over tree-structured trajectories (6.2Γ speedup); MCLA stabilizes MoE RL; KwaiEnv sandbox infrastructure
HY-Embodied-0.5
Post-training (final stage after embodied post-training)
Forward KL (FKL) on-policy distillation from 32B large variant to MoT-2B edge variant. Student generates embodied reasoning trajectories, teacher provides FKL targets.
Mixture-of-Transformers (MoT) architecture with visual latent tokens; 22 embodied benchmarks; downstream VLA for real-world robot control (dual-arm Xtrainer); native-resolution ViT with discrete visual codebook (2k codebook, 8Γ8 patches)
DeepSeek-V4
Post-training (replaces unified mixed-RL stage)
Multi-teacher OPD: domain specialists trained independently (SFT + GRPO per domain β math, code, agent, IF), then a unified student optimises reverse-KL against the specialist set on its own rollouts
Full-vocabulary KL (not token-level estimate) stabilises gradients when specialists disagree; first DeepSeek release where OPD replaces the RL consolidation stage from V3 / R1. V4-Pro 1.6T MoE; V4-Flash 284B.
Qwen3.5-Omni
Post-training (cross-modal capability transfer)
Cross-modal on-policy distillation: transferring text reasoning capabilities into audio-input reasoning. Thinker-Talker architecture with Hybrid Attention MoE for both modules.
GLM-4.5 / 4.6 β β οΈ Tech report describes "expert iteration + RL" without explicit OPD wording. Kept as predecessor of GLM-5 which does have explicit cross-stage OPD.
π Curator's Picks β where to start
Opinionated reading order for someone starting an OPD project today.
#
Why it's the pick
Resource
1
Clearest one-page explanation of why OPD beats both SFT and RL on token efficiency.
PRs are very welcome. When adding an entry, please attempt to fill in the technical details columns (loss / divergence, data source, teacher access, granularity). If you cannot determine these by reading the paper or repo, leave a ? β that's still useful.