A principled and efficient post-training method for large language models
He Zhu¹*, Junyou Su¹*, Peng Lai², Ren Ma³, Wenjia Zhang¹, Linyi Yang², Guanhua Chen²†
¹Peking University ²Southern University of Science and Technology ³Shanghai Artificial Intelligence Laboratory
*Equal Contribution †Corresponding Author
Post-training large language models (LLMs) faces a trade-off:
- Supervised Fine-Tuning (SFT) is efficient but prone to memorization.
- Reinforcement Learning (RL) improves generalization but is costly and unstable.
- Dynamic Fine-Tuning (DFT) tightens the learning bound but suffers from distributional drift and instability.
👉 We propose Anchored Supervised Fine-Tuning (ASFT) — a lightweight extension of DFT that adds KL anchoring. This ensures tightness + stability, combining the best of SFT and RL while keeping efficiency.
📄 2026-02-12: ASFT has been merged into LLaMA-Factory main (commit #10174).
Latest release is v0.9.4, so ASFT support is currently available on main and will be included in the next tagged release.
📄 2026-01-30: Accepted to ICLR 2026.
📄 2026-01-23: Added support for DeepSpeed and LoRA.
📄 2025-09-28: Released ASFT code and paper - Paper | Code
-
Theoretical foundation:
- Formalized in the Reward-Weighted Regression (RWR) framework.
- Proves DFT yields tighter RL lower bounds than SFT.
- Identifies drift as the key weakness of DFT.
-
Anchored stability:
- Adds a KL divergence regularization term to prevent drift.
- Retains DFT’s advantages with controlled variance.
-
Practical benefits:
- Minimal overhead compared to SFT.
- Outperforms SFT, DFT, and iw-SFT across reasoning, medical, and code benchmarks.
- Provides stronger initialization for RL methods like DAPO/GRPO.
Performance comparison of fine-tuning methods on medical and math benchmarks under different dataset scales. ASFT consistently outperforms other methods.
Training dynamics comparison showing ASFT maintains stability through KL anchoring while DFT exhibits severe distributional drift.
Comparison across different model architectures (LLaMA-2, Qwen2.5) demonstrating ASFT's consistent effectiveness across various model sizes and families.
Clone the repository and install dependencies:
git clone https://github.com/zhuchichi56/ASFT.git
cd ASFT
conda create -n asft python=3.10
conda activate asft
pip install -r requirements.txtIf you need flash-attn (prebuilt wheel):
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whlNote: install a matching PyTorch build first (e.g., CUDA 12 + PyTorch 2.4) before installing flash-attn.
Train an ASFT model with default settings (v2 supports more model families and multi-GPU training):
python train_v2.py \
--model_name_or_path models/your-model \
--mode asft \
--data_path data/your-data.jsonl \
--kl_weight 0.03 \
--num_train_epochs 3 \
--learning_rate 2e-5DeepSpeed is supported via --deepspeed_config (Zero-2/Zero-3). Config files are in scripts/ (e.g., scripts/ds_zero2_bf16.json). In practice, DeepSpeed Zero tends to be less stable; native (non-DeepSpeed) runs are the most stable overall. For example:
deepspeed --num_gpus 8 train_v2.py \
--deepspeed_config scripts/ds_zero2_bf16.json \
--model_name_or_path models/your-model \
--mode asft \
--data_path data/your-data.jsonl \
--kl_weight 0.03 \
--num_train_epochs 3 \
--learning_rate 2e-5Note: For mixed precision (bf16/fp16), we recommend
kl_weight=0.03. Larger KL weights amplify precision noise and can destabilize training, leading to degraded accuracy. Setting0.03keeps the KL anchor effective without over-regularizing under lower precision.
We recommend LoRA with rank=8, lora_alpha=16, lora_dropout=0.05, and learning rate 5e-4 for medical tasks. In our grid, lr=5e-4, r=8 performs best on average and is noticeably stronger than lr=2e-5 under the same rank.
Example (LoRA):
python train_v2.py \
--model_name_or_path models/your-model \
--mode asft \
--data_path data/your-data.jsonl \
--use_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--learning_rate 5e-4Partial grid (Med, LLaMA2-7B):
| lr | rank | medqa | mmlu | medmcqa | avg |
|---|---|---|---|---|---|
| 2.00E-05 | 8 | 0.3064 | 0.3366 | 0.3376 | 0.3269 |
| 5.00E-05 | 8 | 0.3299 | 0.3607 | 0.3464 | 0.3457 |
| 1.00E-04 | 8 | 0.3511 | 0.3896 | 0.3588 | 0.3665 |
| 2.00E-04 | 4 | 0.3692 | 0.4188 | 0.3717 | 0.3866 |
| 5.00E-04 | 8 | 0.3951 | 0.4147 | 0.3737 | 0.3945 |
Evaluate trained models on various benchmarks. See eval/README.md for detailed steps and required inputs.
# AlpacaEval-style evaluation
python /volume/pt-train/users/wzhang/ghchen/zh/valid_code/ASFT-dev/eval/alpaca_eval_test.py
# Math evaluation
bash eval/math_evaluation/eval.sh
# Medical evaluation
python eval/medeval/vllm_medical_test.pyLarge-scale training data is not stored in this repository. Please download it from the Hugging Face dataset repository:
chichi56/ASFT
You can also download all dataset files with the provided script:
python download_data.py --output_dir dataIf you find this work useful, please cite:
@misc{zhu2025anchoredsupervisedfinetuning,
title={Anchored Supervised Fine-Tuning},
author={He Zhu and Junyou Su and Peng Lai and Ren Ma and Wenjia Zhang and Linyi Yang and Guanhua Chen},
year={2025},
eprint={2509.23753},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.23753},
}We welcome contributions! Please open issues or submit PRs for:
- Extending ASFT to new domains
- Improving training efficiency
- Adding evaluation benchmarks
- SFT efficiency + RL generalization
- Tighter theoretical guarantees
- Stable across tasks and scales
- Plug-and-play for LLaMA, Qwen, and more


