Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen*, Jinan Xu
Some of our code follows MiniLLM and Distillm.
- deepspeed >= 0.14.0
- torch >= 2.0.1
- transformers >= 4.40.2
- peft >= 0.8.2
- rouge_score >= 0.1.2
The processed data used in our paper can be downloaded here.
You can download the corresponding model files (e.g., pytorch_model.bin
or model.safetensors
) of LLMs used in this paper into model_hub/*/*/
.
For Qwen1.5-1.8B (full fine-tuning), run:
bash scripts/gpt2/sft_teacher_qwen.sh
For LLaMA2-7B (LoRA), run:
bash scripts/tinyllama/sft_teacher_llama2.sh
For Mistral-7B (LoRA), run:
bash scripts/tinyllama/sft_teacher_mistral.sh
For GPT2-base (full fine-tuning), run:
bash scripts/gpt2/sft_gpt2_base.sh
For TinyLLaMA-1.1B (LoRA), run:
bash scripts/tinyllama/sft_tinyllama.sh
For GPT2-base, run:
bash scripts/gpt2/vanilla_kd_gpt2_base.sh
For TinyLLaMA-1.1B, run:
bash scripts/tinyllama/vanilla_kd_tinyllama.sh
You can change the distance functions (e.g., KL Divergence, Reverse KL Divergence, JS Divergence, etc.) using KD_OBJ
in the above scripts.
For GPT2-base, run:
bash scripts/gpt2/dskd_gpt2_base.sh
For TinyLLaMA-1.1B, run:
bash scripts/tinyllama/dskd_tinyllama.sh
Also, you can change the distance functions using KD_OBJ
in the above scripts.
Logits Alignment by Minimum Edit Distance (paper, original implementation)
The original implementation in this repo pre-processes the logit alignment before distillation, while we re-implement this method by faster calculating alignment during distillation in code/criterions/min_edit_dis_kld.py.
For GPT2-base, run:
bash scripts/gpt2/minedit_gpt2_base.sh
For TinyLLaMA-1.1B, run:
bash scripts/tinyllama/minedit_tinyllama.sh
Universal Logit Distillation (paper, original implementation)
We also re-implement this method in code/criterions/universal_logit_distillation.py.
For GPT2-base, run:
bash scripts/gpt2/uld_gpt2_base.sh
For TinyLLaMA-1.1B, run:
bash scripts/tinyllama/uld_tinyllama.sh
For GPT2-base, run:
bash scripts/gpt2/dskd_cma_gpt2_base.sh
For TinyLLaMA-1.1B, run:
bash scripts/tinyllama/dskd_cma_tinyllama.sh
bash scripts/gpt2/run_eval.sh ${CKPT_PATH} ${EVAL_BATCH_SIZE}
bash scripts/tinyllama/run_eval_lora.sh ${LORA_ADAPTER_PATH} ${EVAL_BATCH_SIZE}
Please note to change MODEL_PATH
for different base models (TinyLLaMA, LLaMA2, Mistral).
If you find this repo useful for your research, please consider citing our paper:
@article{zhang2024dskd,
title={Dual-Space Knowledge Distillation for Large Language Models},
author={Songming Zhang and Xue Zhang and Zengkui Sun and Yufeng Chen and Jinan Xu},
year={2024},
journal={arXiv preprint arXiv:2406.17328},
}