Dual-Space Knowledge Distillation for Large Language Models

Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen*, Jinan Xu

Some of our code follows MiniLLM and Distillm.

Requirements

deepspeed >= 0.14.0
torch >= 2.0.1
transformers >= 4.40.2
peft >= 0.8.2
rouge_score >= 0.1.2

Data

The processed data used in our paper can be downloaded here.

Models

You can download the corresponding model files (e.g., pytorch_model.bin or model.safetensors) of LLMs used in this paper into model_hub/*/*/.

Training

SFT for teacher models

For Qwen1.5-1.8B (full fine-tuning), run:

bash scripts/gpt2/sft_teacher_qwen.sh

For LLaMA2-7B (LoRA), run:

bash scripts/tinyllama/sft_teacher_llama2.sh

For Mistral-7B (LoRA), run:

bash scripts/tinyllama/sft_teacher_mistral.sh

SFT for student models

For GPT2-base (full fine-tuning), run:

bash scripts/gpt2/sft_gpt2_base.sh

For TinyLLaMA-1.1B (LoRA), run:

bash scripts/tinyllama/sft_tinyllama.sh

KD for the Same Vocabulary

Vanilla KD framework

For GPT2-base, run:

bash scripts/gpt2/vanilla_kd_gpt2_base.sh

For TinyLLaMA-1.1B, run:

bash scripts/tinyllama/vanilla_kd_tinyllama.sh

You can change the distance functions (e.g., KL Divergence, Reverse KL Divergence, JS Divergence, etc.) using KD_OBJ in the above scripts.

Dual-Space KD framework

For GPT2-base, run:

bash scripts/gpt2/dskd_gpt2_base.sh

For TinyLLaMA-1.1B, run:

bash scripts/tinyllama/dskd_tinyllama.sh

Also, you can change the distance functions using KD_OBJ in the above scripts.

KD for different vocabularies

Logits Alignment by Minimum Edit Distance (paper, original implementation)

The original implementation in this repo pre-processes the logit alignment before distillation, while we re-implement this method by faster calculating alignment during distillation in code/criterions/min_edit_dis_kld.py.

For GPT2-base, run:

bash scripts/gpt2/minedit_gpt2_base.sh

For TinyLLaMA-1.1B, run:

bash scripts/tinyllama/minedit_tinyllama.sh

Universal Logit Distillation (paper, original implementation)

We also re-implement this method in code/criterions/universal_logit_distillation.py.

For GPT2-base, run:

bash scripts/gpt2/uld_gpt2_base.sh

For TinyLLaMA-1.1B, run:

bash scripts/tinyllama/uld_tinyllama.sh

Our Dual-Space KD with Cross-Model Attention (CMA)

For GPT2-base, run:

bash scripts/gpt2/dskd_cma_gpt2_base.sh

For TinyLLaMA-1.1B, run:

bash scripts/tinyllama/dskd_cma_tinyllama.sh

Evaluation

Evaluate Full Fine-tuning Checkpoints

bash scripts/gpt2/run_eval.sh ${CKPT_PATH} ${EVAL_BATCH_SIZE}

Evaluate LoRA Fine-tuning Checkpoints

bash scripts/tinyllama/run_eval_lora.sh ${LORA_ADAPTER_PATH} ${EVAL_BATCH_SIZE}

Please note to change MODEL_PATH for different base models (TinyLLaMA, LLaMA2, Mistral).

BibTeX

If you find this repo useful for your research, please consider citing our paper:

@article{zhang2024dskd,
      title={Dual-Space Knowledge Distillation for Large Language Models}, 
      author={Songming Zhang and Xue Zhang and Zengkui Sun and Yufeng Chen and Jinan Xu},
      year={2024},
      journal={arXiv preprint arXiv:2406.17328},
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
code		code
configs		configs
model_hub		model_hub
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dual-Space Knowledge Distillation for Large Language Models

Requirements

Data

Models

Training

SFT for teacher models

SFT for student models

KD for the Same Vocabulary

Vanilla KD framework

Dual-Space KD framework

KD for different vocabularies

Logits Alignment by Minimum Edit Distance (paper, original implementation)

Universal Logit Distillation (paper, original implementation)

Our Dual-Space KD with Cross-Model Attention (CMA)

Evaluation

Evaluate Full Fine-tuning Checkpoints

Evaluate LoRA Fine-tuning Checkpoints

BibTeX

About

Releases

Packages

Languages

songmzhang/DSKD

Folders and files

Latest commit

History

Repository files navigation

Dual-Space Knowledge Distillation for Large Language Models

Requirements

Data

Models

Training

SFT for teacher models

SFT for student models

KD for the Same Vocabulary

Vanilla KD framework

Dual-Space KD framework

KD for different vocabularies

Logits Alignment by Minimum Edit Distance (paper, original implementation)

Universal Logit Distillation (paper, original implementation)

Our Dual-Space KD with Cross-Model Attention (CMA)

Evaluation

Evaluate Full Fine-tuning Checkpoints

Evaluate LoRA Fine-tuning Checkpoints

BibTeX

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages