This repository contains the code for reproducing the experiments in our paper "SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation".
Experimental Setup:
| Role | Model |
|---|---|
| Teacher | Qwen2.5-7B-Instruct |
| Teacher | Phi-4-mini-instruct |
| Student | Gemma-2-2B-IT |
| Student | Phi-4-mini-instruct |
Evaluation Benchmarks: GSM8K, MATH-500, MBPP, LiveCodeBench-v6
Our experiments are built on top of KDFlow. Please install the dependencies:
git clone https://github.com/sunjie279/SimCT-.git
cd SimCT_
pip install -e ./
pip install flash_attn==2.8.3 --no-build-isolationThen set the following environment variables:
export MODEL_PATH="./models" # Directory containing model weights
export DATA_PATH="./data" # Directory for datasets
export OUTPUT_PATH="./output/ckpts" # Directory for checkpointsRequired model weights (download from HuggingFace):
$MODEL_PATH/Qwen2.5-7B-Instruct$MODEL_PATH/Phi-4-mini-instruct$MODEL_PATH/gemma-2-2b-it
The full pipeline consists of 5 steps:
Data Preparation → Generate Teacher Responses → SFT Warmup → Distillation Training → Evaluation
We construct a 10K mixed math+code training dataset from multiple sources:
# Download raw datasets from HuggingFace
python scripts/data/download_datasets.py
# Prepare individual datasets
python scripts/data/prepare_gsm8k.py
python scripts/data/prepare_orca_math.py
# Build the 10K mixed dataset
python scripts/data/prepare_mixed_math_code.pyThis produces:
$DATA_PATH/mixed_math_code_10k/— Training prompts$DATA_PATH/mixed_math_code_10k_with_source/— Training prompts with source labels
Generate 8 trajectories per question for each teacher model using SGLang:
# Qwen2.5-7B-Instruct
bash scripts/sft/run_generate_responses_10k_qwen.sh
# Phi-4-mini-instruct
bash scripts/sft/run_generate_responses_10k_phi4.shEach script starts an SGLang server (DP=8), generates responses (temperature=0.6, top_p=0.95), and saves to $DATA_PATH/teacher_responses_10k_<model_tag>/.
Before distillation, the student needs an SFT warmup to establish basic instruction-following capability.
Select the shortest correct response per question from teacher trajectories:
# From Qwen responses
bash scripts/sft/run_build_sft_10k_qwen.sh
# From Phi-4 responses
bash scripts/sft/run_build_sft_10k_phi4.shWe use LLaMA-Factory for SFT:
# Gemma-2-2B-IT with Qwen teacher data
bash scripts/sft/run_gemma2_sft_warmup_10k_qwen.sh
# Gemma-2-2B-IT with Phi-4 teacher data
bash scripts/sft/run_gemma2_sft_warmup_10k_phi4.sh
# Phi-4-mini with Qwen teacher data
bash scripts/sft/run_phi4_sft_warmup_10k_qwen.shRun distillation training for each teacher→student pair:
# Qwen2.5-7B → Gemma-2-2B
bash scripts/ctopd/qwen25_gemma2_span_mix10k_lr5e-7.sh
# Qwen2.5-7B → Phi-4-mini
bash scripts/ctopd/qwen25_phi4_span_mix10k_lr5e-7.sh
# Phi-4-mini → Gemma-2-2B
bash scripts/ctopd/phi4_gemma2_span_mix10k_lr5e-7.shEvaluate on GSM8K, MATH-500, MBPP, and LiveCodeBench-v6:
# Prepare LiveCodeBench data (one-time)
python scripts/evaluation/prepare_lcb_data.py
# Run all evaluations
bash scripts/evaluation/eval_all_monitor.shOr evaluate a single checkpoint:
python scripts/evaluation/evaluation.py \
--model_path $MODEL_PATH/your-checkpoint \
--dataset gsm8k \
--base_url http://127.0.0.1:30000 \
--temperature 0.6 \
--top_p 0.95 \
--n 1 \
--max_tokens 4096Supported datasets: gsm8k, math500, mbpp, live-code-bench-v6
This codebase is built on top of KDFlow, a user-friendly and efficient framework for LLM knowledge distillation. We sincerely thank the KDFlow team for their excellent work.
@article{simct2026,
title={SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation},
author={TODO},
year={2026},
}