# Qwen Distillation Lab (System 1 + System 2)

**Colab-ready notebook** to distill **Qwen2.5-7B-Instruct** into two smaller students:
- **System 1** — instruction-following (7B → 0.5B) via black-box KD on DistilQwen_100k
- **System 2** — reasoning / chain-of-thought (1.5B) via SFT-style KD on OmniThought

Uses `distill_app.py` for data prep and EasyDistill for training. **Run cells in order** (or *Run all*); **GPU runtime recommended**. Colab: open from GitHub so the repo is cloned; local: open the notebook from the repo root.

## 0️⃣ Runtime setup

Confirm GPU and Python. In Colab: **Runtime → Change runtime type → GPU** (e.g. T4) before running.

In [1]:
!nvidia-smi
!python --version

Sun Feb 22 11:30:09 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   51C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

## 1️⃣ Install dependencies

Core libs + **EasyDistill from source**. Clone EasyDistill so the `easydistill` CLI and templates are available.

In [2]:
!pip install -q "torch>=2.1.0" "transformers>=4.36.0" "datasets>=2.16.0" "accelerate>=0.25.0" "sentencepiece>=0.1.99"
!pip install -q bitsandbytes>=0.43.0 tqdm

In [3]:
import os
import subprocess
import sys
from pathlib import Path

if Path("/content").exists():
    EASYDIR = Path("/content/easydistill")
else:
    EASYDIR = Path.cwd() / "easydistill"

if not EASYDIR.exists():
    subprocess.run(["git", "clone", "https://github.com/modelscope/easydistill.git", str(EASYDIR)], check=True)
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-e", str(EASYDIR)], check=True)
print("EasyDistill installed from", EASYDIR)

EasyDistill installed from /content/easydistill


### Clone repo (if needed)

If you got FileNotFoundError above (e.g. opened from Drive/upload): run the code cell below once, then re-run the Project root and imports cell. Set GITHUB_REPO to your fork if needed.

In [None]:
# Replace YOUR_USERNAME with your GitHub username
GITHUB_REPO = "https://github.com/YOUR_USERNAME/distilled-model-research.git"

import subprocess
from pathlib import Path
if Path("/content").exists() and not (Path("/content/distilled-model-research") / "distill_app.py").exists():
    subprocess.run(["git", "clone", GITHUB_REPO, "/content/distilled-model-research"], check=True)
    import os
    os.chdir("/content/distilled-model-research")
    print("Cloned. Now re-run the 'Project root and imports' cell above.")
else:
    print("Not in Colab or repo already present. If you still see FileNotFoundError, run locally from the repo root.")

### Clone repo (if needed)

**Only if you got `FileNotFoundError` above** (e.g. you opened this notebook from Drive or uploaded it). Run this cell once, then re-run the "Project root and imports" cell above. Set `GITHUB_REPO` to your fork if needed.

### Project root and imports

Ensure we're in the repo root (where `distill_app.py` lives). In Colab from GitHub, repo is usually `/content/distilled-model-research`.

In [4]:
import sys
from pathlib import Path

def _find_project_root():
    if Path("/content").exists():
        for d in Path("/content").iterdir():
            if d.is_dir() and (d / "distill_app.py").exists():
                return d
    for p in [Path.cwd()] + list(Path.cwd().parents):
        if (p / "distill_app.py").exists():
            return p
    return Path.cwd()

ROOT = _find_project_root()
if ROOT != Path.cwd():
    import os
    os.chdir(ROOT)
    print("Working directory:", ROOT)
sys.path.insert(0, str(ROOT))

if not (ROOT / "distill_app.py").exists():
    raise FileNotFoundError(
        "distill_app.py not found. Colab (Drive/upload): run the 'Clone repo (if needed)' cell below, then re-run this cell. "
        "Local: run this notebook from the repo root (the folder that contains distill_app.py)."
    )

from distill_app import (
    load_teacher,
    prepare_system1_dataset,
    prepare_system2_dataset,
    distill_system1,
    distill_system2,
    compare_models,
    load_student,
    infer_student,
    format_prompt,
)
print("distill_app imported from", ROOT)

FileNotFoundError: distill_app.py not found. Colab: open this notebook from GitHub (File → Open notebook → GitHub) so the repo is cloned. Local: run from the repo root.

In [5]:
# Imports are in the "Project root and imports" cell above. Skip this cell.

---
## 2️⃣ System 1: Instruction-following distillation (7B → 0.5B)

Load a subset of **DistilQwen_100k**, optionally re-label with the teacher, then run black-box KD.

### Config

Increase `DATASET_SLICE_SYS1` (e.g. `train[:5000]`) or `NUM_EPOCHS_SYS1` for better quality.

In [6]:
TEACHER_MODEL_SYS1 = "Qwen/Qwen2.5-7B-Instruct"
STUDENT_MODEL_SYS1 = "Qwen/Qwen2.5-0.5B-Instruct"
DATASET_SLICE_SYS1 = "train[:1000]"
NUM_EPOCHS_SYS1 = 1

### Prepare Data & Label (Optional)

Loads DistilQwen_100k, maps to `{instruction, input, output}`. Set `RELABEL_WITH_TEACHER = True` to re-generate outputs with the teacher (slower, more VRAM).

In [7]:
RELABEL_WITH_TEACHER = False  # Set True to re-label with teacher (requires loading teacher first)

teacher_sys1 = None
tokenizer_sys1 = None
if RELABEL_WITH_TEACHER:
    teacher_sys1, tokenizer_sys1 = load_teacher(TEACHER_MODEL_SYS1)

prepare_system1_dataset(
    slice_str=DATASET_SLICE_SYS1,
    teacher_model=teacher_sys1,
    teacher_tokenizer=tokenizer_sys1,
    relabel_with_teacher=RELABEL_WITH_TEACHER,
    out_instructions="data/train_instructions.json",
    out_labeled="data/train_labeled.json",
)

NameError: name 'prepare_system1_dataset' is not defined

### Run Distillation

Calls EasyDistill (black-box KD). Checkpoint will be written to `./distilled-qwen2.5-0.5b` (or the path you set in config).

In [8]:
config_sys1 = {
    "teacher_model": TEACHER_MODEL_SYS1,
    "student_model": STUDENT_MODEL_SYS1,
    "labeled_path": "data/train_labeled.json",
    "num_epochs": NUM_EPOCHS_SYS1,
    "out_dir": "./distilled-qwen2.5-0.5b",
    "config_path": "configs/kd_black_box_qwen_0_5b.json",
    "template_path": None,
}

# If EasyDistill was cloned, point to its template (configs/chat_template/chat_template_kd.jinja)
if Path("/content").exists() and Path("/content/easydistill/configs/chat_template/chat_template_kd.jinja").exists():
    config_sys1["template_path"] = "/content/easydistill/configs/chat_template/chat_template_kd.jinja"
elif (Path.cwd() / "easydistill" / "configs" / "chat_template" / "chat_template_kd.jinja").exists():
    config_sys1["template_path"] = str(Path.cwd() / "easydistill" / "configs" / "chat_template" / "chat_template_kd.jinja")

path_sys1 = distill_system1(config_sys1)
if path_sys1:
    print("Final checkpoint path:", path_sys1)

NameError: name 'distill_system1' is not defined

### Test System 1 student

Load the distilled model and run a few prompts.

In [9]:
student_path_sys1 = "./distilled-qwen2.5-0.5b"
if Path(student_path_sys1).exists():
    student_sys1, tok_sys1 = load_student(student_path_sys1)
    for p in [
        "Explain what a large language model is to a high school student.",
        "Write a Python function to check if a number is prime.",
        "Give me three use cases of knowledge distillation in deep learning.",
    ]:
        print("=" * 72)
        print("Prompt:", p)
        print("Student (System 1):", infer_student(student_sys1, tok_sys1, p, mode="system1", max_new_tokens=256))
        print()
else:
    print("Checkpoint not found — run System 1 distillation first.")

Checkpoint not found — run System 1 distillation first.


---
## System 2 Distillation (Reasoning / CoT)

Train a CoT-capable student on OmniThought so it shows step-by-step reasoning.

### Config

In [10]:
STUDENT_MODEL_SYS2 = "Qwen/Qwen2.5-1.5B-Instruct"
DATASET_SLICE_SYS2 = "train[:2000]"
RV_MIN = 0.6
CD_MIN = 0.6
NUM_EPOCHS_SYS2 = 1

### Prepare CoT Data

Load OmniThought, filter by RV/CD if present, map to `{instruction, output=cot}` and save to `data/omnithought_cot.json`.

In [11]:
prepare_system2_dataset(
    slice_str=DATASET_SLICE_SYS2,
    rv_min=RV_MIN,
    cd_min=CD_MIN,
    out_cot="data/omnithought_cot.json",
)

NameError: name 'prepare_system2_dataset' is not defined

### Run CoT Distillation

Calls EasyDistill (kd_black_box_train_only). Checkpoint: `./distilled-qwen2.5-1.5b-cot`.

In [12]:
config_sys2 = {
    "student_model": STUDENT_MODEL_SYS2,
    "cot_path": "data/omnithought_cot.json",
    "num_epochs": NUM_EPOCHS_SYS2,
    "out_dir": "./distilled-qwen2.5-1.5b-cot",
    "config_path": "configs/kd_cot_qwen_1_5b.json",
}
# Use EasyDistill template from clone (same as System 1)
_tpl = Path("/content/easydistill/configs/chat_template/chat_template_kd.jinja") if Path("/content").exists() else Path.cwd() / "easydistill" / "configs" / "chat_template" / "chat_template_kd.jinja"
if _tpl.exists():
    config_sys2["template_path"] = str(_tpl)

path_sys2 = distill_system2(config_sys2)
if path_sys2:
    print("Final checkpoint path:", path_sys2)

NameError: name 'distill_system2' is not defined

### Test System 2 (CoT) student

Prompts include CoT instruction; responses should show step-by-step reasoning.

In [13]:
student_path_sys2 = "./distilled-qwen2.5-1.5b-cot"
if Path(student_path_sys2).exists():
    student_sys2, tok_sys2 = load_student(student_path_sys2)
    for p in [
        "A train travels 120 km in 2 hours. If it continues at the same speed, how far will it travel in 5 hours?",
        "You flip a fair coin 3 times. What is the probability of getting exactly two heads?",
        "Explain the difference between overfitting and underfitting with an example.",
    ]:
        print("=" * 72)
        print("Prompt:", p)
        print("Student (System 2 CoT):", infer_student(student_sys2, tok_sys2, p, mode="system2", max_new_tokens=512))
        print()
else:
    print("Checkpoint not found — run System 2 distillation first.")

Checkpoint not found — run System 2 distillation first.


---
## 4️⃣ Teacher vs student comparison

Side-by-side: **Prompt → Teacher | System 1 | System 2**. Missing checkpoints are skipped with a clear message.

In [14]:
COMPARE_PROMPTS = [
    "Explain what overfitting means.",
    "What is the time complexity of binary search?",
    "A train travels 120 km in 2 hours. What is its average speed?",
    "Explain the concept of knowledge distillation and why it is useful.",
]

compare_models(
    COMPARE_PROMPTS,
    teacher_path="Qwen/Qwen2.5-7B-Instruct",
    system1_path="./distilled-qwen2.5-0.5b",
    system2_path="./distilled-qwen2.5-1.5b-cot",
)

NameError: name 'compare_models' is not defined

### Optional: Teacher vs System 2 CoT (side-by-side)

Compare teacher and System 2 student on reasoning prompts with CoT-style prompting. Loads teacher and student if not already in memory.

In [15]:
COT_COMPARE = [
    "A bag has 3 red balls and 2 blue balls. If you draw two without replacement, what is the probability both are red?",
    "What is the derivative of x^3 + 2x^2 - 5x + 7? Explain the steps.",
]
if Path("./distilled-qwen2.5-1.5b-cot").exists():
    try:
        _t, _tt = load_teacher("Qwen/Qwen2.5-7B-Instruct")
        _s2, _ts2 = load_student("./distilled-qwen2.5-1.5b-cot")
        for p in COT_COMPARE:
            print("#" * 72)
            print("Prompt:", p)
            print("\n[Teacher CoT]", infer_student(_t, _tt, p, mode="system2", max_new_tokens=512)[:1000])
            print("\n[Student CoT]", infer_student(_s2, _ts2, p, mode="system2", max_new_tokens=512)[:1000])
            print()
    except Exception as e:
        print("Could not load models:", e)
else:
    print("Run System 2 distillation first.")

Run System 2 distillation first.


---
## 5️⃣ Scaling up

Once a small run works:
- **Data:** Increase slices (e.g. `train[:10000]` System 1, `train[:5000]` System 2).
- **Epochs:** Set `NUM_EPOCHS_SYS1` / `NUM_EPOCHS_SYS2` to 2–3.
- **Batch size:** Increase in generated configs if VRAM allows.
- **System 2 student:** Use `Qwen/Qwen2.5-0.5B-Instruct` if VRAM is tight.
- **Relabeling:** Set `RELABEL_WITH_TEACHER = True` for teacher-generated labels (slower, often better).

### Quick single-prompt inference

Uncomment and run after you have a checkpoint.

In [None]:
# model, tokenizer = load_student("./distilled-qwen2.5-0.5b")
# print(infer_student(model, tokenizer, "Explain what overfitting means.", mode="system1"))