# Qwen Distillation Lab (System 1 + System 2)

**Colab-ready notebook** to distill **Qwen2.5-7B-Instruct** into two smaller students:
- **System 1** — instruction-following (7B → 0.5B) via black-box KD on DistilQwen_100k
- **System 2** — reasoning / chain-of-thought (1.5B) via SFT-style KD on OmniThought

Uses `distill_app.py` for data prep and EasyDistill for training. **Run cells in order** (or *Run all*); **GPU runtime recommended**. Colab: open from GitHub so the repo is cloned; local: open the notebook from the repo root.

## 0️⃣ Runtime setup

Confirm GPU and Python. In Colab: **Runtime → Change runtime type → GPU** (e.g. T4) before running.

In [1]:
!nvidia-smi
!python --version

/bin/bash: line 1: nvidia-smi: command not found
Python 3.12.12


## 1️⃣ Install dependencies

Core libs + **EasyDistill from source**. Clone EasyDistill so the `easydistill` CLI and templates are available.

In [2]:
%pip install -q "torch>=2.1.0" "transformers>=4.36.0" "datasets>=2.16.0" "accelerate>=0.25.0" "sentencepiece>=0.1.99"
%pip install -q bitsandbytes>=0.43.0 tqdm

In [3]:
import os
import subprocess
import sys
from pathlib import Path

if Path("/content").exists():
    EASYDIR = Path("/content/easydistill")
else:
    EASYDIR = Path.cwd() / "easydistill"

if not EASYDIR.exists():
    subprocess.run(["git", "clone", "https://github.com/modelscope/easydistill.git", str(EASYDIR)], check=True)
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-e", str(EASYDIR)], check=True)
print("EasyDistill installed from", EASYDIR)

EasyDistill installed from /content/easydistill


### Clone repo (if needed)

If you got FileNotFoundError above (e.g. opened from Drive/upload): run the code cell below once, then re-run the Project root and imports cell. Set GITHUB_REPO to your fork if needed.

In [3]:
# Replace YOUR_USERNAME with your GitHub username
GITHUB_REPO = "https://github.com/zacharias1219/distilled-model-research.git"

import subprocess
from pathlib import Path
if Path("/content").exists() and not (Path("/content/distilled-model-research") / "distill_app.py").exists():
    subprocess.run(["git", "clone", GITHUB_REPO, "/content/distilled-model-research"], check=True)
    import os
    os.chdir("/content/distilled-model-research")
    print("Cloned. Now re-run the 'Project root and imports' cell above.")
else:
    print("Not in Colab or repo already present. If you still see FileNotFoundError, run locally from the repo root.")

Cloned. Now re-run the 'Project root and imports' cell above.


### Clone repo (if needed)

**Only if you got `FileNotFoundError` above** (e.g. you opened this notebook from Drive or uploaded it). Run this cell once, then re-run the "Project root and imports" cell above. Set `GITHUB_REPO` to your fork if needed.

### HF token (optional)

**Colab (open from GitHub):** Add `HF_TOKEN` in Colab Secrets (key icon in the left sidebar) so Hugging Face uses it for auth and higher rate limits. Run this cell once.

**Local:** If you have a `.env` in the repo root with `HF_TOKEN=...`, it is loaded when you import `distill_app` below; no need to do anything here.

In [None]:
import os
try:
    from google.colab import userdata
    os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")
    print("HF_TOKEN set from Colab secrets.")
except Exception:
    pass  # Local or no secret: .env will be used when distill_app is imported

### Project root and imports

Ensure we're in the repo root (where `distill_app.py` lives). In Colab from GitHub, repo is usually `/content/distilled-model-research`.

In [4]:
import sys
from pathlib import Path

def _find_project_root():
    if Path("/content").exists():
        for d in Path("/content").iterdir():
            if d.is_dir() and (d / "distill_app.py").exists():
                return d
    for p in [Path.cwd()] + list(Path.cwd().parents):
        if (p / "distill_app.py").exists():
            return p
    return Path.cwd()

ROOT = _find_project_root()
if ROOT != Path.cwd():
    import os
    os.chdir(ROOT)
    print("Working directory:", ROOT)
sys.path.insert(0, str(ROOT))

if not (ROOT / "distill_app.py").exists():
    raise FileNotFoundError(
        "distill_app.py not found. Colab (Drive/upload): run the 'Clone repo (if needed)' cell below, then re-run this cell. "
        "Local: run this notebook from the repo root (the folder that contains distill_app.py)."
    )

from distill_app import (
    load_teacher,
    prepare_system1_dataset,
    prepare_system2_dataset,
    distill_system1,
    distill_system2,
    compare_models,
    load_student,
    infer_student,
    format_prompt,
)
print("distill_app imported from", ROOT)

distill_app imported from /content/distilled-model-research


In [5]:
# Imports are in the "Project root and imports" cell above. Skip this cell.

---
## 2️⃣ System 1: Instruction-following distillation (7B → 0.5B)

Load a subset of **DistilQwen_100k**, optionally re-label with the teacher, then run black-box KD.

### Config

Increase `DATASET_SLICE_SYS1` (e.g. `train[:5000]`) or `NUM_EPOCHS_SYS1` for better quality.

In [5]:
TEACHER_MODEL_SYS1 = "Qwen/Qwen2.5-7B-Instruct"
STUDENT_MODEL_SYS1 = "Qwen/Qwen2.5-0.5B-Instruct"
DATASET_SLICE_SYS1 = "train[:1000]"
NUM_EPOCHS_SYS1 = 1

### Prepare Data & Label (Optional)

Loads DistilQwen_100k, maps to `{instruction, input, output}`. Set `RELABEL_WITH_TEACHER = True` to re-generate outputs with the teacher (slower, more VRAM).

**Note:** HF Hub may show warnings about `HF_TOKEN` / unauthenticated requests. You can ignore them; downloads still work. For higher rate limits, add `HF_TOKEN` in Colab secrets (key icon in the sidebar) and run `from huggingface_hub import login; login()`.

In [6]:
RELABEL_WITH_TEACHER = False  # Set True to re-label with teacher (requires loading teacher first)

teacher_sys1 = None
tokenizer_sys1 = None
if RELABEL_WITH_TEACHER:
    teacher_sys1, tokenizer_sys1 = load_teacher(TEACHER_MODEL_SYS1)

prepare_system1_dataset(
    slice_str=DATASET_SLICE_SYS1,
    teacher_model=teacher_sys1,
    teacher_tokenizer=tokenizer_sys1,
    relabel_with_teacher=RELABEL_WITH_TEACHER,
    out_instructions="data/train_instructions.json",
    out_labeled="data/train_labeled.json",
)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/124M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Saved System 1 instructions to data/train_instructions.json
Saved System 1 labeled data to data/train_labeled.json


### Run Distillation

Calls EasyDistill (black-box KD). Checkpoint will be written to `./distilled-qwen2.5-0.5b` (or the path you set in config).

In [7]:
config_sys1 = {
    "teacher_model": TEACHER_MODEL_SYS1,
    "student_model": STUDENT_MODEL_SYS1,
    "labeled_path": "data/train_labeled.json",
    "num_epochs": NUM_EPOCHS_SYS1,
    "out_dir": "./distilled-qwen2.5-0.5b",
    "config_path": "configs/kd_black_box_qwen_0_5b.json",
    "template_path": None,
}

# If EasyDistill was cloned, point to its template (configs/chat_template/chat_template_kd.jinja)
if Path("/content").exists() and Path("/content/easydistill/configs/chat_template/chat_template_kd.jinja").exists():
    config_sys1["template_path"] = "/content/easydistill/configs/chat_template/chat_template_kd.jinja"
elif (Path.cwd() / "easydistill" / "configs" / "chat_template" / "chat_template_kd.jinja").exists():
    config_sys1["template_path"] = str(Path.cwd() / "easydistill" / "configs" / "chat_template" / "chat_template_kd.jinja")

path_sys1 = distill_system1(config_sys1)
if path_sys1:
    print("Final checkpoint path:", path_sys1)

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Wrote System 1 KD config to configs/kd_black_box_qwen_0_5b.json
Running: easydistill --config /content/distilled-model-research/configs/kd_black_box_qwen_0_5b.json


FileNotFoundError: [Errno 2] No such file or directory: 'easydistill'

### Test System 1 student

Load the distilled model and run a few prompts.

In [10]:
student_path_sys1 = "./distilled-qwen2.5-0.5b"
if Path(student_path_sys1).exists():
    student_sys1, tok_sys1 = load_student(student_path_sys1)
    for p in [
        "Explain what a large language model is to a high school student.",
        "Write a Python function to check if a number is prime.",
        "Give me three use cases of knowledge distillation in deep learning.",
    ]:
        print("=" * 72)
        print("Prompt:", p)
        print("Student (System 1):", infer_student(student_sys1, tok_sys1, p, mode="system1", max_new_tokens=256))
        print()
else:
    print("Checkpoint not found — run System 1 distillation first.")

Checkpoint not found — run System 1 distillation first.


---
## System 2 Distillation (Reasoning / CoT)

Train a CoT-capable student on OmniThought so it shows step-by-step reasoning.

### Config

In [8]:
STUDENT_MODEL_SYS2 = "Qwen/Qwen2.5-1.5B-Instruct"
DATASET_SLICE_SYS2 = "train[:2000]"
RV_MIN = 0.6
CD_MIN = 0.6
NUM_EPOCHS_SYS2 = 1

### Prepare CoT Data

Load OmniThought, filter by RV/CD if present, map to `{instruction, output=cot}` and save to `data/omnithought_cot.json`.

In [9]:
prepare_system2_dataset(
    slice_str=DATASET_SLICE_SYS2,
    rv_min=RV_MIN,
    cd_min=CD_MIN,
    out_cot="data/omnithought_cot.json",
)

README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/135 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/135 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/135 [00:00<?, ?files/s]

data/train-00000-of-00135.parquet:   0%|          | 0.00/428M [00:00<?, ?B/s]

data/train-00001-of-00135.parquet:   0%|          | 0.00/353M [00:00<?, ?B/s]

data/train-00002-of-00135.parquet:   0%|          | 0.00/298M [00:00<?, ?B/s]

data/train-00003-of-00135.parquet:   0%|          | 0.00/240M [00:00<?, ?B/s]

data/train-00004-of-00135.parquet:   0%|          | 0.00/229M [00:00<?, ?B/s]

data/train-00005-of-00135.parquet:   0%|          | 0.00/110M [00:00<?, ?B/s]

data/train-00006-of-00135.parquet:   0%|          | 0.00/97.2M [00:00<?, ?B/s]

data/train-00007-of-00135.parquet:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

data/train-00008-of-00135.parquet:   0%|          | 0.00/100M [00:00<?, ?B/s]

data/train-00009-of-00135.parquet:   0%|          | 0.00/94.5M [00:00<?, ?B/s]

data/train-00010-of-00135.parquet:   0%|          | 0.00/93.3M [00:00<?, ?B/s]

data/train-00011-of-00135.parquet:   0%|          | 0.00/91.7M [00:00<?, ?B/s]

data/train-00012-of-00135.parquet:   0%|          | 0.00/97.6M [00:00<?, ?B/s]

data/train-00013-of-00135.parquet:   0%|          | 0.00/100M [00:00<?, ?B/s]

data/train-00014-of-00135.parquet:   0%|          | 0.00/101M [00:00<?, ?B/s]

data/train-00015-of-00135.parquet:   0%|          | 0.00/100M [00:00<?, ?B/s]

data/train-00016-of-00135.parquet:   0%|          | 0.00/99.1M [00:00<?, ?B/s]

data/train-00017-of-00135.parquet:   0%|          | 0.00/100M [00:00<?, ?B/s]

data/train-00018-of-00135.parquet:   0%|          | 0.00/97.6M [00:00<?, ?B/s]

data/train-00019-of-00135.parquet:   0%|          | 0.00/101M [00:00<?, ?B/s]

data/train-00020-of-00135.parquet:   0%|          | 0.00/102M [00:00<?, ?B/s]

data/train-00021-of-00135.parquet:   0%|          | 0.00/100M [00:00<?, ?B/s]

data/train-00022-of-00135.parquet:   0%|          | 0.00/101M [00:00<?, ?B/s]

data/train-00023-of-00135.parquet:   0%|          | 0.00/94.1M [00:00<?, ?B/s]

data/train-00024-of-00135.parquet:   0%|          | 0.00/97.4M [00:00<?, ?B/s]

data/train-00025-of-00135.parquet:   0%|          | 0.00/99.0M [00:00<?, ?B/s]

data/train-00026-of-00135.parquet:   0%|          | 0.00/94.3M [00:00<?, ?B/s]

data/train-00027-of-00135.parquet:   0%|          | 0.00/98.4M [00:00<?, ?B/s]

data/train-00028-of-00135.parquet:   0%|          | 0.00/97.6M [00:00<?, ?B/s]

data/train-00029-of-00135.parquet:   0%|          | 0.00/97.0M [00:00<?, ?B/s]

data/train-00030-of-00135.parquet:   0%|          | 0.00/98.4M [00:00<?, ?B/s]

data/train-00031-of-00135.parquet:   0%|          | 0.00/93.7M [00:00<?, ?B/s]

data/train-00032-of-00135.parquet:   0%|          | 0.00/94.6M [00:00<?, ?B/s]

data/train-00033-of-00135.parquet:   0%|          | 0.00/157M [00:00<?, ?B/s]

data/train-00034-of-00135.parquet:   0%|          | 0.00/229M [00:00<?, ?B/s]

data/train-00035-of-00135.parquet:   0%|          | 0.00/230M [00:00<?, ?B/s]

data/train-00036-of-00135.parquet:   0%|          | 0.00/230M [00:00<?, ?B/s]

data/train-00037-of-00135.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

data/train-00038-of-00135.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

data/train-00039-of-00135.parquet:   0%|          | 0.00/228M [00:00<?, ?B/s]

data/train-00040-of-00135.parquet:   0%|          | 0.00/230M [00:00<?, ?B/s]

data/train-00041-of-00135.parquet:   0%|          | 0.00/227M [00:00<?, ?B/s]

data/train-00042-of-00135.parquet:   0%|          | 0.00/232M [00:00<?, ?B/s]

data/train-00043-of-00135.parquet:   0%|          | 0.00/227M [00:00<?, ?B/s]

data/train-00044-of-00135.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

data/train-00045-of-00135.parquet:   0%|          | 0.00/227M [00:00<?, ?B/s]

data/train-00046-of-00135.parquet:   0%|          | 0.00/227M [00:00<?, ?B/s]

data/train-00047-of-00135.parquet:   0%|          | 0.00/228M [00:00<?, ?B/s]

data/train-00048-of-00135.parquet:   0%|          | 0.00/228M [00:00<?, ?B/s]

data/train-00049-of-00135.parquet:   0%|          | 0.00/163M [00:00<?, ?B/s]

data/train-00050-of-00135.parquet:   0%|          | 0.00/152M [00:00<?, ?B/s]

data/train-00051-of-00135.parquet:   0%|          | 0.00/151M [00:00<?, ?B/s]

data/train-00052-of-00135.parquet:   0%|          | 0.00/156M [00:00<?, ?B/s]

data/train-00053-of-00135.parquet:   0%|          | 0.00/168M [00:00<?, ?B/s]

data/train-00054-of-00135.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

data/train-00055-of-00135.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

data/train-00056-of-00135.parquet:   0%|          | 0.00/230M [00:00<?, ?B/s]

data/train-00057-of-00135.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

data/train-00058-of-00135.parquet:   0%|          | 0.00/237M [00:00<?, ?B/s]

data/train-00059-of-00135.parquet:   0%|          | 0.00/239M [00:00<?, ?B/s]

data/train-00060-of-00135.parquet:   0%|          | 0.00/236M [00:00<?, ?B/s]

data/train-00061-of-00135.parquet:   0%|          | 0.00/223M [00:00<?, ?B/s]

data/train-00062-of-00135.parquet:   0%|          | 0.00/228M [00:00<?, ?B/s]

data/train-00063-of-00135.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

data/train-00064-of-00135.parquet:   0%|          | 0.00/154M [00:00<?, ?B/s]

data/train-00065-of-00135.parquet:   0%|          | 0.00/145M [00:00<?, ?B/s]

data/train-00066-of-00135.parquet:   0%|          | 0.00/236M [00:00<?, ?B/s]

data/train-00067-of-00135.parquet:   0%|          | 0.00/253M [00:00<?, ?B/s]

data/train-00068-of-00135.parquet:   0%|          | 0.00/256M [00:00<?, ?B/s]

data/train-00069-of-00135.parquet:   0%|          | 0.00/297M [00:00<?, ?B/s]

data/train-00070-of-00135.parquet:   0%|          | 0.00/111M [00:00<?, ?B/s]

data/train-00071-of-00135.parquet:   0%|          | 0.00/94.0M [00:00<?, ?B/s]

data/train-00072-of-00135.parquet:   0%|          | 0.00/96.9M [00:00<?, ?B/s]

data/train-00073-of-00135.parquet:   0%|          | 0.00/94.7M [00:00<?, ?B/s]

data/train-00074-of-00135.parquet:   0%|          | 0.00/96.8M [00:00<?, ?B/s]

data/train-00075-of-00135.parquet:   0%|          | 0.00/98.2M [00:00<?, ?B/s]

data/train-00076-of-00135.parquet:   0%|          | 0.00/97.0M [00:00<?, ?B/s]

data/train-00077-of-00135.parquet:   0%|          | 0.00/99.0M [00:00<?, ?B/s]

data/train-00078-of-00135.parquet:   0%|          | 0.00/98.6M [00:00<?, ?B/s]

data/train-00079-of-00135.parquet:   0%|          | 0.00/97.3M [00:00<?, ?B/s]

data/train-00080-of-00135.parquet:   0%|          | 0.00/97.0M [00:00<?, ?B/s]

data/train-00081-of-00135.parquet:   0%|          | 0.00/96.1M [00:00<?, ?B/s]

data/train-00082-of-00135.parquet:   0%|          | 0.00/96.9M [00:00<?, ?B/s]

data/train-00083-of-00135.parquet:   0%|          | 0.00/95.4M [00:00<?, ?B/s]

data/train-00084-of-00135.parquet:   0%|          | 0.00/96.9M [00:00<?, ?B/s]

data/train-00085-of-00135.parquet:   0%|          | 0.00/97.8M [00:00<?, ?B/s]

data/train-00086-of-00135.parquet:   0%|          | 0.00/97.6M [00:00<?, ?B/s]

data/train-00087-of-00135.parquet:   0%|          | 0.00/96.7M [00:00<?, ?B/s]

data/train-00088-of-00135.parquet:   0%|          | 0.00/96.0M [00:00<?, ?B/s]

data/train-00089-of-00135.parquet:   0%|          | 0.00/96.6M [00:00<?, ?B/s]

data/train-00090-of-00135.parquet:   0%|          | 0.00/96.4M [00:00<?, ?B/s]

data/train-00091-of-00135.parquet:   0%|          | 0.00/96.7M [00:00<?, ?B/s]

data/train-00092-of-00135.parquet:   0%|          | 0.00/96.7M [00:00<?, ?B/s]

data/train-00093-of-00135.parquet:   0%|          | 0.00/95.8M [00:00<?, ?B/s]

data/train-00094-of-00135.parquet:   0%|          | 0.00/95.2M [00:00<?, ?B/s]

data/train-00095-of-00135.parquet:   0%|          | 0.00/98.0M [00:00<?, ?B/s]

data/train-00096-of-00135.parquet:   0%|          | 0.00/96.1M [00:00<?, ?B/s]

data/train-00097-of-00135.parquet:   0%|          | 0.00/96.5M [00:00<?, ?B/s]

data/train-00098-of-00135.parquet:   0%|          | 0.00/97.1M [00:00<?, ?B/s]

data/train-00099-of-00135.parquet:   0%|          | 0.00/97.7M [00:00<?, ?B/s]

data/train-00100-of-00135.parquet:   0%|          | 0.00/101M [00:00<?, ?B/s]

data/train-00101-of-00135.parquet:   0%|          | 0.00/285M [00:00<?, ?B/s]

data/train-00102-of-00135.parquet:   0%|          | 0.00/286M [00:00<?, ?B/s]

data/train-00103-of-00135.parquet:   0%|          | 0.00/310M [00:00<?, ?B/s]

data/train-00104-of-00135.parquet:   0%|          | 0.00/321M [00:00<?, ?B/s]

data/train-00105-of-00135.parquet:   0%|          | 0.00/322M [00:00<?, ?B/s]

data/train-00106-of-00135.parquet:   0%|          | 0.00/319M [00:00<?, ?B/s]

data/train-00107-of-00135.parquet:   0%|          | 0.00/327M [00:00<?, ?B/s]

data/train-00108-of-00135.parquet:   0%|          | 0.00/325M [00:00<?, ?B/s]

data/train-00109-of-00135.parquet:   0%|          | 0.00/321M [00:00<?, ?B/s]

data/train-00110-of-00135.parquet:   0%|          | 0.00/315M [00:00<?, ?B/s]

data/train-00111-of-00135.parquet:   0%|          | 0.00/315M [00:00<?, ?B/s]

data/train-00112-of-00135.parquet:   0%|          | 0.00/315M [00:00<?, ?B/s]

data/train-00113-of-00135.parquet:   0%|          | 0.00/315M [00:00<?, ?B/s]

data/train-00114-of-00135.parquet:   0%|          | 0.00/323M [00:00<?, ?B/s]

data/train-00115-of-00135.parquet:   0%|          | 0.00/260M [00:00<?, ?B/s]

data/train-00116-of-00135.parquet:   0%|          | 0.00/210M [00:00<?, ?B/s]

data/train-00117-of-00135.parquet:   0%|          | 0.00/206M [00:00<?, ?B/s]

data/train-00118-of-00135.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

data/train-00119-of-00135.parquet:   0%|          | 0.00/207M [00:00<?, ?B/s]

data/train-00120-of-00135.parquet:   0%|          | 0.00/206M [00:00<?, ?B/s]

data/train-00121-of-00135.parquet:   0%|          | 0.00/206M [00:00<?, ?B/s]

data/train-00122-of-00135.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

data/train-00123-of-00135.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

data/train-00124-of-00135.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

data/train-00125-of-00135.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

data/train-00126-of-00135.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

data/train-00127-of-00135.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

data/train-00128-of-00135.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

data/train-00129-of-00135.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

data/train-00130-of-00135.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

data/train-00131-of-00135.parquet:   0%|          | 0.00/235M [00:00<?, ?B/s]

data/train-00132-of-00135.parquet:   0%|          | 0.00/287M [00:00<?, ?B/s]

data/train-00133-of-00135.parquet:   0%|          | 0.00/211M [00:00<?, ?B/s]

data/train-00134-of-00135.parquet:   0%|          | 0.00/151M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/708009 [00:00<?, ? examples/s]

NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=67398587788, num_examples=708009, shard_lengths=None, original_shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=64077440743, num_examples=709263, shard_lengths=[5253, 5253, 5253, 5253, 5253, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 10506, 10506, 10506, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 10506, 5253, 5253, 5253, 5253, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 10506, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 5253, 10506, 5253, 5253, 10506, 10506, 10506, 10506, 10506, 5253, 5253, 5361], original_shard_lengths=None, dataset_name='omni_thought')}]

### Run CoT Distillation

Calls EasyDistill (kd_black_box_train_only). Checkpoint: `./distilled-qwen2.5-1.5b-cot`.

In [10]:
config_sys2 = {
    "student_model": STUDENT_MODEL_SYS2,
    "cot_path": "data/omnithought_cot.json",
    "num_epochs": NUM_EPOCHS_SYS2,
    "out_dir": "./distilled-qwen2.5-1.5b-cot",
    "config_path": "configs/kd_cot_qwen_1_5b.json",
}
# Use EasyDistill template from clone (same as System 1)
_tpl = Path("/content/easydistill/configs/chat_template/chat_template_kd.jinja") if Path("/content").exists() else Path.cwd() / "easydistill" / "configs" / "chat_template" / "chat_template_kd.jinja"
if _tpl.exists():
    config_sys2["template_path"] = str(_tpl)

path_sys2 = distill_system2(config_sys2)
if path_sys2:
    print("Final checkpoint path:", path_sys2)

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

No CoT data found. Run Prepare CoT Data cell first.


### Test System 2 (CoT) student

Prompts include CoT instruction; responses should show step-by-step reasoning.

In [13]:
student_path_sys2 = "./distilled-qwen2.5-1.5b-cot"
if Path(student_path_sys2).exists():
    student_sys2, tok_sys2 = load_student(student_path_sys2)
    for p in [
        "A train travels 120 km in 2 hours. If it continues at the same speed, how far will it travel in 5 hours?",
        "You flip a fair coin 3 times. What is the probability of getting exactly two heads?",
        "Explain the difference between overfitting and underfitting with an example.",
    ]:
        print("=" * 72)
        print("Prompt:", p)
        print("Student (System 2 CoT):", infer_student(student_sys2, tok_sys2, p, mode="system2", max_new_tokens=512))
        print()
else:
    print("Checkpoint not found — run System 2 distillation first.")

Checkpoint not found — run System 2 distillation first.


---
## 4️⃣ Teacher vs student comparison

Side-by-side: **Prompt → Teacher | System 1 | System 2**. Missing checkpoints are skipped with a clear message.

In [14]:
COMPARE_PROMPTS = [
    "Explain what overfitting means.",
    "What is the time complexity of binary search?",
    "A train travels 120 km in 2 hours. What is its average speed?",
    "Explain the concept of knowledge distillation and why it is useful.",
]

compare_models(
    COMPARE_PROMPTS,
    teacher_path="Qwen/Qwen2.5-7B-Instruct",
    system1_path="./distilled-qwen2.5-0.5b",
    system2_path="./distilled-qwen2.5-1.5b-cot",
)

NameError: name 'compare_models' is not defined

### Optional: Teacher vs System 2 CoT (side-by-side)

Compare teacher and System 2 student on reasoning prompts with CoT-style prompting. Loads teacher and student if not already in memory.

In [15]:
COT_COMPARE = [
    "A bag has 3 red balls and 2 blue balls. If you draw two without replacement, what is the probability both are red?",
    "What is the derivative of x^3 + 2x^2 - 5x + 7? Explain the steps.",
]
if Path("./distilled-qwen2.5-1.5b-cot").exists():
    try:
        _t, _tt = load_teacher("Qwen/Qwen2.5-7B-Instruct")
        _s2, _ts2 = load_student("./distilled-qwen2.5-1.5b-cot")
        for p in COT_COMPARE:
            print("#" * 72)
            print("Prompt:", p)
            print("\n[Teacher CoT]", infer_student(_t, _tt, p, mode="system2", max_new_tokens=512)[:1000])
            print("\n[Student CoT]", infer_student(_s2, _ts2, p, mode="system2", max_new_tokens=512)[:1000])
            print()
    except Exception as e:
        print("Could not load models:", e)
else:
    print("Run System 2 distillation first.")

Run System 2 distillation first.


---
## 5️⃣ Scaling up

Once a small run works:
- **Data:** Increase slices (e.g. `train[:10000]` System 1, `train[:5000]` System 2).
- **Epochs:** Set `NUM_EPOCHS_SYS1` / `NUM_EPOCHS_SYS2` to 2–3.
- **Batch size:** Increase in generated configs if VRAM allows.
- **System 2 student:** Use `Qwen/Qwen2.5-0.5B-Instruct` if VRAM is tight.
- **Relabeling:** Set `RELABEL_WITH_TEACHER = True` for teacher-generated labels (slower, often better).

### Quick single-prompt inference

Uncomment and run after you have a checkpoint.

In [None]:
# model, tokenizer = load_student("./distilled-qwen2.5-0.5b")
# print(infer_student(model, tokenizer, "Explain what overfitting means.", mode="system1"))