# PrimeIntellect SYNTHETIC-1 → Distillix → One Model (Colab-ready)

This notebook builds a single training dataset from the **PrimeIntellect SYNTHETIC‑1** collection (multiple datasets), then **optionally trains one Distillix model** on the unified data.

**What you get**
- `data/training/unified_primeintellect.jsonl` in `{prompt,response,source,metadata}` format
- (optional) a trained checkpoint under `artifacts/` via `scripts/train_codetether.py`


## 0) Setup (clone repo + install dependencies)

Colab starts in `/content`. This cell:
1. Clones the Distillix repo (or reuses it if already present)
2. Installs minimal Python dependencies needed for data import + training

> If you already uploaded the repo to Colab, set `REPO_DIR` accordingly and skip cloning.

In [None]:
import os, sys, subprocess
from pathlib import Path

print('Python:', sys.version.split()[0])
print('Executable:', sys.executable)

# ---- configure repo location ----
REPO_URL = os.environ.get('DISTILLIX_REPO_URL', 'https://github.com/rileyseaburg/distillix.git')
REPO_DIR = Path(os.environ.get('DISTILLIX_REPO_DIR', '/content/distillix')).resolve()

print('Repo URL:', REPO_URL)
print('Repo dir:', REPO_DIR)

if not REPO_DIR.exists():
    subprocess.run(['git', 'clone', '--depth', '1', REPO_URL, str(REPO_DIR)], check=True)
else:
    print('Repo already exists, skipping clone.')

os.chdir(REPO_DIR)
print('CWD:', Path.cwd())

# ---- install dependencies (minimal) ----
deps = [
    'datasets>=2.16.0',
    'transformers>=4.36.0',
    'tokenizers>=0.15.0',
    'sentencepiece>=0.1.99',
    'tqdm>=4.66.0',
    'pyyaml>=6.0.1',
    'pydantic>=2.5.0',
    'numpy>=1.24.0',
    'safetensors>=0.4.0',
    'einops>=0.7.0',
    # training utils
    'rich>=13.7.0',
    'typer>=0.9.0',
    'python-dotenv>=1.0.0',
    # torch is usually preinstalled in Colab; keep this optional
    # 'torch>=2.1.0',
 ]

subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', *deps], check=True)
print('Installed deps OK')

## 1) Smoke test the pipeline (tiny limits)

This is a quick sanity check that:
- Hugging Face datasets import works
- the per-dataset JSONLs are created
- they combine into a unified JSONL


In [None]:
from pathlib import Path
import json
import subprocess, sys

tmp_work = Path('/content/pi_work')
tmp_unified = Path('/content/pi_unified.jsonl')

cmd = [
    sys.executable, 'scripts/pipeline_primeintellect_one_model.py',
    '--dataset', 'PrimeIntellect/verifiable-coding-problems=3',
    '--dataset', 'PrimeIntellect/real-world-swe-problems=2',
    '--work-dir', str(tmp_work),
    '--unified-out', str(tmp_unified),
    '--no-shuffle',
    '--no-dedup',
 ]
print('Running:', ' '.join(cmd))
subprocess.run(cmd, check=True)

print('Unified exists:', tmp_unified.exists())
print('Unified lines:', sum(1 for _ in tmp_unified.open('r', encoding='utf-8')))
first = json.loads(tmp_unified.open('r', encoding='utf-8').readline())
print('Keys:', sorted(first.keys()))
print('Source:', first.get('source'))
print('Prompt preview:', first.get('prompt','')[:200].replace('\n','\\n'))
print('Response preview:', first.get('response','')[:200].replace('\n','\\n'))

## 2) Build a real unified dataset (capped)

These caps keep it runnable in Colab without downloading the full parquet corpora.
You can increase limits later once everything works.

In [None]:
from pathlib import Path
import subprocess, sys

work_dir = Path('data/training/primeintellect_pipeline')
unified_out = Path('data/training/unified_primeintellect.jsonl')

dataset_specs = [
    'PrimeIntellect/SYNTHETIC-1-SFT-Data=20000',
    'PrimeIntellect/verifiable-coding-problems=20000',
    'PrimeIntellect/real-world-swe-problems=10000',
    'PrimeIntellect/synthetic-code-understanding=10000',
    'PrimeIntellect/stackexchange-question-answering=10000',
 ]

cmd = [sys.executable, 'scripts/pipeline_primeintellect_one_model.py']
for spec in dataset_specs:
    cmd += ['--dataset', spec]
cmd += ['--work-dir', str(work_dir), '--unified-out', str(unified_out)]

print('Running:', ' '.join(cmd))
subprocess.run(cmd, check=True)

print('Unified path:', unified_out)
print('Unified lines:', sum(1 for _ in unified_out.open('r', encoding='utf-8')))

## 3) Inspect the unified JSONL

> This is optional but recommended: sanity check the mixture by `source`, and check average prompt/response sizes.

In [None]:
import json
from collections import Counter, defaultdict
from pathlib import Path
import random

path = Path('data/training/unified_primeintellect.jsonl')
assert path.exists(), f"Missing: {path}"

counts = Counter()
lens = defaultdict(lambda: {'p': 0, 'r': 0, 'n': 0})
examples_by_source = defaultdict(list)

max_lines = 2000  # keep cheap in Colab
for i, line in enumerate(path.open('r', encoding='utf-8')):
    if i >= max_lines:
        break
    obj = json.loads(line)
    src = obj.get('source', 'unknown')
    p = obj.get('prompt', '')
    r = obj.get('response', '')

    counts[src] += 1
    lens[src]['p'] += len(p)
    lens[src]['r'] += len(r)
    lens[src]['n'] += 1

    if len(examples_by_source[src]) < 3:
        examples_by_source[src].append(obj)

print('Counts (first', max_lines, 'rows):')
for src, c in counts.most_common():
    n = lens[src]['n']
    ap = lens[src]['p'] / max(1, n)
    ar = lens[src]['r'] / max(1, n)
    print(f"- {src}: {c} | avg_prompt_chars={ap:.0f} avg_response_chars={ar:.0f}")

# show 2 random examples from two different sources (if available)
sources = list(examples_by_source.keys())
random.shuffle(sources)
for src in sources[:2]:
    ex = examples_by_source[src][0]
    print('\n--- example source:', src, '---')
    print('PROMPT:', ex['prompt'][:400])
    print('\nRESPONSE:', ex['response'][:400])

## 4) Optional: Train one model

> **GPU recommended.** This uses `scripts/train_codetether.py` and expects CUDA to be available.

This repo historically suffered from BitNet weight collapse; this run enables the anti-collapse polarization hook by default when `RUN_TRAIN=True`.

In [None]:
import os, sys, subprocess
from pathlib import Path

RUN_TRAIN = False  # <- set True to actually train

if RUN_TRAIN:
    data_path = Path('data/training/unified_primeintellect.jsonl')
    assert data_path.exists(), f"Missing: {data_path}"

    # Pick a healthy-ish base checkpoint if present.
    base = Path('artifacts/distillix-v05-1500steps.pt')
    if not base.exists():
        # fallback: you can swap this to any base you have
        base = Path('artifacts/distillix-v05-500steps.pt')

    cmd = [
        sys.executable, 'scripts/train_codetether.py',
        '--base', str(base),
        '--data', str(data_path),
        '--output', 'distillix-primeintellect',
        '--steps', '1000',
        '--batch', '4',
        '--accum', '8',
        '--muon-weight-decay', '0.0',
        '--adamw-weight-decay', '0.0',
        '--polarize',
        '--target-scale', '0.01',
        '--polarization-strength', '0.1',
    ]

    print('Running training:', ' '.join(cmd))
    subprocess.run(cmd, check=True)
else:
    print('RUN_TRAIN=False (skipping). Set it to True to launch training.')

## 5) Troubleshooting

- **Native abort after Hugging Face streaming**: In some environments, `datasets.load_dataset(..., streaming=True)` can abort during interpreter shutdown. This repo’s importer (`scripts/import_hf_synthetic1.py`) uses a hard-exit after flushing output to avoid that. The pipeline runs the importer in a subprocess so your notebook stays alive.
- **Disk usage**: The raw SYNTHETIC‑1 dataset is large. Keep limits small until you’re happy with the mixture.
- **CUDA**: `scripts/train_codetether.py` currently moves the model to CUDA. In Colab, switch Runtime → Change runtime type → GPU.