A personal study of inference-time compression and optimization for small open-weight language models, run entirely on a single Apple M5 MacBook with 32 GB unified memory. Not a benchmark suite, not a paper, not a production system. Every number comes from a script in this repo that actually ran on this machine.
Speculative decoding on this hardware, visualized. Qwen2.5-0.5B drafts four tokens;
Qwen2.5-3B verifies them in one forward pass. Yellow = proposed, green = accepted,
red = rejected, blue = bonus correction from the target. The draft agrees often
enough that multiple tokens fall out per target step, which is where the speedup
comes from. Aggregate accept rates across four prompts are in docs/RESULTS.md;
the GIF itself is regenerated by scripts/make_specdec_gif.py.
Thirteen methods across three Qwen2.5 base checkpoints:
- Baselines: fp32 / fp16 / bf16 on CPU or MPS.
- Quantization: dynamic int8, weight-only int8/int4, small GPTQ-style pass.
- Compile / export:
torch.compile, ONNX Runtime CPU. - KV + decode: KV-cache growth probe, int8 KV-cache, speculative decoding, SDPA probe.
- Sparsity and distillation: magnitude pruning, short distillation run.
Everything is driven by one harness in src/tinycompress/eval/ and lands in one JSON
per (model, method) under results/raw/. CoreML, CUDA kernels (bitsandbytes / AWQ /
GPTQ / FlashAttention), and frontier-scale models are out of scope.
Captured in every JSON. Apple M5, 32 GB unified memory, macOS 26.1, Python 3.14.3, torch 2.11.0 (MPS available). Sequential laptop runs; no active throttling management.
Three Qwen2.5 base checkpoints, chosen so the same tokenizer is shared across the ladder (required for the speculative-decoding experiment):
- Qwen/Qwen2.5-0.5B - draft model for spec decoding.
- Qwen/Qwen2.5-1.5B - spec-decoding target; fills the "~1B" slot.
- Qwen/Qwen2.5-3B - largest that fits comfortably in 32 GB with headroom.
- results/tables/summary.md - one row per (model, method); peak MB, forward ms, tok/s, PPL.
- docs/RESULTS.md - headline observations with supporting numbers, figures, cross-run variance, and domain-shift checks.
- docs/METHODS.md - what each method measures, how, and the harness dataflow.
- docs/LIMITATIONS.md - what this does not claim.
Per-model detail: 0.5B / 1.5B / 3B.
Figures: PPL by method / latency by method / KV growth / pruning cliff.
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,plots]"
# exact rerun of the numbers above:
pip install -r requirements-lock.txt -e .
pytest -q # smoke tests
python -m tinycompress.hardware_info # sanity check
python scripts/run_baseline.py --model qwen2_5_0_5b --method fp32_cpu # one cell
bash scripts/run_all.sh # full matrixThe per-area runners under scripts/run_*.sh are idempotent. scripts/self_audit.py
is the cross-cut check and must stay green ([OK] all checks passed).
- Qwen2.5: huggingface.co/Qwen/Qwen2.5-0.5B
- PyTorch dynamic quantization: docs.pytorch.org
torch.compile: docs.pytorch.org- ONNX Runtime: onnxruntime.ai
- Speculative decoding, Leviathan et al. 2023: arxiv.org/abs/2211.17192
- GPTQ, Frantar et al. 2022: arxiv.org/abs/2210.17323
- Distillation, Hinton et al. 2015: arxiv.org/abs/1503.02531
- wikitext-2-raw-v1: huggingface.co/datasets/Salesforce/wikitext
src/tinycompress/ library: loader, harness, quant / compile / kv / prune / distill
scripts/ thin entry points (run_*.py, make_tables, make_plots, self_audit)
results/raw/ one JSON per (model, method); ground truth
results/tables/ derived tables (regenerated from raw)
results/figures/ plots (regenerated from raw)
docs/ METHODS.md, RESULTS.md, LIMITATIONS.md
tests/ hermetic, CPU-only, CI-friendly
MIT.
