Releases: timtoole02/Camelid
Camelid v0.1.0
Camelid v0.1.0 is an evidence-first release: it claims exactly what the repository can defend with committed artifacts, and states its boundaries as plainly as its results.
What this is
A Rust-native local GGUF inference backend for Apple Silicon with an OpenAI-style API and a React/Vite chat frontend. Q8_0 weights load directly from GGUF (no conversion step), run on a Metal-resident GPU path with greedy sampling on the GPU, and fall back to validated CPU paths where the resident gates do not apply.
Supported rows (exact, evidence-cited)
Support is per exact model row with row-specific artifacts — see SUPPORT_MATRIX_v0.1.md:
- TinyLlama 1.1B Chat Q8_0 — verified support gate
- Llama 3.2 1B Instruct Q8_0 — verified bounded support
- Llama 3.2 3B Instruct Q8_0 — supported exact-row smoke
- Llama 3 8B Instruct Q8_0 — verified bounded support
- Mistral-7B / Mixtral — evidence-only bring-up; fail-closed in v0.1
Performance (same host, same prompts, three alternating rounds; medians)
One Apple M4 (10-core GPU, 16 GB), comparators llama.cpp (Metal, brew) and MLX-LM (8-bit). Raw logs and methods in the committed evidence bundles under qa/evidence-bundles/; reading boundaries in BENCHMARKS.md.
| Row / lane | Camelid | llama.cpp | MLX-LM |
|---|---|---|---|
| 3B prefill, 601-token prompt (tok/s) | 587.3 | 543.7 | 577.9 |
| 3B decode, short context (tok/s) | 29.7 | 29.1 | 29.1 |
| 1B prefill (tok/s) | 1664.3 | 1472.8 | 1670.0 |
| 1B decode (tok/s) | 74.8 | 67.2 | 69.7 |
| 8B prefill (tok/s) | 234.2 | 220.4 | 229.2 |
| 8B decode (tok/s) | 12.1 | 12.1 | 12.0 |
Stated boundaries, in the same spirit:
- These are short-prompt, same-session snapshots on exact rows; nothing transfers to other shapes or machines.
- Past ~1.7k-token prompts Camelid prefill reads 2–4× below llama.cpp; decode at depth reads below both comparators (25.0 / 16.9 tok/s at 1.5k / 8k vs 26.4 / 19.1 and 26.9 / 22.2) — both recorded as known-behind lanes with their own bundles.
- 1B prefill is parity with MLX-LM (no win claimed); 8B decode is a three-way parity band.
Greedy continuations are token-parity-checked against the CPU reference path throughout, including the GPU sampling fast lane.
Gate
The release gate (RELEASE_GATE_v0.1.md) is green on the tag head: format, clippy (-D warnings, all features), 533 tests, release build, public evidence-claim and scrub checks, benchmark harness self-test, and the frontend build + model-state smoke. The Ollama comparator baseline is explicitly deferred with rationale recorded in the gate.
Full notes: RELEASE_NOTES_v0.1.md