Skip to content

Releases: timtoole02/Camelid

Camelid v0.1.0

05 Jun 04:59

Choose a tag to compare

Camelid v0.1.0 is an evidence-first release: it claims exactly what the repository can defend with committed artifacts, and states its boundaries as plainly as its results.

What this is

A Rust-native local GGUF inference backend for Apple Silicon with an OpenAI-style API and a React/Vite chat frontend. Q8_0 weights load directly from GGUF (no conversion step), run on a Metal-resident GPU path with greedy sampling on the GPU, and fall back to validated CPU paths where the resident gates do not apply.

Supported rows (exact, evidence-cited)

Support is per exact model row with row-specific artifacts — see SUPPORT_MATRIX_v0.1.md:

  • TinyLlama 1.1B Chat Q8_0 — verified support gate
  • Llama 3.2 1B Instruct Q8_0 — verified bounded support
  • Llama 3.2 3B Instruct Q8_0 — supported exact-row smoke
  • Llama 3 8B Instruct Q8_0 — verified bounded support
  • Mistral-7B / Mixtral — evidence-only bring-up; fail-closed in v0.1

Performance (same host, same prompts, three alternating rounds; medians)

One Apple M4 (10-core GPU, 16 GB), comparators llama.cpp (Metal, brew) and MLX-LM (8-bit). Raw logs and methods in the committed evidence bundles under qa/evidence-bundles/; reading boundaries in BENCHMARKS.md.

Row / lane Camelid llama.cpp MLX-LM
3B prefill, 601-token prompt (tok/s) 587.3 543.7 577.9
3B decode, short context (tok/s) 29.7 29.1 29.1
1B prefill (tok/s) 1664.3 1472.8 1670.0
1B decode (tok/s) 74.8 67.2 69.7
8B prefill (tok/s) 234.2 220.4 229.2
8B decode (tok/s) 12.1 12.1 12.0

Stated boundaries, in the same spirit:

  • These are short-prompt, same-session snapshots on exact rows; nothing transfers to other shapes or machines.
  • Past ~1.7k-token prompts Camelid prefill reads 2–4× below llama.cpp; decode at depth reads below both comparators (25.0 / 16.9 tok/s at 1.5k / 8k vs 26.4 / 19.1 and 26.9 / 22.2) — both recorded as known-behind lanes with their own bundles.
  • 1B prefill is parity with MLX-LM (no win claimed); 8B decode is a three-way parity band.

Greedy continuations are token-parity-checked against the CPU reference path throughout, including the GPU sampling fast lane.

Gate

The release gate (RELEASE_GATE_v0.1.md) is green on the tag head: format, clippy (-D warnings, all features), 533 tests, release build, public evidence-claim and scrub checks, benchmark harness self-test, and the frontend build + model-state smoke. The Ollama comparator baseline is explicitly deferred with rationale recorded in the gate.

Full notes: RELEASE_NOTES_v0.1.md