Problem
terraphim-ai has no local inference capability. All LLM calls go through remote APIs (OpenRouter via genai). For machines without GPU or API access, there is no fallback. GGUF models for MedGemma are available (unsloth/medgemma-1.5-4b-it-GGUF, 11.8K downloads on HuggingFace).
Proposed Change
Add a terraphim_llm_local crate (or feature in terraphim_multi_agent) that wraps llama-cpp-rs for local GGUF inference. Implement the same LLM client trait so agents can transparently use local or remote models.
Key requirements:
- CPU-only inference support (many dev machines lack GPU)
- Automatic GGUF model download via
hf-hub crate
- Quantization variant selection (Q4_K_M ~2.5GB for 4B models, Q8_0 for higher quality)
- Same trait interface as the remote genai client for seamless swapping
Scope
- New crate
crates/terraphim_llm_local/ or feature gate in terraphim_multi_agent
- Dependencies:
llama-cpp-rs, hf-hub (both approved)
Context
This is UPLIFT-5 from the medgemma-competition multi-agent integration plan. Local GGUF inference is essential for development workflows where remote API calls are slow or unavailable. The MedGemma 1.5-4b-it GGUF model is the primary target for local inference.
Related upstream issues: #534, #535, #536
Problem
terraphim-ai has no local inference capability. All LLM calls go through remote APIs (OpenRouter via genai). For machines without GPU or API access, there is no fallback. GGUF models for MedGemma are available (
unsloth/medgemma-1.5-4b-it-GGUF, 11.8K downloads on HuggingFace).Proposed Change
Add a
terraphim_llm_localcrate (or feature interraphim_multi_agent) that wrapsllama-cpp-rsfor local GGUF inference. Implement the same LLM client trait so agents can transparently use local or remote models.Key requirements:
hf-hubcrateScope
crates/terraphim_llm_local/or feature gate interraphim_multi_agentllama-cpp-rs,hf-hub(both approved)Context
This is UPLIFT-5 from the medgemma-competition multi-agent integration plan. Local GGUF inference is essential for development workflows where remote API calls are slow or unavailable. The MedGemma 1.5-4b-it GGUF model is the primary target for local inference.
Related upstream issues: #534, #535, #536