Skip to content

feat: Add GGUF/llama-cpp backend to terraphim LLM proxy layer #538

@AlexMikhalev

Description

@AlexMikhalev

Problem

terraphim-ai has no local inference capability. All LLM calls go through remote APIs (OpenRouter via genai). For machines without GPU or API access, there is no fallback. GGUF models for MedGemma are available (unsloth/medgemma-1.5-4b-it-GGUF, 11.8K downloads on HuggingFace).

Proposed Change

Add a terraphim_llm_local crate (or feature in terraphim_multi_agent) that wraps llama-cpp-rs for local GGUF inference. Implement the same LLM client trait so agents can transparently use local or remote models.

Key requirements:

  • CPU-only inference support (many dev machines lack GPU)
  • Automatic GGUF model download via hf-hub crate
  • Quantization variant selection (Q4_K_M ~2.5GB for 4B models, Q8_0 for higher quality)
  • Same trait interface as the remote genai client for seamless swapping

Scope

  • New crate crates/terraphim_llm_local/ or feature gate in terraphim_multi_agent
  • Dependencies: llama-cpp-rs, hf-hub (both approved)

Context

This is UPLIFT-5 from the medgemma-competition multi-agent integration plan. Local GGUF inference is essential for development workflows where remote API calls are slow or unavailable. The MedGemma 1.5-4b-it GGUF model is the primary target for local inference.

Related upstream issues: #534, #535, #536

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions