feat(providers): in-process local llama backend, default Qwen2.5-Coder-1.5B (closes #42)#43
Merged
Conversation
…Coder-1.5B Closes #42. Adds Path 4: an offline-first provider that runs a GGUF model directly in the Python process via llama-cpp-python — no API key and no HTTP overhead. The model is loaded once and reused for the lifetime of the process. Why: the project's own benchmark shows the 6-layer contract lifts a 1.5B coder from ~34% to ~88% pass-rate. A strong, zero-config local default makes that reachable without Ollama or any external endpoint, and reduces the dependency on remote APIs for small edits. - Default to Qwen2.5-Coder-1.5B-Instruct-Q5_K_M (bartowski GGUF) when neither SIMPLICIO_MODEL nor SIMPLICIO_BASE_URL is set; weights fetched once from HF. - Explicit route local-llama/<repo>::<file.gguf> | local-llama/default | local-llama//abs/path.gguf, plus `simplicio task --local`. - Tuning via SIMPLICIO_LOCAL_* env (ctx, threads, gpu layers, max tokens, temp, model path/repo/file). - New optional extra `simplicio-cli[local]` (llama-cpp-python, huggingface-hub). - Friendly SystemExit when the extra is not installed. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG
The Path 4 local backend downloads GGUF weights at runtime; keep them out of version control. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG
The release-metadata test pins the expected version; update it alongside the 0.4.4 -> 0.5.0 bump from the local-llama backend feature. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG
A regex-bench run across GGUF quantizations surfaced a latent collision: when models route as local-llama/default (switching weights via SIMPLICIO_LOCAL_MODEL_PATH/_REPO/_FILE), the completion cache key only used the logical model id, so different GGUFs could share cached completions. Fold the resolved weights (path or repo/file) into the key, covered by a regression test. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
O que muda
Implementa a issue #42 — Qwen2.5-Coder-1.5B como modelo local default via
llama-cpp-python.Adiciona o Path 4: provider in-process que roda um GGUF direto no processo Python — zero API key, zero overhead HTTP. O modelo é carregado uma vez e reusado pelo processo.
Decisões (confirmadas com o autor)
SIMPLICIO_MODELnemSIMPLICIO_BASE_URLestão setados, + rota explícita.Detalhes
Qwen2.5-Coder-1.5B-Instruct-Q5_K_M(bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF), baixado uma vez do HF Hub.local-llama/default,local-llama/<repo>::<file.gguf>,local-llama//abs/path.gguf.simplicio task --localforça o modelo local independente do ambiente.SIMPLICIO_LOCAL_{MODEL_PATH,MODEL_REPO,MODEL_FILE,CTX,THREADS,GPU_LAYERS,MAX_TOKENS,TEMP}.pip install 'simplicio-cli[local]'(llama-cpp-python>=0.3.2,huggingface-hub>=0.23).SystemExit) quando o extra não está instalado.Por quê
O benchmark do próprio projeto mostra o contrato 6-layer levando um coder 1.5B de ~34% → ~88% de pass-rate. Um default local forte e sem config torna isso alcançável sem Ollama nem endpoint externo, reduzindo dependência de APIs remotas.
Mudança de comportamento
simpliciosem provider configurado não dá mais erro — cai no Qwen local (offline-first). SetarSIMPLICIO_BASE_URL/SIMPLICIO_MODELvolta pro provider remoto. O teste que assumia o erro foi atualizado e um novo teste cobre o novo default.Testes
tests/python/test_providers_local.py(roteamento, spec resolution, knobs, cache, paths de erro).ruff checklimpo nos arquivos tocados.Versão / docs
0.4.4 → 0.5.0(pyproject.toml,simplicio/__init__.py).CHANGELOG.mdeREADME.md(nova seção "Path 4 — offline-first local model") atualizados.Closes #42
https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG
Generated by Claude Code