feat(providers): in-process local llama backend, default Qwen2.5-Coder-1.5B (closes #42) by wesleysimplicio · Pull Request #43 · wesleysimplicio/simplicio-dev-cli

wesleysimplicio · 2026-05-31T03:10:05Z

O que muda

Implementa a issue #42 — Qwen2.5-Coder-1.5B como modelo local default via llama-cpp-python.

Adiciona o Path 4: provider in-process que roda um GGUF direto no processo Python — zero API key, zero overhead HTTP. O modelo é carregado uma vez e reusado pelo processo.

Decisões (confirmadas com o autor)

Ativação: default automático quando nem SIMPLICIO_MODEL nem SIMPLICIO_BASE_URL estão setados, + rota explícita.
Integração: direta in-process (não servidor OpenAI-compatible).

Detalhes

Modelo default: Qwen2.5-Coder-1.5B-Instruct-Q5_K_M (bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF), baixado uma vez do HF Hub.
Rotas explícitas: local-llama/default, local-llama/<repo>::<file.gguf>, local-llama//abs/path.gguf.
Flag: simplicio task --local força o modelo local independente do ambiente.
Knobs: SIMPLICIO_LOCAL_{MODEL_PATH,MODEL_REPO,MODEL_FILE,CTX,THREADS,GPU_LAYERS,MAX_TOKENS,TEMP}.
Extra opcional: pip install 'simplicio-cli[local]' (llama-cpp-python>=0.3.2, huggingface-hub>=0.23).
Erro amigável (SystemExit) quando o extra não está instalado.

Por quê

O benchmark do próprio projeto mostra o contrato 6-layer levando um coder 1.5B de ~34% → ~88% de pass-rate. Um default local forte e sem config torna isso alcançável sem Ollama nem endpoint externo, reduzindo dependência de APIs remotas.

Mudança de comportamento

simplicio sem provider configurado não dá mais erro — cai no Qwen local (offline-first). Setar SIMPLICIO_BASE_URL/SIMPLICIO_MODEL volta pro provider remoto. O teste que assumia o erro foi atualizado e um novo teste cobre o novo default.

Testes

Novo tests/python/test_providers_local.py (roteamento, spec resolution, knobs, cache, paths de erro).
Suíte completa: 332 passed, 1 skipped.
ruff check limpo nos arquivos tocados.

Versão / docs

0.4.4 → 0.5.0 (pyproject.toml, simplicio/__init__.py).
CHANGELOG.md e README.md (nova seção "Path 4 — offline-first local model") atualizados.

Closes #42

https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG

Generated by Claude Code

…Coder-1.5B Closes #42. Adds Path 4: an offline-first provider that runs a GGUF model directly in the Python process via llama-cpp-python — no API key and no HTTP overhead. The model is loaded once and reused for the lifetime of the process. Why: the project's own benchmark shows the 6-layer contract lifts a 1.5B coder from ~34% to ~88% pass-rate. A strong, zero-config local default makes that reachable without Ollama or any external endpoint, and reduces the dependency on remote APIs for small edits. - Default to Qwen2.5-Coder-1.5B-Instruct-Q5_K_M (bartowski GGUF) when neither SIMPLICIO_MODEL nor SIMPLICIO_BASE_URL is set; weights fetched once from HF. - Explicit route local-llama/<repo>::<file.gguf> | local-llama/default | local-llama//abs/path.gguf, plus `simplicio task --local`. - Tuning via SIMPLICIO_LOCAL_* env (ctx, threads, gpu layers, max tokens, temp, model path/repo/file). - New optional extra `simplicio-cli[local]` (llama-cpp-python, huggingface-hub). - Friendly SystemExit when the extra is not installed. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG

The Path 4 local backend downloads GGUF weights at runtime; keep them out of version control. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG

The release-metadata test pins the expected version; update it alongside the 0.4.4 -> 0.5.0 bump from the local-llama backend feature. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG

A regex-bench run across GGUF quantizations surfaced a latent collision: when models route as local-llama/default (switching weights via SIMPLICIO_LOCAL_MODEL_PATH/_REPO/_FILE), the completion cache key only used the logical model id, so different GGUFs could share cached completions. Fold the resolved weights (path or repo/file) into the key, covered by a regression test. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG

wesleysimplicio marked this pull request as ready for review May 31, 2026 03:24

claude added 3 commits May 31, 2026 03:37

chore: gitignore local GGUF weights (*.gguf, /models)

a3d4170

The Path 4 local backend downloads GGUF weights at runtime; keep them out of version control. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG

test: bump version assertion to 0.5.0

96b4753

The release-metadata test pins the expected version; update it alongside the 0.4.4 -> 0.5.0 bump from the local-llama backend feature. https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG

wesleysimplicio merged commit 9805ea6 into master May 31, 2026
1 check passed

wesleysimplicio deleted the claude/qwen-coder-local-tokens-k6l0N branch May 31, 2026 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(providers): in-process local llama backend, default Qwen2.5-Coder-1.5B (closes #42)#43

feat(providers): in-process local llama backend, default Qwen2.5-Coder-1.5B (closes #42)#43
wesleysimplicio merged 4 commits into
masterfrom
claude/qwen-coder-local-tokens-k6l0N

wesleysimplicio commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wesleysimplicio commented May 31, 2026

O que muda

Decisões (confirmadas com o autor)

Detalhes

Por quê

Mudança de comportamento

Testes

Versão / docs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants