Skip to content

feat(providers): in-process local llama backend, default Qwen2.5-Coder-1.5B (closes #42)#43

Merged
wesleysimplicio merged 4 commits into
masterfrom
claude/qwen-coder-local-tokens-k6l0N
May 31, 2026
Merged

feat(providers): in-process local llama backend, default Qwen2.5-Coder-1.5B (closes #42)#43
wesleysimplicio merged 4 commits into
masterfrom
claude/qwen-coder-local-tokens-k6l0N

Conversation

@wesleysimplicio
Copy link
Copy Markdown
Owner

O que muda

Implementa a issue #42 — Qwen2.5-Coder-1.5B como modelo local default via llama-cpp-python.

Adiciona o Path 4: provider in-process que roda um GGUF direto no processo Python — zero API key, zero overhead HTTP. O modelo é carregado uma vez e reusado pelo processo.

Decisões (confirmadas com o autor)

  • Ativação: default automático quando nem SIMPLICIO_MODEL nem SIMPLICIO_BASE_URL estão setados, + rota explícita.
  • Integração: direta in-process (não servidor OpenAI-compatible).

Detalhes

  • Modelo default: Qwen2.5-Coder-1.5B-Instruct-Q5_K_M (bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF), baixado uma vez do HF Hub.
  • Rotas explícitas: local-llama/default, local-llama/<repo>::<file.gguf>, local-llama//abs/path.gguf.
  • Flag: simplicio task --local força o modelo local independente do ambiente.
  • Knobs: SIMPLICIO_LOCAL_{MODEL_PATH,MODEL_REPO,MODEL_FILE,CTX,THREADS,GPU_LAYERS,MAX_TOKENS,TEMP}.
  • Extra opcional: pip install 'simplicio-cli[local]' (llama-cpp-python>=0.3.2, huggingface-hub>=0.23).
  • Erro amigável (SystemExit) quando o extra não está instalado.

Por quê

O benchmark do próprio projeto mostra o contrato 6-layer levando um coder 1.5B de ~34% → ~88% de pass-rate. Um default local forte e sem config torna isso alcançável sem Ollama nem endpoint externo, reduzindo dependência de APIs remotas.

Mudança de comportamento

simplicio sem provider configurado não dá mais erro — cai no Qwen local (offline-first). Setar SIMPLICIO_BASE_URL/SIMPLICIO_MODEL volta pro provider remoto. O teste que assumia o erro foi atualizado e um novo teste cobre o novo default.

Testes

  • Novo tests/python/test_providers_local.py (roteamento, spec resolution, knobs, cache, paths de erro).
  • Suíte completa: 332 passed, 1 skipped.
  • ruff check limpo nos arquivos tocados.

Versão / docs

  • 0.4.4 → 0.5.0 (pyproject.toml, simplicio/__init__.py).
  • CHANGELOG.md e README.md (nova seção "Path 4 — offline-first local model") atualizados.

Closes #42

https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG


Generated by Claude Code

…Coder-1.5B

Closes #42.

Adds Path 4: an offline-first provider that runs a GGUF model directly in
the Python process via llama-cpp-python — no API key and no HTTP overhead.
The model is loaded once and reused for the lifetime of the process.

Why: the project's own benchmark shows the 6-layer contract lifts a 1.5B
coder from ~34% to ~88% pass-rate. A strong, zero-config local default makes
that reachable without Ollama or any external endpoint, and reduces the
dependency on remote APIs for small edits.

- Default to Qwen2.5-Coder-1.5B-Instruct-Q5_K_M (bartowski GGUF) when neither
  SIMPLICIO_MODEL nor SIMPLICIO_BASE_URL is set; weights fetched once from HF.
- Explicit route local-llama/<repo>::<file.gguf> | local-llama/default |
  local-llama//abs/path.gguf, plus `simplicio task --local`.
- Tuning via SIMPLICIO_LOCAL_* env (ctx, threads, gpu layers, max tokens, temp,
  model path/repo/file).
- New optional extra `simplicio-cli[local]` (llama-cpp-python, huggingface-hub).
- Friendly SystemExit when the extra is not installed.

https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG
@wesleysimplicio wesleysimplicio marked this pull request as ready for review May 31, 2026 03:24
claude added 3 commits May 31, 2026 03:37
The Path 4 local backend downloads GGUF weights at runtime; keep them out
of version control.

https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG
The release-metadata test pins the expected version; update it alongside the
0.4.4 -> 0.5.0 bump from the local-llama backend feature.

https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG
A regex-bench run across GGUF quantizations surfaced a latent collision: when
models route as local-llama/default (switching weights via
SIMPLICIO_LOCAL_MODEL_PATH/_REPO/_FILE), the completion cache key only used
the logical model id, so different GGUFs could share cached completions. Fold
the resolved weights (path or repo/file) into the key, covered by a
regression test.

https://claude.ai/code/session_01GuocKeRWEE3fg1mKTRNauG
@wesleysimplicio wesleysimplicio merged commit 9805ea6 into master May 31, 2026
1 check passed
@wesleysimplicio wesleysimplicio deleted the claude/qwen-coder-local-tokens-k6l0N branch May 31, 2026 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Make Qwen2.5-Coder-1.5B-Instruct-Q5_K_M the default local model using llama-cpp-python

2 participants