Lossless local line routing for massive logs, source files, and routeable agent-context references before they reach a cloud reasoning model.
token-router is a Codex skill and standalone Python router that uses a local Ollama model to identify exact line ranges in oversized files, then returns the raw, unmodified slices to a high-reasoning cloud model such as GPT-5.5, o3, or another AI coding agent.
Large logs, legacy source files, and long agent instruction references are expensive context. Sending a 2,000+ line deployment log, monolithic source file, or routeable project-rules document directly to a cloud LLM can waste tokens, increase latency, and exhaust budgets before the model reaches the evidence that matters.
token-router solves this with a hybrid separation-of-concerns architecture:
- Local model for search and triage: Gemma 4 via Ollama runs on the user's machine and scans large files for relevant line coordinates.
- Cloud model for reasoning: GPT-5.5, o3, or another cloud model receives only raw high-density evidence slices and applies deep reasoning where it matters.
- No lossy summarization: The local model does not rewrite, summarize, or interpret code. It emits JSON line ranges; the router then extracts exact raw text from the original file.
- Static context routing: Long optional agent-reference docs can be routed on demand while short mandatory root instructions stay always-on.
The result is aggressive context reduction without degrading the technical evidence available to the reasoning model.
token-router deliberately splits the workflow into two specialized phases:
| Phase | Engine | Responsibility | Output |
|---|---|---|---|
| Search / Triage | Local Ollama model | Find relevant line coordinates | JSON ranges |
| Evidence Extraction | Router script | Slice original file by line number | Raw unedited text |
| Reasoning | Cloud LLM / Codex | Debug, explain, patch, or review | High-confidence answer |
This prevents the local model from becoming an unreliable summarizer. The local model is a router, not the analyst.
Summarizing source code or logs through a smaller local model is lossy. A single dropped stack frame, config key, indentation detail, or error suffix can change the diagnosis.
token-router avoids that failure mode:
- The file is scanned locally.
- The local model returns line coordinates.
- The Python script slices the original file directly.
- The cloud model receives raw text exactly as it appeared on disk.
The selection step can be imperfect, but the extracted evidence itself is lossless. If the cloud model needs more context, it can request wider line ranges instead of hallucinating around omitted dependencies.
agent_context mode extends the same line-routing architecture to long instruction references such as detailed AGENTS.md, CLAUDE.md, GEMINI.md, .cursorrules, or agent-context/*.md files.
The important boundary is architectural: if a platform has already injected a long root instruction file into the prompt, token-router cannot retroactively remove that token cost. The intended pattern is:
- Keep the root always-on instruction file compact.
- Move long task-specific rules into routeable reference files.
- Use
agent_contextwith the current task query to retrieve only relevant raw rule slices.
- Three routing modes
error_log: optimized for logs, stack traces, CI output, and deployment failures.heavy_code: optimized for long source files and localized code investigations.agent_context: optimized for long routeable agent instruction references.
- Query pass-through
- Use
--query "token expiration"to bias routing toward user intent. - Multi-word queries are tokenized and exposed as
[Query Terms]so the local model can match any relevant term, not only the full phrase.
- Use
- Deterministic prefilters
- Log mode uses keyword and tail-window scanning.
- Code mode prioritizes query hits, then suspicious code markers, then structural head/tail previews.
- Agent-context mode prioritizes query hits, frontmatter, headings, mandatory rules, workflow/tool/security/testing keywords, and head/tail context.
- Lossless raw slicing
- Router output is copied from the original file by line number.
- Output caps
ROUTER_MAX_OUTPUT_LINESbounds cloud-visible raw text.- In
error_logmode, newer log ranges survive first when the cap is exceeded.
- Memory safety
OLLAMA_KEEP_ALIVE=0sunloads the model immediately after routing.OLLAMA_NUM_CTX=4096or8192bounds local context pressure.
- Regression harness
- Fixture-based tests catch prompt, line-selection, cap, and fallback regressions.
Token counts are estimated with chars / 4 and should be treated as directional rather than billing-exact.
| Case | Mode | Est. Input Tokens | Router Output Tokens | Reduction | Time |
|---|---|---|---|---|---|
| Sparse infra log | error_log |
41,711 | 131 | 99.69% | 5.37s |
| Legacy bug source | heavy_code |
7,520 | 70 | 99.06% | 4.46s |
| Keywordless structural source | heavy_code |
4,188 | 48 | 98.85% | 6.13s |
See docs/benchmark-report.md for methodology, caveats, and regression results.
- Python 3.10+
- Ollama installed and running
- A local routing model, defaulting to:
ollama pull gemma4:e2b-it-q4_K_MOLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py error_log path/to/deploy.log --query "database migration timeout"OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py heavy_code path/to/service.py --query "token expiration"OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py agent_context path/to/agent-context/frontend.md --query "frontend testing workflow"The default model is intentionally small so the router can run on modest local machines:
OLLAMA_MODEL=gemma4:e2b-it-q4_K_MVery small local models may occasionally return invalid JSON on complex logs, unusual symbols, or dense instruction files. The router cleans common code-fence noise and falls back safely, but if JSON errors are frequent and you have enough VRAM, try a larger routing model:
OLLAMA_MODEL=qwen2.5-coder:7b \
python3 scripts/router.py error_log path/to/deploy.log --query "timeout"Community feedback suggests coder-oriented 7B models such as qwen2.5-coder:7b, or 4B-class Gemma variants, can reduce JSON-format failures compared with ultra-small 2B-class models. Please include your OLLAMA_MODEL, quantization tag, and reproduction command when reporting model-specific behavior.
Install or copy this repository into your Codex skills directory:
mkdir -p ~/.codex/skills
cp -R token-router ~/.codex/skills/token-routerThen invoke it naturally:
Use $token-router to analyze this large log with query "payment timeout".
Use $token-router agent_context to route this long AGENTS.md reference for "deployment approval".
This repository includes a compact CLAUDE.md bootstrap for Claude Code. It tells Claude to call the same router script instead of loading oversized logs, source files, or long routeable instruction references directly.
Keep project-level CLAUDE.md files short when they are automatically loaded by Claude Code. Put long task-specific guidance in separate reference files, then route them on demand:
OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py agent_context path/to/claude-context.md --query "deployment approval"| Variable | Default | Purpose |
|---|---|---|
OLLAMA_MODEL |
gemma4:e2b-it-q4_K_M |
Local model used for line routing |
OLLAMA_URL |
http://localhost:11434/api/generate |
Ollama generate endpoint |
OLLAMA_NUM_CTX |
8192 |
Local model context window |
OLLAMA_KEEP_ALIVE |
0s |
Unload model immediately after routing |
ROUTER_TIMEOUT |
120 |
Ollama request timeout in seconds |
ROUTER_MAX_CHARS |
120000 |
Maximum line-numbered content sent to Ollama |
ROUTER_MAX_OUTPUT_LINES |
160 |
Maximum raw lines returned to the cloud model |
ROUTER_STREAM_THRESHOLD_BYTES |
5000000 |
File size threshold for streaming log prefiltering |
ROUTER_LOG_CONTEXT_LINES |
6 |
Log context lines around keyword/query hits |
ROUTER_LOG_TAIL_LINES |
200 |
Tail lines preserved for large logs |
ROUTER_CODE_CONTEXT_LINES |
8 |
Code context lines around query/keyword hits |
ROUTER_AGENT_CONTEXT_LINES |
6 |
Agent-context lines around query/instruction keyword hits |
Run the fast mock-based test suite:
python3 scripts/run_router_tests.py tests/router-tests.jsonRun against the real local model only when you intentionally want to exercise Ollama:
python3 scripts/run_router_tests.py tests/router-tests.json --real- See CONTRIBUTING.md for local validation, model compatibility notes, and pull request guidelines.
- See SECURITY.md for private vulnerability reporting.
- Please do not paste secrets, private logs, credentials, API keys, personal data, customer data, or proprietary source code into public issues.
- If you hit invalid JSON from a local model, try a larger model such as
qwen2.5-coder:7band include the model name in the bug report.
| Situation | Use token-router? |
Rationale |
|---|---|---|
| Massive deployment logs | Yes | Error evidence is usually localized and recent |
| CI logs with stack traces | Yes | Keyword and tail scanning preserve high-value ranges |
| Long files with a specific bug or query | Yes | --query and code markers narrow routing safely |
Legacy files with TODO, FIXME, raise, or assert markers |
Yes | Code keyword windows are highly effective |
| Long routeable agent-reference docs | Yes | agent_context retrieves task-relevant raw instruction slices |
Compact root AGENTS.md with mandatory rules |
Bypass | Always-on rules should remain directly available |
| Already auto-injected long instruction files | Bypass for savings | Router cannot remove token cost that was already injected |
| Broad architecture review | Bypass | The model needs global context, not narrow slices |
| Refactor planning across many modules | Bypass or combine manually | Local routing may hide cross-file relationships |
| Security-sensitive code requiring complete audit | Bypass | Completeness matters more than token reduction |
| Very dense source files where every section matters | Bypass | Context reduction can remove necessary dependencies |
token-router is lossless in extraction, not omniscient in selection. If a selected range is too narrow, the cloud model should request nearby or broader line ranges. The intended workflow is iterative: route, reason, expand when needed.
For static agent instructions, do not move non-negotiable safety or ownership rules out of the always-on root file purely for token savings. Use agent_context for long reference material that is useful only for specific tasks.
MIT License. See LICENSE.