Skip to content

sleeplesshan/token-router

token-router

License: MIT Python Ollama Codex Skill

Lossless local line routing for massive logs, source files, and routeable agent-context references before they reach a cloud reasoning model.

token-router is a Codex skill and standalone Python router that uses a local Ollama model to identify exact line ranges in oversized files, then returns the raw, unmodified slices to a high-reasoning cloud model such as GPT-5.5, o3, or another AI coding agent.

Executive Summary

Large logs, legacy source files, and long agent instruction references are expensive context. Sending a 2,000+ line deployment log, monolithic source file, or routeable project-rules document directly to a cloud LLM can waste tokens, increase latency, and exhaust budgets before the model reaches the evidence that matters.

token-router solves this with a hybrid separation-of-concerns architecture:

  • Local model for search and triage: Gemma 4 via Ollama runs on the user's machine and scans large files for relevant line coordinates.
  • Cloud model for reasoning: GPT-5.5, o3, or another cloud model receives only raw high-density evidence slices and applies deep reasoning where it matters.
  • No lossy summarization: The local model does not rewrite, summarize, or interpret code. It emits JSON line ranges; the router then extracts exact raw text from the original file.
  • Static context routing: Long optional agent-reference docs can be routed on demand while short mandatory root instructions stay always-on.

The result is aggressive context reduction without degrading the technical evidence available to the reasoning model.

Architecture & Core Philosophy

Separation Of Concerns

token-router deliberately splits the workflow into two specialized phases:

Phase Engine Responsibility Output
Search / Triage Local Ollama model Find relevant line coordinates JSON ranges
Evidence Extraction Router script Slice original file by line number Raw unedited text
Reasoning Cloud LLM / Codex Debug, explain, patch, or review High-confidence answer

This prevents the local model from becoming an unreliable summarizer. The local model is a router, not the analyst.

Lossless Line Routing

Summarizing source code or logs through a smaller local model is lossy. A single dropped stack frame, config key, indentation detail, or error suffix can change the diagnosis.

token-router avoids that failure mode:

  1. The file is scanned locally.
  2. The local model returns line coordinates.
  3. The Python script slices the original file directly.
  4. The cloud model receives raw text exactly as it appeared on disk.

The selection step can be imperfect, but the extracted evidence itself is lossless. If the cloud model needs more context, it can request wider line ranges instead of hallucinating around omitted dependencies.

Static Agent Context Routing

agent_context mode extends the same line-routing architecture to long instruction references such as detailed AGENTS.md, CLAUDE.md, GEMINI.md, .cursorrules, or agent-context/*.md files.

The important boundary is architectural: if a platform has already injected a long root instruction file into the prompt, token-router cannot retroactively remove that token cost. The intended pattern is:

  1. Keep the root always-on instruction file compact.
  2. Move long task-specific rules into routeable reference files.
  3. Use agent_context with the current task query to retrieve only relevant raw rule slices.

Features & Context Safety Guardrails

  • Three routing modes
    • error_log: optimized for logs, stack traces, CI output, and deployment failures.
    • heavy_code: optimized for long source files and localized code investigations.
    • agent_context: optimized for long routeable agent instruction references.
  • Query pass-through
    • Use --query "token expiration" to bias routing toward user intent.
    • Multi-word queries are tokenized and exposed as [Query Terms] so the local model can match any relevant term, not only the full phrase.
  • Deterministic prefilters
    • Log mode uses keyword and tail-window scanning.
    • Code mode prioritizes query hits, then suspicious code markers, then structural head/tail previews.
    • Agent-context mode prioritizes query hits, frontmatter, headings, mandatory rules, workflow/tool/security/testing keywords, and head/tail context.
  • Lossless raw slicing
    • Router output is copied from the original file by line number.
  • Output caps
    • ROUTER_MAX_OUTPUT_LINES bounds cloud-visible raw text.
    • In error_log mode, newer log ranges survive first when the cap is exceeded.
  • Memory safety
    • OLLAMA_KEEP_ALIVE=0s unloads the model immediately after routing.
    • OLLAMA_NUM_CTX=4096 or 8192 bounds local context pressure.
  • Regression harness
    • Fixture-based tests catch prompt, line-selection, cap, and fallback regressions.

Benchmark Highlights

Token counts are estimated with chars / 4 and should be treated as directional rather than billing-exact.

Case Mode Est. Input Tokens Router Output Tokens Reduction Time
Sparse infra log error_log 41,711 131 99.69% 5.37s
Legacy bug source heavy_code 7,520 70 99.06% 4.46s
Keywordless structural source heavy_code 4,188 48 98.85% 6.13s

See docs/benchmark-report.md for methodology, caveats, and regression results.

Quick Start

Prerequisites

  • Python 3.10+
  • Ollama installed and running
  • A local routing model, defaulting to:
ollama pull gemma4:e2b-it-q4_K_M

Run Against A Large Log

OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py error_log path/to/deploy.log --query "database migration timeout"

Run Against A Long Source File

OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py heavy_code path/to/service.py --query "token expiration"

Run Against A Routeable Agent Context Reference

OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py agent_context path/to/agent-context/frontend.md --query "frontend testing workflow"

Model Reliability

The default model is intentionally small so the router can run on modest local machines:

OLLAMA_MODEL=gemma4:e2b-it-q4_K_M

Very small local models may occasionally return invalid JSON on complex logs, unusual symbols, or dense instruction files. The router cleans common code-fence noise and falls back safely, but if JSON errors are frequent and you have enough VRAM, try a larger routing model:

OLLAMA_MODEL=qwen2.5-coder:7b \
python3 scripts/router.py error_log path/to/deploy.log --query "timeout"

Community feedback suggests coder-oriented 7B models such as qwen2.5-coder:7b, or 4B-class Gemma variants, can reduce JSON-format failures compared with ultra-small 2B-class models. Please include your OLLAMA_MODEL, quantization tag, and reproduction command when reporting model-specific behavior.

Use As A Codex Skill

Install or copy this repository into your Codex skills directory:

mkdir -p ~/.codex/skills
cp -R token-router ~/.codex/skills/token-router

Then invoke it naturally:

Use $token-router to analyze this large log with query "payment timeout".
Use $token-router agent_context to route this long AGENTS.md reference for "deployment approval".

Use With Claude Code

This repository includes a compact CLAUDE.md bootstrap for Claude Code. It tells Claude to call the same router script instead of loading oversized logs, source files, or long routeable instruction references directly.

Keep project-level CLAUDE.md files short when they are automatically loaded by Claude Code. Put long task-specific guidance in separate reference files, then route them on demand:

OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py agent_context path/to/claude-context.md --query "deployment approval"

Configuration

Variable Default Purpose
OLLAMA_MODEL gemma4:e2b-it-q4_K_M Local model used for line routing
OLLAMA_URL http://localhost:11434/api/generate Ollama generate endpoint
OLLAMA_NUM_CTX 8192 Local model context window
OLLAMA_KEEP_ALIVE 0s Unload model immediately after routing
ROUTER_TIMEOUT 120 Ollama request timeout in seconds
ROUTER_MAX_CHARS 120000 Maximum line-numbered content sent to Ollama
ROUTER_MAX_OUTPUT_LINES 160 Maximum raw lines returned to the cloud model
ROUTER_STREAM_THRESHOLD_BYTES 5000000 File size threshold for streaming log prefiltering
ROUTER_LOG_CONTEXT_LINES 6 Log context lines around keyword/query hits
ROUTER_LOG_TAIL_LINES 200 Tail lines preserved for large logs
ROUTER_CODE_CONTEXT_LINES 8 Code context lines around query/keyword hits
ROUTER_AGENT_CONTEXT_LINES 6 Agent-context lines around query/instruction keyword hits

Regression Tests

Run the fast mock-based test suite:

python3 scripts/run_router_tests.py tests/router-tests.json

Run against the real local model only when you intentionally want to exercise Ollama:

python3 scripts/run_router_tests.py tests/router-tests.json --real

Contributing And Security

  • See CONTRIBUTING.md for local validation, model compatibility notes, and pull request guidelines.
  • See SECURITY.md for private vulnerability reporting.
  • Please do not paste secrets, private logs, credentials, API keys, personal data, customer data, or proprietary source code into public issues.
  • If you hit invalid JSON from a local model, try a larger model such as qwen2.5-coder:7b and include the model name in the bug report.

When To Use vs When To Bypass

Situation Use token-router? Rationale
Massive deployment logs Yes Error evidence is usually localized and recent
CI logs with stack traces Yes Keyword and tail scanning preserve high-value ranges
Long files with a specific bug or query Yes --query and code markers narrow routing safely
Legacy files with TODO, FIXME, raise, or assert markers Yes Code keyword windows are highly effective
Long routeable agent-reference docs Yes agent_context retrieves task-relevant raw instruction slices
Compact root AGENTS.md with mandatory rules Bypass Always-on rules should remain directly available
Already auto-injected long instruction files Bypass for savings Router cannot remove token cost that was already injected
Broad architecture review Bypass The model needs global context, not narrow slices
Refactor planning across many modules Bypass or combine manually Local routing may hide cross-file relationships
Security-sensitive code requiring complete audit Bypass Completeness matters more than token reduction
Very dense source files where every section matters Bypass Context reduction can remove necessary dependencies

Design Caveats

token-router is lossless in extraction, not omniscient in selection. If a selected range is too narrow, the cloud model should request nearby or broader line ranges. The intended workflow is iterative: route, reason, expand when needed.

For static agent instructions, do not move non-negotiable safety or ownership rules out of the always-on root file purely for token savings. Use agent_context for long reference material that is useful only for specific tasks.

License

MIT License. See LICENSE.

About

Lossless local line routing for massive logs, source files, and agent-context references before they reach Claude Code or Codex.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages