token-router

Lossless local line routing for massive logs, source files, and routeable agent-context references before they reach a cloud reasoning model.

token-router is a Codex skill and standalone Python router that uses a local Ollama model to identify exact line ranges in oversized files, then returns the raw, unmodified slices to a high-reasoning cloud model such as GPT-5.5, o3, or another AI coding agent.

Executive Summary

Large logs, legacy source files, and long agent instruction references are expensive context. Sending a 2,000+ line deployment log, monolithic source file, or routeable project-rules document directly to a cloud LLM can waste tokens, increase latency, and exhaust budgets before the model reaches the evidence that matters.

token-router solves this with a hybrid separation-of-concerns architecture:

Local model for search and triage: Gemma 4 via Ollama runs on the user's machine and scans large files for relevant line coordinates.
Cloud model for reasoning: GPT-5.5, o3, or another cloud model receives only raw high-density evidence slices and applies deep reasoning where it matters.
No lossy summarization: The local model does not rewrite, summarize, or interpret code. It emits JSON line ranges; the router then extracts exact raw text from the original file.
Static context routing: Long optional agent-reference docs can be routed on demand while short mandatory root instructions stay always-on.

The result is aggressive context reduction without degrading the technical evidence available to the reasoning model.

Architecture & Core Philosophy

Separation Of Concerns

token-router deliberately splits the workflow into two specialized phases:

Phase	Engine	Responsibility	Output
Search / Triage	Local Ollama model	Find relevant line coordinates	JSON ranges
Evidence Extraction	Router script	Slice original file by line number	Raw unedited text
Reasoning	Cloud LLM / Codex	Debug, explain, patch, or review	High-confidence answer

This prevents the local model from becoming an unreliable summarizer. The local model is a router, not the analyst.

Lossless Line Routing

Summarizing source code or logs through a smaller local model is lossy. A single dropped stack frame, config key, indentation detail, or error suffix can change the diagnosis.

token-router avoids that failure mode:

The file is scanned locally.
The local model returns line coordinates.
The Python script slices the original file directly.
The cloud model receives raw text exactly as it appeared on disk.

The selection step can be imperfect, but the extracted evidence itself is lossless. If the cloud model needs more context, it can request wider line ranges instead of hallucinating around omitted dependencies.

Static Agent Context Routing

agent_context mode extends the same line-routing architecture to long instruction references such as detailed AGENTS.md, CLAUDE.md, GEMINI.md, .cursorrules, or agent-context/*.md files.

The important boundary is architectural: if a platform has already injected a long root instruction file into the prompt, token-router cannot retroactively remove that token cost. The intended pattern is:

Keep the root always-on instruction file compact.
Move long task-specific rules into routeable reference files.
Use agent_context with the current task query to retrieve only relevant raw rule slices.

Features & Context Safety Guardrails

Three routing modes
- error_log: optimized for logs, stack traces, CI output, and deployment failures.
- heavy_code: optimized for long source files and localized code investigations.
- agent_context: optimized for long routeable agent instruction references.
Query pass-through
- Use --query "token expiration" to bias routing toward user intent.
- Multi-word queries are tokenized and exposed as [Query Terms] so the local model can match any relevant term, not only the full phrase.
Deterministic prefilters
- Log mode uses keyword and tail-window scanning.
- Code mode prioritizes query hits, then suspicious code markers, then structural head/tail previews.
- Agent-context mode prioritizes query hits, frontmatter, headings, mandatory rules, workflow/tool/security/testing keywords, and head/tail context.
Lossless raw slicing
- Router output is copied from the original file by line number.
Output caps
- ROUTER_MAX_OUTPUT_LINES bounds cloud-visible raw text.
- In error_log mode, newer log ranges survive first when the cap is exceeded.
Memory safety
- OLLAMA_KEEP_ALIVE=0s unloads the model immediately after routing.
- OLLAMA_NUM_CTX=4096 or 8192 bounds local context pressure.
Regression harness
- Fixture-based tests catch prompt, line-selection, cap, and fallback regressions.

Benchmark Highlights

Token counts are estimated with chars / 4 and should be treated as directional rather than billing-exact.

Case	Mode	Est. Input Tokens	Router Output Tokens	Reduction	Time
Sparse infra log	`error_log`	41,711	131	99.69%	5.37s
Legacy bug source	`heavy_code`	7,520	70	99.06%	4.46s
Keywordless structural source	`heavy_code`	4,188	48	98.85%	6.13s

See docs/benchmark-report.md for methodology, caveats, and regression results.

Quick Start

Prerequisites

Python 3.10+
Ollama installed and running
A local routing model, defaulting to:

ollama pull gemma4:e2b-it-q4_K_M

Run Against A Large Log

OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py error_log path/to/deploy.log --query "database migration timeout"

Run Against A Long Source File

OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py heavy_code path/to/service.py --query "token expiration"

Run Against A Routeable Agent Context Reference

OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py agent_context path/to/agent-context/frontend.md --query "frontend testing workflow"

Model Reliability

The default model is intentionally small so the router can run on modest local machines:

OLLAMA_MODEL=gemma4:e2b-it-q4_K_M

Very small local models may occasionally return invalid JSON on complex logs, unusual symbols, or dense instruction files. The router cleans common code-fence noise and falls back safely, but if JSON errors are frequent and you have enough VRAM, try a larger routing model:

OLLAMA_MODEL=qwen2.5-coder:7b \
python3 scripts/router.py error_log path/to/deploy.log --query "timeout"

Community feedback suggests coder-oriented 7B models such as qwen2.5-coder:7b, or 4B-class Gemma variants, can reduce JSON-format failures compared with ultra-small 2B-class models. Please include your OLLAMA_MODEL, quantization tag, and reproduction command when reporting model-specific behavior.

Use As A Codex Skill

Install or copy this repository into your Codex skills directory:

mkdir -p ~/.codex/skills
cp -R token-router ~/.codex/skills/token-router

Then invoke it naturally:

Use $token-router to analyze this large log with query "payment timeout".
Use $token-router agent_context to route this long AGENTS.md reference for "deployment approval".

Use With Claude Code

This repository includes a compact CLAUDE.md bootstrap for Claude Code. It tells Claude to call the same router script instead of loading oversized logs, source files, or long routeable instruction references directly.

Keep project-level CLAUDE.md files short when they are automatically loaded by Claude Code. Put long task-specific guidance in separate reference files, then route them on demand:

OLLAMA_NUM_CTX=4096 OLLAMA_KEEP_ALIVE=0s \
python3 scripts/router.py agent_context path/to/claude-context.md --query "deployment approval"

Configuration

Variable	Default	Purpose
`OLLAMA_MODEL`	`gemma4:e2b-it-q4_K_M`	Local model used for line routing
`OLLAMA_URL`	`http://localhost:11434/api/generate`	Ollama generate endpoint
`OLLAMA_NUM_CTX`	`8192`	Local model context window
`OLLAMA_KEEP_ALIVE`	`0s`	Unload model immediately after routing
`ROUTER_TIMEOUT`	`120`	Ollama request timeout in seconds
`ROUTER_MAX_CHARS`	`120000`	Maximum line-numbered content sent to Ollama
`ROUTER_MAX_OUTPUT_LINES`	`160`	Maximum raw lines returned to the cloud model
`ROUTER_STREAM_THRESHOLD_BYTES`	`5000000`	File size threshold for streaming log prefiltering
`ROUTER_LOG_CONTEXT_LINES`	`6`	Log context lines around keyword/query hits
`ROUTER_LOG_TAIL_LINES`	`200`	Tail lines preserved for large logs
`ROUTER_CODE_CONTEXT_LINES`	`8`	Code context lines around query/keyword hits
`ROUTER_AGENT_CONTEXT_LINES`	`6`	Agent-context lines around query/instruction keyword hits

Regression Tests

Run the fast mock-based test suite:

python3 scripts/run_router_tests.py tests/router-tests.json

Run against the real local model only when you intentionally want to exercise Ollama:

python3 scripts/run_router_tests.py tests/router-tests.json --real

Contributing And Security

See CONTRIBUTING.md for local validation, model compatibility notes, and pull request guidelines.
See SECURITY.md for private vulnerability reporting.
Please do not paste secrets, private logs, credentials, API keys, personal data, customer data, or proprietary source code into public issues.
If you hit invalid JSON from a local model, try a larger model such as qwen2.5-coder:7b and include the model name in the bug report.

When To Use vs When To Bypass

Situation	Use `token-router`?	Rationale
Massive deployment logs	Yes	Error evidence is usually localized and recent
CI logs with stack traces	Yes	Keyword and tail scanning preserve high-value ranges
Long files with a specific bug or query	Yes	`--query` and code markers narrow routing safely
Legacy files with `TODO`, `FIXME`, `raise`, or `assert` markers	Yes	Code keyword windows are highly effective
Long routeable agent-reference docs	Yes	`agent_context` retrieves task-relevant raw instruction slices
Compact root `AGENTS.md` with mandatory rules	Bypass	Always-on rules should remain directly available
Already auto-injected long instruction files	Bypass for savings	Router cannot remove token cost that was already injected
Broad architecture review	Bypass	The model needs global context, not narrow slices
Refactor planning across many modules	Bypass or combine manually	Local routing may hide cross-file relationships
Security-sensitive code requiring complete audit	Bypass	Completeness matters more than token reduction
Very dense source files where every section matters	Bypass	Context reduction can remove necessary dependencies

Design Caveats

token-router is lossless in extraction, not omniscient in selection. If a selected range is too narrow, the cloud model should request nearby or broader line ranges. The intended workflow is iterative: route, reason, expand when needed.

For static agent instructions, do not move non-negotiable safety or ownership rules out of the always-on root file purely for token savings. Use agent_context for long reference material that is useful only for specific tasks.

License

MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
agents		agents
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
SUPPORT.md		SUPPORT.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

token-router

Executive Summary

Architecture & Core Philosophy

Separation Of Concerns

Lossless Line Routing

Static Agent Context Routing

Features & Context Safety Guardrails

Benchmark Highlights

Quick Start

Prerequisites

Run Against A Large Log

Run Against A Long Source File

Run Against A Routeable Agent Context Reference

Model Reliability

Use As A Codex Skill

Use With Claude Code

Configuration

Regression Tests

Contributing And Security

When To Use vs When To Bypass

Design Caveats

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

token-router

Executive Summary

Architecture & Core Philosophy

Separation Of Concerns

Lossless Line Routing

Static Agent Context Routing

Features & Context Safety Guardrails

Benchmark Highlights

Quick Start

Prerequisites

Run Against A Large Log

Run Against A Long Source File

Run Against A Routeable Agent Context Reference

Model Reliability

Use As A Codex Skill

Use With Claude Code

Configuration

Regression Tests

Contributing And Security

When To Use vs When To Bypass

Design Caveats

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages