An enterprise grade, highly performant route virtualizer for your infrerence workloads. Written in Golang for speed, concurrency and efficient resource usage, Hikyaku is the reliable messaging layer between your agentic workers and the AI inference layers you have deployed.
| Accelerate and connect your agentic process designers and software engineers to the models they need | Manage inference backends across any disparate landscape of providers, local or in the cloud |
| Simple to deploy in minutes, serve thousands of users with minimal hardawre | Secure by default - protect your traffic, your API keys and control access |
| Curate and Control approved models | Reliable low latency and highly available |
| Smart load-balancing with KV-cache affinity, ground-truth health, and failover migration | Inspect and debug messages to speed up analysis and problem solving |
| Set and clamp sampling parameters to ensure optimal model performance | Deploy defenders like loop detector and supress 'zero' messages before they waste valuable tokens |
| RFC 3986 compliant URL resolution — no string concatenation, no double-prefix bugs | Zero-probe mode — health driven by real traffic alone, no synthetic probes needed |
Virtualize models from any provider — local or cloud — and switch between them in your client UI.
client → hikyaku:4000/v1 → vLLM (local)
→ OpenAI (cloud)
→ HuggingFace (cloud)
→ any OpenAI-compatible API
Unify your backends. Point your client at one URL. The proxy forwards requests transparently — model resolution, auth headers, and parameter profiles are applied, but the request format is never translated. Supports /v1/chat/completions, /v1/completions (code completion / FIM), and /v1/embeddings.
Virtual models. Name the same underlying model multiple times with different parameter profiles. A coder with low temperature and thinking enabled, a creative with high temperature and thinking off — same model, different behaviour. Clients just switch the model name.
routes:
- virtual_model: coder
backend: local
real_model: "Qwen/Qwen3.5-35B-A3B-FP8"
defaults: { temperature: 0.2, enable_thinking: true, max_tokens: 16384 }
clamp: { enable_thinking: true }
- virtual_model: creative
backend: local
real_model: "Qwen/Qwen3.5-35B-A3B-FP8"
defaults: { temperature: 0.9, enable_thinking: false, max_tokens: 8192 }Parameter control. Three-layer merge: defaults < caller < clamp. Set sensible defaults, let callers override what you allow, lock down what they can't.
Observability. OpenTelemetry metrics out of the box — TTFT, request duration, token counts, active requests, generation speed. Prometheus exporter, ready for Grafana. Request journal logs structured data about every request for analysis.
On-demand full-body capture. When metrics aren't enough and you need the exact bytes going upstream (prompt-cache debugging, context-size investigations), SIGUSR1 arms a bounded capture window that dumps the next N full request/response bodies to disk. Off by default.
Zero overhead. SSE streams flow directly to the client. Metrics are parsed from the byte stream without buffering. Single static Go binary, ~7 MB Docker image on scratch.
mkdir -p config
cp config.example.yaml config/config.yaml
# Edit config/config.yaml — set your backends and API keys
docker compose up -dPoint your client at http://localhost:4000/v1. Metrics at http://localhost:9091/metrics.
(The config/ subdirectory exists so docker bind-mounts the folder, not the file — editors that save atomically (vim's writebackup, VS Code, etc.) won't orphan the container's view of the file on SIGHUP reload.)
backends:
- id: local
type: openai
base_url: "http://gpu-server:8000/v1/"
timeout_seconds: 300
- id: hf
type: openai
base_url: "https://router.huggingface.co/v1/"
api_key: "${HF_TOKEN}"
skip_probe: trueMap syntax is also supported — the key becomes the id (same for routes → virtual_model):
backends:
local:
type: openai
base_url: "http://gpu-server:8000/v1/"
hf:
type: openai
base_url: "https://router.huggingface.co/v1/"
api_key: "${HF_TOKEN}"
skip_probe: trueBoth formats produce identical results. An explicit id / virtual_model inside the map value takes precedence over the map key.
URL resolution (RFC 3986). The proxy combines base_url with the request path using standard URL resolution (RFC 3986 §5.2.2). Include a trailing slash on base_url when your backend uses a path prefix (e.g. "http://host:port/v1/") so that relative resolution appends correctly. The incoming request path from the client is forwarded verbatim — the proxy does not manipulate it.
Secrets use ${ENV_VAR} syntax — resolved at startup, never stored in config. Hot-reload with SIGHUP — config, log level, and backend probes update without restart.
See the full configuration reference for details on auth types, TLS, auto-routing, and parameter profiles.
Quick start shows the minimum. This section explains the full model of how requests find their way to a backend — because once you've got more than one virtual model and more than one client, the details matter.
Every request follows the same path:
client → POST /v1/chat/completions body.model = "my-virtual-name"
│
▼
┌──────────────────┐
│ 1. Resolve route │──── unknown model? → 404 (or passthrough)
└──────────────────┘
│
▼
┌──────────────────┐
│ 2. Merge params │ defaults < caller < clamp
└──────────────────┘
│
▼
┌────────────────────┐
│ 3. Translate │ enable_thinking → provider-specific shape
└────────────────────┘
│
▼
┌──────────────────────┐
│ 4. Forward to backend│ SSE streams straight to the client
└──────────────────────┘
The proxy does not translate between protocols. A client speaking OpenAI chat completions (/v1/chat/completions) can only reach openai-type backends; Anthropic messages (/v1/messages) can only reach anthropic-type backends; Ollama native (/api/chat, /api/generate, etc.) can only reach ollama-type backends. Pick one per client — the proxy forwards the wire format unchanged.
The virtual_model name is an arbitrary string — whatever the client sends in the "model" field is what the proxy matches against. Many client tools prefix the model name to identify the provider (openai/gpt-4, anthropic/claude-3), and some strip that prefix before sending the request onward. You can make the proxy look like any provider by naming your virtual models to match the client's convention:
| Client convention | Virtual model name you'd configure |
|---|---|
| Bare name (OpenWebUI, most clients) | qwen-coder |
| Provider prefix, prefix stripped before send (LiteLLM) | qwen-coder |
| Slash namespace in the name (dropdowns) | local/qwen-coder |
| Fully-qualified HF-style | Qwen/Qwen3-Coder-30B |
If a client breaks, set LOG_LEVEL=1 and read the [req] line — it shows the exact model name that arrived. Then name your virtual model to match.
Three layers decide where a request goes, in order:
- Explicit route — a route with
virtual_model: Xand eitherbackend: Y,backend_group: G, orauto_route:. Checked first. auto_route— a virtual model whose body is inspected to pick between two sub-routes:- virtual_model: smart auto_route: text: my-text-route # used when request is text-only vision: my-vision-route # used when request contains images
passthrough_unrouted— whentrue, unknown model names are forwarded as-is to the default backend (see below). Whenfalse(default), they return 404 with the list of available virtual models.
The default backend is the one marked default: true in its config, or the first backend in the list if none is marked. It is the target for passthrough_unrouted requests.
Instead of pinning a route to a single backend, use backend_group: to spread traffic across a named group. The proxy supports four strategies — sticky_least_loaded (pins sessions for KV-cache locality), least_loaded, round_robin, and single.
Ground-truth health. Real request outcomes drive backend health: 3 consecutive failures mark a backend unhealthy; any success recovers it. Probes complement this by catching issues during quiet periods. If a real request succeeded within the last 60s, probe failures are ignored (oscillation prevention). All probes can be disabled — the proxy works on traffic alone.
Failover migration. When a pinned backend fails (connection error or 5xx), the affinity pin is invalidated and the backend excluded for a short cooldown. Next request migrates to a healthy peer. Timeouts are tolerated — they mean the backend is busy, not dead.
Innocent until proven guilty. Backends with active requests are immune to probe-based demotion. Long generations won't trigger false positives.
Graduated recovery. Backends recover gently: ramp-up phase honors existing pins but declines new ones, giving time to warm the KV cache.
See Configuration → Load Balancing for the full reference and LOAD-BALANCING.md for the design specification.
Three layers merge per route, in priority order:
- virtual_model: coder
backend: local
real_model: Qwen/Qwen3-Coder-30B
defaults:
temperature: 0.2
max_tokens: 16384
enable_thinking: true
clamp:
enable_thinking: true # caller cannot disable thinking, everdefaults— applied if the caller didn't set the key. "My preferred defaults."- caller — the request body's own values override
defaults. "Let clients tune it." clamp— applied unconditionally, overriding both. "This is non-negotiable."
Use clamp sparingly — it's useful for "always on / always off" guarantees like thinking mode, a ceiling on max_tokens, or pinning temperature for reproducibility.
Two escape hatches for vendor-specific quirks that don't belong in the proxy core:
- virtual_model: gresh-gemma-think
backend: vllm-gemma
real_model: gemma-4-27b
system_prompt:
prepend: "<|think|>\n" # also: append, replace (mutually exclusive)
- virtual_model: gresh-kimi-deep
backend: vllm-kimi
inject: # deep-merged into body, route wins per leaf
chat_template_kwargs:
thinking_mode: "deep"
preserve_thinking: truesystem_prompt mutates the system message (or body.system for Anthropic) before the request leaves the proxy. inject deep-merges arbitrary JSON into the request body. Together they cover most "this vendor needs an unusual knob set" cases without the proxy growing per-vendor knowledge. Full semantics in docs/configuration.md.
Per-route and per-backend headers: blocks let operators rewrite outbound HTTP headers. Three ops — add, remove, rename — applied in that order so an operator can rename Authorization to preserve audit, drop it, and add a fresh one in a single block:
backends:
- id: corporate-llm
headers:
add: { X-Corp-Auth: "${CORP_TOKEN}" }
routes:
- virtual_model: gresh-internal
backend: corporate-llm
real_model: corp-llama
headers:
remove: [X-Forwarded-For, User-Agent]
add: { X-Tenant-Id: "${TENANT_ID}" }
rename: { Authorization: X-Original-Auth }Backend headers apply first; route headers apply on top, route wins on conflict. Full reference in docs/configuration.md.
Admin Panel → Settings → Connections → add an OpenAI API:
- Base URL:
http://hikyaku:4000/v1 - API Key: whatever you set as
server.api_key
Click the refresh icon on the connection. OpenWebUI calls /v1/models and populates its dropdown with your virtual model names. That's it.
LiteLLM wants a provider prefix. For custom OpenAI-compatible endpoints, prefix with openai/ — at time of writing LiteLLM strips that prefix before calling the base URL, so a virtual model named qwen-coder is reached as openai/qwen-coder in LiteLLM config.
If the version of LiteLLM you're running doesn't strip the prefix, you'll see openai/qwen-coder arrive at the proxy (check LOG_LEVEL=1). Either register a matching virtual model or update LiteLLM.
Clients speaking Ollama native (/api/chat, /api/generate, /api/embed, /api/embeddings, /api/tags):
- Base URL:
http://hikyaku:4000(no/apisuffix — hikyaku exposes those paths directly) - API key: your
server.api_key
Configure a route pointing at a type: ollama backend. Ollama's sampling params live under "options" in the request body; route defaults and clamp merge into that nested object automatically. Pure passthrough — the proxy doesn't reshape messages or reinterpret parameters the way Ollama's OpenAI-compat layer does.
Clients that only speak OpenAI-compat mode: point them at /v1 with a type: openai backend instead (Ollama exposes /v1 too, but you get Ollama's own interpretation of OpenAI semantics — known to surprise on system prompts and temperature).
Add a provider block to ~/.opencode/config.json:
{
"providers": {
"local": {
"baseUrl": "http://hikyaku:4000/v1",
"apiKey": "nokey",
"api": "openai-completions",
"models": [
{
"id": "qwen-coder",
"name": "qwen-coder",
"reasoning": true,
"input": ["text", "image"],
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 131072,
"maxTokens": 8192
}
]
}
}
}Inside opencode, /model will list the models. cost is zero because it's your own hardware — set real values if you're proxying a cloud API and want opencode's cost accounting to be accurate.
Point it at an anthropic-type backend through the proxy:
export ANTHROPIC_BASE_URL=http://hikyaku:4000
export ANTHROPIC_AUTH_TOKEN=$PROXY_API_KEY
claudeUse /model inside Claude Code to pick a virtual model whose route points to your anthropic backend. The proxy forwards /v1/messages unchanged, so thinking, tool use, and streaming all work.
- Base URL:
http://hikyaku:4000/v1 - API key: your
server.api_key
The client sees virtual models via /v1/models. If anything odd happens, LOG_LEVEL=1 + [req] line will show you the model name as received — 90% of client-integration issues diagnose from that one line.
The metrics pipeline records sizes and previews — fine for observability, not enough when you need to diff two requests byte-for-byte or inspect a tool array. For that, add to config.yaml:
sig_message_capture:
enabled: true
output_folder: /capture # auto-created with mode 0700 at startup
max_messages: 5SIGHUP to pick up the config, then when you want to dump:
docker kill --signal=USR1 hikyaku
# ...make up to 5 requests...
docker cp hikyaku:/capture/. ./local-captures/
jq . ./local-captures/*.jsonThe container-local path is ephemeral — files live in the container's writable layer and are discarded on rebuild/recreate, which suits ad-hoc debugging. If you want persistence across rebuilds, bind-mount or use a named volume; see logging.md for the full reference (file format, security notes, streaming behaviour).
hikyaku is built for personal/small-team use behind a trusted boundary — Tailscale, VPN, or a private subnet. A few things are worth knowing:
Transport is secure by default. The proxy refuses to start without TLS unless server.allow_plaintext: true is explicitly set. Configure server.tls.cert + server.tls.key for HTTPS, or set the plaintext opt-in if you're behind Tailscale/VPN and accepting the link-layer encryption as your boundary. Outbound to providers is always HTTPS when the base_url says so, verified against the system CA bundle.
Auth. Clients send Authorization: Bearer <server.api_key>. The compare is constant-time. The token is static — rotate by editing config and sending SIGHUP. Empty api_key disables auth (only safe on a loopback-only bind).
Metrics. /metrics has no auth. It binds to 127.0.0.1:9091 by default — localhost-only, safe for plaintext. To expose it off-host, either bind wider and configure telemetry.prometheus.tls.cert + tls.key for HTTPS (independent of the gateway cert), or set telemetry.prometheus.allow_plaintext: true on a trusted network. Non-loopback plaintext without opt-in is refused at startup.
Features that can expose prompt contents. This is the part to read carefully in the full security doc. In summary:
LOG_LEVEL=3logs 80 bytes of request bodies.LOG_LEVEL=4logs full request and response message text.- The request journal, when enabled, records up to 2 KB of system prompt and 8 KB of the last user message per request — regardless of log level.
- SIGUSR1 message capture writes full request/response bodies to disk for a bounded window, for debugging. Off by default; requires both
sig_message_capture.enabled: trueand anoutput_folderto activate.
Hardened build. For deployments where any of the above is unacceptable, build with -tags hardened:
make build-hardened
# or:
go build -tags hardened -o hikyaku ./cmd/hikyakuThe hardened tag compiles out (not just disables) SIGUSR1 capture, log levels 3-4, and the prompt text in journal entries. Structural telemetry — counts, structural signals, routing params — is kept. Every build prints a banner at startup identifying which mode it's running in, so operators can verify via docker logs without re-reading config. Details in docs/security.md.
Container. FROM scratch + USER 65532:65532 — non-root, no shell, no libc. Weekly Dependabot PRs for go.mod, GitHub Actions, and the Docker base image.
- Configuration — backends, routes, virtual models, parameter profiles, TLS
- Logging and diagnostics — log levels, interaction IDs, request journal, hot reload
- Metrics — OTel/Prometheus metrics reference
- Security — threat model, transport, prompt-exposing features, hardened build
- Development — building, testing, project structure
MIT — Copyright (c) 2026 Paul Gresham Advisory LLC
