Skip to content

wentbackward/llm-proxy

Repository files navigation

llm-proxy

CI Release Docker Go Report Card Go Version License: MIT

Virtualize models from any provider — local or cloud — and switch between them in your client UI.

client → llm-proxy:4000/v1 → vLLM (local)
                             → OpenAI (cloud)
                             → HuggingFace (cloud)
                             → any OpenAI-compatible API

What it does

Unify your backends. Point your client at one URL. The proxy forwards requests transparently — model resolution, auth headers, and parameter profiles are applied, but the request format is never translated. Supports /v1/chat/completions, /v1/completions (code completion / FIM), and /v1/embeddings.

Virtual models. Name the same underlying model multiple times with different parameter profiles. A coder with low temperature and thinking enabled, a creative with high temperature and thinking off — same model, different behaviour. Clients just switch the model name.

routes:
  - virtual_model: coder
    backend: local
    real_model: "Qwen/Qwen3.5-35B-A3B-FP8"
    defaults: { temperature: 0.2, enable_thinking: true, max_tokens: 16384 }
    clamp: { enable_thinking: true }

  - virtual_model: creative
    backend: local
    real_model: "Qwen/Qwen3.5-35B-A3B-FP8"
    defaults: { temperature: 0.9, enable_thinking: false, max_tokens: 8192 }

Parameter control. Three-layer merge: defaults < caller < clamp. Set sensible defaults, let callers override what you allow, lock down what they can't.

Observability. OpenTelemetry metrics out of the box — TTFT, request duration, token counts, active requests, generation speed. Prometheus exporter, ready for Grafana. Request journal logs structured data about every request for analysis.

Zero overhead. SSE streams flow directly to the client. Metrics are parsed from the byte stream without buffering. Single static Go binary, ~7 MB Docker image on scratch.

Quick start

cp config.example.yaml config.yaml
# Edit config.yaml — set your backends and API keys
docker compose up -d

Point your client at http://localhost:4000/v1. Metrics at http://localhost:9091/metrics.

Configuration

backends:
  - id: local
    type: openai
    base_url: "http://gpu-server:8000"
    timeout_seconds: 300

  - id: hf
    type: openai
    base_url: "https://router.huggingface.co"
    api_key: "${HF_TOKEN}"
    skip_probe: true

Secrets use ${ENV_VAR} syntax — resolved at startup, never stored in config. Hot-reload with SIGHUP — config, log level, and backend probes update without restart.

See the full configuration reference for details on auth types, TLS, auto-routing, and parameter profiles.

Documentation

License

MIT — Copyright (c) 2026 Paul Gresham Advisory LLC

About

Transparent reverse proxy for LLM APIs — virtual models, parameter profiles, and OTel metrics for any OpenAI-compatible backend

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages