llm-proxy

Virtualize models from any provider — local or cloud — and switch between them in your client UI.

client → llm-proxy:4000/v1 → vLLM (local)
                             → OpenAI (cloud)
                             → HuggingFace (cloud)
                             → any OpenAI-compatible API

What it does

Unify your backends. Point your client at one URL. The proxy forwards requests transparently — model resolution, auth headers, and parameter profiles are applied, but the request format is never translated. Supports /v1/chat/completions, /v1/completions (code completion / FIM), and /v1/embeddings.

Virtual models. Name the same underlying model multiple times with different parameter profiles. A coder with low temperature and thinking enabled, a creative with high temperature and thinking off — same model, different behaviour. Clients just switch the model name.

routes:
  - virtual_model: coder
    backend: local
    real_model: "Qwen/Qwen3.5-35B-A3B-FP8"
    defaults: { temperature: 0.2, enable_thinking: true, max_tokens: 16384 }
    clamp: { enable_thinking: true }

  - virtual_model: creative
    backend: local
    real_model: "Qwen/Qwen3.5-35B-A3B-FP8"
    defaults: { temperature: 0.9, enable_thinking: false, max_tokens: 8192 }

Parameter control. Three-layer merge: defaults < caller < clamp. Set sensible defaults, let callers override what you allow, lock down what they can't.

Observability. OpenTelemetry metrics out of the box — TTFT, request duration, token counts, active requests, generation speed. Prometheus exporter, ready for Grafana. Request journal logs structured data about every request for analysis.

Zero overhead. SSE streams flow directly to the client. Metrics are parsed from the byte stream without buffering. Single static Go binary, ~7 MB Docker image on scratch.

Quick start

cp config.example.yaml config.yaml
# Edit config.yaml — set your backends and API keys
docker compose up -d

Point your client at http://localhost:4000/v1. Metrics at http://localhost:9091/metrics.

Configuration

backends:
  - id: local
    type: openai
    base_url: "http://gpu-server:8000"
    timeout_seconds: 300

  - id: hf
    type: openai
    base_url: "https://router.huggingface.co"
    api_key: "${HF_TOKEN}"
    skip_probe: true

Secrets use ${ENV_VAR} syntax — resolved at startup, never stored in config. Hot-reload with SIGHUP — config, log level, and backend probes update without restart.

See the full configuration reference for details on auth types, TLS, auto-routing, and parameter profiles.

Documentation

Configuration — backends, routes, virtual models, parameter profiles, TLS
Logging and diagnostics — log levels, interaction IDs, request journal, hot reload
Metrics — OTel/Prometheus metrics reference
Development — building, testing, project structure

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
cmd/llm-proxy		cmd/llm-proxy
docs		docs
internal		internal
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.example.yaml		config.example.yaml
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-proxy

What it does

Quick start

Configuration

Documentation

License

About

Uh oh!

Releases 28

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-proxy

What it does

Quick start

Configuration

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 28

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages