Olla is a high-performance proxy and load balancer for LLM infrastructure.
Quick Start
# Docker
docker pull ghcr.io/thushan/olla:v0.0.28
# Binary (see assets below)
./olla --config config.yamlWhat's New in v0.0.28
This is a huge release and contains a lot of new exciting changes and bugfixes!
New Backends & Compatibility
- Native oMLX support - added the oMLX runtime as a first-class backend with native Anthropic Messages API passthrough (#167).
- Docker Model Runner Anthropic passthrough - passthrough now resolves the native messages path from the backend profile (
anthropic_support.messages_path), fixing 404s against DMR's/anthropic/v1/messages(#171). - Lemonade models are now routable - Lemonade's
downloadedflag is mapped toavailablestate, so downloaded models route correctly instead of being treated as unhealthy. Thanks to @matthewjhunter for the report and fix (#161, issue #160). - OpenAI alias fixes -
type: "openai"now resolves model listings correctly via/olla/openai/v1/models; removed the duplicateopenai.yamlprofile that caused non-deterministic route registration.openai-compatibleis the canonical value,
openairemains an accepted alias. Thanks to @petersimmons1972 for the report (#151, issue #148). - OpenAI-compatible backends now surface maximum context-length information (#167).
Authentication & Security
Highly requested feature finally implemented!
- Per-endpoint authentication for local backends -
bearer,api_keyandbasicauth, with credentials from inline strings,${ENV_VAR}, or_filesiblings for Docker/k8s. Works with vLLM/llama.cpp/LiteLLM--api-keyor any bearer-auth reverse
proxy. Thanks to @ShubhamTiwary914 for the report (#146, issue #132). - 401/403 health probes now report
config_errorinstead ofdead; 429 honoursRetry-Afterwithout tripping the circuit breaker; POST retries are skipped once response bytes have flushed (no double billing on mid-stream resets). - Capped upstream error bodies at 1 MiB and enforced
max_body_sizeon non-proxy and translator routes (including chunked bodies) (#173). - Strip inbound
X-Olla-*headers from upstream responses (anti-spoofing), sanitiseX-Request-IDto prevent log injection, and stop leaking the auth scheme in status JSON (#173).
CORS
- CORS support (opt-in, via
rs/cors) for browser-based clients like OpenWebUI and dashboards. Off by default; auto-exposes theX-Olla-*header set. Thanks to @ccsmart for the request (#159, issue #156).
Anthropic / Translation
- Reasoning → thinking translation - OpenAI
reasoning/reasoning_contentnow maps to Anthropicthinkingblocks in both streaming and non-streaming responses (#176). - Hardened streaming translation: preserves final
usagechunks, synthesises fallbackoutput_tokens, keeps same-chunk reasoning/content/tool-call deltas intact, and fixes repeated/interleaved tool-call handling (#176). - More reliable translated SSE: emits valid SSE error events after response commit, with proper
Flush/Unwrapon wrapped writers (#176). - Lenient Anthropic parsing - unknown/experimental Anthropic request fields (e.g.
context_management) are now ignored during translation and forwarded unchanged in passthrough mode. Thanks to @RenxuLogan for the report (#157, issue #154).
Configuration Tunables
- New tunables, all zero-value-means-default and
OLLA_-overridable (#164, #165): server.read_header_timeout(default10s) - guards against Slowloris-style slow-header attacks.proxy.response_header_timeout(default30s) - raise for on-demand model loaders (e.g. Lemonade) that would otherwise abort cold starts at 30s. Now honoured by both Olla and Sherpa. Thanks to @matthewjhunter (#164, issue #163).proxy.connection_keep_alive(default30s) andproxy.tls_handshake_timeout(default10s).model_registry.unification.cleanup_interval/stale_thresholdare now honoured (previously silently ignored, hard-coded to5m) (#171).
Reliability & Performance
- Olla config is now an
atomic.Pointer, snapshotted once per request/stream, so a hot-reload can't hand an in-flight request a torn config (#173, #176). - Endpoint pools and circuit breakers use
LoadOrCompute, eliminating racing transport rebuilds on first use (#173). - Cleanup loops survive per-tick panics; fixed event-bus shutdown and health-checker double-stop races; proxy engine cleanup now runs on shutdown (#176).
litepool.Getreturns errors instead of panicking (handles nil and typed-nil factory results) (#176).- Hot-path tuning: typed-struct streaming chunk parsing and pooled SSE encoding buffers (#176).
- Removed dead code in the Olla engine (unused methods/pools, a forced
runtime.GC(), an unread request counter) (#173).
Behaviour Changes to be aware of
/internal/status/modelsnow returns endpoint names for every entry (was inconsistently URL-then-names). Consumers readingendpoints[0]as a URL must switch to names (#173).- Upstream
X-Olla-*response headers are dropped, so a chained Olla-in-front-of-Olla setup loses the inner instance's headers (#173). - Default proxy engine is now
Ollaand the default load balancer isleast-connections(#172).
Tooling & Docs
- New validation harness (
/olla-validate --quick/--nightly) with a local multi-protocol mock backend and fault injection - no Docker, CI-safe (#174). - Standardised GitHub issue/PR templates (#170).
- Documentation refresh aligning all guides, headers and config examples with the codebase (#172).
- Fixed
/versionself-path reporting (previously reported a 404'ing/internal/version) (#171).
Acknowledgements
Thanks to everyone who reported issues and contributed to this release:
- @matthewjhunter - contributed the Lemonade routing fix (#161, issue #160) and the configurable
response_header_timeout(#164, issue #163). - @ShubhamTiwary914 - reported the need for auth against local backends (#132), which drove per-endpoint authentication.
- @petersimmons1972 - reported empty model listings for
type: "openai"endpoints (#148). - @RenxuLogan - reported newer Anthropic request fields breaking translation (#154), driving lenient parsing.
- @ccsmart - requested CORS support for browser-based clients (#156).
What's Changed Summary
- fix: openai endpoint type returning empty model listings by @thushan in #151
- fix: be lenient for anthropic parsing by @thushan in #157
- feat: Authentication for local backends by @thushan in #146
- feat: CORS implementation by @thushan in #159
- fix: route Lemonade models by mapping downloaded to available state by @matthewjhunter in #161
- fix: make proxy response_header_timeout configurable by @matthewjhunter in #164
- feat: configuration tunables by @thushan in #165
- feat: backend omlx by @thushan in #167
- ci: improve issues templates etc. by @thushan in #170
- fix: anthropic passthrough path for Docker Model Runner + align config defaults by @thushan in #171
- align documentation with the codebase for v0.0.28 by @thushan in #172
- feat: olla hardening June 2026 by @thushan in #173
- feat: olla mocking skills for claude by @thushan in #174
- rollup: June 2026 by @thushan in #176
New Contributors
- @matthewjhunter made their first contribution in #161
Full Changelog: v0.0.27...v0.0.28
Documentation: thushan.github.io/olla | Issues: github.com/thushan/olla/issues