Release Olla v0.0.28 · thushan/olla

Olla is a high-performance proxy and load balancer for LLM infrastructure.

Quick Start

# Docker
docker pull ghcr.io/thushan/olla:v0.0.28

# Binary (see assets below)
./olla --config config.yaml

What's New in v0.0.28

This is a huge release and contains a lot of new exciting changes and bugfixes!

New Backends & Compatibility

Native oMLX support - added the oMLX runtime as a first-class backend with native Anthropic Messages API passthrough (#167).
Docker Model Runner Anthropic passthrough - passthrough now resolves the native messages path from the backend profile (anthropic_support.messages_path), fixing 404s against DMR's /anthropic/v1/messages (#171).
Lemonade models are now routable - Lemonade's downloaded flag is mapped to available state, so downloaded models route correctly instead of being treated as unhealthy. Thanks to @matthewjhunter for the report and fix (#161, issue #160).
OpenAI alias fixes - type: "openai" now resolves model listings correctly via /olla/openai/v1/models; removed the duplicate openai.yaml profile that caused non-deterministic route registration. openai-compatible is the canonical value,
openai remains an accepted alias. Thanks to @petersimmons1972 for the report (#151, issue #148).
OpenAI-compatible backends now surface maximum context-length information (#167).

Authentication & Security

Highly requested feature finally implemented!

Per-endpoint authentication for local backends - bearer, api_key and basic auth, with credentials from inline strings, ${ENV_VAR}, or _file siblings for Docker/k8s. Works with vLLM/llama.cpp/LiteLLM --api-key or any bearer-auth reverse
proxy. Thanks to @ShubhamTiwary914 for the report (#146, issue #132).
401/403 health probes now report config_error instead of dead; 429 honours Retry-After without tripping the circuit breaker; POST retries are skipped once response bytes have flushed (no double billing on mid-stream resets).
Capped upstream error bodies at 1 MiB and enforced max_body_size on non-proxy and translator routes (including chunked bodies) (#173).
Strip inbound X-Olla-* headers from upstream responses (anti-spoofing), sanitise X-Request-ID to prevent log injection, and stop leaking the auth scheme in status JSON (#173).

CORS

CORS support (opt-in, via rs/cors) for browser-based clients like OpenWebUI and dashboards. Off by default; auto-exposes the X-Olla-* header set. Thanks to @ccsmart for the request (#159, issue #156).

Anthropic / Translation

Reasoning → thinking translation - OpenAI reasoning/reasoning_content now maps to Anthropic thinking blocks in both streaming and non-streaming responses (#176).
Hardened streaming translation: preserves final usage chunks, synthesises fallback output_tokens, keeps same-chunk reasoning/content/tool-call deltas intact, and fixes repeated/interleaved tool-call handling (#176).
More reliable translated SSE: emits valid SSE error events after response commit, with proper Flush/Unwrap on wrapped writers (#176).
Lenient Anthropic parsing - unknown/experimental Anthropic request fields (e.g. context_management) are now ignored during translation and forwarded unchanged in passthrough mode. Thanks to @RenxuLogan for the report (#157, issue #154).

Configuration Tunables

New tunables, all zero-value-means-default and OLLA_-overridable (#164, #165):
server.read_header_timeout (default 10s) - guards against Slowloris-style slow-header attacks.
proxy.response_header_timeout (default 30s) - raise for on-demand model loaders (e.g. Lemonade) that would otherwise abort cold starts at 30s. Now honoured by both Olla and Sherpa. Thanks to @matthewjhunter (#164, issue #163).
proxy.connection_keep_alive (default 30s) and proxy.tls_handshake_timeout (default 10s).
model_registry.unification.cleanup_interval / stale_threshold are now honoured (previously silently ignored, hard-coded to 5m) (#171).

Reliability & Performance

Olla config is now an atomic.Pointer, snapshotted once per request/stream, so a hot-reload can't hand an in-flight request a torn config (#173, #176).
Endpoint pools and circuit breakers use LoadOrCompute, eliminating racing transport rebuilds on first use (#173).
Cleanup loops survive per-tick panics; fixed event-bus shutdown and health-checker double-stop races; proxy engine cleanup now runs on shutdown (#176).
litepool.Get returns errors instead of panicking (handles nil and typed-nil factory results) (#176).
Hot-path tuning: typed-struct streaming chunk parsing and pooled SSE encoding buffers (#176).
Removed dead code in the Olla engine (unused methods/pools, a forced runtime.GC(), an unread request counter) (#173).

Behaviour Changes to be aware of

/internal/status/models now returns endpoint names for every entry (was inconsistently URL-then-names). Consumers reading endpoints[0] as a URL must switch to names (#173).
Upstream X-Olla-* response headers are dropped, so a chained Olla-in-front-of-Olla setup loses the inner instance's headers (#173).
Default proxy engine is now Olla and the default load balancer is least-connections (#172).

Tooling & Docs

New validation harness (/olla-validate --quick / --nightly) with a local multi-protocol mock backend and fault injection - no Docker, CI-safe (#174).
Standardised GitHub issue/PR templates (#170).
Documentation refresh aligning all guides, headers and config examples with the codebase (#172).
Fixed /version self-path reporting (previously reported a 404'ing /internal/version) (#171).

Acknowledgements

Thanks to everyone who reported issues and contributed to this release:

@matthewjhunter - contributed the Lemonade routing fix (#161, issue #160) and the configurable response_header_timeout (#164, issue #163).
@ShubhamTiwary914 - reported the need for auth against local backends (#132), which drove per-endpoint authentication.
@petersimmons1972 - reported empty model listings for type: "openai" endpoints (#148).
@RenxuLogan - reported newer Anthropic request fields breaking translation (#154), driving lenient parsing.
@ccsmart - requested CORS support for browser-based clients (#156).

What's Changed Summary

fix: openai endpoint type returning empty model listings by @thushan in #151
fix: be lenient for anthropic parsing by @thushan in #157
feat: Authentication for local backends by @thushan in #146
feat: CORS implementation by @thushan in #159
fix: route Lemonade models by mapping downloaded to available state by @matthewjhunter in #161
fix: make proxy response_header_timeout configurable by @matthewjhunter in #164
feat: configuration tunables by @thushan in #165
feat: backend omlx by @thushan in #167
ci: improve issues templates etc. by @thushan in #170
fix: anthropic passthrough path for Docker Model Runner + align config defaults by @thushan in #171
align documentation with the codebase for v0.0.28 by @thushan in #172
feat: olla hardening June 2026 by @thushan in #173
feat: olla mocking skills for claude by @thushan in #174
rollup: June 2026 by @thushan in #176

New Contributors

@matthewjhunter made their first contribution in #161

Full Changelog: v0.0.27...v0.0.28

Documentation: thushan.github.io/olla | Issues: github.com/thushan/olla/issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Olla v0.0.28

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Quick Start

What's New in v0.0.28

New Backends & Compatibility

Authentication & Security

CORS

Anthropic / Translation

Configuration Tunables

Reliability & Performance

Behaviour Changes to be aware of

Tooling & Docs

Acknowledgements

What's Changed Summary

New Contributors

Contributors

Uh oh!