Skip to content

Olla v0.0.28

Latest

Choose a tag to compare

@github-actions github-actions released this 14 Jun 00:03
e1e4442

Olla is a high-performance proxy and load balancer for LLM infrastructure.

Quick Start

# Docker
docker pull ghcr.io/thushan/olla:v0.0.28

# Binary (see assets below)
./olla --config config.yaml

What's New in v0.0.28

This is a huge release and contains a lot of new exciting changes and bugfixes!

New Backends & Compatibility

  • Native oMLX support - added the oMLX runtime as a first-class backend with native Anthropic Messages API passthrough (#167).
  • Docker Model Runner Anthropic passthrough - passthrough now resolves the native messages path from the backend profile (anthropic_support.messages_path), fixing 404s against DMR's /anthropic/v1/messages (#171).
  • Lemonade models are now routable - Lemonade's downloaded flag is mapped to available state, so downloaded models route correctly instead of being treated as unhealthy. Thanks to @matthewjhunter for the report and fix (#161, issue #160).
  • OpenAI alias fixes - type: "openai" now resolves model listings correctly via /olla/openai/v1/models; removed the duplicate openai.yaml profile that caused non-deterministic route registration. openai-compatible is the canonical value,
    openai remains an accepted alias. Thanks to @petersimmons1972 for the report (#151, issue #148).
  • OpenAI-compatible backends now surface maximum context-length information (#167).

Authentication & Security

Highly requested feature finally implemented!

  • Per-endpoint authentication for local backends - bearer, api_key and basic auth, with credentials from inline strings, ${ENV_VAR}, or _file siblings for Docker/k8s. Works with vLLM/llama.cpp/LiteLLM --api-key or any bearer-auth reverse
    proxy. Thanks to @ShubhamTiwary914 for the report (#146, issue #132).
  • 401/403 health probes now report config_error instead of dead; 429 honours Retry-After without tripping the circuit breaker; POST retries are skipped once response bytes have flushed (no double billing on mid-stream resets).
  • Capped upstream error bodies at 1 MiB and enforced max_body_size on non-proxy and translator routes (including chunked bodies) (#173).
  • Strip inbound X-Olla-* headers from upstream responses (anti-spoofing), sanitise X-Request-ID to prevent log injection, and stop leaking the auth scheme in status JSON (#173).

CORS

  • CORS support (opt-in, via rs/cors) for browser-based clients like OpenWebUI and dashboards. Off by default; auto-exposes the X-Olla-* header set. Thanks to @ccsmart for the request (#159, issue #156).

Anthropic / Translation

  • Reasoning → thinking translation - OpenAI reasoning/reasoning_content now maps to Anthropic thinking blocks in both streaming and non-streaming responses (#176).
  • Hardened streaming translation: preserves final usage chunks, synthesises fallback output_tokens, keeps same-chunk reasoning/content/tool-call deltas intact, and fixes repeated/interleaved tool-call handling (#176).
  • More reliable translated SSE: emits valid SSE error events after response commit, with proper Flush/Unwrap on wrapped writers (#176).
  • Lenient Anthropic parsing - unknown/experimental Anthropic request fields (e.g. context_management) are now ignored during translation and forwarded unchanged in passthrough mode. Thanks to @RenxuLogan for the report (#157, issue #154).

Configuration Tunables

  • New tunables, all zero-value-means-default and OLLA_-overridable (#164, #165):
  • server.read_header_timeout (default 10s) - guards against Slowloris-style slow-header attacks.
  • proxy.response_header_timeout (default 30s) - raise for on-demand model loaders (e.g. Lemonade) that would otherwise abort cold starts at 30s. Now honoured by both Olla and Sherpa. Thanks to @matthewjhunter (#164, issue #163).
  • proxy.connection_keep_alive (default 30s) and proxy.tls_handshake_timeout (default 10s).
  • model_registry.unification.cleanup_interval / stale_threshold are now honoured (previously silently ignored, hard-coded to 5m) (#171).

Reliability & Performance

  • Olla config is now an atomic.Pointer, snapshotted once per request/stream, so a hot-reload can't hand an in-flight request a torn config (#173, #176).
  • Endpoint pools and circuit breakers use LoadOrCompute, eliminating racing transport rebuilds on first use (#173).
  • Cleanup loops survive per-tick panics; fixed event-bus shutdown and health-checker double-stop races; proxy engine cleanup now runs on shutdown (#176).
  • litepool.Get returns errors instead of panicking (handles nil and typed-nil factory results) (#176).
  • Hot-path tuning: typed-struct streaming chunk parsing and pooled SSE encoding buffers (#176).
  • Removed dead code in the Olla engine (unused methods/pools, a forced runtime.GC(), an unread request counter) (#173).

Behaviour Changes to be aware of

  • /internal/status/models now returns endpoint names for every entry (was inconsistently URL-then-names). Consumers reading endpoints[0] as a URL must switch to names (#173).
  • Upstream X-Olla-* response headers are dropped, so a chained Olla-in-front-of-Olla setup loses the inner instance's headers (#173).
  • Default proxy engine is now Olla and the default load balancer is least-connections (#172).

Tooling & Docs

  • New validation harness (/olla-validate --quick / --nightly) with a local multi-protocol mock backend and fault injection - no Docker, CI-safe (#174).
  • Standardised GitHub issue/PR templates (#170).
  • Documentation refresh aligning all guides, headers and config examples with the codebase (#172).
  • Fixed /version self-path reporting (previously reported a 404'ing /internal/version) (#171).

Acknowledgements

Thanks to everyone who reported issues and contributed to this release:

  • @matthewjhunter - contributed the Lemonade routing fix (#161, issue #160) and the configurable response_header_timeout (#164, issue #163).
  • @ShubhamTiwary914 - reported the need for auth against local backends (#132), which drove per-endpoint authentication.
  • @petersimmons1972 - reported empty model listings for type: "openai" endpoints (#148).
  • @RenxuLogan - reported newer Anthropic request fields breaking translation (#154), driving lenient parsing.
  • @ccsmart - requested CORS support for browser-based clients (#156).

What's Changed Summary

New Contributors

Full Changelog: v0.0.27...v0.0.28


Documentation: thushan.github.io/olla | Issues: github.com/thushan/olla/issues