Skip to content

v0.3.1

@raffaelschneider raffaelschneider tagged this 07 Jan 19:14
This release adds first-class support for LLM/AI inference traffic:

- Rate limit by tokens-per-minute instead of requests-per-second
- Dual limiting (tokens + requests) for comprehensive protection
- Token estimation from request body with actual token refund from response
- Support for OpenAI, Anthropic, and generic providers

- New load balancing algorithm optimized for inference workloads
- Routes to target with lowest estimated queue time
- Tracks queued tokens and tokens-per-second EWMA per target

- New `inference` health check type for LLM backends
- Probes `/v1/models` endpoint (configurable)
- Optional model availability verification

- New `service-type "inference"` for routes
- `inference {}` block for provider, rate limiting, and routing config
- Full KDL parsing support

```kdl
route "llm-api" {
    service-type "inference"
    upstream "llm-pool"

    inference {
        provider "openai"
        rate-limit {
            tokens-per-minute 100000
            burst-tokens 20000
        }
    }
}

upstream "llm-pool" {
    load-balancing "least_tokens_queued"
    health-check {
        type "inference" {
            endpoint "/v1/models"
        }
    }
}
```

Full documentation: https://sentinel.raskell.io/v/26.01/configuration/inference/
Assets 2
Loading