This release adds first-class support for LLM/AI inference traffic:
- Rate limit by tokens-per-minute instead of requests-per-second
- Dual limiting (tokens + requests) for comprehensive protection
- Token estimation from request body with actual token refund from response
- Support for OpenAI, Anthropic, and generic providers
- New load balancing algorithm optimized for inference workloads
- Routes to target with lowest estimated queue time
- Tracks queued tokens and tokens-per-second EWMA per target
- New `inference` health check type for LLM backends
- Probes `/v1/models` endpoint (configurable)
- Optional model availability verification
- New `service-type "inference"` for routes
- `inference {}` block for provider, rate limiting, and routing config
- Full KDL parsing support
```kdl
route "llm-api" {
service-type "inference"
upstream "llm-pool"
inference {
provider "openai"
rate-limit {
tokens-per-minute 100000
burst-tokens 20000
}
}
}
upstream "llm-pool" {
load-balancing "least_tokens_queued"
health-check {
type "inference" {
endpoint "/v1/models"
}
}
}
```
Full documentation: https://sentinel.raskell.io/v/26.01/configuration/inference/