-
Notifications
You must be signed in to change notification settings - Fork 179
Description
Background
Currently, the global configuration contains static configs including featuregate for semantic cache, prompt guard, reasoning family and categories.
At the same time, it contains dynamic configuration like model configuration, endpoints configuration, Algorithms.
This proposal is to design an abstraction of semantic route resource to configure the dynamic routing informations, which can be used in local and also as a CRD to be list/watch in kubernetes.
This can improve maintainability and UX hugely. Meanwhile, the experiment shows different model behaviors differently in reasoning mode, introducing SemanticRoute will help optimize Model Level routing algorithms like which categories be on/off.
Goals
- Semantic Route Design: design the next-gen routing abstraction of LLM Semantic Routing.
- Multiple Environments Support: working with multiple environments like Local or Kubernetes.
- Easy to be Extented: pluggable filter design to match the request for future iteration.
What was before?
Before LLM, the routing is basically binding with TCP/IP Protocol, like what GatewayAPI does:
- L7 protocol: HTTPRoute, GRPCRoute...
- L4 protocol: TCPRoute, UDPRoute...
And for L7 protocol, take HTTPRoute as an example, the main concept is to match a rule and route to specific backends.

Here is an example, route to backend by matching header, path and hostname.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: foo-route
spec:
parentRefs:
- name: example-gateway
hostnames:
- "foo.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /login
- headers:
- type: Exact
name: env
value: canary
backendRefs:
- name: foo-svc
port: 8080
It is working well with traditional workloads and it is not surfacing with the LLM.
What is Now?
vLLM Semantic Router introduces a very different view of how routing to the LLM workload. It is not about to match the protocol binding elements like headers, path, hostname in HTTP, or the srcIP,dstIP in TCP.
It is aimming to match the intent which contains in the request, within extensible filter chain for request control. I will list some examples of SemanticRoute. Besides intent understanding, also supporting powerful filters like:
- PIIDetection enables PII detection and filtering
- PromptGuard enables prompt security and jailbreak detection
- SemanticCache enables semantic caching for performance optimization
- ReasoningControl enables reasoning mode control
- ToolSelection enables automatic tool selection based on semantic similarity
Simple SemanticRoute with intent detection
Here is an example of SemanticRoute looks like, matching the math
and computer science
category and route to a LLM Model, if match is failed, fallback to default model:
apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
name: reasoning-route
spec:
rules:
- intents:
- category: "computer science"
- category: "math"
modelRefs:
- modelName: gpt-oss
port: 8080
address: 127.0.0.1
defaultModel:
modelName: deepseek-v31
port: 8088
address: 127.0.0.1
Complex SemanticRoute within Filter Chain
Here is an example of SemanticRoute looks like, matching the math
and computer science
category and route to a LLM Model, if match is failed, fallback to default model, and enable PIIDetection and PromptGuard, as well as the reasoning control in this route
apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
name: complex-route
spec:
rules:
- intents:
- category: "computer science"
- category: "math"
modelRefs:
- modelName: gpt-oss
port: 8080
address: 127.0.0.1
filters:
- type: PIIDetection
allowByDefault: false
pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"]
- type: PromptGuard
threshold: 0.7
- type: SemanticCache
similarityThreshold: 0.8
maxEntries: 1000
ttlSeconds: 3600s
- type: ToolSelection
similarityThreshold: 0.8
- type: ReasoningControl
reasonFamily: gpt-oss
enableReasoning: true
defaultModel:
modelName: deepseek-v31
port: 8088
address: 127.0.0.1
Multiple SemanticRoute
Here is an example of multiple SemanticRoute looks like, matching the math
and computer science
category and route to a powerful LLM Model with reasoning on; matching the other
and creative
category and route to a lightweight LLM Model with reasoning off
Non-reasoning Route:
apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
name: lightweight-route
spec:
rules:
- intents:
- category: "creative"
- category: "other"
modelRefs:
- modelName: gpt-oss
port: 8080
address: 127.0.0.1
filters:
- type: ReasoningControl
reasonFamily: gpt-oss
enableReasoning: false
defaultModel:
modelName: deepseek-v31
port: 8088
address: 127.0.0.1
Reasoning Route:
apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
name: reasoning-route
spec:
rules:
- intents:
- category: "computer science"
- category: "math"
modelRefs:
- modelName: gpt-oss
port: 8080
address: 127.0.0.1
filters:
- type: ReasoningControl
reasonFamily: gpt-oss
enableReasoning: true
defaultModel:
modelName: deepseek-v31
port: 8088
address: 127.0.0.1
Mixed Reasoning SemanticRoute
Here is an example of mixed SemanticRoute looks like, matching the math
and computer science
category and route to a powerful LLM Model with reasoning on; matching the other
and creative
category and route to a lightweight LLM Model with reasoning off
apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
name: lightweight-route
spec:
rules:
- intents:
- category: "computer science"
- category: "math"
modelRefs:
- modelName: gpt-oss
port: 8080
address: 127.0.0.1
filters:
- type: ReasoningControl
reasonFamily: gpt-oss
enableReasoning: true
defaultModel:
modelName: deepseek-v31
port: 8088
address: 127.0.0.1
- intents:
- category: "creative"
- category: "other"
modelRefs:
- modelName: gpt-oss
port: 8080
address: 127.0.0.1
filters:
- type: ReasoningControl
reasonFamily: gpt-oss
enableReasoning: false
defaultModel:
modelName: deepseek-v31
port: 8088
address: 127.0.0.1
Multiple Weighted ModelRef SemanticRoute
Here is an example of a SemanticRoute with Multiple ModelRef
apiVersion: vllm.ai/v1alpha1
kind: SemanticRoute
metadata:
name: lightweight-route
spec:
rules:
- intents:
- category: "computer science"
- category: "math"
modelRefs:
- modelName: gpt-oss
port: 8080
address: 127.0.0.1
weight: 80
- modelName: qwen3
port: 8089
address: 127.0.0.1
weight: 20
filters:
- type: ReasoningControl
reasonFamily: gpt-oss
enableReasoning: true
defaultModel:
modelName: deepseek-v31
port: 8088
address: 127.0.0.1
Implementation
- Semantic Route API Design
- Add Local Support
- Add Kubernetes Support
- Add E2E Test
- Add User Facing Docs