Systematic testing of llm-d inference features across two deployment modes: aggregated (multi-replica with intelligent routing) and prefill/decode disaggregated (NIXL KV cache transfer with GPU time-slicing). All testing performed on an NVIDIA GH200 running k3s.
| Feature | Mode | Status | Details |
|---|---|---|---|
| Prefix-cache-aware routing | Aggregated | Pass -- 73% cache hit rate, prefix affinity routes to same pod | prefix-cache-testing.md |
| Queue-depth load balancing | Aggregated | Pass -- 30 concurrent requests split 18/12 across pods | aggregated-test-results.md |
| KV-cache-utilization scoring | Aggregated | Pass -- weighted scorer influences pod selection | epp-scoring-deep-dive.md |
| Multi-model routing | Aggregated | Pass -- 7B + 0.5B models routed by model name | multi-model-test-results.md |
| HPA autoscaling (inference pool metrics) | Aggregated | Pass -- scales 1 to 4 pods, drains cleanly on scale-down | autoscaling-test-results.md |
| P/D with NIXL on MIG partitions | P/D | Fail -- CUDA IPC blocked between MIG instances | issues-and-resolutions.md |
| P/D with NIXL on time-sliced GPU | P/D | Pass -- time-slicing preserves CUDA IPC, NIXL works | pd-timeslicing-breakthrough.md |
| Selective P/D (prefix-based-pd-decider) | P/D | Pass -- short requests skip prefill, long requests trigger P/D split | pd-timeslicing-breakthrough.md |
| Precise prefix routing in P/D | P/D | Fail -- KV events not reaching EPP; scorer returns 0 for all endpoints | precise-prefix-routing-findings.md |
Two vLLM replicas behind the EPP scheduler. The scheduler scores candidate pods using a weighted combination of prefix-cache affinity, queue depth, and KV-cache utilization, then routes each request to the highest-scoring pod.
Client --> agentgateway --> HTTPRoute --> InferencePool --> EPP scheduler
|
prefix-cache-scorer (weight 2)
queue-scorer (weight 1)
kv-cache-utilization-scorer (weight 2)
|
+------------------+------------------+
v v
vLLM Pod A vLLM Pod B
(GPU slice 0) (GPU slice 1)
Prefill and decode run as separate pods on the same physical GPU via time-slicing. The prefill pod processes the prompt and transfers KV cache to the decode pod over NIXL (CUDA IPC + TCP). The pd-profile-handler in the EPP decides whether each request warrants a P/D split or should go directly to decode.
Client --> agentgateway --> HTTPRoute --> InferencePool --> EPP scheduler
|
pd-profile-handler decides P/D split
|
+-----------------------------+-----------------------------+
v v
Prefill pod Decode pod
vLLM (NixlConnector) routing-sidecar --> vLLM
GPU time-slice 0 GPU time-slice 1
| |
+-------- NIXL KV transfer (cuda_ipc + tcp) ---------------+
Why time-slicing, not MIG: MIG creates hardware-isolated GPU partitions that block CUDA IPC between instances. NIXL requires CUDA IPC (or RDMA) for KV cache transfer. Time-slicing shares the same GPU memory space across pods, keeping CUDA IPC functional. Configuration requires UCX_TLS=tcp,sm,cuda_copy,cuda_ipc and hostIPC: true on the pods.
| Component | Version / Details |
|---|---|
| GPU | NVIDIA GH200 (96 GB HBM3, Hopper compute 9.0, ARM64 Grace CPU) |
| Kubernetes | k3s v1.34.6 (ARM64) |
| Gateway | agentgateway v1.0.0 |
| llm-d Helm charts | infra v1.4.0, inferencepool v1.4.0, modelservice v0.4.9 |
| EPP scheduler | ghcr.io/llm-d/llm-d-inference-scheduler:v0.7.0 |
| Model server | ghcr.io/llm-d/llm-d-cuda:v0.6.0 (native ARM64) |
| Model | Qwen/Qwen2.5-7B-Instruct (BF16, V1 engine, FlashAttention v3) |
Setup instructions: k3s-mig-setup.md | Helm installation: llm-d-install.md
| Document | Description |
|---|---|
| prefix-cache-testing.md | Prefix cache methodology, 73% hit rate, metrics reference |
| aggregated-test-results.md | Prefix routing, queue balancing, load distribution |
| multi-model-test-results.md | Qwen 7B + 0.5B model-name routing |
| autoscaling-test-results.md | Native HPA with EPP metrics, 1 to 4 pod scaling |
| pd-timeslicing-breakthrough.md | P/D via time-slicing: MIG failure analysis, working configuration, UCX transport setup |
| precise-prefix-routing-findings.md | Precise prefix routing in P/D mode: configuration and current limitations |
| Document | Description |
|---|---|
| epp-scoring-deep-dive.md | All EPP scorer plugins, weighted scoring algorithm, overflow behavior |
| autoscaling-metrics-reference.md | All 54 EPP metrics, HPA configuration, WVA overview |
| issues-and-resolutions.md | Issues encountered during deployment with root causes and fixes |
| llm-d-install.md | Helm installation for infra, inferencepool, and modelservice charts |
| k3s-mig-setup.md | k3s cluster setup with GPU partitioning on GH200 |
| File | Description |
|---|---|
| values_gh200_agg.yaml | Aggregated mode (2 replicas) |
| values_gh200_pd.yaml | P/D mode with MIG (non-functional -- documented for reference) |
| values_gh200_ts_pd.yaml | P/D mode with time-slicing (working) |
| File | Description |
|---|---|
| test-aggregated.sh | Aggregated mode test suite |
-
Precise prefix routing in P/D mode does not work. The
precise-prefix-cache-scorerloads and scores, but KV events from vLLM pods do not reach the EPP's prefix index. The scorer returnsscore: 0for all endpoints. Approximate prefix routing works as a fallback. See precise-prefix-routing-findings.md. -
MIG is incompatible with NIXL on a single GPU. CUDA IPC is blocked across MIG instances by design. P/D disaggregation requires time-slicing or multiple physical GPUs with RDMA/InfiniBand.
-
Time-slicing has no memory isolation. Both pods share GPU memory. A prefill pod processing a large prompt can OOM both pods. This is a known tradeoff of time-slicing vs MIG.
- Provision a GH200 instance and install k3s per k3s-mig-setup.md
- Install llm-d Helm charts per llm-d-install.md
- For aggregated mode, use values_gh200_agg.yaml and run test-aggregated.sh
- For P/D mode, configure time-slicing per pd-timeslicing-breakthrough.md and use values_gh200_ts_pd.yaml
Apache 2.0