llm-d Feature Testing -- Aggregated & P/D Disaggregated Inference

Systematic testing of llm-d inference features across two deployment modes: aggregated (multi-replica with intelligent routing) and prefill/decode disaggregated (NIXL KV cache transfer with GPU time-slicing). All testing performed on an NVIDIA GH200 running k3s.

Features Tested

Feature	Mode	Status	Details
Prefix-cache-aware routing	Aggregated	Pass -- 73% cache hit rate, prefix affinity routes to same pod	prefix-cache-testing.md
Queue-depth load balancing	Aggregated	Pass -- 30 concurrent requests split 18/12 across pods	aggregated-test-results.md
KV-cache-utilization scoring	Aggregated	Pass -- weighted scorer influences pod selection	epp-scoring-deep-dive.md
Multi-model routing	Aggregated	Pass -- 7B + 0.5B models routed by model name	multi-model-test-results.md
HPA autoscaling (inference pool metrics)	Aggregated	Pass -- scales 1 to 4 pods, drains cleanly on scale-down	autoscaling-test-results.md
P/D with NIXL on MIG partitions	P/D	Fail -- CUDA IPC blocked between MIG instances	issues-and-resolutions.md
P/D with NIXL on time-sliced GPU	P/D	Pass -- time-slicing preserves CUDA IPC, NIXL works	pd-timeslicing-breakthrough.md
Selective P/D (prefix-based-pd-decider)	P/D	Pass -- short requests skip prefill, long requests trigger P/D split	pd-timeslicing-breakthrough.md
Precise prefix routing in P/D	P/D	Fail -- KV events not reaching EPP; scorer returns 0 for all endpoints	precise-prefix-routing-findings.md

Architecture

Aggregated mode

Two vLLM replicas behind the EPP scheduler. The scheduler scores candidate pods using a weighted combination of prefix-cache affinity, queue depth, and KV-cache utilization, then routes each request to the highest-scoring pod.

Client --> agentgateway --> HTTPRoute --> InferencePool --> EPP scheduler
                                                              |
                                          prefix-cache-scorer (weight 2)
                                          queue-scorer (weight 1)
                                          kv-cache-utilization-scorer (weight 2)
                                                              |
                                           +------------------+------------------+
                                           v                                     v
                                      vLLM Pod A                            vLLM Pod B
                                      (GPU slice 0)                         (GPU slice 1)

P/D disaggregated mode

Prefill and decode run as separate pods on the same physical GPU via time-slicing. The prefill pod processes the prompt and transfers KV cache to the decode pod over NIXL (CUDA IPC + TCP). The pd-profile-handler in the EPP decides whether each request warrants a P/D split or should go directly to decode.

Client --> agentgateway --> HTTPRoute --> InferencePool --> EPP scheduler
                                                              |
                                           pd-profile-handler decides P/D split
                                                              |
                                +-----------------------------+-----------------------------+
                                v                                                           v
                          Prefill pod                                                 Decode pod
                          vLLM (NixlConnector)                                        routing-sidecar --> vLLM
                          GPU time-slice 0                                            GPU time-slice 1
                                |                                                           |
                                +-------- NIXL KV transfer (cuda_ipc + tcp) ---------------+

Why time-slicing, not MIG: MIG creates hardware-isolated GPU partitions that block CUDA IPC between instances. NIXL requires CUDA IPC (or RDMA) for KV cache transfer. Time-slicing shares the same GPU memory space across pods, keeping CUDA IPC functional. Configuration requires UCX_TLS=tcp,sm,cuda_copy,cuda_ipc and hostIPC: true on the pods.

Infrastructure

Component	Version / Details
GPU	NVIDIA GH200 (96 GB HBM3, Hopper compute 9.0, ARM64 Grace CPU)
Kubernetes	k3s v1.34.6 (ARM64)
Gateway	agentgateway v1.0.0
llm-d Helm charts	infra v1.4.0, inferencepool v1.4.0, modelservice v0.4.9
EPP scheduler	`ghcr.io/llm-d/llm-d-inference-scheduler:v0.7.0`
Model server	`ghcr.io/llm-d/llm-d-cuda:v0.6.0` (native ARM64)
Model	Qwen/Qwen2.5-7B-Instruct (BF16, V1 engine, FlashAttention v3)

Setup instructions: k3s-mig-setup.md | Helm installation: llm-d-install.md

Documentation

Test results

Document	Description
prefix-cache-testing.md	Prefix cache methodology, 73% hit rate, metrics reference
aggregated-test-results.md	Prefix routing, queue balancing, load distribution
multi-model-test-results.md	Qwen 7B + 0.5B model-name routing
autoscaling-test-results.md	Native HPA with EPP metrics, 1 to 4 pod scaling
pd-timeslicing-breakthrough.md	P/D via time-slicing: MIG failure analysis, working configuration, UCX transport setup
precise-prefix-routing-findings.md	Precise prefix routing in P/D mode: configuration and current limitations

Reference

Document	Description
epp-scoring-deep-dive.md	All EPP scorer plugins, weighted scoring algorithm, overflow behavior
autoscaling-metrics-reference.md	All 54 EPP metrics, HPA configuration, WVA overview
issues-and-resolutions.md	Issues encountered during deployment with root causes and fixes
llm-d-install.md	Helm installation for infra, inferencepool, and modelservice charts
k3s-mig-setup.md	k3s cluster setup with GPU partitioning on GH200

Helm values

File	Description
values_gh200_agg.yaml	Aggregated mode (2 replicas)
values_gh200_pd.yaml	P/D mode with MIG (non-functional -- documented for reference)
values_gh200_ts_pd.yaml	P/D mode with time-slicing (working)

Scripts

File	Description
test-aggregated.sh	Aggregated mode test suite

Known Issues and Gaps

Precise prefix routing in P/D mode does not work. The precise-prefix-cache-scorer loads and scores, but KV events from vLLM pods do not reach the EPP's prefix index. The scorer returns score: 0 for all endpoints. Approximate prefix routing works as a fallback. See precise-prefix-routing-findings.md.
MIG is incompatible with NIXL on a single GPU. CUDA IPC is blocked across MIG instances by design. P/D disaggregation requires time-slicing or multiple physical GPUs with RDMA/InfiniBand.
Time-slicing has no memory isolation. Both pods share GPU memory. A prefill pod processing a large prompt can OOM both pods. This is a known tradeoff of time-slicing vs MIG.

Reproducing

Provision a GH200 instance and install k3s per k3s-mig-setup.md
Install llm-d Helm charts per llm-d-install.md
For aggregated mode, use values_gh200_agg.yaml and run test-aggregated.sh
For P/D mode, configure time-slicing per pd-timeslicing-breakthrough.md and use values_gh200_ts_pd.yaml

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
helm		helm
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-d Feature Testing -- Aggregated & P/D Disaggregated Inference

Features Tested

Architecture

Aggregated mode

P/D disaggregated mode

Infrastructure

Documentation

Test results

Reference

Helm values

Scripts

Known Issues and Gaps

Reproducing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-d Feature Testing -- Aggregated & P/D Disaggregated Inference

Features Tested

Architecture

Aggregated mode

P/D disaggregated mode

Infrastructure

Documentation

Test results

Reference

Helm values

Scripts

Known Issues and Gaps

Reproducing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages