Skip to content

richajoy/llm-d-lambda-deployment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-d Feature Testing -- Aggregated & P/D Disaggregated Inference

Systematic testing of llm-d inference features across two deployment modes: aggregated (multi-replica with intelligent routing) and prefill/decode disaggregated (NIXL KV cache transfer with GPU time-slicing). All testing performed on an NVIDIA GH200 running k3s.


Features Tested

Feature Mode Status Details
Prefix-cache-aware routing Aggregated Pass -- 73% cache hit rate, prefix affinity routes to same pod prefix-cache-testing.md
Queue-depth load balancing Aggregated Pass -- 30 concurrent requests split 18/12 across pods aggregated-test-results.md
KV-cache-utilization scoring Aggregated Pass -- weighted scorer influences pod selection epp-scoring-deep-dive.md
Multi-model routing Aggregated Pass -- 7B + 0.5B models routed by model name multi-model-test-results.md
HPA autoscaling (inference pool metrics) Aggregated Pass -- scales 1 to 4 pods, drains cleanly on scale-down autoscaling-test-results.md
P/D with NIXL on MIG partitions P/D Fail -- CUDA IPC blocked between MIG instances issues-and-resolutions.md
P/D with NIXL on time-sliced GPU P/D Pass -- time-slicing preserves CUDA IPC, NIXL works pd-timeslicing-breakthrough.md
Selective P/D (prefix-based-pd-decider) P/D Pass -- short requests skip prefill, long requests trigger P/D split pd-timeslicing-breakthrough.md
Precise prefix routing in P/D P/D Fail -- KV events not reaching EPP; scorer returns 0 for all endpoints precise-prefix-routing-findings.md

Architecture

Aggregated mode

Two vLLM replicas behind the EPP scheduler. The scheduler scores candidate pods using a weighted combination of prefix-cache affinity, queue depth, and KV-cache utilization, then routes each request to the highest-scoring pod.

Client --> agentgateway --> HTTPRoute --> InferencePool --> EPP scheduler
                                                              |
                                          prefix-cache-scorer (weight 2)
                                          queue-scorer (weight 1)
                                          kv-cache-utilization-scorer (weight 2)
                                                              |
                                           +------------------+------------------+
                                           v                                     v
                                      vLLM Pod A                            vLLM Pod B
                                      (GPU slice 0)                         (GPU slice 1)

P/D disaggregated mode

Prefill and decode run as separate pods on the same physical GPU via time-slicing. The prefill pod processes the prompt and transfers KV cache to the decode pod over NIXL (CUDA IPC + TCP). The pd-profile-handler in the EPP decides whether each request warrants a P/D split or should go directly to decode.

Client --> agentgateway --> HTTPRoute --> InferencePool --> EPP scheduler
                                                              |
                                           pd-profile-handler decides P/D split
                                                              |
                                +-----------------------------+-----------------------------+
                                v                                                           v
                          Prefill pod                                                 Decode pod
                          vLLM (NixlConnector)                                        routing-sidecar --> vLLM
                          GPU time-slice 0                                            GPU time-slice 1
                                |                                                           |
                                +-------- NIXL KV transfer (cuda_ipc + tcp) ---------------+

Why time-slicing, not MIG: MIG creates hardware-isolated GPU partitions that block CUDA IPC between instances. NIXL requires CUDA IPC (or RDMA) for KV cache transfer. Time-slicing shares the same GPU memory space across pods, keeping CUDA IPC functional. Configuration requires UCX_TLS=tcp,sm,cuda_copy,cuda_ipc and hostIPC: true on the pods.


Infrastructure

Component Version / Details
GPU NVIDIA GH200 (96 GB HBM3, Hopper compute 9.0, ARM64 Grace CPU)
Kubernetes k3s v1.34.6 (ARM64)
Gateway agentgateway v1.0.0
llm-d Helm charts infra v1.4.0, inferencepool v1.4.0, modelservice v0.4.9
EPP scheduler ghcr.io/llm-d/llm-d-inference-scheduler:v0.7.0
Model server ghcr.io/llm-d/llm-d-cuda:v0.6.0 (native ARM64)
Model Qwen/Qwen2.5-7B-Instruct (BF16, V1 engine, FlashAttention v3)

Setup instructions: k3s-mig-setup.md | Helm installation: llm-d-install.md


Documentation

Test results

Document Description
prefix-cache-testing.md Prefix cache methodology, 73% hit rate, metrics reference
aggregated-test-results.md Prefix routing, queue balancing, load distribution
multi-model-test-results.md Qwen 7B + 0.5B model-name routing
autoscaling-test-results.md Native HPA with EPP metrics, 1 to 4 pod scaling
pd-timeslicing-breakthrough.md P/D via time-slicing: MIG failure analysis, working configuration, UCX transport setup
precise-prefix-routing-findings.md Precise prefix routing in P/D mode: configuration and current limitations

Reference

Document Description
epp-scoring-deep-dive.md All EPP scorer plugins, weighted scoring algorithm, overflow behavior
autoscaling-metrics-reference.md All 54 EPP metrics, HPA configuration, WVA overview
issues-and-resolutions.md Issues encountered during deployment with root causes and fixes
llm-d-install.md Helm installation for infra, inferencepool, and modelservice charts
k3s-mig-setup.md k3s cluster setup with GPU partitioning on GH200

Helm values

File Description
values_gh200_agg.yaml Aggregated mode (2 replicas)
values_gh200_pd.yaml P/D mode with MIG (non-functional -- documented for reference)
values_gh200_ts_pd.yaml P/D mode with time-slicing (working)

Scripts

File Description
test-aggregated.sh Aggregated mode test suite

Known Issues and Gaps

  • Precise prefix routing in P/D mode does not work. The precise-prefix-cache-scorer loads and scores, but KV events from vLLM pods do not reach the EPP's prefix index. The scorer returns score: 0 for all endpoints. Approximate prefix routing works as a fallback. See precise-prefix-routing-findings.md.

  • MIG is incompatible with NIXL on a single GPU. CUDA IPC is blocked across MIG instances by design. P/D disaggregation requires time-slicing or multiple physical GPUs with RDMA/InfiniBand.

  • Time-slicing has no memory isolation. Both pods share GPU memory. A prefill pod processing a large prompt can OOM both pods. This is a known tradeoff of time-slicing vs MIG.


Reproducing

  1. Provision a GH200 instance and install k3s per k3s-mig-setup.md
  2. Install llm-d Helm charts per llm-d-install.md
  3. For aggregated mode, use values_gh200_agg.yaml and run test-aggregated.sh
  4. For P/D mode, configure time-slicing per pd-timeslicing-breakthrough.md and use values_gh200_ts_pd.yaml

License

Apache 2.0

About

llm-d feature testing — aggregated inference (prefix-cache routing, queue-depth balancing, HPA autoscaling) and P/D disaggregated inference (NIXL KV transfer on time-sliced GPU) on NVIDIA GH200

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages