Release v0.3.0 · vllm-project/aibrix

Automatically generated release for tag v0.3.0.

🚀 New Features Highlights

AIBrix KVCache Offloading Framework: Introduces a pluggable multi-tier KVCache architecture with support for DRAM and remote backends, enabling efficient offloading of KV states to reduce GPU memory pressure and increase deployment density. (#1057, #1061, #1062, #1063, #1064, #1068, #1069, #1080, #1107)
New KVCache orchestration API: Refactors the orchestration layer to support distributed hashing based caching solutions. (#971, #984, #985, #1037, #1055, #1071, #1114)
Prefix Cache and Load aware Routing: Uses hash token-based prefix matching and load awareness to reduce latency by increasing prefix cache hit rate and routing efficiency (#838, #774, #933, #1067)
Preble Routing (ICLR’25): An implementation of Preble, it balances KV cache reuse and GPU load by comparing prefix lengths and computing prompt-aware cost scores for optimal routing. (#678, #719, #730, #1024)
Fairness-oriented Routing (OSDI’24 VTC): Introduces the vtc-basic router with Windowed Adaptive Fairness Routing, which dynamically tracks token usage and ensures fair load distribution across pods. (#964, #1011, #1065)

📊 Feature Enhancements

Gateway Enhancements

Support for OpenAI-compatible APIs, including streaming responses, usage reporting, asynchronous handling, and standardized error responses for seamless end-to-end integration. (#703, #788, #799)
Introduced the /v1/models endpoint for compatibility with OpenAI-style API clients. (#802)
Refactored gateway-plugins with an extensible ext-proc server architecture, laying the foundation for pluggable policies. (#810)
Improved concurrency safety and routing stability through major cache and router redesigns (#878, #884)

Control Plane:

Added Kubernetes webhook validation for CRDs, providing early error feedback during resource creation (#748, #786).
Improve RayClusterFleet to fully support Deepseek-r1/v3 models (#789, #826, #835, #914, #954).
Add scale subresource in RayClusterFleet CRD and enable HPA support (#1082, #1109)

Installation Experiences:

Introduced Terraform modules for GCP and Kubernetes deployment (#823).
Added setup guides for Minikube on Lambda Cloud and AWS in the documentation (#1020).
Enabled standalone controller installation for simplified system bootstrapping.(#930, #931)
Streamlined upgrade workflows by introducing kubectl apply support. CRDs are now split and applied with --server-side, avoiding annotation size limits and enabling smooth incremental updates. (#793)
Enabled container image publishing to Github Container Registry (GHCR) (#1041).
Support ARM container Images (#1090)

Observability & Stability:

Shipped prebuilt Grafana dashboards covering control plane, gateway, and KV cache components for out-of-the-box observability. (#1048)
Tuned Envoy proxy memory and buffer configurations for better performance under high concurrency. (#825)
Tuned Envoy proxy configurations for memory and buffer management under high concurrency (#967).
Added graceful shutdown, liveness, and readiness probes to improve service resilience (#962).
Delivered production-ready monitoring setups for all major system components (#1048).

New Contributors

@gaocegege made their first contribution in #731
@eltociear made their first contribution in #736
@terrytangyuan made their first contribution in #746
@jolfr made their first contribution in #744
@Abirdcfly made their first contribution in #763
@pierDipi made their first contribution in #764
@Xunzhuo made their first contribution in #810
@zjd0112 made their first contribution in #849
@SongGuyang made their first contribution in #850
@vaaandark made their first contribution in #856
@vie-serendipity made their first contribution in #860
@nurali-techie made their first contribution in #867
@legendtkl made their first contribution in #870
@ronaldosaheki made their first contribution in #886
@nadongjun made their first contribution in #890
@cr7258 made their first contribution in #893
@thomasjpfan made their first contribution in #883
@runzhen made their first contribution in #896
@my-git9 made their first contribution in #895
@googs1025 made their first contribution in #908
@Iceber made their first contribution in #926
@ModiIntel made their first contribution in #954
@Venkat2811 made their first contribution in #964
@SuperMohit made their first contribution in #992
@weapons97 made their first contribution in #990
@zhixian82 made their first contribution in #1082

What's Changed

Full Changelog: v0.2.0...v0.3.0

[Docs] fix format of the dist kv cache doc by @DwyaneShi in #714
complete the 'make generate' command by @kerthcet in #711
Update organization reference in code base by @Jeffwan in #717
[Misc] Update the documentation link by @Jeffwan in #720
Initial implementation of radix tree-based cache by @gangmuk in #678
Add model adapter e2e tests by @varungup90 in #701
Add vllm cpu alternative for local development by @varungup90 in #721
Add white paper file by @Jeffwan in #724
Adding streaming client for AIbrix experiments by @happyandslow in #676
[Docs] Update Readme with new links and blog post, and update white paper by @xieus in #725
Recording failed requests in benchmark client by @gangmuk in #727
Process response headers in gateway by @varungup90 in #703
[misc] Fix white paper link by @Jeffwan in #728
Prefix and load aware routing with radix tree kv cache by @gangmuk in #719
Fix slack link in README.md by @Jeffwan in #729
[readme] Fix wrong link by @gaocegege in #731
[Misc] update scheduler.py by @eltociear in #736
Improve thread safety for TreeNode data structure and refactor related codes by @gangmuk in #730
Fix CacheSpec api scheme by @kerthcet in #740
docs: Fix link to license by @terrytangyuan in #746
Use native codegen cmd generating client-go by @kerthcet in #741
[Docs]: Fixed kubectl commands for install of components by @jolfr in #744
[fix] fixing bug in using AsyncOpenAI client (header setting, token counting, etc) by @gangmuk in #738
Add webhook framework by @kerthcet in #748
Use random seed for xxhash by @varungup90 in #752
Create SECURITY.md to enable security policy by @xieus in #756
[CI] Add integration test by @kerthcet in #759
[Bug] fix: correct non-inherited context by @Abirdcfly in #763
[Misc] Parametrize Makefile for mocked vLLM apps by @pierDipi in #764
Support benchmarking script by using real application trace by @nwangfw in #737
Maintaining common benchmarks utils in a separate dir by @gangmuk in #770
Ignore worker pods for gateway routing by @varungup90 in #776
Disable ENABLE_PROBES_INJECTION in correct way by @Jeffwan in #779
Make stream include usage as optional by @varungup90 in #788
Append ray head label selector in PodAutoscaler by @Jeffwan in #789
Remove redundant install crds in makefile by @varungup90 in #792
Update request message processing for /v1/completion input by @varungup90 in #794
Added target pod to client result and made clients consistent by @gangmuk in #799
Enable CI tests for release branch by @Jeffwan in #805
Move modelAdapter runtime validation to webhook by @kerthcet in #786
[Misc] Adding model field to each request by @happyandslow in #812
[Refactor]: gateway-plugins ext-proc server codebase by @Xunzhuo in #810
[CI]: update release tags pattern by @Xunzhuo in #815
[Docs]: fix vllm mock app Unauthorized response by @Xunzhuo in #817
Reconfigure workload generator for predefined synthetic patterns by @happyandslow in #771
Workload generation scripts for prefix aware routing by @gangmuk in #820
Fix the paths in lambda cloud doc by @gangmuk in #824
[Bug] Added Startup Probe in Quickstart Model by @jolfr in #773
Add /v1/models endpoint to gateway by @varungup90 in #802
Increase envoy proxy memory config and client connection buffersize by @varungup90 in #825
Support to create default HttpRoute for RayClusterFleet by @Jeffwan in #826
[Misc] Fix CI issue on release branch and clean up logs by @Jeffwan in #837
Fix repeated initialization of gateway routers and add unit test for prefix cache by @varungup90 in #838
Add deepseek-r1 671B deployment sample and docs by @Jeffwan in #835
Bump AIBrix version to v0.2.1 in manifests by @Jeffwan in #839
[Docs] Update Slack link by @gaocegege in #841
[Docs] Remove repeated lines by @zjd0112 in #849
Bump AIBrix version to v0.2.1 for standalone distributed inference by @SongGuyang in #850
Support OpenAI api style /v1/models response by @Jeffwan in #829
[Misc] Resolve symlink ambiguity when generating codes by @vaaandark in #856
Introduce RoutingContext in Route interface and clean up stale codes by @Jeffwan in #855
[Misc]: sync hpa status to podAutoScaler by @vie-serendipity in #860
Generate workload based on prefix sharing synthetic data by @happyandslow in #840
Fixing missing image link in #840 by @happyandslow in #871
Cite Melange paper in heterogeneous feature by @Jeffwan in #872
[Misc] support linux for vllm cpu local development by @nurali-techie in #867
Refactor make deploy to use apply instead of create by @varungup90 in #793
Use string based tokenizer in prefix cache by @varungup90 in #774
Add profiling support for gateway plugins and bug fix to close stream decoder by @varungup90 in #857
Add flag to enable/disable GPU Optimizer tracing by @varungup90 in #875
[Docs] fix typo in runtime feature page by @legendtkl in #870
chore: clean-up mock yaml by @Xunzhuo in #877
Fixing image link error in workload generator README.md by @happyandslow in #888
Update Synthetic Load Prodefined Config for Geneerator by @happyandslow in #889
[Misc] Fix plot_workload to pass dirname to makedirs by @ronaldosaheki in #886
[Misc] Fix client.py in case workload has model null and client has default_model by @ronaldosaheki in #887
[WIP] Adding input/output distribution argument to constant load generator by @happyandslow in #882
[Docs] Fix broken contributing guidelines link in README by @nadongjun in #890
[Bug] fix install script PATH environment variable by @cr7258 in #893
[Docs] Link to dynamic lora from docs by @thomasjpfan in #883
[API] Refactor: core cache design and impl by @Xunzhuo in #878
Added antiaffinity in kvcache crd by @gangmuk in #865
[Docs] Fix tpm and rpm typo in gateway-plugins.rst by @runzhen in #896
[Misc] Remove unused function in pkg/utils by @my-git9 in #895
Remove model name from client and generator by @happyandslow in #894
[Misc] Add PS benchmark manifests and scripts by @Jeffwan in #899
Add release overlays to update control plane config for production deployment by @varungup90 in #900
[Misc][Docs]: GCP and Kubernetes Terraform Deployment Modules by @jolfr in #823
[Misc] Cleanup deprecated function intstr.FromInt by @my-git9 in #901
[Bug] Routers that require cache failed on Register by @zhangjyr in #913
[Misc] chore: remove unnecessary check for pod is zero by @googs1025 in #908
[Bug] add Tolerations for kvcache pod to fix Pending and CrashLoopBackOff on GKE by @runzhen in #909
[Misc] chore(raycluster): add concurrency limit and error aggregation to scaleDown by @googs1025 in #914
[API] Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing. by @zhangjyr in #884
Enable parallel client using thread pool in benchmark client by @happyandslow in #919
[Misc] Add pods stats example: running requests. by @zhangjyr in #918
[CLI] feature(modeladapter): make modeladapter controller scheduler policy be configured by @googs1025 in #921
[Bug] Syncmap.Store does not update. by @zhangjyr in #925
[API] [Misc]: Support LRU cache with TTL for prefix cache indexer by @vie-serendipity in #905
Remove unused argument from workload generator by @happyandslow in #929
[BUG] cache: handle DeletedFinalStateUnknown by the delete func by @Iceber in #926
[Misc] feature(rayclusterreplicaset): check rayclusters crd is installed before controller start by @googs1025 in #922
[BUG] return directly when error occurs while adding the controller by @Iceber in #937
[Misc] fix log typo by @Iceber in #935
Move delays to threads in benchmark by @happyandslow in #939
[BUG] controller: fix generating the corresponding HPA object for the PA by @Iceber in #934
Support multi-turn scenarios in benchmark client by @happyandslow in #907
Refactoring benchmark folder by @happyandslow in #946
Performance improvements for prefix cache routing by @varungup90 in #933
[Misc]: move crd check in Initialize part by @googs1025 in #949
[CLI] Add —disableWebhook in controller by @Jeffwan in #931
[BUG] controller: handle DeletedFinalStateUnknown by the delete func by @Iceber in #938
fix: complete RayClusterFleet example for multi-node vLLM inference by @ModiIntel in #954
[Misc] remove the duplicated env functions by @Iceber in #953
[Misc] increase the memory limit of the controller-manager by @Iceber in #952
chore: add help func for get Env value by @googs1025 in #941
[Bug] avoid frequent lookup of the routing strategy env by @Iceber in #956
Enable standalone installation of kv-cache-controller by @Jeffwan in #930
[fix] Fix wheel build errors in runtime image by @Jeffwan in #961
Control maximum concurrent session for workload generator by @happyandslow in #963
Updating Plotting Script to Visualize Sharing Patterns by @happyandslow in #965
Change synthetic cache sharing dataset format by @happyandslow in #966
[Bug] prevent reference grant delete if shared by other deployments by @varungup90 in #968
Add graceful shutdown for gateway and add liveness/readiness probes by @varungup90 in #962
Add httproute status check for response header errors by @varungup90 in #957
Update envoy proxy and gateway-plugins config by @varungup90 in #967
Refactor kv cache controller to support different setup modes by @Jeffwan in #971
[Misc] chore: use t.Log instead of Println by @googs1025 in #973
[fix] Handle error output in analysis script by @happyandslow in #975
[Fix] Unify all workload generator output file names by @happyandslow in #976
[Fix] Fix shallowcopy error in prompt history retrieval by @happyandslow in #978
Assign tasks to client by keys by @happyandslow in #979
[Fix] Fix error case handling for client output analysis by @happyandslow in #980
cmd/controllers: add readyz check for the webhook by @Iceber in #969
Bug fix generating plain data format by @happyandslow in #982
[BUG] cache: start informer after adding the resource handler by @Iceber in #981
Support distributed hashing mode kv cache pool by @Jeffwan in #984
[Feature]: introducing a basic VTC router in gateway plugin to start supporting fairness based routing by @Venkat2811 in #964
[Docs]: Fixed Broken link for tutorials by @SuperMohit in #992
End-to-end script for workload runnning process by @happyandslow in #947
Support hpkv in kv cache controller by @Jeffwan in #985
Update add redis pass for client by @weapons97 in #990
[Misc] chore: refactor selectTargetPod func by @googs1025 in #1000
[Misc] chore: remove unuse func by @googs1025 in #1005
[Misc] Move redis load_env to method level by @Jeffwan in #1002
[Fix] 401 errors in gateway should be returned as immediate response by @Jeffwan in #1006
[BUG] ratelimit: fix the wrong TPM key name by @runzhen in #987
[Misc] Add app.kubernetes.io/name labels to components by @Jeffwan in #1003
E2E CI fix to ensure all pods are ready by @varungup90 in #972
Allow manual trigger for build/push docker images by @varungup90 in #1017
[Fix] Prioritize AutoTokenizer in get_tokenizer with fallback to tiktoken by @Jeffwan in #1016
[Bug] fix: update metaPods cache to use namespace/name as the key by @googs1025 in #1015
[Fix]: optimize vtc-basic router algo from modulo to more robust adaptive-clamped-linear by @Venkat2811 in #1011
Enabling adjustable client pool size and output token limit by @happyandslow in #1025
[Misc] fix: Optimizing Route method of the gateway algorithms by @googs1025 in #1001
[Docs] Support minikube on Lambda cloud and add AWS page by @Jeffwan in #1020
[BUG] Use more accurate chi-squared test for randomness validation in e2e test. by @zhangjyr in #1027
Support multiple configs in synthetic shared dataset by @happyandslow in #1033
[Fix] Bug fix for constant workload QPS by @happyandslow in #1036
Improve installlation test e2e time by @varungup90 in #1034
Add Pareto Sampler for Multiturn Dataset Generation by @happyandslow in #1038
use atomic.Int32c instead of use Int32 type by @googs1025 in #1035
[CI] Enable GHCR image build and push by @Jeffwan in #1041
Adding interval scaling factor for client by @happyandslow in #1043
[Bug] fix: use RLock() instead of Lock() when reading var by @googs1025 in #1044
Dataset generator output argument fix by @happyandslow in #1042
Docker push multi-platform images by @varungup90 in #1026
Refactor the kvcache backend to support infinistore by @Jeffwan in #1037
[Docs] Document gpu optimizer as experimental and improve deployment config. by @zhangjyr in #1051
[Fix] Removing redundant locks in prefix cache and load router function by @gangmuk in #1024
[Feature] AIBrix KVCache common by @DwyaneShi in #1057
[Feature] AIBrix KVCache L1Cache by @DwyaneShi in #1061
[Feature] AIBrix KVCache L2Cache Part1 by @DwyaneShi in #1062
Add dashboard and monitoring setup steps for control plane by @Jeffwan in #1048
[Feature] AIBrix KVCache L2Cache Part2 by @DwyaneShi in #1063
[Feature] AIBrix KVCache L2Cache Part3 and KVCache Managers by @DwyaneShi in #1064
[Bug] fix: add more info for pod metric fetch failures in GetMetricsFromPods by @googs1025 in #1039
[Fix] Fix multiple issues for benchmark implementation by @happyandslow in #1049
[Fix] Fix L2Cache's register descriptor container by @DwyaneShi in #1068
Update kvcache v1alpha1 api spec by @Jeffwan in #1055
[Integration] vLLM integration patch for AIBrix KVCache by @DwyaneShi in #1069
[Misc] fix: add miss Close() for redis client by @googs1025 in #1056
Support prometheus metrics in kv watcher pod by @Jeffwan in #1073
Add rdma gid search scripts by @Jeffwan in #1072
Support watcher pod rbac in kvcache controller by @Jeffwan in #1071
[Misc] chore: change Scheduler interface describe in modeladapter controller by @googs1025 in #1075
[MISC]: add vtc_bucket_size_active metric gauge for vtc-basic by @Venkat2811 in #1065
[Integration] Update vLLM integration by @DwyaneShi in #1080
[Misc] Clean up deployment scripts for volcengine by @Jeffwan in #1081
Cut v0.3.0-rc.1 release by @Jeffwan in #1083
[fix] Correct the python build path in same step by @Jeffwan in #1084
[Misc] Skip attaching python artifacts to github release by @Jeffwan in #1085
[Bug] fix: condition nil panic in FindStatusCondition func by @googs1025 in #1078
Refactor request body processing and add multi-turn conversation support by @varungup90 in #1067
Upload arm build images with git.ref_name by @varungup90 in #1090
Update documentation and add openai sdk samples by @varungup90 in #1092
Rename preble based prefix routing strategy by @varungup90 in #1104
Add v0.3.0 ps performance regression test scenario by @Jeffwan in #1099
Migrating benchmark entrypoints to python client by @happyandslow in #1066
[Misc] Add demo manifests for volcano engine by @Jeffwan in #1105
[Integration] KVCache: update vLLM integration by @DwyaneShi in #1107
[Bug]fix: add scale subresource to rayclusterfleet by @zhixian82 in #1082
[Feature] KVCache: Suppport InfiniStore GID and enhance cluster mode by @DwyaneShi in #1106
[Chore] fix: regenerate crd by @zhixian82 in #1109
[Chore] KVCache: enhance format and dependencies by @DwyaneShi in #1108
Polish benchmark manifests and VE samples by @Jeffwan in #1113
[API] Support customized template for cache by @Jeffwan in #1114
Bump version to v0.3.0-rc.2 by @Jeffwan in #1115
[Fix] Move pdb from patch to resources by @Jeffwan in #1117
[Docs] Add feature manuals for KVCache by @DwyaneShi in #1119
[Docs] Adding benchmark doc by @happyandslow in #999
[Docs] format KVCache docs to eliminate warnings by @DwyaneShi in #1122
Update multi-arch image push to include all platforms for release by @varungup90 in #1124
[Docs] Init folder for KVCache benchmark scenario by @DwyaneShi in #1125
[Docs] Addressing benchmark doc comments by @happyandslow in #1123
Cut v0.3.0 release by @Jeffwan in #1126
Bump python project version to v0.3.0 by @Jeffwan in #1127

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.3.0

🚀 New Features Highlights

📊 Feature Enhancements

Gateway Enhancements

Control Plane:

Installation Experiences:

Observability & Stability:

New Contributors

What's Changed

Contributors

Uh oh!