Automatically generated release for tag v0.3.0.
🚀 New Features Highlights
- AIBrix KVCache Offloading Framework: Introduces a pluggable multi-tier KVCache architecture with support for DRAM and remote backends, enabling efficient offloading of KV states to reduce GPU memory pressure and increase deployment density. (#1057, #1061, #1062, #1063, #1064, #1068, #1069, #1080, #1107)
- New KVCache orchestration API: Refactors the orchestration layer to support distributed hashing based caching solutions. (#971, #984, #985, #1037, #1055, #1071, #1114)
- Prefix Cache and Load aware Routing: Uses hash token-based prefix matching and load awareness to reduce latency by increasing prefix cache hit rate and routing efficiency (#838, #774, #933, #1067)
- Preble Routing (ICLR’25): An implementation of Preble, it balances KV cache reuse and GPU load by comparing prefix lengths and computing prompt-aware cost scores for optimal routing. (#678, #719, #730, #1024)
- Fairness-oriented Routing (OSDI’24 VTC): Introduces the vtc-basic router with Windowed Adaptive Fairness Routing, which dynamically tracks token usage and ensures fair load distribution across pods. (#964, #1011, #1065)
📊 Feature Enhancements
Gateway Enhancements
- Support for OpenAI-compatible APIs, including streaming responses, usage reporting, asynchronous handling, and standardized error responses for seamless end-to-end integration. (#703, #788, #799)
- Introduced the /v1/models endpoint for compatibility with OpenAI-style API clients. (#802)
- Refactored gateway-plugins with an extensible ext-proc server architecture, laying the foundation for pluggable policies. (#810)
- Improved concurrency safety and routing stability through major cache and router redesigns (#878, #884)
Control Plane:
- Added Kubernetes webhook validation for CRDs, providing early error feedback during resource creation (#748, #786).
- Improve RayClusterFleet to fully support Deepseek-r1/v3 models (#789, #826, #835, #914, #954).
- Add scale subresource in RayClusterFleet CRD and enable HPA support (#1082, #1109)
Installation Experiences:
- Introduced Terraform modules for GCP and Kubernetes deployment (#823).
- Added setup guides for Minikube on Lambda Cloud and AWS in the documentation (#1020).
- Enabled standalone controller installation for simplified system bootstrapping.(#930, #931)
- Streamlined upgrade workflows by introducing kubectl apply support. CRDs are now split and applied with --server-side, avoiding annotation size limits and enabling smooth incremental updates. (#793)
- Enabled container image publishing to Github Container Registry (GHCR) (#1041).
- Support ARM container Images (#1090)
Observability & Stability:
- Shipped prebuilt Grafana dashboards covering control plane, gateway, and KV cache components for out-of-the-box observability. (#1048)
- Tuned Envoy proxy memory and buffer configurations for better performance under high concurrency. (#825)
- Tuned Envoy proxy configurations for memory and buffer management under high concurrency (#967).
- Added graceful shutdown, liveness, and readiness probes to improve service resilience (#962).
- Delivered production-ready monitoring setups for all major system components (#1048).
New Contributors
- @gaocegege made their first contribution in #731
- @eltociear made their first contribution in #736
- @terrytangyuan made their first contribution in #746
- @jolfr made their first contribution in #744
- @Abirdcfly made their first contribution in #763
- @pierDipi made their first contribution in #764
- @Xunzhuo made their first contribution in #810
- @zjd0112 made their first contribution in #849
- @SongGuyang made their first contribution in #850
- @vaaandark made their first contribution in #856
- @vie-serendipity made their first contribution in #860
- @nurali-techie made their first contribution in #867
- @legendtkl made their first contribution in #870
- @ronaldosaheki made their first contribution in #886
- @nadongjun made their first contribution in #890
- @cr7258 made their first contribution in #893
- @thomasjpfan made their first contribution in #883
- @runzhen made their first contribution in #896
- @my-git9 made their first contribution in #895
- @googs1025 made their first contribution in #908
- @Iceber made their first contribution in #926
- @ModiIntel made their first contribution in #954
- @Venkat2811 made their first contribution in #964
- @SuperMohit made their first contribution in #992
- @weapons97 made their first contribution in #990
- @zhixian82 made their first contribution in #1082
What's Changed
Full Changelog: v0.2.0...v0.3.0
- [Docs] fix format of the dist kv cache doc by @DwyaneShi in #714
- complete the 'make generate' command by @kerthcet in #711
- Update organization reference in code base by @Jeffwan in #717
- [Misc] Update the documentation link by @Jeffwan in #720
- Initial implementation of radix tree-based cache by @gangmuk in #678
- Add model adapter e2e tests by @varungup90 in #701
- Add vllm cpu alternative for local development by @varungup90 in #721
- Add white paper file by @Jeffwan in #724
- Adding streaming client for AIbrix experiments by @happyandslow in #676
- [Docs] Update Readme with new links and blog post, and update white paper by @xieus in #725
- Recording failed requests in benchmark client by @gangmuk in #727
- Process response headers in gateway by @varungup90 in #703
- [misc] Fix white paper link by @Jeffwan in #728
- Prefix and load aware routing with radix tree kv cache by @gangmuk in #719
- Fix slack link in README.md by @Jeffwan in #729
- [readme] Fix wrong link by @gaocegege in #731
- [Misc] update scheduler.py by @eltociear in #736
- Improve thread safety for TreeNode data structure and refactor related codes by @gangmuk in #730
- Fix CacheSpec api scheme by @kerthcet in #740
- docs: Fix link to license by @terrytangyuan in #746
- Use native codegen cmd generating client-go by @kerthcet in #741
- [Docs]: Fixed kubectl commands for install of components by @jolfr in #744
- [fix] fixing bug in using AsyncOpenAI client (header setting, token counting, etc) by @gangmuk in #738
- Add webhook framework by @kerthcet in #748
- Use random seed for xxhash by @varungup90 in #752
- Create SECURITY.md to enable security policy by @xieus in #756
- [CI] Add integration test by @kerthcet in #759
- [Bug] fix: correct non-inherited context by @Abirdcfly in #763
- [Misc] Parametrize Makefile for mocked vLLM apps by @pierDipi in #764
- Support benchmarking script by using real application trace by @nwangfw in #737
- Maintaining common benchmarks utils in a separate dir by @gangmuk in #770
- Ignore worker pods for gateway routing by @varungup90 in #776
- Disable ENABLE_PROBES_INJECTION in correct way by @Jeffwan in #779
- Make stream include usage as optional by @varungup90 in #788
- Append ray head label selector in PodAutoscaler by @Jeffwan in #789
- Remove redundant install crds in makefile by @varungup90 in #792
- Update request message processing for /v1/completion input by @varungup90 in #794
- Added target pod to client result and made clients consistent by @gangmuk in #799
- Enable CI tests for release branch by @Jeffwan in #805
- Move modelAdapter runtime validation to webhook by @kerthcet in #786
- [Misc] Adding model field to each request by @happyandslow in #812
- [Refactor]: gateway-plugins ext-proc server codebase by @Xunzhuo in #810
- [CI]: update release tags pattern by @Xunzhuo in #815
- [Docs]: fix vllm mock app Unauthorized response by @Xunzhuo in #817
- Reconfigure workload generator for predefined synthetic patterns by @happyandslow in #771
- Workload generation scripts for prefix aware routing by @gangmuk in #820
- Fix the paths in lambda cloud doc by @gangmuk in #824
- [Bug] Added Startup Probe in Quickstart Model by @jolfr in #773
- Add /v1/models endpoint to gateway by @varungup90 in #802
- Increase envoy proxy memory config and client connection buffersize by @varungup90 in #825
- Support to create default HttpRoute for RayClusterFleet by @Jeffwan in #826
- [Misc] Fix CI issue on release branch and clean up logs by @Jeffwan in #837
- Fix repeated initialization of gateway routers and add unit test for prefix cache by @varungup90 in #838
- Add deepseek-r1 671B deployment sample and docs by @Jeffwan in #835
- Bump AIBrix version to v0.2.1 in manifests by @Jeffwan in #839
- [Docs] Update Slack link by @gaocegege in #841
- [Docs] Remove repeated lines by @zjd0112 in #849
- Bump AIBrix version to v0.2.1 for standalone distributed inference by @SongGuyang in #850
- Support OpenAI api style /v1/models response by @Jeffwan in #829
- [Misc] Resolve symlink ambiguity when generating codes by @vaaandark in #856
- Introduce RoutingContext in Route interface and clean up stale codes by @Jeffwan in #855
- [Misc]: sync hpa status to podAutoScaler by @vie-serendipity in #860
- Generate workload based on prefix sharing synthetic data by @happyandslow in #840
- Fixing missing image link in #840 by @happyandslow in #871
- Cite Melange paper in heterogeneous feature by @Jeffwan in #872
- [Misc] support linux for vllm cpu local development by @nurali-techie in #867
- Refactor make deploy to use apply instead of create by @varungup90 in #793
- Use string based tokenizer in prefix cache by @varungup90 in #774
- Add profiling support for gateway plugins and bug fix to close stream decoder by @varungup90 in #857
- Add flag to enable/disable GPU Optimizer tracing by @varungup90 in #875
- [Docs] fix typo in runtime feature page by @legendtkl in #870
- chore: clean-up mock yaml by @Xunzhuo in #877
- Fixing image link error in workload generator README.md by @happyandslow in #888
- Update Synthetic Load Prodefined Config for Geneerator by @happyandslow in #889
- [Misc] Fix plot_workload to pass dirname to makedirs by @ronaldosaheki in #886
- [Misc] Fix client.py in case workload has model null and client has default_model by @ronaldosaheki in #887
- [WIP] Adding input/output distribution argument to constant load generator by @happyandslow in #882
- [Docs] Fix broken contributing guidelines link in README by @nadongjun in #890
- [Bug] fix install script PATH environment variable by @cr7258 in #893
- [Docs] Link to dynamic lora from docs by @thomasjpfan in #883
- [API] Refactor: core cache design and impl by @Xunzhuo in #878
- Added antiaffinity in kvcache crd by @gangmuk in #865
- [Docs] Fix tpm and rpm typo in gateway-plugins.rst by @runzhen in #896
- [Misc] Remove unused function in pkg/utils by @my-git9 in #895
- Remove model name from client and generator by @happyandslow in #894
- [Misc] Add PS benchmark manifests and scripts by @Jeffwan in #899
- Add release overlays to update control plane config for production deployment by @varungup90 in #900
- [Misc][Docs]: GCP and Kubernetes Terraform Deployment Modules by @jolfr in #823
- [Misc] Cleanup deprecated function intstr.FromInt by @my-git9 in #901
- [Bug] Routers that require cache failed on Register by @zhangjyr in #913
- [Misc] chore: remove unnecessary check for pod is zero by @googs1025 in #908
- [Bug] add Tolerations for kvcache pod to fix Pending and CrashLoopBackOff on GKE by @runzhen in #909
- [Misc] chore(raycluster): add concurrency limit and error aggregation to scaleDown by @googs1025 in #914
- [API] Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing. by @zhangjyr in #884
- Enable parallel client using thread pool in benchmark client by @happyandslow in #919
- [Misc] Add pods stats example: running requests. by @zhangjyr in #918
- [CLI] feature(modeladapter): make modeladapter controller scheduler policy be configured by @googs1025 in #921
- [Bug] Syncmap.Store does not update. by @zhangjyr in #925
- [API] [Misc]: Support LRU cache with TTL for prefix cache indexer by @vie-serendipity in #905
- Remove unused argument from workload generator by @happyandslow in #929
- [BUG] cache: handle DeletedFinalStateUnknown by the delete func by @Iceber in #926
- [Misc] feature(rayclusterreplicaset): check rayclusters crd is installed before controller start by @googs1025 in #922
- [BUG] return directly when error occurs while adding the controller by @Iceber in #937
- [Misc] fix log typo by @Iceber in #935
- Move delays to threads in benchmark by @happyandslow in #939
- [BUG] controller: fix generating the corresponding HPA object for the PA by @Iceber in #934
- Support multi-turn scenarios in benchmark client by @happyandslow in #907
- Refactoring benchmark folder by @happyandslow in #946
- Performance improvements for prefix cache routing by @varungup90 in #933
- [Misc]: move crd check in Initialize part by @googs1025 in #949
- [CLI] Add —disableWebhook in controller by @Jeffwan in #931
- [BUG] controller: handle DeletedFinalStateUnknown by the delete func by @Iceber in #938
- fix: complete RayClusterFleet example for multi-node vLLM inference by @ModiIntel in #954
- [Misc] remove the duplicated env functions by @Iceber in #953
- [Misc] increase the memory limit of the controller-manager by @Iceber in #952
- chore: add help func for get Env value by @googs1025 in #941
- [Bug] avoid frequent lookup of the routing strategy env by @Iceber in #956
- Enable standalone installation of kv-cache-controller by @Jeffwan in #930
- [fix] Fix wheel build errors in runtime image by @Jeffwan in #961
- Control maximum concurrent session for workload generator by @happyandslow in #963
- Updating Plotting Script to Visualize Sharing Patterns by @happyandslow in #965
- Change synthetic cache sharing dataset format by @happyandslow in #966
- [Bug] prevent reference grant delete if shared by other deployments by @varungup90 in #968
- Add graceful shutdown for gateway and add liveness/readiness probes by @varungup90 in #962
- Add httproute status check for response header errors by @varungup90 in #957
- Update envoy proxy and gateway-plugins config by @varungup90 in #967
- Refactor kv cache controller to support different setup modes by @Jeffwan in #971
- [Misc] chore: use t.Log instead of Println by @googs1025 in #973
- [fix] Handle error output in analysis script by @happyandslow in #975
- [Fix] Unify all workload generator output file names by @happyandslow in #976
- [Fix] Fix shallowcopy error in prompt history retrieval by @happyandslow in #978
- Assign tasks to client by keys by @happyandslow in #979
- [Fix] Fix error case handling for client output analysis by @happyandslow in #980
- cmd/controllers: add readyz check for the webhook by @Iceber in #969
- Bug fix generating plain data format by @happyandslow in #982
- [BUG] cache: start informer after adding the resource handler by @Iceber in #981
- Support distributed hashing mode kv cache pool by @Jeffwan in #984
- [Feature]: introducing a basic VTC router in gateway plugin to start supporting fairness based routing by @Venkat2811 in #964
- [Docs]: Fixed Broken link for tutorials by @SuperMohit in #992
- End-to-end script for workload runnning process by @happyandslow in #947
- Support hpkv in kv cache controller by @Jeffwan in #985
- Update add redis pass for client by @weapons97 in #990
- [Misc] chore: refactor selectTargetPod func by @googs1025 in #1000
- [Misc] chore: remove unuse func by @googs1025 in #1005
- [Misc] Move redis load_env to method level by @Jeffwan in #1002
- [Fix] 401 errors in gateway should be returned as immediate response by @Jeffwan in #1006
- [BUG] ratelimit: fix the wrong TPM key name by @runzhen in #987
- [Misc] Add app.kubernetes.io/name labels to components by @Jeffwan in #1003
- E2E CI fix to ensure all pods are ready by @varungup90 in #972
- Allow manual trigger for build/push docker images by @varungup90 in #1017
- [Fix] Prioritize AutoTokenizer in get_tokenizer with fallback to tiktoken by @Jeffwan in #1016
- [Bug] fix: update metaPods cache to use namespace/name as the key by @googs1025 in #1015
- [Fix]: optimize vtc-basic router algo from modulo to more robust adaptive-clamped-linear by @Venkat2811 in #1011
- Enabling adjustable client pool size and output token limit by @happyandslow in #1025
- [Misc] fix: Optimizing Route method of the gateway algorithms by @googs1025 in #1001
- [Docs] Support minikube on Lambda cloud and add AWS page by @Jeffwan in #1020
- [BUG] Use more accurate chi-squared test for randomness validation in e2e test. by @zhangjyr in #1027
- Support multiple configs in synthetic shared dataset by @happyandslow in #1033
- [Fix] Bug fix for constant workload QPS by @happyandslow in #1036
- Improve installlation test e2e time by @varungup90 in #1034
- Add Pareto Sampler for Multiturn Dataset Generation by @happyandslow in #1038
- use atomic.Int32c instead of use Int32 type by @googs1025 in #1035
- [CI] Enable GHCR image build and push by @Jeffwan in #1041
- Adding interval scaling factor for client by @happyandslow in #1043
- [Bug] fix: use RLock() instead of Lock() when reading var by @googs1025 in #1044
- Dataset generator output argument fix by @happyandslow in #1042
- Docker push multi-platform images by @varungup90 in #1026
- Refactor the kvcache backend to support infinistore by @Jeffwan in #1037
- [Docs] Document gpu optimizer as experimental and improve deployment config. by @zhangjyr in #1051
- [Fix] Removing redundant locks in prefix cache and load router function by @gangmuk in #1024
- [Feature] AIBrix KVCache common by @DwyaneShi in #1057
- [Feature] AIBrix KVCache L1Cache by @DwyaneShi in #1061
- [Feature] AIBrix KVCache L2Cache Part1 by @DwyaneShi in #1062
- Add dashboard and monitoring setup steps for control plane by @Jeffwan in #1048
- [Feature] AIBrix KVCache L2Cache Part2 by @DwyaneShi in #1063
- [Feature] AIBrix KVCache L2Cache Part3 and KVCache Managers by @DwyaneShi in #1064
- [Bug] fix: add more info for pod metric fetch failures in GetMetricsFromPods by @googs1025 in #1039
- [Fix] Fix multiple issues for benchmark implementation by @happyandslow in #1049
- [Fix] Fix L2Cache's register descriptor container by @DwyaneShi in #1068
- Update kvcache v1alpha1 api spec by @Jeffwan in #1055
- [Integration] vLLM integration patch for AIBrix KVCache by @DwyaneShi in #1069
- [Misc] fix: add miss Close() for redis client by @googs1025 in #1056
- Support prometheus metrics in kv watcher pod by @Jeffwan in #1073
- Add rdma gid search scripts by @Jeffwan in #1072
- Support watcher pod rbac in kvcache controller by @Jeffwan in #1071
- [Misc] chore: change Scheduler interface describe in modeladapter controller by @googs1025 in #1075
- [MISC]: add vtc_bucket_size_active metric gauge for vtc-basic by @Venkat2811 in #1065
- [Integration] Update vLLM integration by @DwyaneShi in #1080
- [Misc] Clean up deployment scripts for volcengine by @Jeffwan in #1081
- Cut v0.3.0-rc.1 release by @Jeffwan in #1083
- [fix] Correct the python build path in same step by @Jeffwan in #1084
- [Misc] Skip attaching python artifacts to github release by @Jeffwan in #1085
- [Bug] fix: condition nil panic in FindStatusCondition func by @googs1025 in #1078
- Refactor request body processing and add multi-turn conversation support by @varungup90 in #1067
- Upload arm build images with git.ref_name by @varungup90 in #1090
- Update documentation and add openai sdk samples by @varungup90 in #1092
- Rename preble based prefix routing strategy by @varungup90 in #1104
- Add v0.3.0 ps performance regression test scenario by @Jeffwan in #1099
- Migrating benchmark entrypoints to python client by @happyandslow in #1066
- [Misc] Add demo manifests for volcano engine by @Jeffwan in #1105
- [Integration] KVCache: update vLLM integration by @DwyaneShi in #1107
- [Bug]fix: add scale subresource to rayclusterfleet by @zhixian82 in #1082
- [Feature] KVCache: Suppport InfiniStore GID and enhance cluster mode by @DwyaneShi in #1106
- [Chore] fix: regenerate crd by @zhixian82 in #1109
- [Chore] KVCache: enhance format and dependencies by @DwyaneShi in #1108
- Polish benchmark manifests and VE samples by @Jeffwan in #1113
- [API] Support customized template for cache by @Jeffwan in #1114
- Bump version to v0.3.0-rc.2 by @Jeffwan in #1115
- [Fix] Move pdb from patch to resources by @Jeffwan in #1117
- [Docs] Add feature manuals for KVCache by @DwyaneShi in #1119
- [Docs] Adding benchmark doc by @happyandslow in #999
- [Docs] format KVCache docs to eliminate warnings by @DwyaneShi in #1122
- Update multi-arch image push to include all platforms for release by @varungup90 in #1124
- [Docs] Init folder for KVCache benchmark scenario by @DwyaneShi in #1125
- [Docs] Addressing benchmark doc comments by @happyandslow in #1123
- Cut v0.3.0 release by @Jeffwan in #1126
- Bump python project version to v0.3.0 by @Jeffwan in #1127