Skip to content

v0.3.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 21 May 21:42
· 61 commits to main since this release
ecc3529

Automatically generated release for tag v0.3.0.

🚀 New Features Highlights

  • AIBrix KVCache Offloading Framework: Introduces a pluggable multi-tier KVCache architecture with support for DRAM and remote backends, enabling efficient offloading of KV states to reduce GPU memory pressure and increase deployment density. (#1057, #1061, #1062, #1063, #1064, #1068, #1069, #1080, #1107)
  • New KVCache orchestration API: Refactors the orchestration layer to support distributed hashing based caching solutions. (#971, #984, #985, #1037, #1055, #1071, #1114)
  • Prefix Cache and Load aware Routing: Uses hash token-based prefix matching and load awareness to reduce latency by increasing prefix cache hit rate and routing efficiency (#838, #774, #933, #1067)
  • Preble Routing (ICLR’25): An implementation of Preble, it balances KV cache reuse and GPU load by comparing prefix lengths and computing prompt-aware cost scores for optimal routing. (#678, #719, #730, #1024)
  • Fairness-oriented Routing (OSDI’24 VTC): Introduces the vtc-basic router with Windowed Adaptive Fairness Routing, which dynamically tracks token usage and ensures fair load distribution across pods. (#964, #1011, #1065)

📊 Feature Enhancements

Gateway Enhancements

  • Support for OpenAI-compatible APIs, including streaming responses, usage reporting, asynchronous handling, and standardized error responses for seamless end-to-end integration. (#703, #788, #799)
  • Introduced the /v1/models endpoint for compatibility with OpenAI-style API clients. (#802)
  • Refactored gateway-plugins with an extensible ext-proc server architecture, laying the foundation for pluggable policies. (#810)
  • Improved concurrency safety and routing stability through major cache and router redesigns (#878, #884)

Control Plane:

  • Added Kubernetes webhook validation for CRDs, providing early error feedback during resource creation (#748, #786).
  • Improve RayClusterFleet to fully support Deepseek-r1/v3 models (#789, #826, #835, #914, #954).
  • Add scale subresource in RayClusterFleet CRD and enable HPA support (#1082, #1109)

Installation Experiences:

  • Introduced Terraform modules for GCP and Kubernetes deployment (#823).
  • Added setup guides for Minikube on Lambda Cloud and AWS in the documentation (#1020).
  • Enabled standalone controller installation for simplified system bootstrapping.(#930, #931)
  • Streamlined upgrade workflows by introducing kubectl apply support. CRDs are now split and applied with --server-side, avoiding annotation size limits and enabling smooth incremental updates. (#793)
  • Enabled container image publishing to Github Container Registry (GHCR) (#1041).
  • Support ARM container Images (#1090)

Observability & Stability:

  • Shipped prebuilt Grafana dashboards covering control plane, gateway, and KV cache components for out-of-the-box observability. (#1048)
  • Tuned Envoy proxy memory and buffer configurations for better performance under high concurrency. (#825)
  • Tuned Envoy proxy configurations for memory and buffer management under high concurrency (#967).
  • Added graceful shutdown, liveness, and readiness probes to improve service resilience (#962).
  • Delivered production-ready monitoring setups for all major system components (#1048).

New Contributors

What's Changed

Full Changelog: v0.2.0...v0.3.0

  • [Docs] fix format of the dist kv cache doc by @DwyaneShi in #714
  • complete the 'make generate' command by @kerthcet in #711
  • Update organization reference in code base by @Jeffwan in #717
  • [Misc] Update the documentation link by @Jeffwan in #720
  • Initial implementation of radix tree-based cache by @gangmuk in #678
  • Add model adapter e2e tests by @varungup90 in #701
  • Add vllm cpu alternative for local development by @varungup90 in #721
  • Add white paper file by @Jeffwan in #724
  • Adding streaming client for AIbrix experiments by @happyandslow in #676
  • [Docs] Update Readme with new links and blog post, and update white paper by @xieus in #725
  • Recording failed requests in benchmark client by @gangmuk in #727
  • Process response headers in gateway by @varungup90 in #703
  • [misc] Fix white paper link by @Jeffwan in #728
  • Prefix and load aware routing with radix tree kv cache by @gangmuk in #719
  • Fix slack link in README.md by @Jeffwan in #729
  • [readme] Fix wrong link by @gaocegege in #731
  • [Misc] update scheduler.py by @eltociear in #736
  • Improve thread safety for TreeNode data structure and refactor related codes by @gangmuk in #730
  • Fix CacheSpec api scheme by @kerthcet in #740
  • docs: Fix link to license by @terrytangyuan in #746
  • Use native codegen cmd generating client-go by @kerthcet in #741
  • [Docs]: Fixed kubectl commands for install of components by @jolfr in #744
  • [fix] fixing bug in using AsyncOpenAI client (header setting, token counting, etc) by @gangmuk in #738
  • Add webhook framework by @kerthcet in #748
  • Use random seed for xxhash by @varungup90 in #752
  • Create SECURITY.md to enable security policy by @xieus in #756
  • [CI] Add integration test by @kerthcet in #759
  • [Bug] fix: correct non-inherited context by @Abirdcfly in #763
  • [Misc] Parametrize Makefile for mocked vLLM apps by @pierDipi in #764
  • Support benchmarking script by using real application trace by @nwangfw in #737
  • Maintaining common benchmarks utils in a separate dir by @gangmuk in #770
  • Ignore worker pods for gateway routing by @varungup90 in #776
  • Disable ENABLE_PROBES_INJECTION in correct way by @Jeffwan in #779
  • Make stream include usage as optional by @varungup90 in #788
  • Append ray head label selector in PodAutoscaler by @Jeffwan in #789
  • Remove redundant install crds in makefile by @varungup90 in #792
  • Update request message processing for /v1/completion input by @varungup90 in #794
  • Added target pod to client result and made clients consistent by @gangmuk in #799
  • Enable CI tests for release branch by @Jeffwan in #805
  • Move modelAdapter runtime validation to webhook by @kerthcet in #786
  • [Misc] Adding model field to each request by @happyandslow in #812
  • [Refactor]: gateway-plugins ext-proc server codebase by @Xunzhuo in #810
  • [CI]: update release tags pattern by @Xunzhuo in #815
  • [Docs]: fix vllm mock app Unauthorized response by @Xunzhuo in #817
  • Reconfigure workload generator for predefined synthetic patterns by @happyandslow in #771
  • Workload generation scripts for prefix aware routing by @gangmuk in #820
  • Fix the paths in lambda cloud doc by @gangmuk in #824
  • [Bug] Added Startup Probe in Quickstart Model by @jolfr in #773
  • Add /v1/models endpoint to gateway by @varungup90 in #802
  • Increase envoy proxy memory config and client connection buffersize by @varungup90 in #825
  • Support to create default HttpRoute for RayClusterFleet by @Jeffwan in #826
  • [Misc] Fix CI issue on release branch and clean up logs by @Jeffwan in #837
  • Fix repeated initialization of gateway routers and add unit test for prefix cache by @varungup90 in #838
  • Add deepseek-r1 671B deployment sample and docs by @Jeffwan in #835
  • Bump AIBrix version to v0.2.1 in manifests by @Jeffwan in #839
  • [Docs] Update Slack link by @gaocegege in #841
  • [Docs] Remove repeated lines by @zjd0112 in #849
  • Bump AIBrix version to v0.2.1 for standalone distributed inference by @SongGuyang in #850
  • Support OpenAI api style /v1/models response by @Jeffwan in #829
  • [Misc] Resolve symlink ambiguity when generating codes by @vaaandark in #856
  • Introduce RoutingContext in Route interface and clean up stale codes by @Jeffwan in #855
  • [Misc]: sync hpa status to podAutoScaler by @vie-serendipity in #860
  • Generate workload based on prefix sharing synthetic data by @happyandslow in #840
  • Fixing missing image link in #840 by @happyandslow in #871
  • Cite Melange paper in heterogeneous feature by @Jeffwan in #872
  • [Misc] support linux for vllm cpu local development by @nurali-techie in #867
  • Refactor make deploy to use apply instead of create by @varungup90 in #793
  • Use string based tokenizer in prefix cache by @varungup90 in #774
  • Add profiling support for gateway plugins and bug fix to close stream decoder by @varungup90 in #857
  • Add flag to enable/disable GPU Optimizer tracing by @varungup90 in #875
  • [Docs] fix typo in runtime feature page by @legendtkl in #870
  • chore: clean-up mock yaml by @Xunzhuo in #877
  • Fixing image link error in workload generator README.md by @happyandslow in #888
  • Update Synthetic Load Prodefined Config for Geneerator by @happyandslow in #889
  • [Misc] Fix plot_workload to pass dirname to makedirs by @ronaldosaheki in #886
  • [Misc] Fix client.py in case workload has model null and client has default_model by @ronaldosaheki in #887
  • [WIP] Adding input/output distribution argument to constant load generator by @happyandslow in #882
  • [Docs] Fix broken contributing guidelines link in README by @nadongjun in #890
  • [Bug] fix install script PATH environment variable by @cr7258 in #893
  • [Docs] Link to dynamic lora from docs by @thomasjpfan in #883
  • [API] Refactor: core cache design and impl by @Xunzhuo in #878
  • Added antiaffinity in kvcache crd by @gangmuk in #865
  • [Docs] Fix tpm and rpm typo in gateway-plugins.rst by @runzhen in #896
  • [Misc] Remove unused function in pkg/utils by @my-git9 in #895
  • Remove model name from client and generator by @happyandslow in #894
  • [Misc] Add PS benchmark manifests and scripts by @Jeffwan in #899
  • Add release overlays to update control plane config for production deployment by @varungup90 in #900
  • [Misc][Docs]: GCP and Kubernetes Terraform Deployment Modules by @jolfr in #823
  • [Misc] Cleanup deprecated function intstr.FromInt by @my-git9 in #901
  • [Bug] Routers that require cache failed on Register by @zhangjyr in #913
  • [Misc] chore: remove unnecessary check for pod is zero by @googs1025 in #908
  • [Bug] add Tolerations for kvcache pod to fix Pending and CrashLoopBackOff on GKE by @runzhen in #909
  • [Misc] chore(raycluster): add concurrency limit and error aggregation to scaleDown by @googs1025 in #914
  • [API] Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing. by @zhangjyr in #884
  • Enable parallel client using thread pool in benchmark client by @happyandslow in #919
  • [Misc] Add pods stats example: running requests. by @zhangjyr in #918
  • [CLI] feature(modeladapter): make modeladapter controller scheduler policy be configured by @googs1025 in #921
  • [Bug] Syncmap.Store does not update. by @zhangjyr in #925
  • [API] [Misc]: Support LRU cache with TTL for prefix cache indexer by @vie-serendipity in #905
  • Remove unused argument from workload generator by @happyandslow in #929
  • [BUG] cache: handle DeletedFinalStateUnknown by the delete func by @Iceber in #926
  • [Misc] feature(rayclusterreplicaset): check rayclusters crd is installed before controller start by @googs1025 in #922
  • [BUG] return directly when error occurs while adding the controller by @Iceber in #937
  • [Misc] fix log typo by @Iceber in #935
  • Move delays to threads in benchmark by @happyandslow in #939
  • [BUG] controller: fix generating the corresponding HPA object for the PA by @Iceber in #934
  • Support multi-turn scenarios in benchmark client by @happyandslow in #907
  • Refactoring benchmark folder by @happyandslow in #946
  • Performance improvements for prefix cache routing by @varungup90 in #933
  • [Misc]: move crd check in Initialize part by @googs1025 in #949
  • [CLI] Add —disableWebhook in controller by @Jeffwan in #931
  • [BUG] controller: handle DeletedFinalStateUnknown by the delete func by @Iceber in #938
  • fix: complete RayClusterFleet example for multi-node vLLM inference by @ModiIntel in #954
  • [Misc] remove the duplicated env functions by @Iceber in #953
  • [Misc] increase the memory limit of the controller-manager by @Iceber in #952
  • chore: add help func for get Env value by @googs1025 in #941
  • [Bug] avoid frequent lookup of the routing strategy env by @Iceber in #956
  • Enable standalone installation of kv-cache-controller by @Jeffwan in #930
  • [fix] Fix wheel build errors in runtime image by @Jeffwan in #961
  • Control maximum concurrent session for workload generator by @happyandslow in #963
  • Updating Plotting Script to Visualize Sharing Patterns by @happyandslow in #965
  • Change synthetic cache sharing dataset format by @happyandslow in #966
  • [Bug] prevent reference grant delete if shared by other deployments by @varungup90 in #968
  • Add graceful shutdown for gateway and add liveness/readiness probes by @varungup90 in #962
  • Add httproute status check for response header errors by @varungup90 in #957
  • Update envoy proxy and gateway-plugins config by @varungup90 in #967
  • Refactor kv cache controller to support different setup modes by @Jeffwan in #971
  • [Misc] chore: use t.Log instead of Println by @googs1025 in #973
  • [fix] Handle error output in analysis script by @happyandslow in #975
  • [Fix] Unify all workload generator output file names by @happyandslow in #976
  • [Fix] Fix shallowcopy error in prompt history retrieval by @happyandslow in #978
  • Assign tasks to client by keys by @happyandslow in #979
  • [Fix] Fix error case handling for client output analysis by @happyandslow in #980
  • cmd/controllers: add readyz check for the webhook by @Iceber in #969
  • Bug fix generating plain data format by @happyandslow in #982
  • [BUG] cache: start informer after adding the resource handler by @Iceber in #981
  • Support distributed hashing mode kv cache pool by @Jeffwan in #984
  • [Feature]: introducing a basic VTC router in gateway plugin to start supporting fairness based routing by @Venkat2811 in #964
  • [Docs]: Fixed Broken link for tutorials by @SuperMohit in #992
  • End-to-end script for workload runnning process by @happyandslow in #947
  • Support hpkv in kv cache controller by @Jeffwan in #985
  • Update add redis pass for client by @weapons97 in #990
  • [Misc] chore: refactor selectTargetPod func by @googs1025 in #1000
  • [Misc] chore: remove unuse func by @googs1025 in #1005
  • [Misc] Move redis load_env to method level by @Jeffwan in #1002
  • [Fix] 401 errors in gateway should be returned as immediate response by @Jeffwan in #1006
  • [BUG] ratelimit: fix the wrong TPM key name by @runzhen in #987
  • [Misc] Add app.kubernetes.io/name labels to components by @Jeffwan in #1003
  • E2E CI fix to ensure all pods are ready by @varungup90 in #972
  • Allow manual trigger for build/push docker images by @varungup90 in #1017
  • [Fix] Prioritize AutoTokenizer in get_tokenizer with fallback to tiktoken by @Jeffwan in #1016
  • [Bug] fix: update metaPods cache to use namespace/name as the key by @googs1025 in #1015
  • [Fix]: optimize vtc-basic router algo from modulo to more robust adaptive-clamped-linear by @Venkat2811 in #1011
  • Enabling adjustable client pool size and output token limit by @happyandslow in #1025
  • [Misc] fix: Optimizing Route method of the gateway algorithms by @googs1025 in #1001
  • [Docs] Support minikube on Lambda cloud and add AWS page by @Jeffwan in #1020
  • [BUG] Use more accurate chi-squared test for randomness validation in e2e test. by @zhangjyr in #1027
  • Support multiple configs in synthetic shared dataset by @happyandslow in #1033
  • [Fix] Bug fix for constant workload QPS by @happyandslow in #1036
  • Improve installlation test e2e time by @varungup90 in #1034
  • Add Pareto Sampler for Multiturn Dataset Generation by @happyandslow in #1038
  • use atomic.Int32c instead of use Int32 type by @googs1025 in #1035
  • [CI] Enable GHCR image build and push by @Jeffwan in #1041
  • Adding interval scaling factor for client by @happyandslow in #1043
  • [Bug] fix: use RLock() instead of Lock() when reading var by @googs1025 in #1044
  • Dataset generator output argument fix by @happyandslow in #1042
  • Docker push multi-platform images by @varungup90 in #1026
  • Refactor the kvcache backend to support infinistore by @Jeffwan in #1037
  • [Docs] Document gpu optimizer as experimental and improve deployment config. by @zhangjyr in #1051
  • [Fix] Removing redundant locks in prefix cache and load router function by @gangmuk in #1024
  • [Feature] AIBrix KVCache common by @DwyaneShi in #1057
  • [Feature] AIBrix KVCache L1Cache by @DwyaneShi in #1061
  • [Feature] AIBrix KVCache L2Cache Part1 by @DwyaneShi in #1062
  • Add dashboard and monitoring setup steps for control plane by @Jeffwan in #1048
  • [Feature] AIBrix KVCache L2Cache Part2 by @DwyaneShi in #1063
  • [Feature] AIBrix KVCache L2Cache Part3 and KVCache Managers by @DwyaneShi in #1064
  • [Bug] fix: add more info for pod metric fetch failures in GetMetricsFromPods by @googs1025 in #1039
  • [Fix] Fix multiple issues for benchmark implementation by @happyandslow in #1049
  • [Fix] Fix L2Cache's register descriptor container by @DwyaneShi in #1068
  • Update kvcache v1alpha1 api spec by @Jeffwan in #1055
  • [Integration] vLLM integration patch for AIBrix KVCache by @DwyaneShi in #1069
  • [Misc] fix: add miss Close() for redis client by @googs1025 in #1056
  • Support prometheus metrics in kv watcher pod by @Jeffwan in #1073
  • Add rdma gid search scripts by @Jeffwan in #1072
  • Support watcher pod rbac in kvcache controller by @Jeffwan in #1071
  • [Misc] chore: change Scheduler interface describe in modeladapter controller by @googs1025 in #1075
  • [MISC]: add vtc_bucket_size_active metric gauge for vtc-basic by @Venkat2811 in #1065
  • [Integration] Update vLLM integration by @DwyaneShi in #1080
  • [Misc] Clean up deployment scripts for volcengine by @Jeffwan in #1081
  • Cut v0.3.0-rc.1 release by @Jeffwan in #1083
  • [fix] Correct the python build path in same step by @Jeffwan in #1084
  • [Misc] Skip attaching python artifacts to github release by @Jeffwan in #1085
  • [Bug] fix: condition nil panic in FindStatusCondition func by @googs1025 in #1078
  • Refactor request body processing and add multi-turn conversation support by @varungup90 in #1067
  • Upload arm build images with git.ref_name by @varungup90 in #1090
  • Update documentation and add openai sdk samples by @varungup90 in #1092
  • Rename preble based prefix routing strategy by @varungup90 in #1104
  • Add v0.3.0 ps performance regression test scenario by @Jeffwan in #1099
  • Migrating benchmark entrypoints to python client by @happyandslow in #1066
  • [Misc] Add demo manifests for volcano engine by @Jeffwan in #1105
  • [Integration] KVCache: update vLLM integration by @DwyaneShi in #1107
  • [Bug]fix: add scale subresource to rayclusterfleet by @zhixian82 in #1082
  • [Feature] KVCache: Suppport InfiniStore GID and enhance cluster mode by @DwyaneShi in #1106
  • [Chore] fix: regenerate crd by @zhixian82 in #1109
  • [Chore] KVCache: enhance format and dependencies by @DwyaneShi in #1108
  • Polish benchmark manifests and VE samples by @Jeffwan in #1113
  • [API] Support customized template for cache by @Jeffwan in #1114
  • Bump version to v0.3.0-rc.2 by @Jeffwan in #1115
  • [Fix] Move pdb from patch to resources by @Jeffwan in #1117
  • [Docs] Add feature manuals for KVCache by @DwyaneShi in #1119
  • [Docs] Adding benchmark doc by @happyandslow in #999
  • [Docs] format KVCache docs to eliminate warnings by @DwyaneShi in #1122
  • Update multi-arch image push to include all platforms for release by @varungup90 in #1124
  • [Docs] Init folder for KVCache benchmark scenario by @DwyaneShi in #1125
  • [Docs] Addressing benchmark doc comments by @happyandslow in #1123
  • Cut v0.3.0 release by @Jeffwan in #1126
  • Bump python project version to v0.3.0 by @Jeffwan in #1127