Releases: vllm-project/aibrix
v0.3.0
Automatically generated release for tag v0.3.0.
🚀 New Features Highlights
- AIBrix KVCache Offloading Framework: Introduces a pluggable multi-tier KVCache architecture with support for DRAM and remote backends, enabling efficient offloading of KV states to reduce GPU memory pressure and increase deployment density. (#1057, #1061, #1062, #1063, #1064, #1068, #1069, #1080, #1107)
- New KVCache orchestration API: Refactors the orchestration layer to support distributed hashing based caching solutions. (#971, #984, #985, #1037, #1055, #1071, #1114)
- Prefix Cache and Load aware Routing: Uses hash token-based prefix matching and load awareness to reduce latency by increasing prefix cache hit rate and routing efficiency (#838, #774, #933, #1067)
- Preble Routing (ICLR’25): An implementation of Preble, it balances KV cache reuse and GPU load by comparing prefix lengths and computing prompt-aware cost scores for optimal routing. (#678, #719, #730, #1024)
- Fairness-oriented Routing (OSDI’24 VTC): Introduces the vtc-basic router with Windowed Adaptive Fairness Routing, which dynamically tracks token usage and ensures fair load distribution across pods. (#964, #1011, #1065)
📊 Feature Enhancements
Gateway Enhancements
- Support for OpenAI-compatible APIs, including streaming responses, usage reporting, asynchronous handling, and standardized error responses for seamless end-to-end integration. (#703, #788, #799)
- Introduced the /v1/models endpoint for compatibility with OpenAI-style API clients. (#802)
- Refactored gateway-plugins with an extensible ext-proc server architecture, laying the foundation for pluggable policies. (#810)
- Improved concurrency safety and routing stability through major cache and router redesigns (#878, #884)
Control Plane:
- Added Kubernetes webhook validation for CRDs, providing early error feedback during resource creation (#748, #786).
- Improve RayClusterFleet to fully support Deepseek-r1/v3 models (#789, #826, #835, #914, #954).
- Add scale subresource in RayClusterFleet CRD and enable HPA support (#1082, #1109)
Installation Experiences:
- Introduced Terraform modules for GCP and Kubernetes deployment (#823).
- Added setup guides for Minikube on Lambda Cloud and AWS in the documentation (#1020).
- Enabled standalone controller installation for simplified system bootstrapping.(#930, #931)
- Streamlined upgrade workflows by introducing kubectl apply support. CRDs are now split and applied with --server-side, avoiding annotation size limits and enabling smooth incremental updates. (#793)
- Enabled container image publishing to Github Container Registry (GHCR) (#1041).
- Support ARM container Images (#1090)
Observability & Stability:
- Shipped prebuilt Grafana dashboards covering control plane, gateway, and KV cache components for out-of-the-box observability. (#1048)
- Tuned Envoy proxy memory and buffer configurations for better performance under high concurrency. (#825)
- Tuned Envoy proxy configurations for memory and buffer management under high concurrency (#967).
- Added graceful shutdown, liveness, and readiness probes to improve service resilience (#962).
- Delivered production-ready monitoring setups for all major system components (#1048).
New Contributors
- @gaocegege made their first contribution in #731
- @eltociear made their first contribution in #736
- @terrytangyuan made their first contribution in #746
- @jolfr made their first contribution in #744
- @Abirdcfly made their first contribution in #763
- @pierDipi made their first contribution in #764
- @Xunzhuo made their first contribution in #810
- @zjd0112 made their first contribution in #849
- @SongGuyang made their first contribution in #850
- @vaaandark made their first contribution in #856
- @vie-serendipity made their first contribution in #860
- @nurali-techie made their first contribution in #867
- @legendtkl made their first contribution in #870
- @ronaldosaheki made their first contribution in #886
- @nadongjun made their first contribution in #890
- @cr7258 made their first contribution in #893
- @thomasjpfan made their first contribution in #883
- @runzhen made their first contribution in #896
- @my-git9 made their first contribution in #895
- @googs1025 made their first contribution in #908
- @Iceber made their first contribution in #926
- @ModiIntel made their first contribution in #954
- @Venkat2811 made their first contribution in #964
- @SuperMohit made their first contribution in #992
- @weapons97 made their first contribution in #990
- @zhixian82 made their first contribution in #1082
What's Changed
Full Changelog: v0.2.0...v0.3.0
- [Docs] fix format of the dist kv cache doc by @DwyaneShi in #714
- complete the 'make generate' command by @kerthcet in #711
- Update organization reference in code base by @Jeffwan in #717
- [Misc] Update the documentation link by @Jeffwan in #720
- Initial implementation of radix tree-based cache by @gangmuk in #678
- Add model adapter e2e tests by @varungup90 in #701
- Add vllm cpu alternative for local development by @varungup90 in #721
- Add white paper file by @Jeffwan in #724
- Adding streaming client for AIbrix experiments by @happyandslow in #676
- [Docs] Update Readme with new links and blog post, and update white paper by @xieus in #725
- Recording failed requests in benchmark client by @gangmuk in #727
- Process response headers in gateway by @varungup90 in #703
- [misc] Fix white paper link by @Jeffwan in #728
- Prefix and load aware routing with radix tree kv cache by @gangmuk in #719
- Fix slack link in README.md by @Jeffwan in #729
- [readme] Fix wrong link by @gaocegege in #731
- [Misc] update scheduler.py by @eltociear in #736
- Improve thread safety for TreeNode data structure and refactor related codes by @gangmuk in #730
- Fix CacheSpec api scheme by @kerthcet in #740
- docs: Fix link to license by @terrytangyuan in #746
- Use native codegen cmd generating client-go by @kerthcet in #741
- [Docs]: Fixed kubectl commands for install of components by @jolfr in #744
- [fix] fixing bug in using AsyncOpenAI client (header setting, token counting, etc) by @gangmuk in #738
- Add webhook framework by @kerthcet in #748
- Use random seed for xxhash by @varungup90 in #752
- Create SECURITY.md to enable security policy by @xieus in #756
- [CI] Add integration test by @kerthcet in #759
- [Bug] fix: correct non-inherited context by @Abirdcfly in #763
- [Misc] Parametrize Makefile for mocked vLLM apps by @pierDipi in #764
- Support benchmarking script by using real application trace by @nwangfw in #737
- Maintaining common benchmarks utils in a separate dir by @gangmuk in #770
- Ignore worker pods for gateway routing by @varungup90 in #776
- Disable ENABLE_PROBES_INJECTION in correct way by @Jeffwan in #779
- Make stream include usage as optional by @varungup90 in #788
- Append ray head label selector in PodAutoscaler by @Jeffwan in #789
- Remove redundant install crds in makefile by @varungup90 in #792
- Update request message processing for /v1/completion input by @varungup90 in #794
- Added target...
v0.3.0-rc.2
Automatically generated release for tag v0.3.0-rc.2.
What's Changed
- [Bug] fix: condition nil panic in FindStatusCondition func by @googs1025 in #1078
- Refactor request body processing and add multi-turn conversation support by @varungup90 in #1067
- Upload arm build images with git.ref_name by @varungup90 in #1090
- Update documentation and add openai sdk samples by @varungup90 in #1092
- Rename preble based prefix routing strategy by @varungup90 in #1104
- Add v0.3.0 ps performance regression test scenario by @Jeffwan in #1099
- Migrating benchmark entrypoints to python client by @happyandslow in #1066
- [Misc] Add demo manifests for volcano engine by @Jeffwan in #1105
- [Integration] KVCache: update vLLM integration by @DwyaneShi in #1107
- [Bug]fix: add scale subresource to rayclusterfleet by @zhixian82 in #1082
- [Feature] KVCache: Suppport InfiniStore GID and enhance cluster mode by @DwyaneShi in #1106
- [Chore] fix: regenerate crd by @zhixian82 in #1109
- [Chore] KVCache: enhance format and dependencies by @DwyaneShi in #1108
- Polish benchmark manifests and VE samples by @Jeffwan in #1113
- [API] Support customized template for cache by @Jeffwan in #1114
- Bump version to v0.3.0-rc.2 by @Jeffwan in #1115
- [Fix] Move pdb from patch to resources by @Jeffwan in #1117
New Contributors
- @zhixian82 made their first contribution in #1082
Full Changelog: v0.3.0-rc.1...v0.3.0-rc.2
v0.3.0-rc.1
What's Changed
- [Docs] fix format of the dist kv cache doc by @DwyaneShi in #714
- complete the 'make generate' command by @kerthcet in #711
- Update organization reference in code base by @Jeffwan in #717
- [Misc] Update the documentation link by @Jeffwan in #720
- Initial implementation of radix tree-based cache by @gangmuk in #678
- Add model adapter e2e tests by @varungup90 in #701
- Add vllm cpu alternative for local development by @varungup90 in #721
- Add white paper file by @Jeffwan in #724
- Adding streaming client for AIbrix experiments by @happyandslow in #676
- [Docs] Update Readme with new links and blog post, and update white paper by @xieus in #725
- Recording failed requests in benchmark client by @gangmuk in #727
- Process response headers in gateway by @varungup90 in #703
- [misc] Fix white paper link by @Jeffwan in #728
- Prefix and load aware routing with radix tree kv cache by @gangmuk in #719
- Fix slack link in README.md by @Jeffwan in #729
- [readme] Fix wrong link by @gaocegege in #731
- [Misc] update scheduler.py by @eltociear in #736
- Improve thread safety for TreeNode data structure and refactor related codes by @gangmuk in #730
- Fix CacheSpec api scheme by @kerthcet in #740
- docs: Fix link to license by @terrytangyuan in #746
- Use native codegen cmd generating client-go by @kerthcet in #741
- [Docs]: Fixed kubectl commands for install of components by @jolfr in #744
- [fix] fixing bug in using AsyncOpenAI client (header setting, token counting, etc) by @gangmuk in #738
- Add webhook framework by @kerthcet in #748
- Use random seed for xxhash by @varungup90 in #752
- Create SECURITY.md to enable security policy by @xieus in #756
- [CI] Add integration test by @kerthcet in #759
- [Bug] fix: correct non-inherited context by @Abirdcfly in #763
- [Misc] Parametrize Makefile for mocked vLLM apps by @pierDipi in #764
- Support benchmarking script by using real application trace by @nwangfw in #737
- Maintaining common benchmarks utils in a separate dir by @gangmuk in #770
- Ignore worker pods for gateway routing by @varungup90 in #776
- Disable ENABLE_PROBES_INJECTION in correct way by @Jeffwan in #779
- Make stream include usage as optional by @varungup90 in #788
- Append ray head label selector in PodAutoscaler by @Jeffwan in #789
- Remove redundant install crds in makefile by @varungup90 in #792
- Update request message processing for /v1/completion input by @varungup90 in #794
- Added target pod to client result and made clients consistent by @gangmuk in #799
- Enable CI tests for release branch by @Jeffwan in #805
- Move modelAdapter runtime validation to webhook by @kerthcet in #786
- [Misc] Adding model field to each request by @happyandslow in #812
- [Refactor]: gateway-plugins ext-proc server codebase by @Xunzhuo in #810
- [CI]: update release tags pattern by @Xunzhuo in #815
- [Docs]: fix vllm mock app Unauthorized response by @Xunzhuo in #817
- Reconfigure workload generator for predefined synthetic patterns by @happyandslow in #771
- Workload generation scripts for prefix aware routing by @gangmuk in #820
- Fix the paths in lambda cloud doc by @gangmuk in #824
- [Bug] Added Startup Probe in Quickstart Model by @jolfr in #773
- Add /v1/models endpoint to gateway by @varungup90 in #802
- Increase envoy proxy memory config and client connection buffersize by @varungup90 in #825
- Support to create default HttpRoute for RayClusterFleet by @Jeffwan in #826
- [Misc] Fix CI issue on release branch and clean up logs by @Jeffwan in #837
- Fix repeated initialization of gateway routers and add unit test for prefix cache by @varungup90 in #838
- Add deepseek-r1 671B deployment sample and docs by @Jeffwan in #835
- Bump AIBrix version to v0.2.1 in manifests by @Jeffwan in #839
- [Docs] Update Slack link by @gaocegege in #841
- [Docs] Remove repeated lines by @zjd0112 in #849
- Bump AIBrix version to v0.2.1 for standalone distributed inference by @SongGuyang in #850
- Support OpenAI api style /v1/models response by @Jeffwan in #829
- [Misc] Resolve symlink ambiguity when generating codes by @vaaandark in #856
- Introduce RoutingContext in Route interface and clean up stale codes by @Jeffwan in #855
- [Misc]: sync hpa status to podAutoScaler by @vie-serendipity in #860
- Generate workload based on prefix sharing synthetic data by @happyandslow in #840
- Fixing missing image link in #840 by @happyandslow in #871
- Cite Melange paper in heterogeneous feature by @Jeffwan in #872
- [Misc] support linux for vllm cpu local development by @nurali-techie in #867
- Refactor make deploy to use apply instead of create by @varungup90 in #793
- Use string based tokenizer in prefix cache by @varungup90 in #774
- Add profiling support for gateway plugins and bug fix to close stream decoder by @varungup90 in #857
- Add flag to enable/disable GPU Optimizer tracing by @varungup90 in #875
- [Docs] fix typo in runtime feature page by @legendtkl in #870
- chore: clean-up mock yaml by @Xunzhuo in #877
- Fixing image link error in workload generator README.md by @happyandslow in #888
- Update Synthetic Load Prodefined Config for Geneerator by @happyandslow in #889
- [Misc] Fix plot_workload to pass dirname to makedirs by @ronaldosaheki in #886
- [Misc] Fix client.py in case workload has model null and client has default_model by @ronaldosaheki in #887
- [WIP] Adding input/output distribution argument to constant load generator by @happyandslow in #882
- [Docs] Fix broken contributing guidelines link in README by @nadongjun in #890
- [Bug] fix install script PATH environment variable by @cr7258 in #893
- [Docs] Link to dynamic lora from docs by @thomasjpfan in #883
- [API] Refactor: core cache design and impl by @Xunzhuo in #878
- Added antiaffinity in kvcache crd by @gangmuk in #865
- [Docs] Fix tpm and rpm typo in gateway-plugins.rst by @runzhen in #896
- [Misc] Remove unused function in pkg/utils by @my-git9 in #895
- Remove model name from client and generator by @happyandslow in #894
- [Misc] Add PS benchmark manifests and scripts by @Jeffwan in #899
- Add release overlays to update control plane config for production deployment by @varungup90 in https://github.com/vllm-project/aibri...
v0.2.1
Automatically generated release for tag v0.2.1.
What's Changed
- Cherry-pick Enable CI tests for release branch (#805) by @Jeffwan in #808
- Cherry pick #776 #779 #788 #789 #794 to release branch by @Jeffwan @varungup90 in #809
- Cherry-pick #825 #826 part of #717 in release branch by @varungup90 @Jeffwan in #828
- Update version and tags to v0.2.1 by @Jeffwan in #833
Full Changelog: v0.2.0...v0.2.1
v0.2.0
Automatically generated release for tag v0.2.0.
🚀 New Features Highlights
- Distributed KV Cache: Implemented support for managing KV cache across multiple nodes, enhancing performance.
- Cost-Driven Heterogenous Serving: Improved scheduling and inference strategies for mixed GPU environments, optimizing cost and resource utilization. (#371 #430, #509, #598, #554, #598)
- Optimizer Based Autoscaling: Leverage offline profiles of inference server to calculate the number of replicas. (#430, #500, #692, #508)
- Prefix Cache Aware Routing: Added support for routing decisions based on prefix cache hits, improving inference efficiency. (#641, #657)
📊 Feature Enhancements
- LoRA Scheduling Enhancements: Introduced multiple scheduling strategies, including bin packing, least latency, least throughput, and random. (#544)
- Prefix Cache Aware Routing: Added support for routing decisions based on prefix cache hits, improving inference efficiency. (#641)
- Gateway Enhancements: Improved request handling efficiency by enabling streaming in the Envoy gateway. (#377) Enhanced the handling of model registration and invalid cache scenarios. (#542), Introduced fallback strategies to ensure robust request allocation. (#445) Optimized cache store retrieval, reducing unnecessary overhead. (#639) Addressed missing Prometheus config preventing gateway startup. (#441)
- PodAutoscaler Scaling improvements: Improved scaling logic to handle edge cases more efficiently. (#508, #515)
🛠Infrastructure & CI/CD Upgrades
- Parallelized Build Tasks: CI efficiency improvements by running builds in parallel. (#398)
- CrashLoopBackOff Detection in CI: Added monitoring for pod failures in testing workflows. (#444)
- Improved GitHub Actions Cost Efficiency: Optimized triggers and removed unnecessary nightly builds. (#411, #422)
- Integration Tests for Core Components: Added integration tests for autoscalers, routing policies, and deployment configurations. (#616, #620)
What's Changed
- Add envoy gateway streaming support by @varungup90 in #377
- Add client traffic policy to increase per connection buffer size from 32kb to 256kb by @varungup90 in #395
- Misc: add support to metricsSources property of podautoscaler by @zhangjyr in #371
- [Misc] Update runtime server startup command in v0.1.0 by @brosoul in #396
- [CI] improve the ci efficiency by parallelizing the build tasks by @nwangfw in #398
- Fix the ticker interval by removing unnecessary ms by @Jeffwan in #415
- [Misc] Disable specific endpoints logs by @Jeffwan in #418
- [CI] Github Action trigger condition optimized for cost saving by @nwangfw in #411
- [Misc] Fix the mocked app role permission issue by @Jeffwan in #416
- [CI] Nightly tag removed for release branch by @nwangfw in #422
- Enable setting PodAutoscaler configuration via YAML labels by @kr11 in #409
- Update manifest to adopt v0.1.1 images by @Jeffwan in #429
- [Bug]: duplicated http in rest metrics fetcher (#408) by @zhangjyr in #421
- [MISC]: Improve Request Trace Granularity with Version Control by @zhangjyr in #431
- Support histogram metrics from engine in cache by @Jeffwan in #424
- Support fetching metrics from remote Prometheus server by @Jeffwan in #433
- [CI] Add python wheel to release artifact by @Jeffwan in #434
- Fix update cache pod issue and refactor updatePod handler by @Jeffwan in #439
- Extract common metrics structure to types and utils by @Jeffwan in #438
- Fix gateway startup issue due to missing prometheus config by @Jeffwan in #441
- [feat]: GPU Optimizer and Simulator development app by @zhangjyr in #430
- Add selectrandom fallback in routing and only scraping healthy pods by @Jeffwan in #445
- AIBrix Workload Generator / Scenario Simulator by @happyandslow in #428
- CrashLoopBackOff status detection in CI by @nwangfw in #444
- Support installing individual controllers from giant controller-manager by @nwangfw in #442
- Refactor Scaler: Resolve Issues with Metric Parameter Updates in Multiple KPAs by @kr11 in #437
- Support metrics multi labels for different models by @brosoul in #450
- Add health check api interface for runtime by @Jeffwan in #451
- Fix the service name override issue in rolebindings by @Jeffwan in #453
- Reorganize docs/development and docs/tutorial structure by @Jeffwan in #455
- Move tools to separate folders and update mocked app README.md by @Jeffwan in #457
- Fix multi models metric result in PromQL by @brosoul in #458
- Support Azure LLM trace in workload generator by @happyandslow in #462
- Fix autoscaler scalingstrategy switching logic by @nwangfw in #475
- Fix missing handle of PromQL scope is PodMetricScope by @brosoul in #479
- [Misc] Consolidate app and simulator by @zhangjyr in #477
- [Bug] Avoid including sensitive info in Dockerfile ENV by @zhangjyr in #487
- Refactor generator to generate time-based traces by @happyandslow in #478
- [CI] Update deploy workload script in installation test by @nwangfw in #499
- [Bug] handle metricKey creation with MetricsSources by @nwangfw in #498
- Adding Client for Workload Generator Workload File by @happyandslow in #501
- [Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity by @zhangjyr in #500
- Fix some simulator format issue and add some TODOs by @Jeffwan in #505
- [Bug] Fix the way how podautoscaler handle 0 pods. by @zhangjyr in #508
- [Misc] Improve gpu optimizer debugging on podautoscaler. by @zhangjyr in #509
- Optimize kustomize overlay for volcano engine deployment by @Jeffwan in #512
- [perf] Refact tos downloader in Runtime by @brosoul in #510
- Refactor metric source for customized protocol, port and path by @kr11 in #511
- [Bug] Fixed the yaml of deployments in heterogenous GPU settings to make KPA scaling work as expected. by @zhangjyr in #513
- [Misc] Heterogeneous GPU Optimizer Logging Clean Up by @nwangfw in #514
- Fix KPA bug, and an elaborate KPA test case by @kr11 in #515
- Cut v0.2.0-rc.1 release by @Jeffwan in #516
- [Bug] Accumulated bug fix on controller manager, mock app configuration, and gpu optimizer. by @zhangjyr in #522
- [Misc] Reduced runtime's container image size by @nwangfw in #518
- clean memory scaler object when pa crd is deleted by @kr11 in #520
- Configure autoscaler http client to skip certificate check by @Jeffwan in #530
- [Doc] Update aibrix documentation by @Jeffwan in #533
- Refactor the gateway-plugin and metadata service manifests by @Jeffwan in #531
- Fix the GITHUB_WORKSPACE artifact sharing issue in release workflow by @Jeffwan in #532
- [Misc] Polish the benchmark scripts by @Jeffwan in #525
- Fix APA bugs in creation, add test and demo yaml by @kr11 in #536
- Add VKE IPv4 Testing Cluster Config by @nwangfw in #537
- Support for request length internal trace by @happyandslow in #538
- [Feat] Add download status into runtime downloader by @brosoul in #539
- [Feat] Add runtime model management api by @brosoul in #540
- [gateway] handle the wrong model name and cache inconsistency case by @Jeffwan in #542
- [Docs] fix: update the parameters instruction in readme by @scarlet25151 in #548
- add lora schedulers - bin pack, least latency, least throughput, random by @Aspirin96 in #544
- add request routers - least kv cache, least expected latency by @Aspirin96 in #543
- [Docs] heterogenous gpu docs added by ...
v0.2.0-rc.2
Automatically generated release for tag v0.2.0-rc.2.
What's Changed
- [Bug] Accumulated bug fix on controller manager, mock app configuration, and gpu optimizer. by @zhangjyr in #522
- [Misc] Reduced runtime's container image size by @nwangfw in #518
- clean memory scaler object when pa crd is deleted by @kr11 in #520
- Configure autoscaler http client to skip certificate check by @Jeffwan in #530
- [Doc] Update aibrix documentation by @Jeffwan in #533
- Refactor the gateway-plugin and metadata service manifests by @Jeffwan in #531
- Fix the GITHUB_WORKSPACE artifact sharing issue in release workflow by @Jeffwan in #532
- [Misc] Polish the benchmark scripts by @Jeffwan in #525
- Fix APA bugs in creation, add test and demo yaml by @kr11 in #536
- Add VKE IPv4 Testing Cluster Config by @nwangfw in #537
- Support for request length internal trace by @happyandslow in #538
- [Feat] Add download status into runtime downloader by @brosoul in #539
- [Feat] Add runtime model management api by @brosoul in #540
- [gateway] handle the wrong model name and cache inconsistency case by @Jeffwan in #542
- [Docs] fix: update the parameters instruction in readme by @scarlet25151 in #548
- add lora schedulers - bin pack, least latency, least throughput, random by @Aspirin96 in #544
- add request routers - least kv cache, least expected latency by @Aspirin96 in #543
- [Docs] heterogenous gpu docs added by @nwangfw in #545
- Fix race condition in cache by @varungup90 in #550
- Fix pod internal cache delete handling by @varungup90 in #552
- Handle terminating pod for request routing by @varungup90 in #549
- Support absolute path as lora adapter artifact path by @Jeffwan in #556
- Deadlock fix for cache by @varungup90 in #557
- Mock app log fix for missing metrics warning by @varungup90 in #564
- Add vllm graceful termination configuration by @nwangfw in #568
- Enhance dynamic lora adapter support for auth enabled scenario by @Jeffwan in #571
- Update pyproject.toml to support python 3.12 by @Jeffwan in #579
- [Docs ]Update ai runtime management api and downloader docs by @Jeffwan in #577
- Check the HPA ownerReference in request enqueue by @Jeffwan in #582
- Add request length for traces by @happyandslow in #569
- Support model registration flow using aibrix runtime api by @Jeffwan in #580
- Gateway plugin report total incoming requests and pending requests by @zhangjyr in #554
- Support distributed kv cache orchestration by @Jeffwan in #583
- Grant workflow action permission to write packages by @Jeffwan in #586
- Update routers to use GetPodModelMetric api and misc cleanup in metri… by @varungup90 in #590
- Update upload/download artifact github actions version to v4 by @varungup90 in #591
- Update version in aibrix/python to 0.2.0-rc.2 by @varungup90 in #594
New Contributors
- @scarlet25151 made their first contribution in #548
- @Aspirin96 made their first contribution in #544
Full Changelog: v0.2.0-rc.1...v0.2.0-rc.2
v0.1.2
v0.2.0-rc.1
What's Changed
- Add envoy gateway streaming support by @varungup90 in #377
- Add client traffic policy to increase per connection buffer size from 32kb to 256kb by @varungup90 in #395
- Misc: add support to metricsSources property of podautoscaler by @zhangjyr in #371
- [Misc] Update runtime server startup command in v0.1.0 by @brosoul in #396
- [CI] improve the ci efficiency by parallelizing the build tasks by @nwangfw in #398
- Fix the ticker interval by removing unnecessary ms by @Jeffwan in #415
- [Misc] Disable specific endpoints logs by @Jeffwan in #418
- [CI] Github Action trigger condition optimized for cost saving by @nwangfw in #411
- [Misc] Fix the mocked app role permission issue by @Jeffwan in #416
- [CI] Nightly tag removed for release branch by @nwangfw in #422
- Enable setting PodAutoscaler configuration via YAML labels by @kr11 in #409
- Update manifest to adopt v0.1.1 images by @Jeffwan in #429
- [Bug]: duplicated http in rest metrics fetcher (#408) by @zhangjyr in #421
- [MISC]: Improve Request Trace Granularity with Version Control by @zhangjyr in #431
- Support histogram metrics from engine in cache by @Jeffwan in #424
- Support fetching metrics from remote Prometheus server by @Jeffwan in #433
- [CI] Add python wheel to release artifact by @Jeffwan in #434
- Fix update cache pod issue and refactor updatePod handler by @Jeffwan in #439
- Extract common metrics structure to types and utils by @Jeffwan in #438
- Fix gateway startup issue due to missing prometheus config by @Jeffwan in #441
- [feat]: GPU Optimizer and Simulator development app by @zhangjyr in #430
- Add selectrandom fallback in routing and only scraping healthy pods by @Jeffwan in #445
- AIBrix Workload Generator / Scenario Simulator by @happyandslow in #428
- CrashLoopBackOff status detection in CI by @nwangfw in #444
- Support installing individual controllers from giant controller-manager by @nwangfw in #442
- Refactor Scaler: Resolve Issues with Metric Parameter Updates in Multiple KPAs by @kr11 in #437
- Support metrics multi labels for different models by @brosoul in #450
- Add health check api interface for runtime by @Jeffwan in #451
- Fix the service name override issue in rolebindings by @Jeffwan in #453
- Reorganize docs/development and docs/tutorial structure by @Jeffwan in #455
- Move tools to separate folders and update mocked app README.md by @Jeffwan in #457
- Fix multi models metric result in PromQL by @brosoul in #458
- Support Azure LLM trace in workload generator by @happyandslow in #462
- Fix autoscaler scalingstrategy switching logic by @nwangfw in #475
- Fix missing handle of PromQL scope is PodMetricScope by @brosoul in #479
- [Misc] Consolidate app and simulator by @zhangjyr in #477
- [Bug] Avoid including sensitive info in Dockerfile ENV by @zhangjyr in #487
- Refactor generator to generate time-based traces by @happyandslow in #478
- [CI] Update deploy workload script in installation test by @nwangfw in #499
- [Bug] handle metricKey creation with MetricsSources by @nwangfw in #498
- Adding Client for Workload Generator Workload File by @happyandslow in #501
- [Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity by @zhangjyr in #500
- Fix some simulator format issue and add some TODOs by @Jeffwan in #505
- [Bug] Fix the way how podautoscaler handle 0 pods. by @zhangjyr in #508
- [Misc] Improve gpu optimizer debugging on podautoscaler. by @zhangjyr in #509
- Optimize kustomize overlay for volcano engine deployment by @Jeffwan in #512
- [perf] Refact tos downloader in Runtime by @brosoul in #510
- Refactor metric source for customized protocol, port and path by @kr11 in #511
- [Bug] Fixed the yaml of deployments in heterogenous GPU settings to make KPA scaling work as expected. by @zhangjyr in #513
- [Misc] Heterogeneous GPU Optimizer Logging Clean Up by @nwangfw in #514
- Fix KPA bug, and an elaborate KPA test case by @kr11 in #515
- Cut v0.2.0-rc.1 release by @Jeffwan in #516
Full Changelog: v0.1.1...v0.2.0-rc.1
v0.1.1
v0.1.0
Feature Highlights
1. Dynamic LoRa Adapter
The Dynamic LoRa Adapter introduces a flexible approach to model adaptation, allowing dynamic management of LoRa models within Kubernetes. This new functionality includes efficient handling of model registration, unloading, and routing, significantly enhancing operational control and scalability for production environments.
2. Gateway Extension Server with Multi-Algorithm Routing Support
We extend the Envoy Gateway through an extension server and the external processing service can inspect and mutate requests and responses. We use this way to extend some features not directly supported in kubernetes service like various routing algorithms, such as least request
, least throughput
, and random
and rate limit feature. This flexibility allows users to fine-tune routing strategies based on their specific application needs, ultimately improving traffic distribution and system performance.
3. LLM-specific Autoscaler
This release integrates multiple autoscaling algorithms, including HPA (Horizontal Pod Autoscaler), KPA (Knative Pod Autoscaler), and APA (AIBrix Pod Autoscaler). The autoscaling framework now features a direct connection to fetch metrics from pods, enabling real-time adjustments based on load and optimized resource utilization.
4. Unified AI Runtime
The AI runtime has been created to support faster model downloading through GPU streaming way, streamlined metrics aggregation, and efficient LoRa request delegation to abstract underlying engine complexities. This runtime provides an optimized environment for deploying and managing machine learning models, making it easier to handle high-volume requests.
Additional Enhancements:
- Doc website: Updated documents, including quick-start guides, installation instructions, and tutorials for autoscaling, make setup and onboarding smoother.
- Benchmarking and Performance Analysis Tools: Integrated tools for benchmarking autoscalers, gateways and lora to monitor and improve system efficiency and performance.
- CI/CD Workflow: The new CI/CD pipeline includes automated image builds, GitHub Actions for testing and linting, and release pipelines for simplified deployment.
What's Changed
- Add common project documents and skeleton folders by @Jeffwan in #4
- Scaffolding aibrix project using kubebuilder by @Jeffwan in #17
- Optimize project layouts by moving controllers to pkg folder by @Jeffwan in #21
- Create Lora api and controller by @Jeffwan in #23
- Rename LoraAdapter to ModelAdapter by @Jeffwan in #25
- Add ModelAdapter API by @Jeffwan in #26
- Use better way to set up controller with Manager by @Jeffwan in #27
- Initial model adapter controller implementation by @Jeffwan in #32
- Add mocked model container for lora adapter fast prototyping by @Jeffwan in #33
- [Misc] Add the PR and issues template by @jsw-zorro in #38
- [Docs] Add example to run vLLM distributed inference using Ray by @Jeffwan in #39
- [Doc] Improve the model adapter mock service by @Jeffwan in #45
- [Misc] Simplify the feature/bug/enhancement template. by @jsw-zorro in #48
- [Misc] Make model adapter controller e2e work by @Jeffwan in #50
- [Docs] A draft version of the contributing guideline document by @kr11 in #47
- [Core] Improve model adapter controller by handling existing resources by @Jeffwan in #54
- [Feat] Initial Implementation of PodAutoscaler Reconciler by @kr11 in #55
- [Docs] Move the sample mocked application to common folder by @Jeffwan in #64
- [Misc] Minor refactor the PodAutoscaler codes by @Jeffwan in #68
- [Core] Add model router controller by @varungup90 in #57
- Add rbac rules in model router by @varungup90 in #71
- [bugs] Add autoscaler RBAC to successfully list horizontalpodautoscalers by @kr11 in #72
- [Misc] Update license info; Add license check by @happyandslow in #73
- add github workflow to lint & test code by @M00nF1sh in #74
- [CI] Fix the golang lint issues by @Jeffwan in #77
- [CI] fix the failures from make test by @Jeffwan in #80
- [Misc] Add code-generator and openapi-gen as dependencies by @Jeffwan in #59
- [Misc] Reconcile hpa, kpa and apa separately by @Jeffwan in #83
- [feat] Add rpm/tpm extension proc plugin by @varungup90 in #79
- Add kpa scale algorithm implementation by @kr11 in #87
- Add host override to query specific pod by @varungup90 in #86
- [Core] init aibrix runtime framework by @brosoul in #88
- Support kpa/apa autoscaling workflow part I by @Jeffwan in #85
- Fix Dockerfile Packaging Issues Related to Go Version and Missing Utils by @kr11 in #92
- Autoscaling Workflow Enhancement - Part 2 by @kr11 in #94
- Add custom CRD clientset by @varungup90 in #97
- Autoscaling Workflow Enhancement - Part 3 by @kr11 in #101
- [Core] Add Downloader implementation for runtime by @brosoul in #96
- Add RayClusterReplicaSet and RayClusterFleet apis by @Jeffwan in #103
- Apply crd:maxDescLen=0 in manifest generation by @Jeffwan in #108
- Apply filter to objects owned by model adapters by @varungup90 in #111
- Add custom cache and interface for model adapter scheduling by @varungup90 in #100
- Refactor gateway package by @varungup90 in #112
- BatchAPI storage component together with test by @xinchen384 in #104
- Update the installation guidance and README.md by @Jeffwan in #115
- [CI] Package AI Runtime by @brosoul in #118
- Add gateway installation by @varungup90 in #122
- [CI] Support container image build and push in CI by @Jeffwan in #120
- [CI] Fix nightly image push error by @Jeffwan in #127
- [Bug] Fix download bugs during download benchmark by @brosoul in #134
- Autoscaling Workflow Enhancement - Part 4: Integrating MetricClient into Autoscaling Workflow by @kr11 in #116
- Update make generate by @varungup90 in #132
- Model adapter controller improvement and refactor by @Jeffwan in #135
- Improve the aibrix installation scripts by @Jeffwan in #141
- [CI] Support python package publish by @brosoul in #138
- Fix some typo and naming issues by @Jeffwan in #150
- Fix gateway bootstrap issues by @varungup90 in #154
- Add kubeconfig flag for cache initialization by @varungup90 in #155
- Using sphinx to generate html pages for our project static site by @xinchen384 in #153
- Add finalizer and handle the model unload requests by @Jeffwan in #152
- Fix kubeConfig redefined issue and update imagePullPolicy by @Jeffwan in #158
- Add expectation lib to allows us to set and wait on expectations by @Jeffwan in #164
- Add routing algorithms by @varungup90 in #143
- Add readthedocs configuration for CI builds and update theme by @Jeffwan in #169
- Add RayClusterReplicaSet initial implementation by @Jeffwan in #165
- Add template page for the docs by @Jeffwan in #170
- Remove myst_parser from sphinx extensions by @Jeffwan in #172
- Update quickstart in the doc by @Jeffwan in #174
- Metric standardizing in ai runtime by @brosoul in #163
- [Misc] Rename env in runtime by @brosoul in #176
- Add readiness check for redis in gateway plugin by @varungup90 in #173
- [batch] job manager handles job state transition by @xinchen384 in #180
- Add users CRUD API by @varungup90 in #181
- Add routing for model adapter by @varungup90 in https:/...