Releases · kubernetes-sigs/inference-perf

We are excited to announce the initial release of Inference Perf v0.1.0! This release comes with the following key features:

Highly scalable and can support benchmarking large inference production deployments by sending up to 10k QPS.
Reports the key metrics needed to measure LLM performance.
Supports different real world and synthetic datasets.
Supports different APIs and can support multiple model servers.
Supports specifying an exact input and output distribution to simulate different scenarios - Gaussian distribution, fixed length, min-max cases are all supported.
Generates different load patterns and can benchmark specific cases like burst traffic, scaling to saturation and other autoscaling / routing scenarios.

What's Changed

Add directory structure for the tool by @achandrasekar in #1
Add Makefile and typecheck presubmit by @Bslabe123 in #3
Fix Makefile Typo by @Bslabe123 in #6
Add design document to the repo by @achandrasekar in #4
Add default python gitignore by @sjmonson in #11
Make Inference-Perf Package-able / Use Modern Python Tooling by @sjmonson in #13
Added Abstract Type for Metrics Client by @Bslabe123 in #7
Add Chen Wang to OWNERS by @terrytangyuan in #20
Inference perf basic load run implementation by @SachinVarghese in #21
Adding vLLM Client to inference perf runner by @SachinVarghese in #27
Add HF ShareGPT Data Generator by @vivekk16 in #33
Add Unit Testing Maketargets and Unit Testing Github Workflow by @Bslabe123 in #19
Mock metrics client implementation by @SachinVarghese in #32
Parameterization of CLI tool using config file by @SachinVarghese in #34
Add SachinVarghese as approver, add owner aliases by @achandrasekar in #36
Containerize the benchmark by @achandrasekar in #38
Added demo example for vLLM Server and shareGPT datagen component by @SachinVarghese in #37
Fix: Raising error for api type mismatch by @SachinVarghese in #44
Add Custom Tokenizer by @vivekk16 in #43
Multi-stage performance run by @SachinVarghese in #49
Update README.md with meeting time / recording links by @achandrasekar in #54
Add support for cluster-local benchmarking by @Bslabe123 in #60
Update DataGenerator to Handle Both Chat and Completion APIs by @Bslabe123 in #58
Lint and type check fixes by @SachinVarghese in #62
Add StorageClient abstract type and GCS Client Implementation by @Bslabe123 in #61
Added Prometheus client to get model server metrics by @aish1331 in #64
Add support for different input distributions with a synthetic dataset by @achandrasekar in #66
Automatically Populate Missing Fields in Config by @Bslabe123 in #71
Generic model server client config by @SachinVarghese in #72
Request Lifecycle Report Generation by @Bslabe123 in #77
Add output distribution to synthetic data generator by @achandrasekar in #79
Improved Logging for Writing Report Files by @Bslabe123 in #80
Add the option to ignore end of sequence by @achandrasekar in #83
Add GitHub Release Workflow and Changelog Configuration by @wangchen615 in #41
Improved abstractions for perf project by @SachinVarghese in #84
Add issue templates for the repo by @achandrasekar in #90
docs: Update link to Slack channel in README.md by @terrytangyuan in #91
Add random data generator by @achandrasekar in #94
Multi-stage report generation for Prometheus Metrics by @aish1331 in #95
Add Docker build and push workflows for PRs and releases by @wangchen615 in #97
Add shared prefix generator to benchmark prefix caching by @achandrasekar in #98
Added throughput metrics to output report by @Bslabe123 in #101
Basic code test setup by @SachinVarghese in #96
Enable Docker Build Workflow on Push to Main Branch by @wangchen615 in #102
Fix Docker Tag Generation by Using env.QUAY_USERNAME in Workflow by @wangchen615 in #105
Update Quay.io Organization Name in Docker Build Workflow by @wangchen615 in #106
Add Support for Streaming Requests to Completions API by @Bslabe123 in #103
Add multiprocess, multithreaded loadgen by @jjk-g in #99
Update documentation to cover newer capabilities by @achandrasekar in #104
Use logging methods with levels instead of print by @shotarok in #110
Merge the latest fixes to the release branch by @achandrasekar in #115