The Inference Perf project aims to provide GenAI inference performance benchmarking tool. It came out of wg-serving and is sponsored by SIG Scalability. See the proposal for more info.
This project is currently in development.
You can configure inference-perf to run with different data generation and load generation configurations today. Please see config.yml
and examples in /examples
.
Supported datasets include the following:
- ShareGPT (for a real world conversational dataset)
- Synthetic (for specific input / output distributions)
- Mock (for testing)
Similarly load generation can be configured to run with different request rates and durations. You can also run multiple stages with different request rates and durations within a single run.
-
Setup a virtual environment and install inference-perf
pip install .
-
Run inference-perf CLI with a configuration file
inference-perf --config_file config.yml
-
See more examples
-
Build the container
docker build -t inference-perf .
-
Run the container
docker run -it --rm -v $(pwd)/config.yml:/workspace/config.yml inference-perf
Refer to the guide in /deploy
.
Our community meeting is weekly on Thursdays alternating betweem 09:00 and 11:30 PDT (Zoom Link, Meeting Notes, Meeting Recordings).
We currently utilize the #inference-perf channel in Kubernetes Slack workspace for communications.
Contributions are welcomed, thanks for joining us!
Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.