GuideLLM v0.7.0
Overview
GuideLLM v0.7.0 improves how you configure and run benchmarks, adds production-realistic trace replay, and expands support for modern LLM workloads: reasoning models, tool calling, embeddings, and Mooncake traces.
To get started, install with:
pip install guidellm\[recommended\]==0.7.0Or from source with:
pip install 'guidellm\[recommended\] @ git+[https://github.com/vllm-project/guidellm.git'@v0.7.0](https://github.com/vllm-project/guidellm.git'@v0.7.0)What's New
- Support for server-side conversation history in /v1/responses API
- Support for client and server side tool calling using /v1/chat/completions and /v1/responses APIs
- Websocket backend (openai_websocket) for realtime audio transcription
- vLLM Python backend (vllm_python) for running inference in the same process as GuideLLM using vLLM's python API (AsyncLLMEngine), without an HTTP server.
- Synthetic tool calling support in chat/completions and responses APIs
- Mooncake LLM trace replay
- Given a data file with timestamps and
prompt_tokens/output_tokenssynthetic data parameters, you can replay a recorded session. Mooncake provides KV cache replay using token cache block hash values which are used to compute tokens that replay a conversation with equivalent cache sensitivity, without recording or using the original data.
- Given a data file with timestamps and
- Support for testing OpenAI embeddings API
- Support for building ARM64 container images
Major CLI refactoring
- Constraints (
--max-requests,--max-duration,--max-errors, etc.) are now treated as first class objects using the consistent syntax--constraint kind=<name>,<OPTIONS>..., such as--constraint kind=max_requests,count=1000or--constraint kind=over_saturation,mode=enforce,moe_threshold=3.0 - The
--datahandling was overloaded, often with no clear way to determine what sort of data was to be loaded – for example, a huggingface dataset vs a local file. Error messages are often unclear, because the engine searched through a list of possibilities with no way to know which was expected to succeed. Now “data” is clearly typed, like--data kind=huggingface,source=<name>or--data kind=json_file,path=<path> - Specification of profiles and backends are now clearly typed, with clearly connected parameters, like
--backend kind=openai_http,target=<url>,streaming=trueand--profile async ‘{“rate”:[10,20]}’ - The syntax is designed to allow pre-loading layered “config” files like the previous
--scenario <file>, also allowing overrides to the scenario/global values. This will enable the long-requested feature of being able to override constraints for each benchmark (“strategy”) scheduled under a profile. For example,--profile kind=async,rate=10schedules one asynchronous strategy with rate 10, but specifying multiple rates requires using inline JSON to specify a list. Instead, you can define the rates with–profile kind=async --override profile.rate=10,20you can run two rates.
CLI Migration Guide
Read on GitHub at v0.7.0 Migration Guide
What's Fixed
- Several fixes for the HTML report output rendering. (Future plans include making the HTML report format completely self-contained to eliminate many problems.)
- Improved handling of audio file format – preserve original format if possible, and when transcoding is necessary default to WAV rather than MP3
- Improved reporting of TTFT, especially when reasoning models first generate “non-output” thinking tokens
- Several fixes in dataset column mapping, including fast failure when there are no mappable columns
Known Limitations
- Tool call responses are currently added to the following user turn. The following release will separate them into dedicated response turns.
Changelog
Features
- Process tool call requests for chat completions API by @jaredoconnell in #687
- Apply tool calling stats to the responses API by @jaredoconnell in #692
- Support server-side conversation history on responses API by @jaredoconnell in #697
- feat: Add embeddings endpoint support (MVP) by @maryamtahhan in #710
- [v0.7 CLI Refactor] Rework BackendArgs to be the authoritative config location by @sjmonson in #723
- [FEAT] Add replay from trace strategy by @VincentG1234 in #620
- Multi-turn tool call chat completions conversations by @jaredoconnell in #712
- Add multi-arch (x86, arm) container image support by @maryamtahhan in #720
- [v0.7 CLI Refactor] Rework Data Deserialization Config by @sjmonson in #733
- Feat/multi turn tools responses by @jaredoconnell in #739
- [v0.7 CLI Refactor] Finish up data rework by @sjmonson in #754
- CLI profile refactor by @dbutenhof in #753
- Add lazy-loading for extras packages by @sjmonson in #641
- Constraints refactor by @jaredoconnell in #786
- Add Mooncake trace format support by @SkiHatDuckie in #777
- Infer audio encoding format from source instead of defaulting to MP3 by @jaredoconnell in #794
- [v0.7 CLI Refactor] New internal top-level args / CLI by @sjmonson in #789
- Simplify constraint parameter names by @dbutenhof in #799
- Realtime transcription endpoint by @ushaket in #713
- Revert Pydantic Registry parameters back to single value fields by @sjmonson in #826
- Restore startup logs and fix config environment variable support by @sjmonson in #825
- Improve Pydantic commenting by @dbutenhof in #818
- Rename "config" CLI command to "env" by @jaredoconnell in #849
- Added registry setup for metrics, with accompanying CLI option by @jaredoconnell in #863
- Add
--labelargument by @sjmonson in #846
Internal refactoring and cleanup
- Add AGENTS.md by @sjmonson in #714
- Refactor CSV tests by @jaredoconnell in #715
- Replace various key=value string parsers with a single utility by @sjmonson in #569
- Refactor CLI into nested structure by @sjmonson in #717
- Add instructions against common AI poor code quality habits by @jaredoconnell in #718
- Build ProfileArgs in BenchmarkGenerativeTextArgs by @dbutenhof in #774
- [v0.7 CLI Refactor] Misc Cleanup by @sjmonson in #788
- Disable reloading parent schemas by default by @sjmonson in #805
- Switch BenchmarksArgs to BenchmarkScernario in output by @sjmonson in #816
- Remove "rate" as an alias for profile parameters by @dbutenhof in #836
- Conversation extraction script for debugging by @jaredoconnell in #848
Bug fixes
- Add a health check for the worker processes by @jaredoconnell in #686
- Fix long import times due to processing all pydantic subclasses by @jaredoconnell in #689
- Fix custom column mapping by @sjmonson in #711
- Fix CSV column misalignment across multiple benchmarks by @leehyeoklee in #707
- Checks for no valid requests in processed dataset by @jaredoconnell in #709
- Fix TypeError when streaming delta has tool_calls=null by @rgerganov in #752
- Fix blank HTML report by serving UI assets from GitHub Pages by @regrow1123 in #744
- Fix TTFT measurement for reasoning-capable models by @soyr-redhat in #742
- Make synthetic_text output_tokens optional and improve CLI errors by @rgerganov in #759
- Time to first output token and related fixes by @jaredoconnell in #760
- Enable multiprocessing support for trace replay strategy by @VincentG1234 in #745
- Fix unpicklable lambda collate_fn in TorchDataLoader for Python 3.14 by @rgerganov in #782
- Fix report serialization for nested paths by @sjh9714 in #783
- fix(data): fail fast when no mappable columns are found (#787) by @Anai-Guo in #796
- fix(openai): surface streaming SSE error payloads as request failures (#743) by @Anai-Guo in #795
- fix(openai): recognize reasoning_content key for TTFT calculation by @rgerganov in #835
- Fix: Exclude computed fields during pydantic MP serialization by @SkiHatDuckie in #845
- Fix
preprocess datasetby @dbutenhof in #842 - Fix: Add relative_timestamp column to output in Mooncake deserializer by @SkiHatDuckie in #855
- Misc v0.7.0 Fixes by @sjmonson in #843
CI environment
- Drop UI CI jobs by @sjmonson in #694
- Redrop nightly PyPi build by @sjmonson in #695
- [GitHub Actions]: Bump actions/cache from 5.0.4 to 5.0.5 by @dependabot[bot] in #698
- [GitHub Actions]: Bump actions/upload-artifact from 7.0.0 to 7.0.1 by @dependabot[bot] in #699
- Rename tox envs by @sjmonson in #701
- ci(mergify): upgrade configuration to current format by @mergify[bot] in #685
- Made E2E test theoretically run better in CI by @jaredoconnell in #738
- Run a useful set of tox environments by default by @sjmonson in #748
- Ensure system env vars don't taint tests by @jaredoconnell in #751
- feat: add automated docs deployment workflow by @1195343015 in #747
- Add a rebase + merge based merge queue by @sjmonson in #756
- fix mergify queue settings by @sjmonson in #757
- fix: fix docs deploy trigger and minor doc issues by @1195343015 in #758
- [GitHub Actions]: Bump docker/setup-qemu-action from 3.2.0 to 4.1.0 by @dependabot[bot] in #749
- Implement a Squash Commit PR Workflow by @sjmonson in #763
- Fix update-description job by @sjmonson in #766
- [GitHub Actions]: Bump docker/setup-buildx-action from 4.0.0 to 4.1.0 by @dependabot[bot] in #750
- [GitHub Actions]: Bump snok/container-retention-policy from 3.0.1 to 3.1.0 by @dependabot[bot] in #772
- [GitHub Actions]: Bump actions/checkout from 6.0.2 to 6.0.3 by @dependabot[bot] in #773
- Force linux LF line endings by @dbutenhof in #798
- Add a HuggingFace cache step to tox-run by @sjmonson in #806
- [GitHub Actions]: Bump actions/checkout from 6.0.3 to 7.0.0 by @dependabot[bot] in #815
- Update container build by @dbutenhof in #819
- test: register
slowpytest marker to silence PytestUnknownMarkWarning (#840) by @Anai-Guo in #841 - Migrate docs build to nightly and fix error by @sjmonson in #859
- [GitHub Actions]: Bump actions/setup-python from 6.2.0 to 6.3.0 by @dependabot[bot] in #858
- [GitHub Actions]: Bump actions/cache from 5.0.5 to 6.0.0 by @dependabot[bot] in #857
Documentation
- Add documentation section regarding AI assistance by @dbutenhof in #721
- Update README.md by @SkiHatDuckie in #792
- Documentation refactoring by @dbutenhof in #814
- Add troubleshooting guide by @jaredoconnell in #860
- Fix up doc linkages by @dbutenhof in #870
Dependency updates
- Bump aiohttp from 3.13.3 to 3.13.4 by @dependabot[bot] in #682
- Bump pillow from 12.1.1 to 12.2.0 by @dependabot[bot] in #693
- Bump python-dotenv from 1.2.1 to 1.2.2 by @dependabot[bot] in #702
- Bump lxml from 6.0.2 to 6.1.0 by @dependabot[bot] in #703
- Bump urllib3 from 2.6.3 to 2.7.0 by @dependabot[bot] in #727
- Bump ujson from 5.12.0 to 5.12.1 by @dependabot[bot] in #730
- Bump idna from 3.11 to 3.15 by @dependabot[bot] in #734
- Bump aiohttp from 3.13.4 to 3.14.0 by @dependabot[bot] in #771
- Bump pyarrow from 22.0.0 to 23.0.1 by @dependabot[bot] in #780
- Bump aiohttp from 3.14.0 to 3.14.1 by @dependabot[bot] in #801
- Bump ujson from 5.12.0 to 5.13.0 by @dependabot[bot] in #822
- Bump pydantic-settings from 2.12.0 to 2.14.2 by @dependabot[bot] in #823
- Bump msgpack from 1.1.2 to 1.2.1 by @dependabot[bot] in #824
New Contributors
- @mergify[bot] made their first contribution in #685
- @leehyeoklee made their first contribution in #707
- @VincentG1234 made their first contribution in #620
- @regrow1123 made their first contribution in #744
- @1195343015 made their first contribution in #747
- @soyr-redhat made their first contribution in #742
- @sjh9714 made their first contribution in #783
- @SkiHatDuckie made their first contribution in #777
- @Anai-Guo made their first contribution in #796
Full Changelog: v0.6.0...v0.7.0