Skip to content

GuideLLM v0.7.0

Choose a tag to compare

@dbutenhof dbutenhof released this 29 Jun 13:49
Immutable release. Only release title and notes can be modified.
c6389a5

Overview

GuideLLM v0.7.0 improves how you configure and run benchmarks, adds production-realistic trace replay, and expands support for modern LLM workloads: reasoning models, tool calling, embeddings, and Mooncake traces.

To get started, install with:

pip install guidellm\[recommended\]==0.7.0

Or from source with:

pip install 'guidellm\[recommended\] @ git+[https://github.com/vllm-project/guidellm.git'@v0.7.0](https://github.com/vllm-project/guidellm.git'@v0.7.0)

What's New

  • Support for server-side conversation history in /v1/responses API
  • Support for client and server side tool calling using /v1/chat/completions and /v1/responses APIs
  • Websocket backend (openai_websocket) for realtime audio transcription
  • vLLM Python backend (vllm_python) for running inference in the same process as GuideLLM using vLLM's python API (AsyncLLMEngine), without an HTTP server.
  • Synthetic tool calling support in chat/completions and responses APIs
  • Mooncake LLM trace replay
    • Given a data file with timestamps and prompt_tokens/output_tokens synthetic data parameters, you can replay a recorded session. Mooncake provides KV cache replay using token cache block hash values which are used to compute tokens that replay a conversation with equivalent cache sensitivity, without recording or using the original data.
  • Support for testing OpenAI embeddings API
  • Support for building ARM64 container images

Major CLI refactoring

  • Constraints (--max-requests, --max-duration, --max-errors, etc.) are now treated as first class objects using the consistent syntax --constraint kind=<name>,<OPTIONS>..., such as --constraint kind=max_requests,count=1000 or --constraint kind=over_saturation,mode=enforce,moe_threshold=3.0
  • The --data handling was overloaded, often with no clear way to determine what sort of data was to be loaded – for example, a huggingface dataset vs a local file. Error messages are often unclear, because the engine searched through a list of possibilities with no way to know which was expected to succeed. Now “data” is clearly typed, like --data kind=huggingface,source=<name> or --data kind=json_file,path=<path>
  • Specification of profiles and backends are now clearly typed, with clearly connected parameters, like --backend kind=openai_http,target=<url>,streaming=true and --profile async ‘{“rate”:[10,20]}’
  • The syntax is designed to allow pre-loading layered “config” files like the previous --scenario <file>, also allowing overrides to the scenario/global values. This will enable the long-requested feature of being able to override constraints for each benchmark (“strategy”) scheduled under a profile. For example, --profile kind=async,rate=10 schedules one asynchronous strategy with rate 10, but specifying multiple rates requires using inline JSON to specify a list. Instead, you can define the rates with –profile kind=async --override profile.rate=10,20 you can run two rates.

CLI Migration Guide

Read on GitHub at v0.7.0 Migration Guide

What's Fixed

  • Several fixes for the HTML report output rendering. (Future plans include making the HTML report format completely self-contained to eliminate many problems.)
  • Improved handling of audio file format – preserve original format if possible, and when transcoding is necessary default to WAV rather than MP3
  • Improved reporting of TTFT, especially when reasoning models first generate “non-output” thinking tokens
  • Several fixes in dataset column mapping, including fast failure when there are no mappable columns

Known Limitations

  • Tool call responses are currently added to the following user turn. The following release will separate them into dedicated response turns.

Changelog

Features

Internal refactoring and cleanup

Bug fixes

CI environment

Documentation

Dependency updates

New Contributors

Full Changelog: v0.6.0...v0.7.0