-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Motivation.
While we continue to push the limit of vLLM performance, an important aspect is to make sure that these wins will stay. It’s quite common in practice that commits may just silently regress engine performance. For instance, we observed a downward trend of engine performance in the past month:

vLLM Llama 8b shareGPT benchmark dashboard
To ensure long-term performance and reliability, we want to target the following areas:
- Unified Optimization Measurement: formalize the process to measure optimization wins (low-level components or e2e latency improvements)
- Regression Detection and Prevention: ensure optimization holds in a long run and prevent slow performance bleeding
- DevX: reduce the operational overhead to measure wins or investigate regressions to save time against optimization exploration
Existing Work
@huydhn and the Pytorch team already did a great job building the vLLM performance benchmark dashboard, which runs predefined benchmarks every 2 hours. Users can view vLLM’s performance on certain benchmarks over time (as shown above), benchmark arbitrary vLLM commits, and compare benchmark performance of different commits.
Proposed Change.
Proposal
We propose integrating lightweight profiling instrumentation into vLLM coupled with periodic e2e profiling jobs. The system will feature the following:
- Extend vLLM benchmark dashboard to include SGLang as well on certain model/workload combinations
- Help us understand vLLM vs. SGLang performance trends over time
- Continuous dump of Pytorch profiler traces on fixed benchmarks to S3 for deep dives
- Extend vLLM continuous benchmark to collect profiling metrics (latency breakdowns, GPU metrics, scheduling stats, etc.), store profiling data and visualize trends over time
- Tooling for automated anomaly detection and bisection
Design
On a high level, we divide the continuous profiling work into the following components:
- vLLM metric collector
- Performance measurement tooling
- Data storage and visualization
- Regression detection alerting and bisecting

vLLM Metrics Collector
We are targeting similar functionalities of Jialin#1 / Jialin#6 that log the latency component breakdown, but with a pluggable instrumentation layer, in the form of hooks/timers controlled by an environment var.
Ideally this instrumentation layer should export breakdowns similar to the following such that we gain observability into where time is spent.

The implementation should be extensible such that it can be extended to PD/RL/disagg scenarios.
Performance Measurement Tooling
We’ll develop a script that can be used pre-commit, which will trigger certain performance benchmarks that will output differences on detailed latency breakdowns and throughput comparisons.
Data Storage and Visualization
For profiling traces we’ll just continuous export to S3 with 6 months TTL.
For latency breakdown visualization, we’ll reuse the vLLM benchmark flow: at the end of the benchmark job, we will dump the result into a local file and upload it to S3. An AWS Lambda job will be triggered to parse the data and write a hosted Clickhouse database, and we’ll use Grafana to visualize.
Our work here would be to define the profiling data dump format, create a new AWS Lambda rule to dump the data, and create a new table to store the data in Clickhouse, and lastly build a new dashboard to visualize the profiling data. We want to show the latency trend of different components over time.
Regression detection alerting and bisecting
Currently we can trigger benchmark flow with any commit, we can build a tool to extend this to bisecting the culprit commit based on a certain latency or throughput metric.
We will also set up alerts on performance regressions.
Feedback Period.
No response
CC List.
@Jialin @huydhn @namanlalitnyu
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.