Skip to content

[RFC]: Continuous profiling and regression prevention for vLLM #23431

@linzebing

Description

@linzebing

Motivation.

While we continue to push the limit of vLLM performance, an important aspect is to make sure that these wins will stay. It’s quite common in practice that commits may just silently regress engine performance. For instance, we observed a downward trend of engine performance in the past month:

Image

vLLM Llama 8b shareGPT benchmark dashboard

To ensure long-term performance and reliability, we want to target the following areas:

  • Unified Optimization Measurement: formalize the process to measure optimization wins (low-level components or e2e latency improvements)
  • Regression Detection and Prevention: ensure optimization holds in a long run and prevent slow performance bleeding
  • DevX: reduce the operational overhead to measure wins or investigate regressions to save time against optimization exploration

Existing Work

@huydhn and the Pytorch team already did a great job building the vLLM performance benchmark dashboard, which runs predefined benchmarks every 2 hours. Users can view vLLM’s performance on certain benchmarks over time (as shown above), benchmark arbitrary vLLM commits, and compare benchmark performance of different commits.

Proposed Change.

Proposal

We propose integrating lightweight profiling instrumentation into vLLM coupled with periodic e2e profiling jobs. The system will feature the following:

  • Extend vLLM benchmark dashboard to include SGLang as well on certain model/workload combinations
  • Help us understand vLLM vs. SGLang performance trends over time
  • Continuous dump of Pytorch profiler traces on fixed benchmarks to S3 for deep dives
  • Extend vLLM continuous benchmark to collect profiling metrics (latency breakdowns, GPU metrics, scheduling stats, etc.), store profiling data and visualize trends over time
  • Tooling for automated anomaly detection and bisection

Design

On a high level, we divide the continuous profiling work into the following components:

  • vLLM metric collector
  • Performance measurement tooling
  • Data storage and visualization
  • Regression detection alerting and bisecting
Image

vLLM Metrics Collector

We are targeting similar functionalities of Jialin#1 / Jialin#6 that log the latency component breakdown, but with a pluggable instrumentation layer, in the form of hooks/timers controlled by an environment var.

Ideally this instrumentation layer should export breakdowns similar to the following such that we gain observability into where time is spent.

Image

The implementation should be extensible such that it can be extended to PD/RL/disagg scenarios.

Performance Measurement Tooling

We’ll develop a script that can be used pre-commit, which will trigger certain performance benchmarks that will output differences on detailed latency breakdowns and throughput comparisons.

Data Storage and Visualization

For profiling traces we’ll just continuous export to S3 with 6 months TTL.

For latency breakdown visualization, we’ll reuse the vLLM benchmark flow: at the end of the benchmark job, we will dump the result into a local file and upload it to S3. An AWS Lambda job will be triggered to parse the data and write a hosted Clickhouse database, and we’ll use Grafana to visualize.

Our work here would be to define the profiling data dump format, create a new AWS Lambda rule to dump the data, and create a new table to store the data in Clickhouse, and lastly build a new dashboard to visualize the profiling data. We want to show the latency trend of different components over time.

Regression detection alerting and bisecting

Currently we can trigger benchmark flow with any commit, we can build a tool to extend this to bisecting the culprit commit based on a certain latency or throughput metric.

We will also set up alerts on performance regressions.

Feedback Period.

No response

CC List.

@Jialin @huydhn @namanlalitnyu

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions