Skip to content

[Feature]: Enabling performance optimizations by default #25689

@ProExpertProg

Description

@ProExpertProg

🚀 The feature, motivation and pitch

In vLLM, achieving the best performance often requires many additional CLI/config flags. We should instead enable the best behavior by default where we can. For example, look at the Blackwell recipe for LLaMa 3:

kv-cache-dtype: fp8
compilation-config: '{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY","splitting_ops":[]}'

Resolving most of these requires feature work, and some of these are only going to be available when using torch>=2.9. Instead of waiting for the release, we should enable relevant features now (conditional on the torch version), or as soon as they are merged. The work is tracked in milestone; this just tracks changing the defaults.

Features only dependent on additional work:

Features conditional on torch 2.9:

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Ready

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions