[Feature]: Enabling performance optimizations by default

### 🚀 The feature, motivation and pitch

In vLLM, achieving the best performance often requires many additional CLI/config flags. We should instead enable the best behavior by default where we can. For example, look at the [Blackwell recipe for LLaMa 3](https://github.com/vllm-project/recipes/blob/main/Llama/Llama3.3_Blackwell.yaml):
```
kv-cache-dtype: fp8
compilation-config: '{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY","splitting_ops":[]}'
```

Resolving most of these requires feature work, and some of these are only going to be available when using `torch>=2.9`. Instead of waiting for the release, we should enable relevant features now (conditional on the torch version), or as soon as they are merged. The work is tracked in [milestone](https://github.com/vllm-project/vllm/milestone/14); this just tracks changing the defaults.

Features only dependent on additional work:
- [x] `CUDAGraphMode`: this should already be possible, when `splitting_ops=[]`, `FULL_AND_PIECEWISE` should convert to `FULL` which should downgrade to `FULL_DECODE_ONLY` (needs verification): #23046
- [ ] Remove `custom_ops` flag requirement for fusions: #24604
- [ ] Enable RMSNorm+QuantFP8 & SiluMul/QuantFP8 fusions by default: #24604
  - will also `enable_noop`  
- [ ] Enable FI allreduce fusion by default: #24604 -> #24248 -> #24252 

Features conditional on torch 2.9:
- [ ] enable `use_inductor_graph_partition`
  - only if `use_inductor=True`
  - removes the need for `splitting_ops=[]`
  - after #20283, only if `-O2` or `-O3` (we can discuss this, basically -O1 might want to compile faster and hence we should use the faster startup Dynamo partitioning - slight performance reduction in general as well as disabling AttnFusion and SP/AsyncTP passes which only apply to quant models and TP scenarios respectively anyway).
  - nice to have: use `register_should_partition_rule`: #25691
- [ ] enable AttnFusion by default: possible with `use_inductor_graph_partition`
- [ ] reenable `VLLM_STANDALONE_COMPILE` by default
  - We disabled it due to #24547, once that's fixed we should reenable it
- [ ] enable SequenceParallel and AsyncTP passes: #25277
  - relies on `use_inductor_graph_partition` and fixing issues
- [ ] enable torch group quant by default: #25094
  - don't know root cause for perf slowdown yet
- [ ] enable `VLLM_USE_AOT_COMPILE` by default (after #24274)
  - not in 2.9, we should conditionally enable this in 2.10 still to make sure there are no issues


### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Enabling performance optimizations by default #25689

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Enabling performance optimizations by default #25689

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions