-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Open
0 / 10 of 1 issue completedOpen
0 / 10 of 1 issue completed
Copy link
Labels
Description
🚀 The feature, motivation and pitch
In vLLM, achieving the best performance often requires many additional CLI/config flags. We should instead enable the best behavior by default where we can. For example, look at the Blackwell recipe for LLaMa 3:
kv-cache-dtype: fp8
compilation-config: '{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY","splitting_ops":[]}'
Resolving most of these requires feature work, and some of these are only going to be available when using torch>=2.9
. Instead of waiting for the release, we should enable relevant features now (conditional on the torch version), or as soon as they are merged. The work is tracked in milestone; this just tracks changing the defaults.
Features only dependent on additional work:
-
CUDAGraphMode
: this should already be possible, whensplitting_ops=[]
,FULL_AND_PIECEWISE
should convert toFULL
which should downgrade toFULL_DECODE_ONLY
(needs verification): [V1] address post issues related to #20059 (part 1); cascade attention reenable by default #23046 - Remove
custom_ops
flag requirement for fusions: Luka/custom op matching 2 #24604 - Enable RMSNorm+QuantFP8 & SiluMul/QuantFP8 fusions by default: Luka/custom op matching 2 #24604
- will also
enable_noop
- will also
- Enable FI allreduce fusion by default: Luka/custom op matching 2 #24604 -> [PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds #24248 -> [Compile] Conditional compilation. Introduce compile_ranges #24252
Features conditional on torch 2.9:
- enable
use_inductor_graph_partition
- only if
use_inductor=True
- removes the need for
splitting_ops=[]
- after [RFC][UX][torch.compile][CUDAGraph]: Overhaul
CompilationConfig
and improve CLI-O<n>
#20283, only if-O2
or-O3
(we can discuss this, basically -O1 might want to compile faster and hence we should use the faster startup Dynamo partitioning - slight performance reduction in general as well as disabling AttnFusion and SP/AsyncTP passes which only apply to quant models and TP scenarios respectively anyway). - nice to have: use
register_should_partition_rule
: [Feature][DRAFT]: Inductor partitioning should decide what ops to partition on dynamically #25691
- only if
- enable AttnFusion by default: possible with
use_inductor_graph_partition
- reenable
VLLM_STANDALONE_COMPILE
by default- We disabled it due to [Bug]: torch.compile fails for Gemma3n on pytorch 2.8 #24547, once that's fixed we should reenable it
- enable SequenceParallel and AsyncTP passes: [Bug]: Sequence Parallelism and Async TP disabled by default #25277
- relies on
use_inductor_graph_partition
and fixing issues
- relies on
- enable torch group quant by default: [Performance]: Compiled
QuantFP8.forward_native
group quantization (1, 128) slower than CUDA on H100/RTX5090 #25094- don't know root cause for perf slowdown yet
- enable
VLLM_USE_AOT_COMPILE
by default (after AOT Compilation for torch.compile (Bundled) #24274)- not in 2.9, we should conditionally enable this in 2.10 still to make sure there are no issues
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
ZJY0516 and leofidus
Sub-issues
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Ready