[Feature]: Improve startup time UX

# 🚀 The feature, motivation and pitch

vLLM startup time has become a pain-point for certain use cases, like auto-scaling instances or model swapping. This leads to poor user experience or even users choosing to use `--enforce-eager`, sacrificing performance. I'm creating this parent issue to track our work on better understanding the startup time as well as the throughput tradeoffs from skipping certain steps.

## 🚧 [WIP] Startup heavy hitters (most time-consuming)

1. P2P access check
2. Weight loading
3. Dynamo tracing
4. Inductor compilation
  a. Additional time spent on extra compile_sizes and max-autotune
5. CUDAGraph capture
6. PTX compilation

### Other
- @ywang96 mentioned LMMs take a long time generating dummy multi-modal data in the profile_run
- 

Recent measurements from @robertgshaw2-redhat:
> Llama-70B-Fp8 on TP=8. I see the following:
> - ~60s to check for P2P access manually. We can disable this check with VLLM_SKIP_P2P_CHECK=1
> - ~10s to load weights --- from hot page cache
> - ~15s to convert dynamo bytecode
> - ~70s to run torch.compile
> - ~60s to capture the cudagraphs

Comment from #17280:
> Furthermore the very first request after starting up vLLM takes 30-60 seconds.

Due to #19336, we only build FA for SM80, SM90, and PTX. On another machine, PTX is compiled dynamically.

## Proposed roadmap

### 1. Enumerate use-cases and models we care about

Larger models take longer to load and compile than smaller ones, so we should decide what models we want to look at. More on startup "regimes"/use-cases can come later.

### 2. Measure time taken and performance tradeoffs

We do an end-to-end measurement of the startup time, and make sure we're not missing anything from the list of the heavy hitters. We might need to improve vLLMs time measurement infrastructure to have better visibility into startup time moving forward.

### 3. Address heavy-hitters

The exact mitigation strategies should balance effort with measured benefit, e.g. I think Dynamo caching might be a bit hard infrastructure-wise and not provide much time savings. 

#### 1. P2P access check:
  - per @njhill's suggestion this could be done async?
  - @aarnphm suggested to use a hardware mapping instead
#### 2. Weight loading: ❓ 
#### 3. Dynamo tracing:
  - AFAIK this is currently not cached, but we could try to manually cache it
#### 4. Inductor compilation:
  - This is fully cachable. We should first advertise this to users to make sure they're e.g. sharing the cache between auto-scaling deployments.
  - It seems like the triton autotuning is not cached for explicit compile sizes (TODO @aarnphm create the issue)
  - Depending on benefits provided this could be disabled, but I am strongly against recommending disabling this to users, as we rely on Inductor for custom passes for performance and there are more passes in progress.
  - We can still improve the custom ops to make sure performance is as good as possible without Inductor: #19817
  - Inductor can generate Triton kernels in parallel, we should make sure this actually happens.
#### 5. CUDAGraph capture
  - We could reduce the amount of sizes we capture cudagraphs for.
    - Larger step, smaller max size, or even larger small size if we know that we'll only be hitting larger sizes due to high QPS.
  - For larger models, I assume CUDA graphs provide less benefit, so we could also turn them off.
  - If somebody was interested in a research project that tries to manually serialize `cudaGraph_t` and use `mmap` tricks for tensors to save and load cudagraphs from memory without capture, that would be amazing. But it's not clear this is possible. Perhaps we can get help from NVIDIA/AMD on this.
  - @lionelvillard proposed "lazy cudagraph capture" where we only capture cudagraphs as needed. TODO write the RFC.

#### 6. PTX compilation
Until pip package size is limited, we cannot bundle FA code for all architectures into the package. Perhaps we could invoke PTX compilation upon vllm installation? But that wouldn't work on a "headless" machine

### 4. Add explicit performance "regimes"

We're starting to see a need for different regimes with prefill-disaggregation and throughput/latency optimized kernels for prefill/decode respectively. Similarly, we could have a "faster startup" regime that provides sensible defaults for skipping steps. Or, we use throughput and latency regimes and set different defaults (perhaps the prefill/throughput instance doesn't do cudagraph capture and the decode/latency instance does full cudagraphs and focuses on smaller sizes). This would not change our current ability to control every aspect of compilation through CompilationConfig (still necessary for developers), but users shouldn't need to tweak these settings for common use-cases.


## Alternatives

_No response_

## Additional context

_No response_

## Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Improve startup time UX #19824

🚀 The feature, motivation and pitch

🚧 [WIP] Startup heavy hitters (most time-consuming)

Other

Proposed roadmap

1. Enumerate use-cases and models we care about

2. Measure time taken and performance tradeoffs

3. Address heavy-hitters

1. P2P access check:

2. Weight loading: ❓

3. Dynamo tracing:

4. Inductor compilation:

5. CUDAGraph capture

6. PTX compilation

4. Add explicit performance "regimes"

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Improve startup time UX #19824

Description

🚀 The feature, motivation and pitch

🚧 [WIP] Startup heavy hitters (most time-consuming)

Other

Proposed roadmap

1. Enumerate use-cases and models we care about

2. Measure time taken and performance tradeoffs

3. Address heavy-hitters

1. P2P access check:

2. Weight loading: ❓

3. Dynamo tracing:

4. Inductor compilation:

5. CUDAGraph capture

6. PTX compilation

4. Add explicit performance "regimes"

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions