Skip to content

Conversation

rouchenzi
Copy link
Contributor

@rouchenzi rouchenzi commented Sep 23, 2025

Set max_autotune & coordinate_descent_tuning as env variable in inductor config for static shape compilation.

Purpose

Those two inductor variables are hard-coded as True for inductor config of static compile shape: code ref, thus enabled by default whenever compile_sizes passed from vLLM engine args.

For some specific model architectures, those two turnings might not be consistently useful and instead extend vLLM engine first start time. This change surfaces them to vLLM-level environment variable, giving users flexibility to turn off to lower cold-start, or keep on if it improves performance.

Add as env var as initial commit, also to avoid conflicts with ongoing CompilationConfig overhaul PR: #20283.

Test Plan & Test Result

Sanity test on Qwen3 model, performance might vary across models, but overall cold start with compile_sizes reduces when those two envs set as False

vLLM engine first start time (w/o cache)

[TEST1] w/o compile_sizes

vllm serve Qwen/Qwen3-8B \
    --port 3000 \
    --tensor-parallel-size 8 

vLLM log

INFO 09-22 23:27:20 [monitor.py:34] torch.compile takes 27.28 s in total
INFO 09-22 23:27:27 [core.py:210] init engine (profile, create kv cache, warmup model) took 39.62 seconds

[TEST2] w/ compile_sizes=[1,2,4,8,16,24]

This by default has max_autotune=True & coordinate_descent_tuning=True

vllm serve Qwen/Qwen3-8B \
    --port 3000 \
    --tensor-parallel-size 8 \
    --compilation-config '{"compile_sizes": [1,2,4,8,16,24]}'

vLLM log

INFO 09-22 23:38:14 [monitor.py:34] torch.compile takes 222.34 s in total
INFO 09-22 23:38:17 [core.py:210] init engine (profile, create kv cache, warmup model) took 245.31 seconds

[TEST3] w/ compile_sizes=[1,2,4,8,16,24] & VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0

This by default has coordinate_descent_tuning=True

VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0 vllm serve Qwen/Qwen3-8B \
    --port 3000 \
    --tensor-parallel-size 8 \
    --compilation-config '{"compile_sizes": [1,2,4,8,16,24]}'

vLLM log

INFO 09-22 23:10:55 [monitor.py:34] torch.compile takes 187.04 s in total
INFO 09-22 23:10:57 [core.py:210] init engine (profile, create kv cache, warmup model) took 206.25 seconds

[TEST4] w/ compile_sizes=[1,2,4,8,16,24] & VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING=0 & VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0

VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING=0 VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0 vllm serve Qwen/Qwen3-8B \
    --port 3000 \
    --tensor-parallel-size 8 \
    --compilation-config '{"compile_sizes": [1,2,4,8,16,24]}'

vLLM log

INFO 09-22 23:22:38 [monitor.py:34] torch.compile takes 85.19 s in total
INFO 09-22 23:22:39 [core.py:210] init engine (profile, create kv cache, warmup model) took 99.96 seconds

vLLM performance

command

vllm bench serve \
    --port 3000 \
    --model Qwen/Qwen3-8B \
    --request-rate 5 \
    --num-prompts 200 \
    --random-input-len 4096 \
    --random-output-len 300\
    --tokenizer Qwen/Qwen3-8B

Could see shape [1,2,4,8,16,24] in running batchsize from one of the example logs

INFO 09-22 22:58:28 [forward_context.py:365] Batchsize forward time stats (batchsize, count, median_time(ms)): [(8, 6090, 3.39), (16, 3487, 3.51), (4, 1174, 3.4), (1, 912, 3.39), (2, 547, 3.4), (24, 114, 3.54), (32, 27, 3.8), (4102, 25, 31.94), (4103, 21, 31.95), (4101, 19, 31.93), (4104, 18, 31.89), (4105, 15, 31.99), (4108, 15, 31.99), (4106, 14, 31.94), (4107, 11, 32.01), (4100, 8, 31.91), (4098, 7, 31.89), (4099, 6, 31.86), (4109, 5, 32.02), (8192, 4, 55.54), (4097, 3, 31.9), (4110, 3, 32.14), (4111, 3, 32.06), (4043, 2, 31.32)]

[TEST1] w/o compile_sizes

============ Serving Benchmark Result ============
Successful requests: 200
Request rate configured (RPS): 5.00
Benchmark duration (s): 41.24
Total input tokens: 818247
Total generated tokens: 55904
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1355.52
Peak output token throughput (tok/s): 2646.00
Peak concurrent requests: 17.00
Total Token throughput (tok/s): 21195.76
---------------Time to First Token----------------
Mean TTFT (ms): 22.98
Median TTFT (ms): 23.09
P99 TTFT (ms): 32.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.13
Median TPOT (ms): 4.13
P99 TPOT (ms): 4.32
---------------Inter-token Latency----------------
Mean ITL (ms): 4.13
Median ITL (ms): 4.06
P99 ITL (ms): 5.27

[TEST2] w/ compile_sizes=[1,2,4,8,16,24]

This by default has max_autotune=True & coordinate_descent_tuning=True

============ Serving Benchmark Result ============
Successful requests: 200
Request rate configured (RPS): 5.00
Benchmark duration (s): 41.25
Total input tokens: 818247
Total generated tokens: 56254
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1363.83
Peak output token throughput (tok/s): 2702.00
Peak concurrent requests: 16.00
Total Token throughput (tok/s): 21201.55
---------------Time to First Token----------------
Mean TTFT (ms): 23.28
Median TTFT (ms): 23.46
P99 TTFT (ms): 33.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.16
Median TPOT (ms): 4.16
P99 TPOT (ms): 4.26
---------------Inter-token Latency----------------
Mean ITL (ms): 4.17
Median ITL (ms): 4.08
P99 ITL (ms): 5.39

[TEST3] w/ compile_sizes=[1,2,4,8,16,24] & VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0

This by default has coordinate_descent_tuning=True

============ Serving Benchmark Result ============
Successful requests: 200
Request rate configured (RPS): 5.00
Benchmark duration (s): 41.26
Total input tokens: 818247
Total generated tokens: 56266
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1363.53
Peak output token throughput (tok/s): 2707.00
Peak concurrent requests: 17.00
Total Token throughput (tok/s): 21192.62
---------------Time to First Token----------------
Mean TTFT (ms): 24.51
Median TTFT (ms): 24.86
P99 TTFT (ms): 34.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.22
Median TPOT (ms): 4.20
P99 TPOT (ms): 4.52
---------------Inter-token Latency----------------
Mean ITL (ms): 4.22
Median ITL (ms): 4.12
P99 ITL (ms): 5.74

[TEST4] w/ compile_sizes=[1,2,4,8,16,24] & VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING=0 & VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0

============ Serving Benchmark Result ============
Successful requests: 200
Request rate configured (RPS): 5.00
Benchmark duration (s): 41.27
Total input tokens: 818247
Total generated tokens: 55620
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1347.75
Peak output token throughput (tok/s): 2652.00
Peak concurrent requests: 17.00
Total Token throughput (tok/s): 21175.01
---------------Time to First Token----------------
Mean TTFT (ms): 23.23
Median TTFT (ms): 23.19
P99 TTFT (ms): 34.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.19
Median TPOT (ms): 4.18
P99 TPOT (ms): 4.70
---------------Inter-token Latency----------------
Mean ITL (ms): 4.20
Median ITL (ms): 4.08
P99 ITL (ms): 5.83


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…TE_DESCENT_TUNING in environment variable

Signed-off-by: rouchenzi <ruochenwen@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully exposes two Inductor tuning parameters, max_autotune and coordinate_descent_tuning, as environment variables. This is a valuable change that gives users more control over the trade-off between compilation time and potential runtime performance. The implementation is clean and follows existing patterns, and the PR description includes thorough test results. I have one suggestion to improve the robustness of how the new environment variables are parsed to prevent potential runtime errors.

Comment on lines +1408 to +1414
"VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE":
lambda: bool(int(os.getenv("VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE", "1"))),
# If set to 1, enable coordinate_descent_tuning;
# By default, this is enabled (1)
"VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING":
lambda: bool(int(os.getenv("VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING",
"1"))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation bool(int(os.getenv(...))) for parsing these boolean environment variables is not robust. It will raise a ValueError if a user sets the variable to a non-integer string like "true" or "false". To improve user experience and prevent runtime crashes from misconfiguration, it's better to adopt a more resilient parsing pattern that is also used elsewhere in this file.

Suggested change
"VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE":
lambda: bool(int(os.getenv("VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE", "1"))),
# If set to 1, enable coordinate_descent_tuning;
# By default, this is enabled (1)
"VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING":
lambda: bool(int(os.getenv("VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING",
"1"))),
"VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE":
lambda: os.getenv("VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE", "1").lower() in ("1", "true"),
# If set to 1, enable coordinate_descent_tuning;
# By default, this is enabled (1)
"VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING":
lambda: (os.getenv("VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING",
"1").lower() in ("1", "true")),

@simon-mo
Copy link
Collaborator

Can these be compilation config instead? Seems like they are not temporary...

@rouchenzi
Copy link
Contributor Author

@simon-mo yeah was thinking about compilation config, but as those vars currently only apply to static induct config and there is ongoing compilation overhaul on CompilationConfig: #20283.

So thinking it might be straightforward starting from env var to avoid conflicts, we could move to config incrementally once its more stable

@ProExpertProg
Copy link
Collaborator

I think this should start as an env variable because of the compilation config overhaul.

Alternatively we could add a inductor_compile_config_static for Inductor config only applied to static sizes.

Copy link

mergify bot commented Sep 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rouchenzi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 23, 2025
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Sep 23, 2025
@simon-mo simon-mo enabled auto-merge (squash) September 23, 2025 20:30
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025
@simon-mo simon-mo merged commit eca7be9 into vllm-project:main Sep 23, 2025
52 checks passed
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
vllm-project#25493)

Signed-off-by: rouchenzi <ruochenwen@gmail.com>
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
#25493)

Signed-off-by: rouchenzi <ruochenwen@gmail.com>
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants