Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… #25493

rouchenzi · 2025-09-23T16:55:26Z

Set max_autotune & coordinate_descent_tuning as env variable in inductor config for static shape compilation.

Purpose

Those two inductor variables are hard-coded as True for inductor config of static compile shape: code ref, thus enabled by default whenever compile_sizes passed from vLLM engine args.

For some specific model architectures, those two turnings might not be consistently useful and instead extend vLLM engine first start time. This change surfaces them to vLLM-level environment variable, giving users flexibility to turn off to lower cold-start, or keep on if it improves performance.

Add as env var as initial commit, also to avoid conflicts with ongoing CompilationConfig overhaul PR: #20283.

Test Plan & Test Result

Sanity test on Qwen3 model, performance might vary across models, but overall cold start with compile_sizes reduces when those two envs set as False

vLLM engine first start time (w/o cache)

[TEST1] w/o compile_sizes

vllm serve Qwen/Qwen3-8B \
    --port 3000 \
    --tensor-parallel-size 8

vLLM log

INFO 09-22 23:27:20 [monitor.py:34] torch.compile takes 27.28 s in total
INFO 09-22 23:27:27 [core.py:210] init engine (profile, create kv cache, warmup model) took 39.62 seconds

[TEST2] w/ compile_sizes=[1,2,4,8,16,24]

This by default has max_autotune=True & coordinate_descent_tuning=True

vllm serve Qwen/Qwen3-8B \
    --port 3000 \
    --tensor-parallel-size 8 \
    --compilation-config '{"compile_sizes": [1,2,4,8,16,24]}'

vLLM log

INFO 09-22 23:38:14 [monitor.py:34] torch.compile takes 222.34 s in total
INFO 09-22 23:38:17 [core.py:210] init engine (profile, create kv cache, warmup model) took 245.31 seconds

[TEST3] w/ compile_sizes=[1,2,4,8,16,24] & `VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`=0

This by default has coordinate_descent_tuning=True

VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0 vllm serve Qwen/Qwen3-8B \
    --port 3000 \
    --tensor-parallel-size 8 \
    --compilation-config '{"compile_sizes": [1,2,4,8,16,24]}'

vLLM log

INFO 09-22 23:10:55 [monitor.py:34] torch.compile takes 187.04 s in total
INFO 09-22 23:10:57 [core.py:210] init engine (profile, create kv cache, warmup model) took 206.25 seconds

[TEST4] w/ compile_sizes=[1,2,4,8,16,24] & `VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING`=0 & `VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`=0

VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING=0 VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0 vllm serve Qwen/Qwen3-8B \
    --port 3000 \
    --tensor-parallel-size 8 \
    --compilation-config '{"compile_sizes": [1,2,4,8,16,24]}'

vLLM log

INFO 09-22 23:22:38 [monitor.py:34] torch.compile takes 85.19 s in total
INFO 09-22 23:22:39 [core.py:210] init engine (profile, create kv cache, warmup model) took 99.96 seconds

vLLM performance

command

vllm bench serve \
    --port 3000 \
    --model Qwen/Qwen3-8B \
    --request-rate 5 \
    --num-prompts 200 \
    --random-input-len 4096 \
    --random-output-len 300\
    --tokenizer Qwen/Qwen3-8B

Could see shape [1,2,4,8,16,24] in running batchsize from one of the example logs

INFO 09-22 22:58:28 [forward_context.py:365] Batchsize forward time stats (batchsize, count, median_time(ms)): [(8, 6090, 3.39), (16, 3487, 3.51), (4, 1174, 3.4), (1, 912, 3.39), (2, 547, 3.4), (24, 114, 3.54), (32, 27, 3.8), (4102, 25, 31.94), (4103, 21, 31.95), (4101, 19, 31.93), (4104, 18, 31.89), (4105, 15, 31.99), (4108, 15, 31.99), (4106, 14, 31.94), (4107, 11, 32.01), (4100, 8, 31.91), (4098, 7, 31.89), (4099, 6, 31.86), (4109, 5, 32.02), (8192, 4, 55.54), (4097, 3, 31.9), (4110, 3, 32.14), (4111, 3, 32.06), (4043, 2, 31.32)]

[TEST1] w/o compile_sizes

============ Serving Benchmark Result ============
Successful requests: 200
Request rate configured (RPS): 5.00
Benchmark duration (s): 41.24
Total input tokens: 818247
Total generated tokens: 55904
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1355.52
Peak output token throughput (tok/s): 2646.00
Peak concurrent requests: 17.00
Total Token throughput (tok/s): 21195.76
---------------Time to First Token----------------
Mean TTFT (ms): 22.98
Median TTFT (ms): 23.09
P99 TTFT (ms): 32.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.13
Median TPOT (ms): 4.13
P99 TPOT (ms): 4.32
---------------Inter-token Latency----------------
Mean ITL (ms): 4.13
Median ITL (ms): 4.06
P99 ITL (ms): 5.27

[TEST2] w/ compile_sizes=[1,2,4,8,16,24]

This by default has max_autotune=True & coordinate_descent_tuning=True

============ Serving Benchmark Result ============
Successful requests: 200
Request rate configured (RPS): 5.00
Benchmark duration (s): 41.25
Total input tokens: 818247
Total generated tokens: 56254
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1363.83
Peak output token throughput (tok/s): 2702.00
Peak concurrent requests: 16.00
Total Token throughput (tok/s): 21201.55
---------------Time to First Token----------------
Mean TTFT (ms): 23.28
Median TTFT (ms): 23.46
P99 TTFT (ms): 33.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.16
Median TPOT (ms): 4.16
P99 TPOT (ms): 4.26
---------------Inter-token Latency----------------
Mean ITL (ms): 4.17
Median ITL (ms): 4.08
P99 ITL (ms): 5.39

[TEST3] w/ compile_sizes=[1,2,4,8,16,24] & `VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`=0

This by default has coordinate_descent_tuning=True

============ Serving Benchmark Result ============
Successful requests: 200
Request rate configured (RPS): 5.00
Benchmark duration (s): 41.26
Total input tokens: 818247
Total generated tokens: 56266
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1363.53
Peak output token throughput (tok/s): 2707.00
Peak concurrent requests: 17.00
Total Token throughput (tok/s): 21192.62
---------------Time to First Token----------------
Mean TTFT (ms): 24.51
Median TTFT (ms): 24.86
P99 TTFT (ms): 34.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.22
Median TPOT (ms): 4.20
P99 TPOT (ms): 4.52
---------------Inter-token Latency----------------
Mean ITL (ms): 4.22
Median ITL (ms): 4.12
P99 ITL (ms): 5.74

[TEST4] w/ compile_sizes=[1,2,4,8,16,24] & `VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING`=0 & `VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`=0

============ Serving Benchmark Result ============
Successful requests: 200
Request rate configured (RPS): 5.00
Benchmark duration (s): 41.27
Total input tokens: 818247
Total generated tokens: 55620
Request throughput (req/s): 4.85
Output token throughput (tok/s): 1347.75
Peak output token throughput (tok/s): 2652.00
Peak concurrent requests: 17.00
Total Token throughput (tok/s): 21175.01
---------------Time to First Token----------------
Mean TTFT (ms): 23.23
Median TTFT (ms): 23.19
P99 TTFT (ms): 34.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.19
Median TPOT (ms): 4.18
P99 TPOT (ms): 4.70
---------------Inter-token Latency----------------
Mean ITL (ms): 4.20
Median ITL (ms): 4.08
P99 ITL (ms): 5.83

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…TE_DESCENT_TUNING in environment variable Signed-off-by: rouchenzi <ruochenwen@gmail.com>

gemini-code-assist

Code Review

This pull request successfully exposes two Inductor tuning parameters, max_autotune and coordinate_descent_tuning, as environment variables. This is a valuable change that gives users more control over the trade-off between compilation time and potential runtime performance. The implementation is clean and follows existing patterns, and the PR description includes thorough test results. I have one suggestion to improve the robustness of how the new environment variables are parsed to prevent potential runtime errors.

gemini-code-assist · 2025-09-23T16:56:49Z

vllm/envs.py

+    "VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE":
+    lambda: bool(int(os.getenv("VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE", "1"))),
+    # If set to 1, enable coordinate_descent_tuning;
+    # By default, this is enabled (1)
+    "VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING":
+    lambda: bool(int(os.getenv("VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING",
+        "1"))),


The current implementation bool(int(os.getenv(...))) for parsing these boolean environment variables is not robust. It will raise a ValueError if a user sets the variable to a non-integer string like "true" or "false". To improve user experience and prevent runtime crashes from misconfiguration, it's better to adopt a more resilient parsing pattern that is also used elsewhere in this file.

Suggested change

"VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE":

lambda: bool(int(os.getenv("VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE", "1"))),

# If set to 1, enable coordinate_descent_tuning;

# By default, this is enabled (1)

"VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING":

lambda: bool(int(os.getenv("VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING",

"1"))),

"VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE":

lambda: os.getenv("VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE", "1").lower() in ("1", "true"),

# If set to 1, enable coordinate_descent_tuning;

# By default, this is enabled (1)

"VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING":

lambda: (os.getenv("VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING",

"1").lower() in ("1", "true")),

simon-mo · 2025-09-23T17:23:44Z

Can these be compilation config instead? Seems like they are not temporary...

rouchenzi · 2025-09-23T17:28:26Z

@simon-mo yeah was thinking about compilation config, but as those vars currently only apply to static induct config and there is ongoing compilation overhaul on CompilationConfig: #20283.

So thinking it might be straightforward starting from env var to avoid conflicts, we could move to config incrementally once its more stable

ProExpertProg · 2025-09-23T17:39:12Z

I think this should start as an env variable because of the compilation config overhaul.

Alternatively we could add a inductor_compile_config_static for Inductor config only applied to static sizes.

mergify · 2025-09-23T18:43:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rouchenzi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>

vllm-project#25493) Signed-off-by: rouchenzi <ruochenwen@gmail.com> Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>

#25493) Signed-off-by: rouchenzi <ruochenwen@gmail.com> Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA…

386373b

…TE_DESCENT_TUNING in environment variable Signed-off-by: rouchenzi <ruochenwen@gmail.com>

rouchenzi requested review from zou3519, youkaichao and ProExpertProg as code owners September 23, 2025 16:55

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

ProExpertProg approved these changes Sep 23, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 23, 2025

Merge branch 'main' into autotune_env_test

5f801ee

Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>

mergify bot removed the needs-rebase label Sep 23, 2025

simon-mo enabled auto-merge (squash) September 23, 2025 20:30

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025

simon-mo merged commit eca7be9 into vllm-project:main Sep 23, 2025
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… #25493

Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… #25493

Uh oh!

rouchenzi commented Sep 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 23, 2025

Uh oh!

simon-mo commented Sep 23, 2025

Uh oh!

rouchenzi commented Sep 23, 2025

Uh oh!

ProExpertProg commented Sep 23, 2025

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… #25493

Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… #25493

Uh oh!

Conversation

rouchenzi commented Sep 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan & Test Result

vLLM engine first start time (w/o cache)

[TEST1] w/o compile_sizes

[TEST2] w/ compile_sizes=[1,2,4,8,16,24]

[TEST3] w/ compile_sizes=[1,2,4,8,16,24] & VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0

[TEST4] w/ compile_sizes=[1,2,4,8,16,24] & VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING=0 & VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0

vLLM performance

[TEST1] w/o compile_sizes

[TEST2] w/ compile_sizes=[1,2,4,8,16,24]

[TEST3] w/ compile_sizes=[1,2,4,8,16,24] & VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0

[TEST4] w/ compile_sizes=[1,2,4,8,16,24] & VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING=0 & VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE=0

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

simon-mo commented Sep 23, 2025

Uh oh!

rouchenzi commented Sep 23, 2025

Uh oh!

ProExpertProg commented Sep 23, 2025

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

rouchenzi commented Sep 23, 2025 •

edited by github-actions bot

Loading

[TEST3] w/ compile_sizes=[1,2,4,8,16,24] & `VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`=0

[TEST4] w/ compile_sizes=[1,2,4,8,16,24] & `VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING`=0 & `VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`=0

[TEST3] w/ compile_sizes=[1,2,4,8,16,24] & `VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`=0

[TEST4] w/ compile_sizes=[1,2,4,8,16,24] & `VLLM_ENABLE_INDUCTOR_COORDINATE_DESCENT_TUNING`=0 & `VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE`=0