From bea9d1c685ed82731daafdc609e9e6d095717ca5 Mon Sep 17 00:00:00 2001
From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com>
Date: Thu, 9 Oct 2025 16:33:57 +0000
Subject: [PATCH 1/2] Documentation updates from Promptless
---
docs.json | 3 +-
serverless/vllm/environment-variables.mdx | 178 ++++++++++++++++++++++
serverless/vllm/get-started.mdx | 2 +-
serverless/vllm/openai-compatibility.mdx | 2 +-
serverless/vllm/overview.mdx | 4 +-
5 files changed, 184 insertions(+), 5 deletions(-)
create mode 100644 serverless/vllm/environment-variables.mdx
diff --git a/docs.json b/docs.json
index 8bd11fc9..c45cddf3 100644
--- a/docs.json
+++ b/docs.json
@@ -82,7 +82,8 @@
"serverless/vllm/overview",
"serverless/vllm/get-started",
"serverless/vllm/vllm-requests",
- "serverless/vllm/openai-compatibility"
+ "serverless/vllm/openai-compatibility",
+ "serverless/vllm/environment-variables"
]
},
{
diff --git a/serverless/vllm/environment-variables.mdx b/serverless/vllm/environment-variables.mdx
new file mode 100644
index 00000000..0440d2fd
--- /dev/null
+++ b/serverless/vllm/environment-variables.mdx
@@ -0,0 +1,178 @@
+---
+title: "vLLM environment variables"
+sidebarTitle: "Environment variables"
+---
+
+Environment variables let you configure your vLLM worker without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options.
+
+To set environment variables, go to your endpoint settings and add them under **Public Environment Variables** or **Private Environment Variables** depending on your security needs.
+
+## LLM settings
+
+These variables control the core language model configuration.
+
+| Variable | Default | Type/Choices | Description |
+| --- | --- | --- | --- |
+| `MODEL_NAME` | `facebook/opt-125m` | `str` | Name or path of the Hugging Face model to use. |
+| `MODEL_REVISION` | `main` | `str` | Model revision to load. |
+| `TOKENIZER` | None | `str` | Name or path of the Hugging Face tokenizer to use. |
+| `SKIP_TOKENIZER_INIT` | `False` | `bool` | Skip initialization of tokenizer and detokenizer. |
+| `TOKENIZER_MODE` | `auto` | `auto`, `slow` | The tokenizer mode. |
+| `TRUST_REMOTE_CODE` | `False` | `bool` | Trust remote code from Hugging Face. |
+| `DOWNLOAD_DIR` | None | `str` | Directory to download and load the weights. |
+| `LOAD_FORMAT` | `auto` | `str` | The format of the model weights to load. |
+| `HF_TOKEN` | - | `str` | Hugging Face token for private and gated models. |
+| `DTYPE` | `auto` | `auto`, `half`, `float16`, `bfloat16`, `float`, `float32` | Data type for model weights and activations. |
+| `KV_CACHE_DTYPE` | `auto` | `auto`, `fp8` | Data type for KV cache storage. |
+| `QUANTIZATION_PARAM_PATH` | None | `str` | Path to the JSON file containing the KV cache scaling factors. |
+| `MAX_MODEL_LEN` | None | `int` | Model context length. |
+| `GUIDED_DECODING_BACKEND` | `outlines` | `outlines`, `lm-format-enforcer` | Which engine will be used for guided decoding by default. |
+| `DISTRIBUTED_EXECUTOR_BACKEND` | None | `ray`, `mp` | Backend to use for distributed serving. |
+| `WORKER_USE_RAY` | `False` | `bool` | Deprecated, use `DISTRIBUTED_EXECUTOR_BACKEND=ray`. |
+| `PIPELINE_PARALLEL_SIZE` | `1` | `int` | Number of pipeline stages. |
+| `TENSOR_PARALLEL_SIZE` | `1` | `int` | Number of tensor parallel replicas. |
+| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Load model sequentially in multiple batches. |
+| `RAY_WORKERS_USE_NSIGHT` | `False` | `bool` | If specified, use nsight to profile Ray workers. |
+| `ENABLE_PREFIX_CACHING` | `False` | `bool` | Enables automatic prefix caching. |
+| `DISABLE_SLIDING_WINDOW` | `False` | `bool` | Disables sliding window, capping to sliding window size. |
+| `USE_V2_BLOCK_MANAGER` | `False` | `bool` | Use BlockSpaceMangerV2. |
+| `NUM_LOOKAHEAD_SLOTS` | `0` | `int` | Experimental scheduling config necessary for speculative decoding. |
+| `SEED` | `0` | `int` | Random seed for operations. |
+| `NUM_GPU_BLOCKS_OVERRIDE` | None | `int` | If specified, ignore GPU profiling result and use this number of GPU blocks. |
+| `MAX_NUM_BATCHED_TOKENS` | None | `int` | Maximum number of batched tokens per iteration. |
+| `MAX_NUM_SEQS` | `256` | `int` | Maximum number of sequences per iteration. |
+| `MAX_LOGPROBS` | `20` | `int` | Max number of log probs to return when logprobs is specified in SamplingParams. |
+| `DISABLE_LOG_STATS` | `False` | `bool` | Disable logging statistics. |
+| `QUANTIZATION` | None | `awq`, `squeezellm`, `gptq`, `bitsandbytes` | Method used to quantize the weights. |
+| `ROPE_SCALING` | None | `dict` | RoPE scaling configuration in JSON format. |
+| `ROPE_THETA` | None | `float` | RoPE theta. Use with `ROPE_SCALING`. |
+| `TOKENIZER_POOL_SIZE` | `0` | `int` | Size of tokenizer pool to use for asynchronous tokenization. |
+| `TOKENIZER_POOL_TYPE` | `ray` | `str` | Type of tokenizer pool to use for asynchronous tokenization. |
+| `TOKENIZER_POOL_EXTRA_CONFIG` | None | `dict` | Extra config for tokenizer pool. |
+
+## LoRA settings
+
+Configure LoRA (Low-Rank Adaptation) adapters for your model.
+
+| Variable | Default | Type | Description |
+| --- | --- | --- | --- |
+| `ENABLE_LORA` | `False` | `bool` | If True, enable handling of LoRA adapters. |
+| `MAX_LORAS` | `1` | `int` | Max number of LoRAs in a single batch. |
+| `MAX_LORA_RANK` | `16` | `int` | Max LoRA rank. |
+| `LORA_EXTRA_VOCAB_SIZE` | `256` | `int` | Maximum size of extra vocabulary for LoRA adapters. |
+| `LORA_DTYPE` | `auto` | `auto`, `float16`, `bfloat16`, `float32` | Data type for LoRA. |
+| `LONG_LORA_SCALING_FACTORS` | None | `tuple` | Specify multiple scaling factors for LoRA adapters. |
+| `MAX_CPU_LORAS` | None | `int` | Maximum number of LoRAs to store in CPU memory. |
+| `FULLY_SHARDED_LORAS` | `False` | `bool` | Enable fully sharded LoRA layers. |
+| `LORA_MODULES` | `[]` | `list[dict]` | Add lora adapters from Hugging Face `[{"name": "xx", "path": "xxx/xxxx", "base_model_name": "xxx/xxxx"}]` |
+
+## Speculative decoding settings
+
+Configure speculative decoding to improve inference performance.
+
+| Variable | Default | Type/Choices | Description |
+| --- | --- | --- | --- |
+| `SCHEDULER_DELAY_FACTOR` | `0.0` | `float` | Apply a delay before scheduling next prompt. |
+| `ENABLE_CHUNKED_PREFILL` | `False` | `bool` | Enable chunked prefill requests. |
+| `SPECULATIVE_MODEL` | None | `str` | The name of the draft model to be used in speculative decoding. |
+| `NUM_SPECULATIVE_TOKENS` | None | `int` | The number of speculative tokens to sample from the draft model. |
+| `SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE` | None | `int` | Number of tensor parallel replicas for the draft model. |
+| `SPECULATIVE_MAX_MODEL_LEN` | None | `int` | The maximum sequence length supported by the draft model. |
+| `SPECULATIVE_DISABLE_BY_BATCH_SIZE` | None | `int` | Disable speculative decoding if the number of enqueue requests is larger than this value. |
+| `NGRAM_PROMPT_LOOKUP_MAX` | None | `int` | Max size of window for ngram prompt lookup in speculative decoding. |
+| `NGRAM_PROMPT_LOOKUP_MIN` | None | `int` | Min size of window for ngram prompt lookup in speculative decoding. |
+| `SPEC_DECODING_ACCEPTANCE_METHOD` | `rejection_sampler` | `rejection_sampler`, `typical_acceptance_sampler` | Specify the acceptance method for draft token verification in speculative decoding. |
+| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD` | None | `float` | Set the lower bound threshold for the posterior probability of a token to be accepted. |
+| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA` | None | `float` | A scaling factor for the entropy-based threshold for token acceptance. |
+
+## System performance settings
+
+Configure GPU memory and system resource utilization.
+
+| Variable | Default | Type/Choices | Description |
+| --- | --- | --- | --- |
+| `GPU_MEMORY_UTILIZATION` | `0.95` | `float` | Sets GPU VRAM utilization. |
+| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models. |
+| `BLOCK_SIZE` | `16` | `8`, `16`, `32` | Token block size for contiguous chunks of tokens. |
+| `SWAP_SPACE` | `4` | `int` | CPU swap space size (GiB) per GPU. |
+| `ENFORCE_EAGER` | `False` | `bool` | Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility. |
+| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. |
+| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | Enables or disables custom all reduce. |
+
+## Tokenizer settings
+
+Customize tokenizer behavior and chat templates.
+
+| Variable | Default | Type/Choices | Description |
+| --- | --- | --- | --- |
+| `TOKENIZER_NAME` | None | `str` | Tokenizer repository to use a different tokenizer than the model's default. |
+| `TOKENIZER_REVISION` | None | `str` | Tokenizer revision to load. |
+| `CUSTOM_CHAT_TEMPLATE` | None | `str` of single-line jinja template | Custom chat jinja template. [More info](https://huggingface.co/docs/transformers/chat_templating) |
+
+## Streaming and batch settings
+
+Control how tokens are streamed back in HTTP responses.
+
+The way this works is that the first request will have a batch size of `DEFAULT_MIN_BATCH_SIZE`, and each subsequent request will have a batch size of `previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR`. This will continue until the batch size reaches `DEFAULT_BATCH_SIZE`. For example, with the default values, the batch sizes will be `1, 3, 9, 27, 50, 50, 50, ...`. You can also specify this per request, with inputs `max_batch_size`, `min_batch_size`, and `batch_size_growth_factor`. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker.
+
+| Variable | Default | Type/Choices | Description |
+| --- | --- | --- | --- |
+| `DEFAULT_BATCH_SIZE` | `50` | `int` | Default and maximum batch size for token streaming to reduce HTTP calls. |
+| `DEFAULT_MIN_BATCH_SIZE` | `1` | `int` | Batch size for the first request, which will be multiplied by the growth factor every subsequent request. |
+| `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | `3` | `float` | Growth factor for dynamic batch size. |
+
+## OpenAI compatibility settings
+
+Configure OpenAI API compatibility features.
+
+| Variable | Default | Type/Choices | Description |
+| --- | --- | --- | --- |
+| `RAW_OPENAI_OUTPUT` | `1` | boolean as `int` | Enables raw OpenAI SSE format string output when streaming. **Required** to be enabled (which it is by default) for OpenAI compatibility. |
+| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | `str` | Overrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the `model` parameter when making OpenAI requests. |
+| `OPENAI_RESPONSE_ROLE` | `assistant` | `str` | Role of the LLM's response in OpenAI chat completions. |
+| `ENABLE_AUTO_TOOL_CHOICE` | `false` | `bool` | Enables automatic tool selection for supported models. Set to `true` to activate. |
+| `TOOL_CALL_PARSER` | None | `str` | Specifies the parser for tool calls. Options: `mistral`, `hermes`, `llama3_json`, `llama4_json`, `llama4_pythonic`, `granite`, `granite-20b-fc`, `deepseek_v3`, `internlm`, `jamba`, `phi4_mini_json`, `pythonic` |
+| `REASONING_PARSER` | None | `str` | Parser for reasoning-capable models (enables reasoning mode). Examples: `deepseek_r1`, `qwen3`, `granite`, `hunyuan_a13b`. Leave unset to disable. |
+
+## Serverless and concurrency settings
+
+Configure concurrency and logging for Serverless deployments.
+
+| Variable | Default | Type/Choices | Description |
+| --- | --- | --- | --- |
+| `MAX_CONCURRENCY` | `300` | `int` | Max concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency. |
+| `DISABLE_LOG_STATS` | `False` | `bool` | Enables or disables vLLM stats logging. |
+| `DISABLE_LOG_REQUESTS` | `False` | `bool` | Enables or disables vLLM request logging. |
+
+## Advanced settings
+
+Additional configuration options for specialized use cases.
+
+| Variable | Default | Type | Description |
+| --- | --- | --- | --- |
+| `MODEL_LOADER_EXTRA_CONFIG` | None | `dict` | Extra config for model loader. |
+| `PREEMPTION_MODE` | None | `str` | If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations into the CPU memory as preemption happens. |
+| `PREEMPTION_CHECK_PERIOD` | `1.0` | `float` | How frequently the engine checks if a preemption happens. |
+| `PREEMPTION_CPU_CAPACITY` | `2` | `float` | The percentage of CPU memory used for the saved activations. |
+| `DISABLE_LOGGING_REQUEST` | `False` | `bool` | Disable logging requests. |
+| `MAX_LOG_LEN` | None | `int` | Max number of prompt characters or prompt ID numbers being printed in log. |
+
+## Docker build arguments
+
+These variables are used when building custom Docker images with models baked in.
+
+| Variable | Default | Type | Description |
+| --- | --- | --- | --- |
+| `BASE_PATH` | `/runpod-volume` | `str` | Storage directory for huggingface cache and model. |
+| `WORKER_CUDA_VERSION` | `12.1.0` | `str` | CUDA version for the worker image. |
+
+## Deprecated variables
+
+
+The following variables are deprecated and will be removed in future versions.
+
+
+| Old Variable | New Variable | Note |
+| --- | --- | --- |
+| `MAX_CONTEXT_LEN_TO_CAPTURE` | `MAX_SEQ_LEN_TO_CAPTURE` | Use new variable name. |
+| `kv_cache_dtype=fp8_e5m2` | `kv_cache_dtype=fp8` | Simplified fp8 format. |
diff --git a/serverless/vllm/get-started.mdx b/serverless/vllm/get-started.mdx
index 587cdef3..170a3dce 100644
--- a/serverless/vllm/get-started.mdx
+++ b/serverless/vllm/get-started.mdx
@@ -145,7 +145,7 @@ To add or modify environment variables:
4. Add/edit your desired variables.
5. Click **Save Endpoint**.
-You can find a full list of available environment variables in the [vLLM worker GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variablessettings).
+For a complete list of available environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables).
You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per reponse. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests#request-input-parameters).
diff --git a/serverless/vllm/openai-compatibility.mdx b/serverless/vllm/openai-compatibility.mdx
index 8d3c3d50..9dd5edf3 100644
--- a/serverless/vllm/openai-compatibility.mdx
+++ b/serverless/vllm/openai-compatibility.mdx
@@ -259,7 +259,7 @@ Use these environment variables to customize the OpenAI compatibility:
| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override the model name in responses |
| `OPENAI_RESPONSE_ROLE` | `assistant` | Role for responses in chat completions |
-You can find a complete list of vLLM environment variables on the [GitHub README](https://github.com/runpod-workers/worker-vllm#environment-variables).
+For a complete list of all vLLM environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables).
## Client libraries
diff --git a/serverless/vllm/overview.mdx b/serverless/vllm/overview.mdx
index 0e50052e..d34d4ac5 100644
--- a/serverless/vllm/overview.mdx
+++ b/serverless/vllm/overview.mdx
@@ -14,7 +14,7 @@ vLLM workers offer several advantages that make them ideal for LLM deployment:
* **Pre-built optimization**: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference.
* **OpenAI API compatibility**: They provide a drop-in replacement for OpenAI's API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key.
* **Hugging Face integration**: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others.
-* **Configurable environments**: Extensive customization options through [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variables) allow you to adjust model parameters, performance settings, and other behaviors.
+* **Configurable environments**: Extensive customization options through [environment variables](/serverless/vllm/environment-variables) allow you to adjust model parameters, performance settings, and other behaviors.
* **Auto-scaling architecture**: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis.
## Deployment options
@@ -33,7 +33,7 @@ Quick-deployed workers will download models during initialization, which can tak
### Option 2: Deploy using a Docker image
-Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variablessettings).
+Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](/serverless/vllm/environment-variables).
Follow the instructions in the [vLLM worker README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside) to build a model into your worker image.
From a2220ae2778b6ee9c35ce5f4c9d6da6c3079213d Mon Sep 17 00:00:00 2001
From: Mo King
Date: Thu, 9 Oct 2025 12:52:43 -0400
Subject: [PATCH 2/2] vLLM environment vars second draft
---
serverless/vllm/environment-variables.mdx | 195 +++++++++++-----------
1 file changed, 93 insertions(+), 102 deletions(-)
diff --git a/serverless/vllm/environment-variables.mdx b/serverless/vllm/environment-variables.mdx
index 0440d2fd..8b513d4b 100644
--- a/serverless/vllm/environment-variables.mdx
+++ b/serverless/vllm/environment-variables.mdx
@@ -5,50 +5,50 @@ sidebarTitle: "Environment variables"
Environment variables let you configure your vLLM worker without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options.
-To set environment variables, go to your endpoint settings and add them under **Public Environment Variables** or **Private Environment Variables** depending on your security needs.
+To set environment variables, go to your [endpoint settings](/serverless/endpoints/manage-endpoints#edit-an-endpoint) and add them under **Environment Variables**.
## LLM settings
These variables control the core language model configuration.
-| Variable | Default | Type/Choices | Description |
+| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
-| `MODEL_NAME` | `facebook/opt-125m` | `str` | Name or path of the Hugging Face model to use. |
-| `MODEL_REVISION` | `main` | `str` | Model revision to load. |
-| `TOKENIZER` | None | `str` | Name or path of the Hugging Face tokenizer to use. |
-| `SKIP_TOKENIZER_INIT` | `False` | `bool` | Skip initialization of tokenizer and detokenizer. |
+| `MODEL_NAME` | `facebook/opt-125m` | `str` | The name or path of the Hugging Face model to use. |
+| `MODEL_REVISION` | `main` | `str` | The model revision to load. |
+| `TOKENIZER` | None | `str` | The name or path of the Hugging Face tokenizer to use. |
+| `SKIP_TOKENIZER_INIT` | `False` | `bool` | If `True`, skips the initialization of the tokenizer and detokenizer. |
| `TOKENIZER_MODE` | `auto` | `auto`, `slow` | The tokenizer mode. |
-| `TRUST_REMOTE_CODE` | `False` | `bool` | Trust remote code from Hugging Face. |
-| `DOWNLOAD_DIR` | None | `str` | Directory to download and load the weights. |
+| `TRUST_REMOTE_CODE` | `False` | `bool` | If `True`, trusts remote code from Hugging Face. |
+| `DOWNLOAD_DIR` | None | `str` | The directory to download and load the model weights from. |
| `LOAD_FORMAT` | `auto` | `str` | The format of the model weights to load. |
-| `HF_TOKEN` | - | `str` | Hugging Face token for private and gated models. |
-| `DTYPE` | `auto` | `auto`, `half`, `float16`, `bfloat16`, `float`, `float32` | Data type for model weights and activations. |
-| `KV_CACHE_DTYPE` | `auto` | `auto`, `fp8` | Data type for KV cache storage. |
-| `QUANTIZATION_PARAM_PATH` | None | `str` | Path to the JSON file containing the KV cache scaling factors. |
-| `MAX_MODEL_LEN` | None | `int` | Model context length. |
-| `GUIDED_DECODING_BACKEND` | `outlines` | `outlines`, `lm-format-enforcer` | Which engine will be used for guided decoding by default. |
-| `DISTRIBUTED_EXECUTOR_BACKEND` | None | `ray`, `mp` | Backend to use for distributed serving. |
-| `WORKER_USE_RAY` | `False` | `bool` | Deprecated, use `DISTRIBUTED_EXECUTOR_BACKEND=ray`. |
-| `PIPELINE_PARALLEL_SIZE` | `1` | `int` | Number of pipeline stages. |
-| `TENSOR_PARALLEL_SIZE` | `1` | `int` | Number of tensor parallel replicas. |
-| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Load model sequentially in multiple batches. |
-| `RAY_WORKERS_USE_NSIGHT` | `False` | `bool` | If specified, use nsight to profile Ray workers. |
-| `ENABLE_PREFIX_CACHING` | `False` | `bool` | Enables automatic prefix caching. |
-| `DISABLE_SLIDING_WINDOW` | `False` | `bool` | Disables sliding window, capping to sliding window size. |
-| `USE_V2_BLOCK_MANAGER` | `False` | `bool` | Use BlockSpaceMangerV2. |
-| `NUM_LOOKAHEAD_SLOTS` | `0` | `int` | Experimental scheduling config necessary for speculative decoding. |
-| `SEED` | `0` | `int` | Random seed for operations. |
-| `NUM_GPU_BLOCKS_OVERRIDE` | None | `int` | If specified, ignore GPU profiling result and use this number of GPU blocks. |
-| `MAX_NUM_BATCHED_TOKENS` | None | `int` | Maximum number of batched tokens per iteration. |
-| `MAX_NUM_SEQS` | `256` | `int` | Maximum number of sequences per iteration. |
-| `MAX_LOGPROBS` | `20` | `int` | Max number of log probs to return when logprobs is specified in SamplingParams. |
-| `DISABLE_LOG_STATS` | `False` | `bool` | Disable logging statistics. |
-| `QUANTIZATION` | None | `awq`, `squeezellm`, `gptq`, `bitsandbytes` | Method used to quantize the weights. |
-| `ROPE_SCALING` | None | `dict` | RoPE scaling configuration in JSON format. |
-| `ROPE_THETA` | None | `float` | RoPE theta. Use with `ROPE_SCALING`. |
-| `TOKENIZER_POOL_SIZE` | `0` | `int` | Size of tokenizer pool to use for asynchronous tokenization. |
-| `TOKENIZER_POOL_TYPE` | `ray` | `str` | Type of tokenizer pool to use for asynchronous tokenization. |
-| `TOKENIZER_POOL_EXTRA_CONFIG` | None | `dict` | Extra config for tokenizer pool. |
+| `HF_TOKEN` | - | `str` | Your Hugging Face token, used for private and gated models. |
+| `DTYPE` | `auto` | `auto`, `half`, `float16`, `bfloat16`, `float`, `float32` | The data type for model weights and activations. |
+| `KV_CACHE_DTYPE` | `auto` | `auto`, `fp8` | The data type for KV cache storage. |
+| `QUANTIZATION_PARAM_PATH` | None | `str` | The path to the JSON file containing the KV cache scaling factors. |
+| `MAX_MODEL_LEN` | None | `int` | The maximum model context length. |
+| `GUIDED_DECODING_BACKEND` | `outlines` | `outlines`, `lm-format-enforcer` | The default engine for guided decoding. |
+| `DISTRIBUTED_EXECUTOR_BACKEND` | None | `ray`, `mp` | The backend to use for distributed serving. |
+| `WORKER_USE_RAY` | `False` | `bool` | Deprecated. Use `DISTRIBUTED_EXECUTOR_BACKEND=ray` instead. |
+| `PIPELINE_PARALLEL_SIZE` | `1` | `int` | The number of pipeline stages. |
+| `TENSOR_PARALLEL_SIZE` | `1` | `int` | The number of tensor parallel replicas. |
+| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | The number of workers to use for parallel model loading. |
+| `RAY_WORKERS_USE_NSIGHT` | `False` | `bool` | If `True`, uses nsight to profile Ray workers. |
+| `ENABLE_PREFIX_CACHING` | `False` | `bool` | If `True`, enables automatic prefix caching. |
+| `DISABLE_SLIDING_WINDOW` | `False` | `bool` | If `True`, disables the sliding window, capping to the sliding window size. |
+| `USE_V2_BLOCK_MANAGER` | `False` | `bool` | If `True`, uses the BlockSpaceMangerV2. |
+| `NUM_LOOKAHEAD_SLOTS` | `0` | `int` | The number of lookahead slots, an experimental scheduling configuration for speculative decoding. |
+| `SEED` | `0` | `int` | The random seed for operations. |
+| `NUM_GPU_BLOCKS_OVERRIDE` | None | `int` | If specified, this value overrides the GPU profiling result for the number of GPU blocks. |
+| `MAX_NUM_BATCHED_TOKENS` | None | `int` | The maximum number of batched tokens per iteration. |
+| `MAX_NUM_SEQS` | `256` | `int` | The maximum number of sequences per iteration. |
+| `MAX_LOGPROBS` | `20` | `int` | The maximum number of log probabilities to return when `logprobs` is specified in `SamplingParams`. |
+| `DISABLE_LOG_STATS` | `False` | `bool` | If `True`, disables logging statistics. |
+| `QUANTIZATION` | None | `awq`, `squeezellm`, `gptq`, `bitsandbytes` | The method used to quantize the model weights. |
+| `ROPE_SCALING` | None | `dict` | The RoPE scaling configuration in JSON format. |
+| `ROPE_THETA` | None | `float` | The RoPE theta value. Use with `ROPE_SCALING`. |
+| `TOKENIZER_POOL_SIZE` | `0` | `int` | The size of the tokenizer pool for asynchronous tokenization. |
+| `TOKENIZER_POOL_TYPE` | `ray` | `str` | The type of the tokenizer pool for asynchronous tokenization. |
+| `TOKENIZER_POOL_EXTRA_CONFIG` | None | `dict` | Extra configuration for the tokenizer pool. |
## LoRA settings
@@ -56,93 +56,95 @@ Configure LoRA (Low-Rank Adaptation) adapters for your model.
| Variable | Default | Type | Description |
| --- | --- | --- | --- |
-| `ENABLE_LORA` | `False` | `bool` | If True, enable handling of LoRA adapters. |
-| `MAX_LORAS` | `1` | `int` | Max number of LoRAs in a single batch. |
-| `MAX_LORA_RANK` | `16` | `int` | Max LoRA rank. |
-| `LORA_EXTRA_VOCAB_SIZE` | `256` | `int` | Maximum size of extra vocabulary for LoRA adapters. |
-| `LORA_DTYPE` | `auto` | `auto`, `float16`, `bfloat16`, `float32` | Data type for LoRA. |
-| `LONG_LORA_SCALING_FACTORS` | None | `tuple` | Specify multiple scaling factors for LoRA adapters. |
-| `MAX_CPU_LORAS` | None | `int` | Maximum number of LoRAs to store in CPU memory. |
-| `FULLY_SHARDED_LORAS` | `False` | `bool` | Enable fully sharded LoRA layers. |
-| `LORA_MODULES` | `[]` | `list[dict]` | Add lora adapters from Hugging Face `[{"name": "xx", "path": "xxx/xxxx", "base_model_name": "xxx/xxxx"}]` |
+| `ENABLE_LORA` | `False` | `bool` | If `True`, enables the handling of LoRA adapters. |
+| `MAX_LORAS` | `1` | `int` | The maximum number of LoRAs in a single batch. |
+| `MAX_LORA_RANK` | `16` | `int` | The maximum LoRA rank. |
+| `LORA_EXTRA_VOCAB_SIZE` | `256` | `int` | The maximum size of the extra vocabulary for LoRA adapters. |
+| `LORA_DTYPE` | `auto` | `auto`, `float16`, `bfloat16`, `float32` | The data type for LoRA. |
+| `LONG_LORA_SCALING_FACTORS` | None | `tuple` | Specifies multiple scaling factors for LoRA adapters. |
+| `MAX_CPU_LORAS` | None | `int` | The maximum number of LoRAs to store in CPU memory. |
+| `FULLY_SHARDED_LORAS` | `False` | `bool` | If `True`, enables fully sharded LoRA layers. |
+| `LORA_MODULES` | `[]` | `list[dict]` | A list of LoRA adapters to add from Hugging Face. Example: `[{"name": "adapter1", "path": "user/adapter1"}]` |
## Speculative decoding settings
Configure speculative decoding to improve inference performance.
-| Variable | Default | Type/Choices | Description |
+| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
-| `SCHEDULER_DELAY_FACTOR` | `0.0` | `float` | Apply a delay before scheduling next prompt. |
-| `ENABLE_CHUNKED_PREFILL` | `False` | `bool` | Enable chunked prefill requests. |
-| `SPECULATIVE_MODEL` | None | `str` | The name of the draft model to be used in speculative decoding. |
+| `SCHEDULER_DELAY_FACTOR` | `0.0` | `float` | Applies a delay before scheduling the next prompt. |
+| `ENABLE_CHUNKED_PREFILL` | `False` | `bool` | If `True`, enables chunked prefill requests. |
+| `SPECULATIVE_MODEL` | None | `str` | The name of the draft model for speculative decoding. |
| `NUM_SPECULATIVE_TOKENS` | None | `int` | The number of speculative tokens to sample from the draft model. |
-| `SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE` | None | `int` | Number of tensor parallel replicas for the draft model. |
+| `SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE` | None | `int` | The number of tensor parallel replicas for the draft model. |
| `SPECULATIVE_MAX_MODEL_LEN` | None | `int` | The maximum sequence length supported by the draft model. |
-| `SPECULATIVE_DISABLE_BY_BATCH_SIZE` | None | `int` | Disable speculative decoding if the number of enqueue requests is larger than this value. |
-| `NGRAM_PROMPT_LOOKUP_MAX` | None | `int` | Max size of window for ngram prompt lookup in speculative decoding. |
-| `NGRAM_PROMPT_LOOKUP_MIN` | None | `int` | Min size of window for ngram prompt lookup in speculative decoding. |
-| `SPEC_DECODING_ACCEPTANCE_METHOD` | `rejection_sampler` | `rejection_sampler`, `typical_acceptance_sampler` | Specify the acceptance method for draft token verification in speculative decoding. |
-| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD` | None | `float` | Set the lower bound threshold for the posterior probability of a token to be accepted. |
+| `SPECULATIVE_DISABLE_BY_BATCH_SIZE` | None | `int` | Disables speculative decoding if the number of enqueued requests is larger than this value. |
+| `NGRAM_PROMPT_LOOKUP_MAX` | None | `int` | The maximum window size for ngram prompt lookup in speculative decoding. |
+| `NGRAM_PROMPT_LOOKUP_MIN` | None | `int` | The minimum window size for ngram prompt lookup in speculative decoding. |
+| `SPEC_DECODING_ACCEPTANCE_METHOD` | `rejection_sampler` | `rejection_sampler`, `typical_acceptance_sampler` | The acceptance method for draft token verification in speculative decoding. |
+| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD` | None | `float` | Sets the lower bound threshold for the posterior probability of a token to be accepted. |
| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA` | None | `float` | A scaling factor for the entropy-based threshold for token acceptance. |
## System performance settings
Configure GPU memory and system resource utilization.
-| Variable | Default | Type/Choices | Description |
+| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
-| `GPU_MEMORY_UTILIZATION` | `0.95` | `float` | Sets GPU VRAM utilization. |
-| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models. |
-| `BLOCK_SIZE` | `16` | `8`, `16`, `32` | Token block size for contiguous chunks of tokens. |
-| `SWAP_SPACE` | `4` | `int` | CPU swap space size (GiB) per GPU. |
-| `ENFORCE_EAGER` | `False` | `bool` | Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility. |
-| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. |
-| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | Enables or disables custom all reduce. |
+| `GPU_MEMORY_UTILIZATION` | `0.95` | `float` | The GPU VRAM utilization. |
+| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Loads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models. |
+| `BLOCK_SIZE` | `16` | `8`, `16`, `32` | The token block size for contiguous chunks of tokens. |
+| `SWAP_SPACE` | `4` | `int` | The CPU swap space size (in GiB) per GPU. |
+| `ENFORCE_EAGER` | `False` | `bool` | If `True`, always uses eager-mode PyTorch. If `False`, uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility. |
+| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | The maximum context length covered by CUDA graphs. When a sequence has a context length larger than this, the system falls back to eager mode. |
+| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | If `0`, enables custom all-reduce. If `1`, disables it. |
## Tokenizer settings
Customize tokenizer behavior and chat templates.
-| Variable | Default | Type/Choices | Description |
+| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
-| `TOKENIZER_NAME` | None | `str` | Tokenizer repository to use a different tokenizer than the model's default. |
-| `TOKENIZER_REVISION` | None | `str` | Tokenizer revision to load. |
-| `CUSTOM_CHAT_TEMPLATE` | None | `str` of single-line jinja template | Custom chat jinja template. [More info](https://huggingface.co/docs/transformers/chat_templating) |
+| `TOKENIZER_NAME` | None | `str` | The tokenizer repository to use a different tokenizer than the model's default. |
+| `TOKENIZER_REVISION` | None | `str` | The tokenizer revision to load. |
+| `CUSTOM_CHAT_TEMPLATE` | None | `str` of single-line jinja template | A custom chat Jinja template. See the [Hugging Face documentation](https://huggingface.co/docs/transformers/chat_templating) for more information. |
## Streaming and batch settings
Control how tokens are streamed back in HTTP responses.
-The way this works is that the first request will have a batch size of `DEFAULT_MIN_BATCH_SIZE`, and each subsequent request will have a batch size of `previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR`. This will continue until the batch size reaches `DEFAULT_BATCH_SIZE`. For example, with the default values, the batch sizes will be `1, 3, 9, 27, 50, 50, 50, ...`. You can also specify this per request, with inputs `max_batch_size`, `min_batch_size`, and `batch_size_growth_factor`. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker.
+These settings control how tokens are batched in HTTP responses when streaming. The batch size starts at `DEFAULT_MIN_BATCH_SIZE` and increases by a factor of `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` with each request until it reaches `DEFAULT_BATCH_SIZE`.
-| Variable | Default | Type/Choices | Description |
+For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM's internal batching.
+
+| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
-| `DEFAULT_BATCH_SIZE` | `50` | `int` | Default and maximum batch size for token streaming to reduce HTTP calls. |
-| `DEFAULT_MIN_BATCH_SIZE` | `1` | `int` | Batch size for the first request, which will be multiplied by the growth factor every subsequent request. |
-| `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | `3` | `float` | Growth factor for dynamic batch size. |
+| `DEFAULT_BATCH_SIZE` | `50` | `int` | The default and maximum batch size for token streaming. |
+| `DEFAULT_MIN_BATCH_SIZE` | `1` | `int` | The initial batch size for the first request. |
+| `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | `3` | `float` | The growth factor for the dynamic batch size. |
## OpenAI compatibility settings
Configure OpenAI API compatibility features.
-| Variable | Default | Type/Choices | Description |
+| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
-| `RAW_OPENAI_OUTPUT` | `1` | boolean as `int` | Enables raw OpenAI SSE format string output when streaming. **Required** to be enabled (which it is by default) for OpenAI compatibility. |
-| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | `str` | Overrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the `model` parameter when making OpenAI requests. |
-| `OPENAI_RESPONSE_ROLE` | `assistant` | `str` | Role of the LLM's response in OpenAI chat completions. |
-| `ENABLE_AUTO_TOOL_CHOICE` | `false` | `bool` | Enables automatic tool selection for supported models. Set to `true` to activate. |
-| `TOOL_CALL_PARSER` | None | `str` | Specifies the parser for tool calls. Options: `mistral`, `hermes`, `llama3_json`, `llama4_json`, `llama4_pythonic`, `granite`, `granite-20b-fc`, `deepseek_v3`, `internlm`, `jamba`, `phi4_mini_json`, `pythonic` |
-| `REASONING_PARSER` | None | `str` | Parser for reasoning-capable models (enables reasoning mode). Examples: `deepseek_r1`, `qwen3`, `granite`, `hunyuan_a13b`. Leave unset to disable. |
+| `RAW_OPENAI_OUTPUT` | `1` | boolean as `int` | If `1`, enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility. |
+| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | `str` | Overrides the served model name. This allows you to use a custom name in the `model` parameter of your OpenAI requests. |
+| `OPENAI_RESPONSE_ROLE` | `assistant` | `str` | The role of the LLM's response in OpenAI chat completions. |
+| `ENABLE_AUTO_TOOL_CHOICE` | `false` | `bool` | If `true`, enables automatic tool selection for supported models. |
+| `TOOL_CALL_PARSER` | None | `str` | The parser for tool calls. |
+| `REASONING_PARSER` | None | `str` | The parser for reasoning-capable models. Setting this enables reasoning mode. |
## Serverless and concurrency settings
Configure concurrency and logging for Serverless deployments.
-| Variable | Default | Type/Choices | Description |
+| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
-| `MAX_CONCURRENCY` | `300` | `int` | Max concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency. |
-| `DISABLE_LOG_STATS` | `False` | `bool` | Enables or disables vLLM stats logging. |
-| `DISABLE_LOG_REQUESTS` | `False` | `bool` | Enables or disables vLLM request logging. |
+| `MAX_CONCURRENCY` | `300` | `int` | The maximum number of concurrent requests per worker. vLLM's internal queue handles VRAM limitations, so this setting is primarily for scaling and load balancing efficiency. |
+| `DISABLE_LOG_STATS` | `False` | `bool` | If `False`, enables vLLM stats logging. |
+| `DISABLE_LOG_REQUESTS` | `False` | `bool` | If `False`, enables vLLM request logging. |
## Advanced settings
@@ -150,12 +152,12 @@ Additional configuration options for specialized use cases.
| Variable | Default | Type | Description |
| --- | --- | --- | --- |
-| `MODEL_LOADER_EXTRA_CONFIG` | None | `dict` | Extra config for model loader. |
-| `PREEMPTION_MODE` | None | `str` | If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations into the CPU memory as preemption happens. |
-| `PREEMPTION_CHECK_PERIOD` | `1.0` | `float` | How frequently the engine checks if a preemption happens. |
-| `PREEMPTION_CPU_CAPACITY` | `2` | `float` | The percentage of CPU memory used for the saved activations. |
-| `DISABLE_LOGGING_REQUEST` | `False` | `bool` | Disable logging requests. |
-| `MAX_LOG_LEN` | None | `int` | Max number of prompt characters or prompt ID numbers being printed in log. |
+| `MODEL_LOADER_EXTRA_CONFIG` | None | `dict` | Extra configuration for the model loader. |
+| `PREEMPTION_MODE` | None | `str` | The preemption mode. If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations to CPU memory during preemption. |
+| `PREEMPTION_CHECK_PERIOD` | `1.0` | `float` | The frequency (in seconds) at which the engine checks for preemption. |
+| `PREEMPTION_CPU_CAPACITY` | `2` | `float` | The percentage of CPU memory to use for saved activations. |
+| `DISABLE_LOGGING_REQUEST` | `False` | `bool` | If `True`, disables logging requests. |
+| `MAX_LOG_LEN` | None | `int` | The maximum number of prompt characters or prompt ID numbers to print in the log. |
## Docker build arguments
@@ -163,16 +165,5 @@ These variables are used when building custom Docker images with models baked in
| Variable | Default | Type | Description |
| --- | --- | --- | --- |
-| `BASE_PATH` | `/runpod-volume` | `str` | Storage directory for huggingface cache and model. |
-| `WORKER_CUDA_VERSION` | `12.1.0` | `str` | CUDA version for the worker image. |
-
-## Deprecated variables
-
-
-The following variables are deprecated and will be removed in future versions.
-
-
-| Old Variable | New Variable | Note |
-| --- | --- | --- |
-| `MAX_CONTEXT_LEN_TO_CAPTURE` | `MAX_SEQ_LEN_TO_CAPTURE` | Use new variable name. |
-| `kv_cache_dtype=fp8_e5m2` | `kv_cache_dtype=fp8` | Simplified fp8 format. |
+| `BASE_PATH` | `/runpod-volume` | `str` | The storage directory for the Hugging Face cache and model. |
+| `WORKER_CUDA_VERSION` | `12.1.0` | `str` | The CUDA version for the worker image. |