runpod · muhsinking · Oct 15, 2025 · Oct 9, 2025 · Oct 9, 2025 · Oct 15, 2025
diff --git a/docs.json b/docs.json
@@ -73,7 +73,8 @@
                   "serverless/vllm/overview",
                   "serverless/vllm/get-started",
                   "serverless/vllm/vllm-requests",
-                  "serverless/vllm/openai-compatibility"
+                  "serverless/vllm/openai-compatibility",
+                  "serverless/vllm/environment-variables"
                 ]
               },
               {

diff --git a/serverless/vllm/environment-variables.mdx b/serverless/vllm/environment-variables.mdx
@@ -0,0 +1,169 @@
+---
+title: "vLLM environment variables"
+sidebarTitle: "Environment variables"
+---
+
+Environment variables let you configure your vLLM worker without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options.
+
+To set environment variables, go to your [endpoint settings](/serverless/endpoints/manage-endpoints#edit-an-endpoint) and add them under **Environment Variables**.
+
+## LLM settings
+
+These variables control the core language model configuration.
+
+| Variable | Default | Type(s) | Description |
+| --- | --- | --- | --- |
+| `MODEL_NAME` | `facebook/opt-125m` | `str` | The name or path of the Hugging Face model to use. |
+| `MODEL_REVISION` | `main` | `str` | The model revision to load. |
+| `TOKENIZER` | None | `str` | The name or path of the Hugging Face tokenizer to use. |
+| `SKIP_TOKENIZER_INIT` | `False` | `bool` | If `True`, skips the initialization of the tokenizer and detokenizer. |
+| `TOKENIZER_MODE` | `auto` | `auto`, `slow` | The tokenizer mode. |
+| `TRUST_REMOTE_CODE` | `False` | `bool` | If `True`, trusts remote code from Hugging Face. |
+| `DOWNLOAD_DIR` | None | `str` | The directory to download and load the model weights from. |
+| `LOAD_FORMAT` | `auto` | `str` | The format of the model weights to load. |
+| `HF_TOKEN` | - | `str` | Your Hugging Face token, used for private and gated models. |
+| `DTYPE` | `auto` | `auto`, `half`, `float16`, `bfloat16`, `float`, `float32` | The data type for model weights and activations. |
+| `KV_CACHE_DTYPE` | `auto` | `auto`, `fp8` | The data type for KV cache storage. |
+| `QUANTIZATION_PARAM_PATH` | None | `str` | The path to the JSON file containing the KV cache scaling factors. |
+| `MAX_MODEL_LEN` | None | `int` | The maximum model context length. |
+| `GUIDED_DECODING_BACKEND` | `outlines` | `outlines`, `lm-format-enforcer` | The default engine for guided decoding. |
+| `DISTRIBUTED_EXECUTOR_BACKEND` | None | `ray`, `mp` | The backend to use for distributed serving. |
+| `WORKER_USE_RAY` | `False` | `bool` | Deprecated. Use `DISTRIBUTED_EXECUTOR_BACKEND=ray` instead. |
+| `PIPELINE_PARALLEL_SIZE` | `1` | `int` | The number of pipeline stages. |
+| `TENSOR_PARALLEL_SIZE` | `1` | `int` | The number of tensor parallel replicas. |
+| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | The number of workers to use for parallel model loading. |
+| `RAY_WORKERS_USE_NSIGHT` | `False` | `bool` | If `True`, uses nsight to profile Ray workers. |
+| `ENABLE_PREFIX_CACHING` | `False` | `bool` | If `True`, enables automatic prefix caching. |
+| `DISABLE_SLIDING_WINDOW` | `False` | `bool` | If `True`, disables the sliding window, capping to the sliding window size. |
+| `USE_V2_BLOCK_MANAGER` | `False` | `bool` | If `True`, uses the BlockSpaceMangerV2. |
+| `NUM_LOOKAHEAD_SLOTS` | `0` | `int` | The number of lookahead slots, an experimental scheduling configuration for speculative decoding. |
+| `SEED` | `0` | `int` | The random seed for operations. |
+| `NUM_GPU_BLOCKS_OVERRIDE` | None | `int` | If specified, this value overrides the GPU profiling result for the number of GPU blocks. |
+| `MAX_NUM_BATCHED_TOKENS` | None | `int` | The maximum number of batched tokens per iteration. |
+| `MAX_NUM_SEQS` | `256` | `int` | The maximum number of sequences per iteration. |
+| `MAX_LOGPROBS` | `20` | `int` | The maximum number of log probabilities to return when `logprobs` is specified in `SamplingParams`. |
+| `DISABLE_LOG_STATS` | `False` | `bool` | If `True`, disables logging statistics. |
+| `QUANTIZATION` | None | `awq`, `squeezellm`, `gptq`, `bitsandbytes` | The method used to quantize the model weights. |
+| `ROPE_SCALING` | None | `dict` | The RoPE scaling configuration in JSON format. |
+| `ROPE_THETA` | None | `float` | The RoPE theta value. Use with `ROPE_SCALING`. |
+| `TOKENIZER_POOL_SIZE` | `0` | `int` | The size of the tokenizer pool for asynchronous tokenization. |
+| `TOKENIZER_POOL_TYPE` | `ray` | `str` | The type of the tokenizer pool for asynchronous tokenization. |
+| `TOKENIZER_POOL_EXTRA_CONFIG` | None | `dict` | Extra configuration for the tokenizer pool. |
+
+## LoRA settings
+
+Configure LoRA (Low-Rank Adaptation) adapters for your model.
+
+| Variable | Default | Type | Description |
+| --- | --- | --- | --- |
+| `ENABLE_LORA` | `False` | `bool` | If `True`, enables the handling of LoRA adapters. |
+| `MAX_LORAS` | `1` | `int` | The maximum number of LoRAs in a single batch. |
+| `MAX_LORA_RANK` | `16` | `int` | The maximum LoRA rank. |
+| `LORA_EXTRA_VOCAB_SIZE` | `256` | `int` | The maximum size of the extra vocabulary for LoRA adapters. |
+| `LORA_DTYPE` | `auto` | `auto`, `float16`, `bfloat16`, `float32` | The data type for LoRA. |
+| `LONG_LORA_SCALING_FACTORS` | None | `tuple` | Specifies multiple scaling factors for LoRA adapters. |
+| `MAX_CPU_LORAS` | None | `int` | The maximum number of LoRAs to store in CPU memory. |
+| `FULLY_SHARDED_LORAS` | `False` | `bool` | If `True`, enables fully sharded LoRA layers. |
+| `LORA_MODULES` | `[]` | `list[dict]` | A list of LoRA adapters to add from Hugging Face. Example: `[{"name": "adapter1", "path": "user/adapter1"}]` |
+
+## Speculative decoding settings
+
+Configure speculative decoding to improve inference performance.
+
+| Variable | Default | Type(s) | Description |
+| --- | --- | --- | --- |
+| `SCHEDULER_DELAY_FACTOR` | `0.0` | `float` | Applies a delay before scheduling the next prompt. |
+| `ENABLE_CHUNKED_PREFILL` | `False` | `bool` | If `True`, enables chunked prefill requests. |
+| `SPECULATIVE_MODEL` | None | `str` | The name of the draft model for speculative decoding. |
+| `NUM_SPECULATIVE_TOKENS` | None | `int` | The number of speculative tokens to sample from the draft model. |
+| `SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE` | None | `int` | The number of tensor parallel replicas for the draft model. |
+| `SPECULATIVE_MAX_MODEL_LEN` | None | `int` | The maximum sequence length supported by the draft model. |
+| `SPECULATIVE_DISABLE_BY_BATCH_SIZE` | None | `int` | Disables speculative decoding if the number of enqueued requests is larger than this value. |
+| `NGRAM_PROMPT_LOOKUP_MAX` | None | `int` | The maximum window size for ngram prompt lookup in speculative decoding. |
+| `NGRAM_PROMPT_LOOKUP_MIN` | None | `int` | The minimum window size for ngram prompt lookup in speculative decoding. |
+| `SPEC_DECODING_ACCEPTANCE_METHOD` | `rejection_sampler` | `rejection_sampler`, `typical_acceptance_sampler` | The acceptance method for draft token verification in speculative decoding. |
+| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD` | None | `float` | Sets the lower bound threshold for the posterior probability of a token to be accepted. |
+| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA` | None | `float` | A scaling factor for the entropy-based threshold for token acceptance. |
+
+## System performance settings
+
+Configure GPU memory and system resource utilization.
+
+| Variable | Default | Type(s) | Description |
+| --- | --- | --- | --- |
+| `GPU_MEMORY_UTILIZATION` | `0.95` | `float` | The GPU VRAM utilization. |
+| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Loads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models. |
+| `BLOCK_SIZE` | `16` | `8`, `16`, `32` | The token block size for contiguous chunks of tokens. |
+| `SWAP_SPACE` | `4` | `int` | The CPU swap space size (in GiB) per GPU. |
+| `ENFORCE_EAGER` | `False` | `bool` | If `True`, always uses eager-mode PyTorch. If `False`, uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility. |
+| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | The maximum context length covered by CUDA graphs. When a sequence has a context length larger than this, the system falls back to eager mode. |
+| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | If `0`, enables custom all-reduce. If `1`, disables it. |
+
+## Tokenizer settings
+
+Customize tokenizer behavior and chat templates.
+
+| Variable | Default | Type(s) | Description |
+| --- | --- | --- | --- |
+| `TOKENIZER_NAME` | None | `str` | The tokenizer repository to use a different tokenizer than the model's default. |
+| `TOKENIZER_REVISION` | None | `str` | The tokenizer revision to load. |
+| `CUSTOM_CHAT_TEMPLATE` | None | `str` of single-line jinja template | A custom chat Jinja template. See the [Hugging Face documentation](https://huggingface.co/docs/transformers/chat_templating) for more information. |
+
+## Streaming and batch settings
+
+Control how tokens are streamed back in HTTP responses.
+
+These settings control how tokens are batched in HTTP responses when streaming. The batch size starts at `DEFAULT_MIN_BATCH_SIZE` and increases by a factor of `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` with each request until it reaches `DEFAULT_BATCH_SIZE`.
+
+For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM's internal batching.
+
+| Variable | Default | Type(s) | Description |
+| --- | --- | --- | --- |
+| `DEFAULT_BATCH_SIZE` | `50` | `int` | The default and maximum batch size for token streaming. |
+| `DEFAULT_MIN_BATCH_SIZE` | `1` | `int` | The initial batch size for the first request. |
+| `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | `3` | `float` | The growth factor for the dynamic batch size. |
+
+## OpenAI compatibility settings
+
+Configure OpenAI API compatibility features.
+
+| Variable | Default | Type(s) | Description |
+| --- | --- | --- | --- |
+| `RAW_OPENAI_OUTPUT` | `1` | boolean as `int` | If `1`, enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility. |
+| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | `str` | Overrides the served model name. This allows you to use a custom name in the `model` parameter of your OpenAI requests. |
+| `OPENAI_RESPONSE_ROLE` | `assistant` | `str` | The role of the LLM's response in OpenAI chat completions. |
+| `ENABLE_AUTO_TOOL_CHOICE` | `false` | `bool` | If `true`, enables automatic tool selection for supported models. |
+| `TOOL_CALL_PARSER` | None | `str` | The parser for tool calls. |
+| `REASONING_PARSER` | None | `str` | The parser for reasoning-capable models. Setting this enables reasoning mode. |
+
+## Serverless and concurrency settings
+
+Configure concurrency and logging for Serverless deployments.
+
+| Variable | Default | Type(s) | Description |
+| --- | --- | --- | --- |
+| `MAX_CONCURRENCY` | `300` | `int` | The maximum number of concurrent requests per worker. vLLM's internal queue handles VRAM limitations, so this setting is primarily for scaling and load balancing efficiency. |
+| `DISABLE_LOG_STATS` | `False` | `bool` | If `False`, enables vLLM stats logging. |
+| `DISABLE_LOG_REQUESTS` | `False` | `bool` | If `False`, enables vLLM request logging. |
+
+## Advanced settings
+
+Additional configuration options for specialized use cases.
+
+| Variable | Default | Type | Description |
+| --- | --- | --- | --- |
+| `MODEL_LOADER_EXTRA_CONFIG` | None | `dict` | Extra configuration for the model loader. |
+| `PREEMPTION_MODE` | None | `str` | The preemption mode. If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations to CPU memory during preemption. |
+| `PREEMPTION_CHECK_PERIOD` | `1.0` | `float` | The frequency (in seconds) at which the engine checks for preemption. |
+| `PREEMPTION_CPU_CAPACITY` | `2` | `float` | The percentage of CPU memory to use for saved activations. |
+| `DISABLE_LOGGING_REQUEST` | `False` | `bool` | If `True`, disables logging requests. |
+| `MAX_LOG_LEN` | None | `int` | The maximum number of prompt characters or prompt ID numbers to print in the log. |
+
+## Docker build arguments
+
+These variables are used when building custom Docker images with models baked in.
+
+| Variable | Default | Type | Description |
+| --- | --- | --- | --- |
+| `BASE_PATH` | `/runpod-volume` | `str` | The storage directory for the Hugging Face cache and model. |
+| `WORKER_CUDA_VERSION` | `12.1.0` | `str` | The CUDA version for the worker image. |
diff --git a/serverless/vllm/get-started.mdx b/serverless/vllm/get-started.mdx
@@ -145,7 +145,7 @@ To add or modify environment variables:
 4. Add/edit your desired variables.
 5. Click **Save Endpoint**.
 
-You can find a full list of available environment variables in the [vLLM worker GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variablessettings).
+For a complete list of available environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables).
 
 You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per reponse. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests#request-input-parameters).
 

diff --git a/serverless/vllm/openai-compatibility.mdx b/serverless/vllm/openai-compatibility.mdx
@@ -259,7 +259,7 @@ Use these environment variables to customize the OpenAI compatibility:
 | `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None        | Override the model name in responses        |
 | `OPENAI_RESPONSE_ROLE`              | `assistant` | Role for responses in chat completions      |
 
-You can find a complete list of vLLM environment variables on the [GitHub README](https://github.com/runpod-workers/worker-vllm#environment-variables).
+For a complete list of all vLLM environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables).
 
 ## Client libraries
 

diff --git a/serverless/vllm/overview.mdx b/serverless/vllm/overview.mdx
@@ -14,7 +14,7 @@ vLLM workers offer several advantages that make them ideal for LLM deployment:
 * **Pre-built optimization**: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference.
 * **OpenAI API compatibility**: They provide a drop-in replacement for OpenAI's API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key.
 * **Hugging Face integration**: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others.
-* **Configurable environments**: Extensive customization options through [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variables) allow you to adjust model parameters, performance settings, and other behaviors.
+* **Configurable environments**: Extensive customization options through [environment variables](/serverless/vllm/environment-variables) allow you to adjust model parameters, performance settings, and other behaviors.
 * **Auto-scaling architecture**: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis.
 
 ## Deployment options
@@ -33,7 +33,7 @@ Quick-deployed workers will download models during initialization, which can tak
 
 ### Option 2: Deploy using a Docker image
 
-Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variablessettings).
+Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](/serverless/vllm/environment-variables).
 
 Follow the instructions in the [vLLM worker README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside) to build a model into your worker image.