From bea9d1c685ed82731daafdc609e9e6d095717ca5 Mon Sep 17 00:00:00 2001 From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com> Date: Thu, 9 Oct 2025 16:33:57 +0000 Subject: [PATCH 1/2] Documentation updates from Promptless --- docs.json | 3 +- serverless/vllm/environment-variables.mdx | 178 ++++++++++++++++++++++ serverless/vllm/get-started.mdx | 2 +- serverless/vllm/openai-compatibility.mdx | 2 +- serverless/vllm/overview.mdx | 4 +- 5 files changed, 184 insertions(+), 5 deletions(-) create mode 100644 serverless/vllm/environment-variables.mdx diff --git a/docs.json b/docs.json index 8bd11fc9..c45cddf3 100644 --- a/docs.json +++ b/docs.json @@ -82,7 +82,8 @@ "serverless/vllm/overview", "serverless/vllm/get-started", "serverless/vllm/vllm-requests", - "serverless/vllm/openai-compatibility" + "serverless/vllm/openai-compatibility", + "serverless/vllm/environment-variables" ] }, { diff --git a/serverless/vllm/environment-variables.mdx b/serverless/vllm/environment-variables.mdx new file mode 100644 index 00000000..0440d2fd --- /dev/null +++ b/serverless/vllm/environment-variables.mdx @@ -0,0 +1,178 @@ +--- +title: "vLLM environment variables" +sidebarTitle: "Environment variables" +--- + +Environment variables let you configure your vLLM worker without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options. + +To set environment variables, go to your endpoint settings and add them under **Public Environment Variables** or **Private Environment Variables** depending on your security needs. + +## LLM settings + +These variables control the core language model configuration. + +| Variable | Default | Type/Choices | Description | +| --- | --- | --- | --- | +| `MODEL_NAME` | `facebook/opt-125m` | `str` | Name or path of the Hugging Face model to use. | +| `MODEL_REVISION` | `main` | `str` | Model revision to load. | +| `TOKENIZER` | None | `str` | Name or path of the Hugging Face tokenizer to use. | +| `SKIP_TOKENIZER_INIT` | `False` | `bool` | Skip initialization of tokenizer and detokenizer. | +| `TOKENIZER_MODE` | `auto` | `auto`, `slow` | The tokenizer mode. | +| `TRUST_REMOTE_CODE` | `False` | `bool` | Trust remote code from Hugging Face. | +| `DOWNLOAD_DIR` | None | `str` | Directory to download and load the weights. | +| `LOAD_FORMAT` | `auto` | `str` | The format of the model weights to load. | +| `HF_TOKEN` | - | `str` | Hugging Face token for private and gated models. | +| `DTYPE` | `auto` | `auto`, `half`, `float16`, `bfloat16`, `float`, `float32` | Data type for model weights and activations. | +| `KV_CACHE_DTYPE` | `auto` | `auto`, `fp8` | Data type for KV cache storage. | +| `QUANTIZATION_PARAM_PATH` | None | `str` | Path to the JSON file containing the KV cache scaling factors. | +| `MAX_MODEL_LEN` | None | `int` | Model context length. | +| `GUIDED_DECODING_BACKEND` | `outlines` | `outlines`, `lm-format-enforcer` | Which engine will be used for guided decoding by default. | +| `DISTRIBUTED_EXECUTOR_BACKEND` | None | `ray`, `mp` | Backend to use for distributed serving. | +| `WORKER_USE_RAY` | `False` | `bool` | Deprecated, use `DISTRIBUTED_EXECUTOR_BACKEND=ray`. | +| `PIPELINE_PARALLEL_SIZE` | `1` | `int` | Number of pipeline stages. | +| `TENSOR_PARALLEL_SIZE` | `1` | `int` | Number of tensor parallel replicas. | +| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Load model sequentially in multiple batches. | +| `RAY_WORKERS_USE_NSIGHT` | `False` | `bool` | If specified, use nsight to profile Ray workers. | +| `ENABLE_PREFIX_CACHING` | `False` | `bool` | Enables automatic prefix caching. | +| `DISABLE_SLIDING_WINDOW` | `False` | `bool` | Disables sliding window, capping to sliding window size. | +| `USE_V2_BLOCK_MANAGER` | `False` | `bool` | Use BlockSpaceMangerV2. | +| `NUM_LOOKAHEAD_SLOTS` | `0` | `int` | Experimental scheduling config necessary for speculative decoding. | +| `SEED` | `0` | `int` | Random seed for operations. | +| `NUM_GPU_BLOCKS_OVERRIDE` | None | `int` | If specified, ignore GPU profiling result and use this number of GPU blocks. | +| `MAX_NUM_BATCHED_TOKENS` | None | `int` | Maximum number of batched tokens per iteration. | +| `MAX_NUM_SEQS` | `256` | `int` | Maximum number of sequences per iteration. | +| `MAX_LOGPROBS` | `20` | `int` | Max number of log probs to return when logprobs is specified in SamplingParams. | +| `DISABLE_LOG_STATS` | `False` | `bool` | Disable logging statistics. | +| `QUANTIZATION` | None | `awq`, `squeezellm`, `gptq`, `bitsandbytes` | Method used to quantize the weights. | +| `ROPE_SCALING` | None | `dict` | RoPE scaling configuration in JSON format. | +| `ROPE_THETA` | None | `float` | RoPE theta. Use with `ROPE_SCALING`. | +| `TOKENIZER_POOL_SIZE` | `0` | `int` | Size of tokenizer pool to use for asynchronous tokenization. | +| `TOKENIZER_POOL_TYPE` | `ray` | `str` | Type of tokenizer pool to use for asynchronous tokenization. | +| `TOKENIZER_POOL_EXTRA_CONFIG` | None | `dict` | Extra config for tokenizer pool. | + +## LoRA settings + +Configure LoRA (Low-Rank Adaptation) adapters for your model. + +| Variable | Default | Type | Description | +| --- | --- | --- | --- | +| `ENABLE_LORA` | `False` | `bool` | If True, enable handling of LoRA adapters. | +| `MAX_LORAS` | `1` | `int` | Max number of LoRAs in a single batch. | +| `MAX_LORA_RANK` | `16` | `int` | Max LoRA rank. | +| `LORA_EXTRA_VOCAB_SIZE` | `256` | `int` | Maximum size of extra vocabulary for LoRA adapters. | +| `LORA_DTYPE` | `auto` | `auto`, `float16`, `bfloat16`, `float32` | Data type for LoRA. | +| `LONG_LORA_SCALING_FACTORS` | None | `tuple` | Specify multiple scaling factors for LoRA adapters. | +| `MAX_CPU_LORAS` | None | `int` | Maximum number of LoRAs to store in CPU memory. | +| `FULLY_SHARDED_LORAS` | `False` | `bool` | Enable fully sharded LoRA layers. | +| `LORA_MODULES` | `[]` | `list[dict]` | Add lora adapters from Hugging Face `[{"name": "xx", "path": "xxx/xxxx", "base_model_name": "xxx/xxxx"}]` | + +## Speculative decoding settings + +Configure speculative decoding to improve inference performance. + +| Variable | Default | Type/Choices | Description | +| --- | --- | --- | --- | +| `SCHEDULER_DELAY_FACTOR` | `0.0` | `float` | Apply a delay before scheduling next prompt. | +| `ENABLE_CHUNKED_PREFILL` | `False` | `bool` | Enable chunked prefill requests. | +| `SPECULATIVE_MODEL` | None | `str` | The name of the draft model to be used in speculative decoding. | +| `NUM_SPECULATIVE_TOKENS` | None | `int` | The number of speculative tokens to sample from the draft model. | +| `SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE` | None | `int` | Number of tensor parallel replicas for the draft model. | +| `SPECULATIVE_MAX_MODEL_LEN` | None | `int` | The maximum sequence length supported by the draft model. | +| `SPECULATIVE_DISABLE_BY_BATCH_SIZE` | None | `int` | Disable speculative decoding if the number of enqueue requests is larger than this value. | +| `NGRAM_PROMPT_LOOKUP_MAX` | None | `int` | Max size of window for ngram prompt lookup in speculative decoding. | +| `NGRAM_PROMPT_LOOKUP_MIN` | None | `int` | Min size of window for ngram prompt lookup in speculative decoding. | +| `SPEC_DECODING_ACCEPTANCE_METHOD` | `rejection_sampler` | `rejection_sampler`, `typical_acceptance_sampler` | Specify the acceptance method for draft token verification in speculative decoding. | +| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD` | None | `float` | Set the lower bound threshold for the posterior probability of a token to be accepted. | +| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA` | None | `float` | A scaling factor for the entropy-based threshold for token acceptance. | + +## System performance settings + +Configure GPU memory and system resource utilization. + +| Variable | Default | Type/Choices | Description | +| --- | --- | --- | --- | +| `GPU_MEMORY_UTILIZATION` | `0.95` | `float` | Sets GPU VRAM utilization. | +| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models. | +| `BLOCK_SIZE` | `16` | `8`, `16`, `32` | Token block size for contiguous chunks of tokens. | +| `SWAP_SPACE` | `4` | `int` | CPU swap space size (GiB) per GPU. | +| `ENFORCE_EAGER` | `False` | `bool` | Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility. | +| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. | +| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | Enables or disables custom all reduce. | + +## Tokenizer settings + +Customize tokenizer behavior and chat templates. + +| Variable | Default | Type/Choices | Description | +| --- | --- | --- | --- | +| `TOKENIZER_NAME` | None | `str` | Tokenizer repository to use a different tokenizer than the model's default. | +| `TOKENIZER_REVISION` | None | `str` | Tokenizer revision to load. | +| `CUSTOM_CHAT_TEMPLATE` | None | `str` of single-line jinja template | Custom chat jinja template. [More info](https://huggingface.co/docs/transformers/chat_templating) | + +## Streaming and batch settings + +Control how tokens are streamed back in HTTP responses. + +The way this works is that the first request will have a batch size of `DEFAULT_MIN_BATCH_SIZE`, and each subsequent request will have a batch size of `previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR`. This will continue until the batch size reaches `DEFAULT_BATCH_SIZE`. For example, with the default values, the batch sizes will be `1, 3, 9, 27, 50, 50, 50, ...`. You can also specify this per request, with inputs `max_batch_size`, `min_batch_size`, and `batch_size_growth_factor`. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker. + +| Variable | Default | Type/Choices | Description | +| --- | --- | --- | --- | +| `DEFAULT_BATCH_SIZE` | `50` | `int` | Default and maximum batch size for token streaming to reduce HTTP calls. | +| `DEFAULT_MIN_BATCH_SIZE` | `1` | `int` | Batch size for the first request, which will be multiplied by the growth factor every subsequent request. | +| `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | `3` | `float` | Growth factor for dynamic batch size. | + +## OpenAI compatibility settings + +Configure OpenAI API compatibility features. + +| Variable | Default | Type/Choices | Description | +| --- | --- | --- | --- | +| `RAW_OPENAI_OUTPUT` | `1` | boolean as `int` | Enables raw OpenAI SSE format string output when streaming. **Required** to be enabled (which it is by default) for OpenAI compatibility. | +| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | `str` | Overrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the `model` parameter when making OpenAI requests. | +| `OPENAI_RESPONSE_ROLE` | `assistant` | `str` | Role of the LLM's response in OpenAI chat completions. | +| `ENABLE_AUTO_TOOL_CHOICE` | `false` | `bool` | Enables automatic tool selection for supported models. Set to `true` to activate. | +| `TOOL_CALL_PARSER` | None | `str` | Specifies the parser for tool calls. Options: `mistral`, `hermes`, `llama3_json`, `llama4_json`, `llama4_pythonic`, `granite`, `granite-20b-fc`, `deepseek_v3`, `internlm`, `jamba`, `phi4_mini_json`, `pythonic` | +| `REASONING_PARSER` | None | `str` | Parser for reasoning-capable models (enables reasoning mode). Examples: `deepseek_r1`, `qwen3`, `granite`, `hunyuan_a13b`. Leave unset to disable. | + +## Serverless and concurrency settings + +Configure concurrency and logging for Serverless deployments. + +| Variable | Default | Type/Choices | Description | +| --- | --- | --- | --- | +| `MAX_CONCURRENCY` | `300` | `int` | Max concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency. | +| `DISABLE_LOG_STATS` | `False` | `bool` | Enables or disables vLLM stats logging. | +| `DISABLE_LOG_REQUESTS` | `False` | `bool` | Enables or disables vLLM request logging. | + +## Advanced settings + +Additional configuration options for specialized use cases. + +| Variable | Default | Type | Description | +| --- | --- | --- | --- | +| `MODEL_LOADER_EXTRA_CONFIG` | None | `dict` | Extra config for model loader. | +| `PREEMPTION_MODE` | None | `str` | If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations into the CPU memory as preemption happens. | +| `PREEMPTION_CHECK_PERIOD` | `1.0` | `float` | How frequently the engine checks if a preemption happens. | +| `PREEMPTION_CPU_CAPACITY` | `2` | `float` | The percentage of CPU memory used for the saved activations. | +| `DISABLE_LOGGING_REQUEST` | `False` | `bool` | Disable logging requests. | +| `MAX_LOG_LEN` | None | `int` | Max number of prompt characters or prompt ID numbers being printed in log. | + +## Docker build arguments + +These variables are used when building custom Docker images with models baked in. + +| Variable | Default | Type | Description | +| --- | --- | --- | --- | +| `BASE_PATH` | `/runpod-volume` | `str` | Storage directory for huggingface cache and model. | +| `WORKER_CUDA_VERSION` | `12.1.0` | `str` | CUDA version for the worker image. | + +## Deprecated variables + + +The following variables are deprecated and will be removed in future versions. + + +| Old Variable | New Variable | Note | +| --- | --- | --- | +| `MAX_CONTEXT_LEN_TO_CAPTURE` | `MAX_SEQ_LEN_TO_CAPTURE` | Use new variable name. | +| `kv_cache_dtype=fp8_e5m2` | `kv_cache_dtype=fp8` | Simplified fp8 format. | diff --git a/serverless/vllm/get-started.mdx b/serverless/vllm/get-started.mdx index 587cdef3..170a3dce 100644 --- a/serverless/vllm/get-started.mdx +++ b/serverless/vllm/get-started.mdx @@ -145,7 +145,7 @@ To add or modify environment variables: 4. Add/edit your desired variables. 5. Click **Save Endpoint**. -You can find a full list of available environment variables in the [vLLM worker GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variablessettings). +For a complete list of available environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables). You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per reponse. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests#request-input-parameters). diff --git a/serverless/vllm/openai-compatibility.mdx b/serverless/vllm/openai-compatibility.mdx index 8d3c3d50..9dd5edf3 100644 --- a/serverless/vllm/openai-compatibility.mdx +++ b/serverless/vllm/openai-compatibility.mdx @@ -259,7 +259,7 @@ Use these environment variables to customize the OpenAI compatibility: | `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override the model name in responses | | `OPENAI_RESPONSE_ROLE` | `assistant` | Role for responses in chat completions | -You can find a complete list of vLLM environment variables on the [GitHub README](https://github.com/runpod-workers/worker-vllm#environment-variables). +For a complete list of all vLLM environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables). ## Client libraries diff --git a/serverless/vllm/overview.mdx b/serverless/vllm/overview.mdx index 0e50052e..d34d4ac5 100644 --- a/serverless/vllm/overview.mdx +++ b/serverless/vllm/overview.mdx @@ -14,7 +14,7 @@ vLLM workers offer several advantages that make them ideal for LLM deployment: * **Pre-built optimization**: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference. * **OpenAI API compatibility**: They provide a drop-in replacement for OpenAI's API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key. * **Hugging Face integration**: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others. -* **Configurable environments**: Extensive customization options through [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variables) allow you to adjust model parameters, performance settings, and other behaviors. +* **Configurable environments**: Extensive customization options through [environment variables](/serverless/vllm/environment-variables) allow you to adjust model parameters, performance settings, and other behaviors. * **Auto-scaling architecture**: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis. ## Deployment options @@ -33,7 +33,7 @@ Quick-deployed workers will download models during initialization, which can tak ### Option 2: Deploy using a Docker image -Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variablessettings). +Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](/serverless/vllm/environment-variables). Follow the instructions in the [vLLM worker README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside) to build a model into your worker image. From a2220ae2778b6ee9c35ce5f4c9d6da6c3079213d Mon Sep 17 00:00:00 2001 From: Mo King Date: Thu, 9 Oct 2025 12:52:43 -0400 Subject: [PATCH 2/2] vLLM environment vars second draft --- serverless/vllm/environment-variables.mdx | 195 +++++++++++----------- 1 file changed, 93 insertions(+), 102 deletions(-) diff --git a/serverless/vllm/environment-variables.mdx b/serverless/vllm/environment-variables.mdx index 0440d2fd..8b513d4b 100644 --- a/serverless/vllm/environment-variables.mdx +++ b/serverless/vllm/environment-variables.mdx @@ -5,50 +5,50 @@ sidebarTitle: "Environment variables" Environment variables let you configure your vLLM worker without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options. -To set environment variables, go to your endpoint settings and add them under **Public Environment Variables** or **Private Environment Variables** depending on your security needs. +To set environment variables, go to your [endpoint settings](/serverless/endpoints/manage-endpoints#edit-an-endpoint) and add them under **Environment Variables**. ## LLM settings These variables control the core language model configuration. -| Variable | Default | Type/Choices | Description | +| Variable | Default | Type(s) | Description | | --- | --- | --- | --- | -| `MODEL_NAME` | `facebook/opt-125m` | `str` | Name or path of the Hugging Face model to use. | -| `MODEL_REVISION` | `main` | `str` | Model revision to load. | -| `TOKENIZER` | None | `str` | Name or path of the Hugging Face tokenizer to use. | -| `SKIP_TOKENIZER_INIT` | `False` | `bool` | Skip initialization of tokenizer and detokenizer. | +| `MODEL_NAME` | `facebook/opt-125m` | `str` | The name or path of the Hugging Face model to use. | +| `MODEL_REVISION` | `main` | `str` | The model revision to load. | +| `TOKENIZER` | None | `str` | The name or path of the Hugging Face tokenizer to use. | +| `SKIP_TOKENIZER_INIT` | `False` | `bool` | If `True`, skips the initialization of the tokenizer and detokenizer. | | `TOKENIZER_MODE` | `auto` | `auto`, `slow` | The tokenizer mode. | -| `TRUST_REMOTE_CODE` | `False` | `bool` | Trust remote code from Hugging Face. | -| `DOWNLOAD_DIR` | None | `str` | Directory to download and load the weights. | +| `TRUST_REMOTE_CODE` | `False` | `bool` | If `True`, trusts remote code from Hugging Face. | +| `DOWNLOAD_DIR` | None | `str` | The directory to download and load the model weights from. | | `LOAD_FORMAT` | `auto` | `str` | The format of the model weights to load. | -| `HF_TOKEN` | - | `str` | Hugging Face token for private and gated models. | -| `DTYPE` | `auto` | `auto`, `half`, `float16`, `bfloat16`, `float`, `float32` | Data type for model weights and activations. | -| `KV_CACHE_DTYPE` | `auto` | `auto`, `fp8` | Data type for KV cache storage. | -| `QUANTIZATION_PARAM_PATH` | None | `str` | Path to the JSON file containing the KV cache scaling factors. | -| `MAX_MODEL_LEN` | None | `int` | Model context length. | -| `GUIDED_DECODING_BACKEND` | `outlines` | `outlines`, `lm-format-enforcer` | Which engine will be used for guided decoding by default. | -| `DISTRIBUTED_EXECUTOR_BACKEND` | None | `ray`, `mp` | Backend to use for distributed serving. | -| `WORKER_USE_RAY` | `False` | `bool` | Deprecated, use `DISTRIBUTED_EXECUTOR_BACKEND=ray`. | -| `PIPELINE_PARALLEL_SIZE` | `1` | `int` | Number of pipeline stages. | -| `TENSOR_PARALLEL_SIZE` | `1` | `int` | Number of tensor parallel replicas. | -| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Load model sequentially in multiple batches. | -| `RAY_WORKERS_USE_NSIGHT` | `False` | `bool` | If specified, use nsight to profile Ray workers. | -| `ENABLE_PREFIX_CACHING` | `False` | `bool` | Enables automatic prefix caching. | -| `DISABLE_SLIDING_WINDOW` | `False` | `bool` | Disables sliding window, capping to sliding window size. | -| `USE_V2_BLOCK_MANAGER` | `False` | `bool` | Use BlockSpaceMangerV2. | -| `NUM_LOOKAHEAD_SLOTS` | `0` | `int` | Experimental scheduling config necessary for speculative decoding. | -| `SEED` | `0` | `int` | Random seed for operations. | -| `NUM_GPU_BLOCKS_OVERRIDE` | None | `int` | If specified, ignore GPU profiling result and use this number of GPU blocks. | -| `MAX_NUM_BATCHED_TOKENS` | None | `int` | Maximum number of batched tokens per iteration. | -| `MAX_NUM_SEQS` | `256` | `int` | Maximum number of sequences per iteration. | -| `MAX_LOGPROBS` | `20` | `int` | Max number of log probs to return when logprobs is specified in SamplingParams. | -| `DISABLE_LOG_STATS` | `False` | `bool` | Disable logging statistics. | -| `QUANTIZATION` | None | `awq`, `squeezellm`, `gptq`, `bitsandbytes` | Method used to quantize the weights. | -| `ROPE_SCALING` | None | `dict` | RoPE scaling configuration in JSON format. | -| `ROPE_THETA` | None | `float` | RoPE theta. Use with `ROPE_SCALING`. | -| `TOKENIZER_POOL_SIZE` | `0` | `int` | Size of tokenizer pool to use for asynchronous tokenization. | -| `TOKENIZER_POOL_TYPE` | `ray` | `str` | Type of tokenizer pool to use for asynchronous tokenization. | -| `TOKENIZER_POOL_EXTRA_CONFIG` | None | `dict` | Extra config for tokenizer pool. | +| `HF_TOKEN` | - | `str` | Your Hugging Face token, used for private and gated models. | +| `DTYPE` | `auto` | `auto`, `half`, `float16`, `bfloat16`, `float`, `float32` | The data type for model weights and activations. | +| `KV_CACHE_DTYPE` | `auto` | `auto`, `fp8` | The data type for KV cache storage. | +| `QUANTIZATION_PARAM_PATH` | None | `str` | The path to the JSON file containing the KV cache scaling factors. | +| `MAX_MODEL_LEN` | None | `int` | The maximum model context length. | +| `GUIDED_DECODING_BACKEND` | `outlines` | `outlines`, `lm-format-enforcer` | The default engine for guided decoding. | +| `DISTRIBUTED_EXECUTOR_BACKEND` | None | `ray`, `mp` | The backend to use for distributed serving. | +| `WORKER_USE_RAY` | `False` | `bool` | Deprecated. Use `DISTRIBUTED_EXECUTOR_BACKEND=ray` instead. | +| `PIPELINE_PARALLEL_SIZE` | `1` | `int` | The number of pipeline stages. | +| `TENSOR_PARALLEL_SIZE` | `1` | `int` | The number of tensor parallel replicas. | +| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | The number of workers to use for parallel model loading. | +| `RAY_WORKERS_USE_NSIGHT` | `False` | `bool` | If `True`, uses nsight to profile Ray workers. | +| `ENABLE_PREFIX_CACHING` | `False` | `bool` | If `True`, enables automatic prefix caching. | +| `DISABLE_SLIDING_WINDOW` | `False` | `bool` | If `True`, disables the sliding window, capping to the sliding window size. | +| `USE_V2_BLOCK_MANAGER` | `False` | `bool` | If `True`, uses the BlockSpaceMangerV2. | +| `NUM_LOOKAHEAD_SLOTS` | `0` | `int` | The number of lookahead slots, an experimental scheduling configuration for speculative decoding. | +| `SEED` | `0` | `int` | The random seed for operations. | +| `NUM_GPU_BLOCKS_OVERRIDE` | None | `int` | If specified, this value overrides the GPU profiling result for the number of GPU blocks. | +| `MAX_NUM_BATCHED_TOKENS` | None | `int` | The maximum number of batched tokens per iteration. | +| `MAX_NUM_SEQS` | `256` | `int` | The maximum number of sequences per iteration. | +| `MAX_LOGPROBS` | `20` | `int` | The maximum number of log probabilities to return when `logprobs` is specified in `SamplingParams`. | +| `DISABLE_LOG_STATS` | `False` | `bool` | If `True`, disables logging statistics. | +| `QUANTIZATION` | None | `awq`, `squeezellm`, `gptq`, `bitsandbytes` | The method used to quantize the model weights. | +| `ROPE_SCALING` | None | `dict` | The RoPE scaling configuration in JSON format. | +| `ROPE_THETA` | None | `float` | The RoPE theta value. Use with `ROPE_SCALING`. | +| `TOKENIZER_POOL_SIZE` | `0` | `int` | The size of the tokenizer pool for asynchronous tokenization. | +| `TOKENIZER_POOL_TYPE` | `ray` | `str` | The type of the tokenizer pool for asynchronous tokenization. | +| `TOKENIZER_POOL_EXTRA_CONFIG` | None | `dict` | Extra configuration for the tokenizer pool. | ## LoRA settings @@ -56,93 +56,95 @@ Configure LoRA (Low-Rank Adaptation) adapters for your model. | Variable | Default | Type | Description | | --- | --- | --- | --- | -| `ENABLE_LORA` | `False` | `bool` | If True, enable handling of LoRA adapters. | -| `MAX_LORAS` | `1` | `int` | Max number of LoRAs in a single batch. | -| `MAX_LORA_RANK` | `16` | `int` | Max LoRA rank. | -| `LORA_EXTRA_VOCAB_SIZE` | `256` | `int` | Maximum size of extra vocabulary for LoRA adapters. | -| `LORA_DTYPE` | `auto` | `auto`, `float16`, `bfloat16`, `float32` | Data type for LoRA. | -| `LONG_LORA_SCALING_FACTORS` | None | `tuple` | Specify multiple scaling factors for LoRA adapters. | -| `MAX_CPU_LORAS` | None | `int` | Maximum number of LoRAs to store in CPU memory. | -| `FULLY_SHARDED_LORAS` | `False` | `bool` | Enable fully sharded LoRA layers. | -| `LORA_MODULES` | `[]` | `list[dict]` | Add lora adapters from Hugging Face `[{"name": "xx", "path": "xxx/xxxx", "base_model_name": "xxx/xxxx"}]` | +| `ENABLE_LORA` | `False` | `bool` | If `True`, enables the handling of LoRA adapters. | +| `MAX_LORAS` | `1` | `int` | The maximum number of LoRAs in a single batch. | +| `MAX_LORA_RANK` | `16` | `int` | The maximum LoRA rank. | +| `LORA_EXTRA_VOCAB_SIZE` | `256` | `int` | The maximum size of the extra vocabulary for LoRA adapters. | +| `LORA_DTYPE` | `auto` | `auto`, `float16`, `bfloat16`, `float32` | The data type for LoRA. | +| `LONG_LORA_SCALING_FACTORS` | None | `tuple` | Specifies multiple scaling factors for LoRA adapters. | +| `MAX_CPU_LORAS` | None | `int` | The maximum number of LoRAs to store in CPU memory. | +| `FULLY_SHARDED_LORAS` | `False` | `bool` | If `True`, enables fully sharded LoRA layers. | +| `LORA_MODULES` | `[]` | `list[dict]` | A list of LoRA adapters to add from Hugging Face. Example: `[{"name": "adapter1", "path": "user/adapter1"}]` | ## Speculative decoding settings Configure speculative decoding to improve inference performance. -| Variable | Default | Type/Choices | Description | +| Variable | Default | Type(s) | Description | | --- | --- | --- | --- | -| `SCHEDULER_DELAY_FACTOR` | `0.0` | `float` | Apply a delay before scheduling next prompt. | -| `ENABLE_CHUNKED_PREFILL` | `False` | `bool` | Enable chunked prefill requests. | -| `SPECULATIVE_MODEL` | None | `str` | The name of the draft model to be used in speculative decoding. | +| `SCHEDULER_DELAY_FACTOR` | `0.0` | `float` | Applies a delay before scheduling the next prompt. | +| `ENABLE_CHUNKED_PREFILL` | `False` | `bool` | If `True`, enables chunked prefill requests. | +| `SPECULATIVE_MODEL` | None | `str` | The name of the draft model for speculative decoding. | | `NUM_SPECULATIVE_TOKENS` | None | `int` | The number of speculative tokens to sample from the draft model. | -| `SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE` | None | `int` | Number of tensor parallel replicas for the draft model. | +| `SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE` | None | `int` | The number of tensor parallel replicas for the draft model. | | `SPECULATIVE_MAX_MODEL_LEN` | None | `int` | The maximum sequence length supported by the draft model. | -| `SPECULATIVE_DISABLE_BY_BATCH_SIZE` | None | `int` | Disable speculative decoding if the number of enqueue requests is larger than this value. | -| `NGRAM_PROMPT_LOOKUP_MAX` | None | `int` | Max size of window for ngram prompt lookup in speculative decoding. | -| `NGRAM_PROMPT_LOOKUP_MIN` | None | `int` | Min size of window for ngram prompt lookup in speculative decoding. | -| `SPEC_DECODING_ACCEPTANCE_METHOD` | `rejection_sampler` | `rejection_sampler`, `typical_acceptance_sampler` | Specify the acceptance method for draft token verification in speculative decoding. | -| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD` | None | `float` | Set the lower bound threshold for the posterior probability of a token to be accepted. | +| `SPECULATIVE_DISABLE_BY_BATCH_SIZE` | None | `int` | Disables speculative decoding if the number of enqueued requests is larger than this value. | +| `NGRAM_PROMPT_LOOKUP_MAX` | None | `int` | The maximum window size for ngram prompt lookup in speculative decoding. | +| `NGRAM_PROMPT_LOOKUP_MIN` | None | `int` | The minimum window size for ngram prompt lookup in speculative decoding. | +| `SPEC_DECODING_ACCEPTANCE_METHOD` | `rejection_sampler` | `rejection_sampler`, `typical_acceptance_sampler` | The acceptance method for draft token verification in speculative decoding. | +| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD` | None | `float` | Sets the lower bound threshold for the posterior probability of a token to be accepted. | | `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA` | None | `float` | A scaling factor for the entropy-based threshold for token acceptance. | ## System performance settings Configure GPU memory and system resource utilization. -| Variable | Default | Type/Choices | Description | +| Variable | Default | Type(s) | Description | | --- | --- | --- | --- | -| `GPU_MEMORY_UTILIZATION` | `0.95` | `float` | Sets GPU VRAM utilization. | -| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models. | -| `BLOCK_SIZE` | `16` | `8`, `16`, `32` | Token block size for contiguous chunks of tokens. | -| `SWAP_SPACE` | `4` | `int` | CPU swap space size (GiB) per GPU. | -| `ENFORCE_EAGER` | `False` | `bool` | Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility. | -| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. | -| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | Enables or disables custom all reduce. | +| `GPU_MEMORY_UTILIZATION` | `0.95` | `float` | The GPU VRAM utilization. | +| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Loads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models. | +| `BLOCK_SIZE` | `16` | `8`, `16`, `32` | The token block size for contiguous chunks of tokens. | +| `SWAP_SPACE` | `4` | `int` | The CPU swap space size (in GiB) per GPU. | +| `ENFORCE_EAGER` | `False` | `bool` | If `True`, always uses eager-mode PyTorch. If `False`, uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility. | +| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | The maximum context length covered by CUDA graphs. When a sequence has a context length larger than this, the system falls back to eager mode. | +| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | If `0`, enables custom all-reduce. If `1`, disables it. | ## Tokenizer settings Customize tokenizer behavior and chat templates. -| Variable | Default | Type/Choices | Description | +| Variable | Default | Type(s) | Description | | --- | --- | --- | --- | -| `TOKENIZER_NAME` | None | `str` | Tokenizer repository to use a different tokenizer than the model's default. | -| `TOKENIZER_REVISION` | None | `str` | Tokenizer revision to load. | -| `CUSTOM_CHAT_TEMPLATE` | None | `str` of single-line jinja template | Custom chat jinja template. [More info](https://huggingface.co/docs/transformers/chat_templating) | +| `TOKENIZER_NAME` | None | `str` | The tokenizer repository to use a different tokenizer than the model's default. | +| `TOKENIZER_REVISION` | None | `str` | The tokenizer revision to load. | +| `CUSTOM_CHAT_TEMPLATE` | None | `str` of single-line jinja template | A custom chat Jinja template. See the [Hugging Face documentation](https://huggingface.co/docs/transformers/chat_templating) for more information. | ## Streaming and batch settings Control how tokens are streamed back in HTTP responses. -The way this works is that the first request will have a batch size of `DEFAULT_MIN_BATCH_SIZE`, and each subsequent request will have a batch size of `previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR`. This will continue until the batch size reaches `DEFAULT_BATCH_SIZE`. For example, with the default values, the batch sizes will be `1, 3, 9, 27, 50, 50, 50, ...`. You can also specify this per request, with inputs `max_batch_size`, `min_batch_size`, and `batch_size_growth_factor`. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker. +These settings control how tokens are batched in HTTP responses when streaming. The batch size starts at `DEFAULT_MIN_BATCH_SIZE` and increases by a factor of `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` with each request until it reaches `DEFAULT_BATCH_SIZE`. -| Variable | Default | Type/Choices | Description | +For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM's internal batching. + +| Variable | Default | Type(s) | Description | | --- | --- | --- | --- | -| `DEFAULT_BATCH_SIZE` | `50` | `int` | Default and maximum batch size for token streaming to reduce HTTP calls. | -| `DEFAULT_MIN_BATCH_SIZE` | `1` | `int` | Batch size for the first request, which will be multiplied by the growth factor every subsequent request. | -| `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | `3` | `float` | Growth factor for dynamic batch size. | +| `DEFAULT_BATCH_SIZE` | `50` | `int` | The default and maximum batch size for token streaming. | +| `DEFAULT_MIN_BATCH_SIZE` | `1` | `int` | The initial batch size for the first request. | +| `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | `3` | `float` | The growth factor for the dynamic batch size. | ## OpenAI compatibility settings Configure OpenAI API compatibility features. -| Variable | Default | Type/Choices | Description | +| Variable | Default | Type(s) | Description | | --- | --- | --- | --- | -| `RAW_OPENAI_OUTPUT` | `1` | boolean as `int` | Enables raw OpenAI SSE format string output when streaming. **Required** to be enabled (which it is by default) for OpenAI compatibility. | -| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | `str` | Overrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the `model` parameter when making OpenAI requests. | -| `OPENAI_RESPONSE_ROLE` | `assistant` | `str` | Role of the LLM's response in OpenAI chat completions. | -| `ENABLE_AUTO_TOOL_CHOICE` | `false` | `bool` | Enables automatic tool selection for supported models. Set to `true` to activate. | -| `TOOL_CALL_PARSER` | None | `str` | Specifies the parser for tool calls. Options: `mistral`, `hermes`, `llama3_json`, `llama4_json`, `llama4_pythonic`, `granite`, `granite-20b-fc`, `deepseek_v3`, `internlm`, `jamba`, `phi4_mini_json`, `pythonic` | -| `REASONING_PARSER` | None | `str` | Parser for reasoning-capable models (enables reasoning mode). Examples: `deepseek_r1`, `qwen3`, `granite`, `hunyuan_a13b`. Leave unset to disable. | +| `RAW_OPENAI_OUTPUT` | `1` | boolean as `int` | If `1`, enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility. | +| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | `str` | Overrides the served model name. This allows you to use a custom name in the `model` parameter of your OpenAI requests. | +| `OPENAI_RESPONSE_ROLE` | `assistant` | `str` | The role of the LLM's response in OpenAI chat completions. | +| `ENABLE_AUTO_TOOL_CHOICE` | `false` | `bool` | If `true`, enables automatic tool selection for supported models. | +| `TOOL_CALL_PARSER` | None | `str` | The parser for tool calls. | +| `REASONING_PARSER` | None | `str` | The parser for reasoning-capable models. Setting this enables reasoning mode. | ## Serverless and concurrency settings Configure concurrency and logging for Serverless deployments. -| Variable | Default | Type/Choices | Description | +| Variable | Default | Type(s) | Description | | --- | --- | --- | --- | -| `MAX_CONCURRENCY` | `300` | `int` | Max concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency. | -| `DISABLE_LOG_STATS` | `False` | `bool` | Enables or disables vLLM stats logging. | -| `DISABLE_LOG_REQUESTS` | `False` | `bool` | Enables or disables vLLM request logging. | +| `MAX_CONCURRENCY` | `300` | `int` | The maximum number of concurrent requests per worker. vLLM's internal queue handles VRAM limitations, so this setting is primarily for scaling and load balancing efficiency. | +| `DISABLE_LOG_STATS` | `False` | `bool` | If `False`, enables vLLM stats logging. | +| `DISABLE_LOG_REQUESTS` | `False` | `bool` | If `False`, enables vLLM request logging. | ## Advanced settings @@ -150,12 +152,12 @@ Additional configuration options for specialized use cases. | Variable | Default | Type | Description | | --- | --- | --- | --- | -| `MODEL_LOADER_EXTRA_CONFIG` | None | `dict` | Extra config for model loader. | -| `PREEMPTION_MODE` | None | `str` | If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations into the CPU memory as preemption happens. | -| `PREEMPTION_CHECK_PERIOD` | `1.0` | `float` | How frequently the engine checks if a preemption happens. | -| `PREEMPTION_CPU_CAPACITY` | `2` | `float` | The percentage of CPU memory used for the saved activations. | -| `DISABLE_LOGGING_REQUEST` | `False` | `bool` | Disable logging requests. | -| `MAX_LOG_LEN` | None | `int` | Max number of prompt characters or prompt ID numbers being printed in log. | +| `MODEL_LOADER_EXTRA_CONFIG` | None | `dict` | Extra configuration for the model loader. | +| `PREEMPTION_MODE` | None | `str` | The preemption mode. If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations to CPU memory during preemption. | +| `PREEMPTION_CHECK_PERIOD` | `1.0` | `float` | The frequency (in seconds) at which the engine checks for preemption. | +| `PREEMPTION_CPU_CAPACITY` | `2` | `float` | The percentage of CPU memory to use for saved activations. | +| `DISABLE_LOGGING_REQUEST` | `False` | `bool` | If `True`, disables logging requests. | +| `MAX_LOG_LEN` | None | `int` | The maximum number of prompt characters or prompt ID numbers to print in the log. | ## Docker build arguments @@ -163,16 +165,5 @@ These variables are used when building custom Docker images with models baked in | Variable | Default | Type | Description | | --- | --- | --- | --- | -| `BASE_PATH` | `/runpod-volume` | `str` | Storage directory for huggingface cache and model. | -| `WORKER_CUDA_VERSION` | `12.1.0` | `str` | CUDA version for the worker image. | - -## Deprecated variables - - -The following variables are deprecated and will be removed in future versions. - - -| Old Variable | New Variable | Note | -| --- | --- | --- | -| `MAX_CONTEXT_LEN_TO_CAPTURE` | `MAX_SEQ_LEN_TO_CAPTURE` | Use new variable name. | -| `kv_cache_dtype=fp8_e5m2` | `kv_cache_dtype=fp8` | Simplified fp8 format. | +| `BASE_PATH` | `/runpod-volume` | `str` | The storage directory for the Hugging Face cache and model. | +| `WORKER_CUDA_VERSION` | `12.1.0` | `str` | The CUDA version for the worker image. |