Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,8 @@
"serverless/vllm/overview",
"serverless/vllm/get-started",
"serverless/vllm/vllm-requests",
"serverless/vllm/openai-compatibility"
"serverless/vllm/openai-compatibility",
"serverless/vllm/environment-variables"
]
},
{
Expand Down
169 changes: 169 additions & 0 deletions serverless/vllm/environment-variables.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
title: "vLLM environment variables"
sidebarTitle: "Environment variables"
---

Environment variables let you configure your vLLM worker without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options.

To set environment variables, go to your [endpoint settings](/serverless/endpoints/manage-endpoints#edit-an-endpoint) and add them under **Environment Variables**.

## LLM settings

These variables control the core language model configuration.

| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
| `MODEL_NAME` | `facebook/opt-125m` | `str` | The name or path of the Hugging Face model to use. |
| `MODEL_REVISION` | `main` | `str` | The model revision to load. |
| `TOKENIZER` | None | `str` | The name or path of the Hugging Face tokenizer to use. |
| `SKIP_TOKENIZER_INIT` | `False` | `bool` | If `True`, skips the initialization of the tokenizer and detokenizer. |
| `TOKENIZER_MODE` | `auto` | `auto`, `slow` | The tokenizer mode. |
| `TRUST_REMOTE_CODE` | `False` | `bool` | If `True`, trusts remote code from Hugging Face. |
| `DOWNLOAD_DIR` | None | `str` | The directory to download and load the model weights from. |
| `LOAD_FORMAT` | `auto` | `str` | The format of the model weights to load. |
| `HF_TOKEN` | - | `str` | Your Hugging Face token, used for private and gated models. |
| `DTYPE` | `auto` | `auto`, `half`, `float16`, `bfloat16`, `float`, `float32` | The data type for model weights and activations. |
| `KV_CACHE_DTYPE` | `auto` | `auto`, `fp8` | The data type for KV cache storage. |
| `QUANTIZATION_PARAM_PATH` | None | `str` | The path to the JSON file containing the KV cache scaling factors. |
| `MAX_MODEL_LEN` | None | `int` | The maximum model context length. |
| `GUIDED_DECODING_BACKEND` | `outlines` | `outlines`, `lm-format-enforcer` | The default engine for guided decoding. |
| `DISTRIBUTED_EXECUTOR_BACKEND` | None | `ray`, `mp` | The backend to use for distributed serving. |
| `WORKER_USE_RAY` | `False` | `bool` | Deprecated. Use `DISTRIBUTED_EXECUTOR_BACKEND=ray` instead. |
| `PIPELINE_PARALLEL_SIZE` | `1` | `int` | The number of pipeline stages. |
| `TENSOR_PARALLEL_SIZE` | `1` | `int` | The number of tensor parallel replicas. |
| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | The number of workers to use for parallel model loading. |
| `RAY_WORKERS_USE_NSIGHT` | `False` | `bool` | If `True`, uses nsight to profile Ray workers. |
| `ENABLE_PREFIX_CACHING` | `False` | `bool` | If `True`, enables automatic prefix caching. |
| `DISABLE_SLIDING_WINDOW` | `False` | `bool` | If `True`, disables the sliding window, capping to the sliding window size. |
| `USE_V2_BLOCK_MANAGER` | `False` | `bool` | If `True`, uses the BlockSpaceMangerV2. |
| `NUM_LOOKAHEAD_SLOTS` | `0` | `int` | The number of lookahead slots, an experimental scheduling configuration for speculative decoding. |
| `SEED` | `0` | `int` | The random seed for operations. |
| `NUM_GPU_BLOCKS_OVERRIDE` | None | `int` | If specified, this value overrides the GPU profiling result for the number of GPU blocks. |
| `MAX_NUM_BATCHED_TOKENS` | None | `int` | The maximum number of batched tokens per iteration. |
| `MAX_NUM_SEQS` | `256` | `int` | The maximum number of sequences per iteration. |
| `MAX_LOGPROBS` | `20` | `int` | The maximum number of log probabilities to return when `logprobs` is specified in `SamplingParams`. |
| `DISABLE_LOG_STATS` | `False` | `bool` | If `True`, disables logging statistics. |
| `QUANTIZATION` | None | `awq`, `squeezellm`, `gptq`, `bitsandbytes` | The method used to quantize the model weights. |
| `ROPE_SCALING` | None | `dict` | The RoPE scaling configuration in JSON format. |
| `ROPE_THETA` | None | `float` | The RoPE theta value. Use with `ROPE_SCALING`. |
| `TOKENIZER_POOL_SIZE` | `0` | `int` | The size of the tokenizer pool for asynchronous tokenization. |
| `TOKENIZER_POOL_TYPE` | `ray` | `str` | The type of the tokenizer pool for asynchronous tokenization. |
| `TOKENIZER_POOL_EXTRA_CONFIG` | None | `dict` | Extra configuration for the tokenizer pool. |

## LoRA settings

Configure LoRA (Low-Rank Adaptation) adapters for your model.

| Variable | Default | Type | Description |
| --- | --- | --- | --- |
| `ENABLE_LORA` | `False` | `bool` | If `True`, enables the handling of LoRA adapters. |
| `MAX_LORAS` | `1` | `int` | The maximum number of LoRAs in a single batch. |
| `MAX_LORA_RANK` | `16` | `int` | The maximum LoRA rank. |
| `LORA_EXTRA_VOCAB_SIZE` | `256` | `int` | The maximum size of the extra vocabulary for LoRA adapters. |
| `LORA_DTYPE` | `auto` | `auto`, `float16`, `bfloat16`, `float32` | The data type for LoRA. |
| `LONG_LORA_SCALING_FACTORS` | None | `tuple` | Specifies multiple scaling factors for LoRA adapters. |
| `MAX_CPU_LORAS` | None | `int` | The maximum number of LoRAs to store in CPU memory. |
| `FULLY_SHARDED_LORAS` | `False` | `bool` | If `True`, enables fully sharded LoRA layers. |
| `LORA_MODULES` | `[]` | `list[dict]` | A list of LoRA adapters to add from Hugging Face. Example: `[{"name": "adapter1", "path": "user/adapter1"}]` |

## Speculative decoding settings

Configure speculative decoding to improve inference performance.

| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
| `SCHEDULER_DELAY_FACTOR` | `0.0` | `float` | Applies a delay before scheduling the next prompt. |
| `ENABLE_CHUNKED_PREFILL` | `False` | `bool` | If `True`, enables chunked prefill requests. |
| `SPECULATIVE_MODEL` | None | `str` | The name of the draft model for speculative decoding. |
| `NUM_SPECULATIVE_TOKENS` | None | `int` | The number of speculative tokens to sample from the draft model. |
| `SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE` | None | `int` | The number of tensor parallel replicas for the draft model. |
| `SPECULATIVE_MAX_MODEL_LEN` | None | `int` | The maximum sequence length supported by the draft model. |
| `SPECULATIVE_DISABLE_BY_BATCH_SIZE` | None | `int` | Disables speculative decoding if the number of enqueued requests is larger than this value. |
| `NGRAM_PROMPT_LOOKUP_MAX` | None | `int` | The maximum window size for ngram prompt lookup in speculative decoding. |
| `NGRAM_PROMPT_LOOKUP_MIN` | None | `int` | The minimum window size for ngram prompt lookup in speculative decoding. |
| `SPEC_DECODING_ACCEPTANCE_METHOD` | `rejection_sampler` | `rejection_sampler`, `typical_acceptance_sampler` | The acceptance method for draft token verification in speculative decoding. |
| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD` | None | `float` | Sets the lower bound threshold for the posterior probability of a token to be accepted. |
| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA` | None | `float` | A scaling factor for the entropy-based threshold for token acceptance. |

## System performance settings

Configure GPU memory and system resource utilization.

| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
| `GPU_MEMORY_UTILIZATION` | `0.95` | `float` | The GPU VRAM utilization. |
| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Loads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models. |
| `BLOCK_SIZE` | `16` | `8`, `16`, `32` | The token block size for contiguous chunks of tokens. |
| `SWAP_SPACE` | `4` | `int` | The CPU swap space size (in GiB) per GPU. |
| `ENFORCE_EAGER` | `False` | `bool` | If `True`, always uses eager-mode PyTorch. If `False`, uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility. |
| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | The maximum context length covered by CUDA graphs. When a sequence has a context length larger than this, the system falls back to eager mode. |
| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | If `0`, enables custom all-reduce. If `1`, disables it. |

## Tokenizer settings

Customize tokenizer behavior and chat templates.

| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
| `TOKENIZER_NAME` | None | `str` | The tokenizer repository to use a different tokenizer than the model's default. |
| `TOKENIZER_REVISION` | None | `str` | The tokenizer revision to load. |
| `CUSTOM_CHAT_TEMPLATE` | None | `str` of single-line jinja template | A custom chat Jinja template. See the [Hugging Face documentation](https://huggingface.co/docs/transformers/chat_templating) for more information. |

## Streaming and batch settings

Control how tokens are streamed back in HTTP responses.

These settings control how tokens are batched in HTTP responses when streaming. The batch size starts at `DEFAULT_MIN_BATCH_SIZE` and increases by a factor of `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` with each request until it reaches `DEFAULT_BATCH_SIZE`.

For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM's internal batching.

| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
| `DEFAULT_BATCH_SIZE` | `50` | `int` | The default and maximum batch size for token streaming. |
| `DEFAULT_MIN_BATCH_SIZE` | `1` | `int` | The initial batch size for the first request. |
| `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | `3` | `float` | The growth factor for the dynamic batch size. |

## OpenAI compatibility settings

Configure OpenAI API compatibility features.

| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
| `RAW_OPENAI_OUTPUT` | `1` | boolean as `int` | If `1`, enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility. |
| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | `str` | Overrides the served model name. This allows you to use a custom name in the `model` parameter of your OpenAI requests. |
| `OPENAI_RESPONSE_ROLE` | `assistant` | `str` | The role of the LLM's response in OpenAI chat completions. |
| `ENABLE_AUTO_TOOL_CHOICE` | `false` | `bool` | If `true`, enables automatic tool selection for supported models. |
| `TOOL_CALL_PARSER` | None | `str` | The parser for tool calls. |
| `REASONING_PARSER` | None | `str` | The parser for reasoning-capable models. Setting this enables reasoning mode. |

## Serverless and concurrency settings

Configure concurrency and logging for Serverless deployments.

| Variable | Default | Type(s) | Description |
| --- | --- | --- | --- |
| `MAX_CONCURRENCY` | `300` | `int` | The maximum number of concurrent requests per worker. vLLM's internal queue handles VRAM limitations, so this setting is primarily for scaling and load balancing efficiency. |
| `DISABLE_LOG_STATS` | `False` | `bool` | If `False`, enables vLLM stats logging. |
| `DISABLE_LOG_REQUESTS` | `False` | `bool` | If `False`, enables vLLM request logging. |

## Advanced settings

Additional configuration options for specialized use cases.

| Variable | Default | Type | Description |
| --- | --- | --- | --- |
| `MODEL_LOADER_EXTRA_CONFIG` | None | `dict` | Extra configuration for the model loader. |
| `PREEMPTION_MODE` | None | `str` | The preemption mode. If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations to CPU memory during preemption. |
| `PREEMPTION_CHECK_PERIOD` | `1.0` | `float` | The frequency (in seconds) at which the engine checks for preemption. |
| `PREEMPTION_CPU_CAPACITY` | `2` | `float` | The percentage of CPU memory to use for saved activations. |
| `DISABLE_LOGGING_REQUEST` | `False` | `bool` | If `True`, disables logging requests. |
| `MAX_LOG_LEN` | None | `int` | The maximum number of prompt characters or prompt ID numbers to print in the log. |

## Docker build arguments

These variables are used when building custom Docker images with models baked in.

| Variable | Default | Type | Description |
| --- | --- | --- | --- |
| `BASE_PATH` | `/runpod-volume` | `str` | The storage directory for the Hugging Face cache and model. |
| `WORKER_CUDA_VERSION` | `12.1.0` | `str` | The CUDA version for the worker image. |
2 changes: 1 addition & 1 deletion serverless/vllm/get-started.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ To add or modify environment variables:
4. Add/edit your desired variables.
5. Click **Save Endpoint**.

You can find a full list of available environment variables in the [vLLM worker GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variablessettings).
For a complete list of available environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables).

You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per reponse. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests#request-input-parameters).

Expand Down
2 changes: 1 addition & 1 deletion serverless/vllm/openai-compatibility.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ Use these environment variables to customize the OpenAI compatibility:
| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override the model name in responses |
| `OPENAI_RESPONSE_ROLE` | `assistant` | Role for responses in chat completions |

You can find a complete list of vLLM environment variables on the [GitHub README](https://github.com/runpod-workers/worker-vllm#environment-variables).
For a complete list of all vLLM environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables).

## Client libraries

Expand Down
4 changes: 2 additions & 2 deletions serverless/vllm/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ vLLM workers offer several advantages that make them ideal for LLM deployment:
* **Pre-built optimization**: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference.
* **OpenAI API compatibility**: They provide a drop-in replacement for OpenAI's API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key.
* **Hugging Face integration**: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others.
* **Configurable environments**: Extensive customization options through [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variables) allow you to adjust model parameters, performance settings, and other behaviors.
* **Configurable environments**: Extensive customization options through [environment variables](/serverless/vllm/environment-variables) allow you to adjust model parameters, performance settings, and other behaviors.
* **Auto-scaling architecture**: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis.

## Deployment options
Expand All @@ -33,7 +33,7 @@ Quick-deployed workers will download models during initialization, which can tak

### Option 2: Deploy using a Docker image

Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variablessettings).
Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](/serverless/vllm/environment-variables).

Follow the instructions in the [vLLM worker README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside) to build a model into your worker image.

Expand Down