diff --git a/docs.json b/docs.json index 1ae27b65..c24f8737 100644 --- a/docs.json +++ b/docs.json @@ -73,7 +73,8 @@ "serverless/vllm/overview", "serverless/vllm/get-started", "serverless/vllm/vllm-requests", - "serverless/vllm/openai-compatibility" + "serverless/vllm/openai-compatibility", + "serverless/vllm/environment-variables" ] }, { diff --git a/serverless/vllm/environment-variables.mdx b/serverless/vllm/environment-variables.mdx new file mode 100644 index 00000000..8b513d4b --- /dev/null +++ b/serverless/vllm/environment-variables.mdx @@ -0,0 +1,169 @@ +--- +title: "vLLM environment variables" +sidebarTitle: "Environment variables" +--- + +Environment variables let you configure your vLLM worker without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options. + +To set environment variables, go to your [endpoint settings](/serverless/endpoints/manage-endpoints#edit-an-endpoint) and add them under **Environment Variables**. + +## LLM settings + +These variables control the core language model configuration. + +| Variable | Default | Type(s) | Description | +| --- | --- | --- | --- | +| `MODEL_NAME` | `facebook/opt-125m` | `str` | The name or path of the Hugging Face model to use. | +| `MODEL_REVISION` | `main` | `str` | The model revision to load. | +| `TOKENIZER` | None | `str` | The name or path of the Hugging Face tokenizer to use. | +| `SKIP_TOKENIZER_INIT` | `False` | `bool` | If `True`, skips the initialization of the tokenizer and detokenizer. | +| `TOKENIZER_MODE` | `auto` | `auto`, `slow` | The tokenizer mode. | +| `TRUST_REMOTE_CODE` | `False` | `bool` | If `True`, trusts remote code from Hugging Face. | +| `DOWNLOAD_DIR` | None | `str` | The directory to download and load the model weights from. | +| `LOAD_FORMAT` | `auto` | `str` | The format of the model weights to load. | +| `HF_TOKEN` | - | `str` | Your Hugging Face token, used for private and gated models. | +| `DTYPE` | `auto` | `auto`, `half`, `float16`, `bfloat16`, `float`, `float32` | The data type for model weights and activations. | +| `KV_CACHE_DTYPE` | `auto` | `auto`, `fp8` | The data type for KV cache storage. | +| `QUANTIZATION_PARAM_PATH` | None | `str` | The path to the JSON file containing the KV cache scaling factors. | +| `MAX_MODEL_LEN` | None | `int` | The maximum model context length. | +| `GUIDED_DECODING_BACKEND` | `outlines` | `outlines`, `lm-format-enforcer` | The default engine for guided decoding. | +| `DISTRIBUTED_EXECUTOR_BACKEND` | None | `ray`, `mp` | The backend to use for distributed serving. | +| `WORKER_USE_RAY` | `False` | `bool` | Deprecated. Use `DISTRIBUTED_EXECUTOR_BACKEND=ray` instead. | +| `PIPELINE_PARALLEL_SIZE` | `1` | `int` | The number of pipeline stages. | +| `TENSOR_PARALLEL_SIZE` | `1` | `int` | The number of tensor parallel replicas. | +| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | The number of workers to use for parallel model loading. | +| `RAY_WORKERS_USE_NSIGHT` | `False` | `bool` | If `True`, uses nsight to profile Ray workers. | +| `ENABLE_PREFIX_CACHING` | `False` | `bool` | If `True`, enables automatic prefix caching. | +| `DISABLE_SLIDING_WINDOW` | `False` | `bool` | If `True`, disables the sliding window, capping to the sliding window size. | +| `USE_V2_BLOCK_MANAGER` | `False` | `bool` | If `True`, uses the BlockSpaceMangerV2. | +| `NUM_LOOKAHEAD_SLOTS` | `0` | `int` | The number of lookahead slots, an experimental scheduling configuration for speculative decoding. | +| `SEED` | `0` | `int` | The random seed for operations. | +| `NUM_GPU_BLOCKS_OVERRIDE` | None | `int` | If specified, this value overrides the GPU profiling result for the number of GPU blocks. | +| `MAX_NUM_BATCHED_TOKENS` | None | `int` | The maximum number of batched tokens per iteration. | +| `MAX_NUM_SEQS` | `256` | `int` | The maximum number of sequences per iteration. | +| `MAX_LOGPROBS` | `20` | `int` | The maximum number of log probabilities to return when `logprobs` is specified in `SamplingParams`. | +| `DISABLE_LOG_STATS` | `False` | `bool` | If `True`, disables logging statistics. | +| `QUANTIZATION` | None | `awq`, `squeezellm`, `gptq`, `bitsandbytes` | The method used to quantize the model weights. | +| `ROPE_SCALING` | None | `dict` | The RoPE scaling configuration in JSON format. | +| `ROPE_THETA` | None | `float` | The RoPE theta value. Use with `ROPE_SCALING`. | +| `TOKENIZER_POOL_SIZE` | `0` | `int` | The size of the tokenizer pool for asynchronous tokenization. | +| `TOKENIZER_POOL_TYPE` | `ray` | `str` | The type of the tokenizer pool for asynchronous tokenization. | +| `TOKENIZER_POOL_EXTRA_CONFIG` | None | `dict` | Extra configuration for the tokenizer pool. | + +## LoRA settings + +Configure LoRA (Low-Rank Adaptation) adapters for your model. + +| Variable | Default | Type | Description | +| --- | --- | --- | --- | +| `ENABLE_LORA` | `False` | `bool` | If `True`, enables the handling of LoRA adapters. | +| `MAX_LORAS` | `1` | `int` | The maximum number of LoRAs in a single batch. | +| `MAX_LORA_RANK` | `16` | `int` | The maximum LoRA rank. | +| `LORA_EXTRA_VOCAB_SIZE` | `256` | `int` | The maximum size of the extra vocabulary for LoRA adapters. | +| `LORA_DTYPE` | `auto` | `auto`, `float16`, `bfloat16`, `float32` | The data type for LoRA. | +| `LONG_LORA_SCALING_FACTORS` | None | `tuple` | Specifies multiple scaling factors for LoRA adapters. | +| `MAX_CPU_LORAS` | None | `int` | The maximum number of LoRAs to store in CPU memory. | +| `FULLY_SHARDED_LORAS` | `False` | `bool` | If `True`, enables fully sharded LoRA layers. | +| `LORA_MODULES` | `[]` | `list[dict]` | A list of LoRA adapters to add from Hugging Face. Example: `[{"name": "adapter1", "path": "user/adapter1"}]` | + +## Speculative decoding settings + +Configure speculative decoding to improve inference performance. + +| Variable | Default | Type(s) | Description | +| --- | --- | --- | --- | +| `SCHEDULER_DELAY_FACTOR` | `0.0` | `float` | Applies a delay before scheduling the next prompt. | +| `ENABLE_CHUNKED_PREFILL` | `False` | `bool` | If `True`, enables chunked prefill requests. | +| `SPECULATIVE_MODEL` | None | `str` | The name of the draft model for speculative decoding. | +| `NUM_SPECULATIVE_TOKENS` | None | `int` | The number of speculative tokens to sample from the draft model. | +| `SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE` | None | `int` | The number of tensor parallel replicas for the draft model. | +| `SPECULATIVE_MAX_MODEL_LEN` | None | `int` | The maximum sequence length supported by the draft model. | +| `SPECULATIVE_DISABLE_BY_BATCH_SIZE` | None | `int` | Disables speculative decoding if the number of enqueued requests is larger than this value. | +| `NGRAM_PROMPT_LOOKUP_MAX` | None | `int` | The maximum window size for ngram prompt lookup in speculative decoding. | +| `NGRAM_PROMPT_LOOKUP_MIN` | None | `int` | The minimum window size for ngram prompt lookup in speculative decoding. | +| `SPEC_DECODING_ACCEPTANCE_METHOD` | `rejection_sampler` | `rejection_sampler`, `typical_acceptance_sampler` | The acceptance method for draft token verification in speculative decoding. | +| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD` | None | `float` | Sets the lower bound threshold for the posterior probability of a token to be accepted. | +| `TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA` | None | `float` | A scaling factor for the entropy-based threshold for token acceptance. | + +## System performance settings + +Configure GPU memory and system resource utilization. + +| Variable | Default | Type(s) | Description | +| --- | --- | --- | --- | +| `GPU_MEMORY_UTILIZATION` | `0.95` | `float` | The GPU VRAM utilization. | +| `MAX_PARALLEL_LOADING_WORKERS` | None | `int` | Loads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models. | +| `BLOCK_SIZE` | `16` | `8`, `16`, `32` | The token block size for contiguous chunks of tokens. | +| `SWAP_SPACE` | `4` | `int` | The CPU swap space size (in GiB) per GPU. | +| `ENFORCE_EAGER` | `False` | `bool` | If `True`, always uses eager-mode PyTorch. If `False`, uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility. | +| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | The maximum context length covered by CUDA graphs. When a sequence has a context length larger than this, the system falls back to eager mode. | +| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | If `0`, enables custom all-reduce. If `1`, disables it. | + +## Tokenizer settings + +Customize tokenizer behavior and chat templates. + +| Variable | Default | Type(s) | Description | +| --- | --- | --- | --- | +| `TOKENIZER_NAME` | None | `str` | The tokenizer repository to use a different tokenizer than the model's default. | +| `TOKENIZER_REVISION` | None | `str` | The tokenizer revision to load. | +| `CUSTOM_CHAT_TEMPLATE` | None | `str` of single-line jinja template | A custom chat Jinja template. See the [Hugging Face documentation](https://huggingface.co/docs/transformers/chat_templating) for more information. | + +## Streaming and batch settings + +Control how tokens are streamed back in HTTP responses. + +These settings control how tokens are batched in HTTP responses when streaming. The batch size starts at `DEFAULT_MIN_BATCH_SIZE` and increases by a factor of `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` with each request until it reaches `DEFAULT_BATCH_SIZE`. + +For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM's internal batching. + +| Variable | Default | Type(s) | Description | +| --- | --- | --- | --- | +| `DEFAULT_BATCH_SIZE` | `50` | `int` | The default and maximum batch size for token streaming. | +| `DEFAULT_MIN_BATCH_SIZE` | `1` | `int` | The initial batch size for the first request. | +| `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | `3` | `float` | The growth factor for the dynamic batch size. | + +## OpenAI compatibility settings + +Configure OpenAI API compatibility features. + +| Variable | Default | Type(s) | Description | +| --- | --- | --- | --- | +| `RAW_OPENAI_OUTPUT` | `1` | boolean as `int` | If `1`, enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility. | +| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | `str` | Overrides the served model name. This allows you to use a custom name in the `model` parameter of your OpenAI requests. | +| `OPENAI_RESPONSE_ROLE` | `assistant` | `str` | The role of the LLM's response in OpenAI chat completions. | +| `ENABLE_AUTO_TOOL_CHOICE` | `false` | `bool` | If `true`, enables automatic tool selection for supported models. | +| `TOOL_CALL_PARSER` | None | `str` | The parser for tool calls. | +| `REASONING_PARSER` | None | `str` | The parser for reasoning-capable models. Setting this enables reasoning mode. | + +## Serverless and concurrency settings + +Configure concurrency and logging for Serverless deployments. + +| Variable | Default | Type(s) | Description | +| --- | --- | --- | --- | +| `MAX_CONCURRENCY` | `300` | `int` | The maximum number of concurrent requests per worker. vLLM's internal queue handles VRAM limitations, so this setting is primarily for scaling and load balancing efficiency. | +| `DISABLE_LOG_STATS` | `False` | `bool` | If `False`, enables vLLM stats logging. | +| `DISABLE_LOG_REQUESTS` | `False` | `bool` | If `False`, enables vLLM request logging. | + +## Advanced settings + +Additional configuration options for specialized use cases. + +| Variable | Default | Type | Description | +| --- | --- | --- | --- | +| `MODEL_LOADER_EXTRA_CONFIG` | None | `dict` | Extra configuration for the model loader. | +| `PREEMPTION_MODE` | None | `str` | The preemption mode. If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations to CPU memory during preemption. | +| `PREEMPTION_CHECK_PERIOD` | `1.0` | `float` | The frequency (in seconds) at which the engine checks for preemption. | +| `PREEMPTION_CPU_CAPACITY` | `2` | `float` | The percentage of CPU memory to use for saved activations. | +| `DISABLE_LOGGING_REQUEST` | `False` | `bool` | If `True`, disables logging requests. | +| `MAX_LOG_LEN` | None | `int` | The maximum number of prompt characters or prompt ID numbers to print in the log. | + +## Docker build arguments + +These variables are used when building custom Docker images with models baked in. + +| Variable | Default | Type | Description | +| --- | --- | --- | --- | +| `BASE_PATH` | `/runpod-volume` | `str` | The storage directory for the Hugging Face cache and model. | +| `WORKER_CUDA_VERSION` | `12.1.0` | `str` | The CUDA version for the worker image. | diff --git a/serverless/vllm/get-started.mdx b/serverless/vllm/get-started.mdx index 587cdef3..170a3dce 100644 --- a/serverless/vllm/get-started.mdx +++ b/serverless/vllm/get-started.mdx @@ -145,7 +145,7 @@ To add or modify environment variables: 4. Add/edit your desired variables. 5. Click **Save Endpoint**. -You can find a full list of available environment variables in the [vLLM worker GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variablessettings). +For a complete list of available environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables). You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per reponse. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests#request-input-parameters). diff --git a/serverless/vllm/openai-compatibility.mdx b/serverless/vllm/openai-compatibility.mdx index 8d3c3d50..9dd5edf3 100644 --- a/serverless/vllm/openai-compatibility.mdx +++ b/serverless/vllm/openai-compatibility.mdx @@ -259,7 +259,7 @@ Use these environment variables to customize the OpenAI compatibility: | `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override the model name in responses | | `OPENAI_RESPONSE_ROLE` | `assistant` | Role for responses in chat completions | -You can find a complete list of vLLM environment variables on the [GitHub README](https://github.com/runpod-workers/worker-vllm#environment-variables). +For a complete list of all vLLM environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables). ## Client libraries diff --git a/serverless/vllm/overview.mdx b/serverless/vllm/overview.mdx index 0e50052e..d34d4ac5 100644 --- a/serverless/vllm/overview.mdx +++ b/serverless/vllm/overview.mdx @@ -14,7 +14,7 @@ vLLM workers offer several advantages that make them ideal for LLM deployment: * **Pre-built optimization**: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference. * **OpenAI API compatibility**: They provide a drop-in replacement for OpenAI's API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key. * **Hugging Face integration**: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others. -* **Configurable environments**: Extensive customization options through [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variables) allow you to adjust model parameters, performance settings, and other behaviors. +* **Configurable environments**: Extensive customization options through [environment variables](/serverless/vllm/environment-variables) allow you to adjust model parameters, performance settings, and other behaviors. * **Auto-scaling architecture**: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis. ## Deployment options @@ -33,7 +33,7 @@ Quick-deployed workers will download models during initialization, which can tak ### Option 2: Deploy using a Docker image -Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variablessettings). +Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](/serverless/vllm/environment-variables). Follow the instructions in the [vLLM worker README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside) to build a model into your worker image.