From 06e26ec7f62d6044f96bbae1dee76c2c9bbbf57c Mon Sep 17 00:00:00 2001 From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com> Date: Thu, 9 Oct 2025 19:27:16 +0000 Subject: [PATCH 1/2] Documentation updates from Promptless --- serverless/vllm/get-started.mdx | 92 +++---- serverless/vllm/openai-compatibility.mdx | 183 +++++++++----- serverless/vllm/overview.mdx | 135 ++++++---- serverless/vllm/vllm-requests.mdx | 303 ++++++++++++++++------- 4 files changed, 457 insertions(+), 256 deletions(-) diff --git a/serverless/vllm/get-started.mdx b/serverless/vllm/get-started.mdx index 587cdef3..12209316 100644 --- a/serverless/vllm/get-started.mdx +++ b/serverless/vllm/get-started.mdx @@ -1,16 +1,17 @@ --- title: "Deploy a vLLM worker" +description: "Deploy a large language model using Runpod's vLLM workers and start serving requests in minutes." --- -Learn how to deploy a large language model (LLM) using Runpod's preconfigured vLLM workers. By the end of this guide, you'll have a fully functional API endpoint that you can use to handle LLM inference requests. +Learn how to deploy a large language model using Runpod's vLLM workers. By the end of this guide, you'll have a fully functional Serverless endpoint that can handle LLM inference requests. ## What you'll learn In this tutorial, you'll learn how to: -* Configure and deploy a vLLM worker using Runpod's Serverless platform. +* Configure and deploy a vLLM worker using Runpod Serverless. * Select the appropriate hardware and scaling settings for your model. -* Set up environmental variables to customize your deployment. +* Set up environment variables to customize your deployment. * Test your endpoint using the Runpod API. * Troubleshoot common issues that might arise during deployment. @@ -21,47 +22,33 @@ In this tutorial, you'll learn how to: ## Step 1: Choose your model -First, decide which LLM you want to deploy. The vLLM worker supports most Hugging Face models, including: +First, decide which LLM you want to deploy. The vLLM worker supports most models on Hugging Face, including: -* Llama 3 (e.g., `meta-llama/Llama-3.2-3B-Instruct`) -* Mistral (e.g., `mistralai/Ministral-8B-Instruct-2410`) -* Qwen3 (e.g., `Qwen/Qwen3-8B`) -* OpenChat (e.g., `openchat/openchat-3.5-0106`) -* Gemma (e.g., `google/gemma-3-1b-it`) -* Deepseek-R1 (e.g., `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`) -* Phi-4 (e.g., `microsoft/Phi-4-mini-instruct`) +* Llama 3 (e.g., `meta-llama/Llama-3.2-3B-Instruct`). +* Mistral (e.g., `mistralai/Ministral-8B-Instruct-2410`). +* Qwen3 (e.g., `Qwen/Qwen3-8B`). +* OpenChat (e.g., `openchat/openchat-3.5-0106`). +* Gemma (e.g., `google/gemma-3-1b-it`). +* DeepSeek-R1 (e.g., `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`). +* Phi-4 (e.g., `microsoft/Phi-4-mini-instruct`). -For this walkthrough, we'll use `openchat/openchat-3.5-0106`, but you can substitute this with [any compatible model](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#compatible-model-architectures). +For this walkthrough, we'll use `openchat/openchat-3.5-0106`, but you can substitute this with [any compatible model](https://docs.vllm.ai/en/latest/models/supported_models.html). ## Step 2: Deploy using the Runpod console -The easiest way to deploy a vLLM worker is through the Runpod console: +The easiest way to deploy a vLLM worker is through Runpod's Ready-to-Deploy Repos: -1. Navigate to the [Serverless page](https://www.console.runpod.io/serverless). +1. Find the [vLLM repo](https://console.runpod.io/hub/runpod-workers/worker-vllm) in the Runpod Hub. -2. Under **Quick Deploy**, find **Serverless vLLM** and click **Configure**. +2. Click **Deploy**, using the latest vLLM worker version. -3. In the deployment modal: +3. In the **Model (optional)** field, end the model name: `openchat/openchat-3.5-0106`. +4. Click **Advanced** to expand the vLLM settings. +5. Set **Max Model Length** to `8192` (or an appropriate context length for your model). +6. Leave other settings at their defaults unless you have specific requirements, then click **Next**. +7. Click **Create Endpoint** - * Select a vLLM version (latest stable recommended). - * Under **Hugging Face Models**, enter your model: `openchat/openchat-3.5-0106`. - * If using a gated model, enter your **Hugging Face Token**. - * Click **Next**. - -4. In the vLLM settings modal, under **LLM Settings**: - - * Set **Max Model Length** to `8192` (or an appropriate context length for your model). - * Leave other settings at their defaults unless you have specific requirements. - * Click **Next**. - -5. Make changes to the endpoint settings if you have specific requirements, then click **Deploy**. - - - - - - -Your endpoint will now begin initializing. This may take several minutes while Runpod provisions resources and downloads your model. +Your endpoint will now begin initializing. This may take several minutes while Runpod provisions resources and downloads the selected model. @@ -71,12 +58,9 @@ For more details on how to optimize your endpoint, see [Endpoint configurations] ## Step 3: Understand your endpoint -While your endpoint is initializing, let's understand what's happening and what you'll be able to do with it: - -* Runpod is creating a Serverless endpoint with your specified configuration. -* The vLLM worker image is being deployed with your chosen model. +While your endpoint is initializing, let's understand what's happening and what you'll be able to do with it. -Once deployment is complete, make a note of your **Endpoint ID**. You'll need this to make API requests. +Runpod is creating a Serverless endpoint with your specified configuration, and the vLLM worker image is being deployed using your chosen model. Once deployment is complete, make a note of your **Endpoint ID**, as you'll need this to make API requests. @@ -127,27 +111,27 @@ When the workers finish processing your request, you should see output on the ri } ``` -## Step 5: Customize your model (optional) +## Step 5: Customize your deployment with environment variables (optional) If you need to customize your model deployment, you can edit your endpoint settings to add environment variables. Here are some useful environment variables you might want to set: -* `MAX_MODEL_LEN`: Maximum context length (e.g., `16384`) -* `DTYPE`: Data type for model weights (`float16`, `bfloat16`, or `float32`) -* `GPU_MEMORY_UTILIZATION`: Controls VRAM usage (e.g., `0.95` for 95%) -* `CUSTOM_CHAT_TEMPLATE`: For models that need a custom chat template -* `OPENAI_SERVED_MODEL_NAME_OVERRIDE`: Change the model name to use in OpenAI requests +* `MAX_MODEL_LEN`: Maximum context length (e.g., `16384`). +* `DTYPE`: Data type for model weights (`float16`, `bfloat16`, or `float32`). +* `GPU_MEMORY_UTILIZATION`: Controls VRAM usage (e.g., `0.95` for 95%). +* `CUSTOM_CHAT_TEMPLATE`: For models that need a custom chat template. +* `OPENAI_SERVED_MODEL_NAME_OVERRIDE`: Change the model name to use in OpenAI requests. To add or modify environment variables: 1. Go to your endpoint details page. 2. Select **Manage**, then select **Edit Endpoint**. 3. Expand the **Public Environment Variables** section. -4. Add/edit your desired variables. +4. Add or edit your desired variables. 5. Click **Save Endpoint**. -You can find a full list of available environment variables in the [vLLM worker GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variablessettings). +You can find a full list of available environment variables in the [environment variables documentation](/serverless/vllm/environment-variables). -You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per reponse. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests#request-input-parameters). +You may also wish to adjust the input parameters for your request. For example, use the `max_tokens` parameter to increase the maximum number of tokens generated per response. To learn more, see [Send vLLM requests](/serverless/vllm/vllm-requests). ## Troubleshooting @@ -160,11 +144,11 @@ If you encounter issues with your deployment: ## Next steps -Congratulations! You've successfully deployed a vLLM worker on Runpod's Serverless platform. You now have a powerful, scalable LLM inference API that's compatible with both the OpenAI client and Runpod's native API. +Congratulations! You've successfully deployed a vLLM worker on Runpod Serverless. You now have a powerful, scalable LLM inference API that's compatible with both the OpenAI client and Runpod's native API. Next you can try: -* [Sending requests using the Runpod API.](/serverless/vllm/vllm-requests) -* [Learning about vLLM's OpenAI API compatibility.](/serverless/vllm/openai-compatibility) -* [Customizing your vLLM worker's handler function.](/serverless/workers/handler-functions) -* [Building a custom worker for more specialized workloads.](/serverless/workers/custom-worker) +* [Sending requests using the Runpod API](/serverless/vllm/vllm-requests). +* [Learning about vLLM's OpenAI API compatibility](/serverless/vllm/openai-compatibility). +* [Customizing your vLLM worker's handler function](/serverless/workers/handler-functions). +* [Building a custom worker for more specialized workloads](/serverless/workers/custom-worker). diff --git a/serverless/vllm/openai-compatibility.mdx b/serverless/vllm/openai-compatibility.mdx index 8d3c3d50..0ed8fa8c 100644 --- a/serverless/vllm/openai-compatibility.mdx +++ b/serverless/vllm/openai-compatibility.mdx @@ -1,19 +1,20 @@ --- title: "OpenAI API compatibility guide" -sidebarTitle: "OpenAI API compability" +sidebarTitle: "OpenAI API compatibility" +description: "Integrate vLLM workers with OpenAI client libraries and API-compatible tools." --- -Runpod's [vLLM workers](/serverless/vllm/overview) implement OpenAI API compatibility, allowing you to use familiar [OpenAI client libraries](https://platform.openai.com/docs/libraries) with your deployed models. This guide will help you understand how to leverage this compatibility to integrate your models seamlessly with existing OpenAI-based applications. +Runpod's vLLM workers implement OpenAI API compatibility, allowing you to use familiar [OpenAI client libraries](https://platform.openai.com/docs/libraries) with your deployed models. This guide explains how to leverage this compatibility to integrate your models seamlessly with existing OpenAI-based applications. ## Endpoint structure -When using the OpenAI-compatible API with Runpod, your requests will be directed to this base URL pattern: +When using the OpenAI-compatible API with Runpod, your requests are directed to this base URL pattern: -```bash -https://api.runpod.ai/v2/[ENDPOINT_ID]/openai/v1 +``` +https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1 ``` -Replace `[ENDPOINT_ID]` with your Serverless endpoint ID. +Replace `ENDPOINT_ID` with your Serverless endpoint ID. ## Supported APIs @@ -29,30 +30,30 @@ The vLLM worker implements these core OpenAI API endpoints: The `MODEL_NAME` environment variable is essential for all OpenAI-compatible API requests. This variable corresponds to either: -1. The [Hugging Face model](https://huggingface.co/models) you've deployed (e.g., `mistralai/Mistral-7B-Instruct-v0.2`) -2. A custom name if you've set `OPENAI_SERVED_MODEL_NAME_OVERRIDE` as an environment variable +1. The [Hugging Face model](https://huggingface.co/models) you've deployed (e.g., `mistralai/Mistral-7B-Instruct-v0.2`). +2. A custom name if you've set `OPENAI_SERVED_MODEL_NAME_OVERRIDE` as an environment variable. + +This model name is used in chat and text completion API requests to identify which model should process your request. -This model name is used in chat/text completion API requests to identify which model should process your request. -## Initilization +## Initialize the OpenAI client -Before you can send API requests, start by setting up an OpenAI client with your Runpod API key and endpoint URL: +Before you can send API requests, set up an OpenAI client with your Runpod API key and endpoint URL: ```python from openai import OpenAI -import os -MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2" # Use your deployed model +MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2" # Use your deployed model client = OpenAI( - api_key=[RUNPOD_API_KEY], - base_url=f"https://api.runpod.ai/v2/[RUNPOD_ENDPOINT_ID]/openai/v1", + api_key="", + base_url="https://api.runpod.ai/v2//openai/v1", ) ``` -## Send a request +## Send requests -You can use Runpod's OpenAI compatible API to send requests to your Runpod endpoint, enabling you to use the same client libraries and code that you use with OpenAI's services. You only need to change the base URL to point to your Runpod endpoint. +You can use Runpod's OpenAI-compatible API to send requests to your Runpod endpoint, enabling you to use the same client libraries and code that you use with OpenAI's services. You only need to change the base URL to point to your Runpod endpoint. @@ -64,20 +65,19 @@ You can also send requests using [Runpod's native API](/serverless/vllm/vllm-req The `/chat/completions` endpoint is designed for instruction-tuned LLMs that follow a chat format. -#### Non-streaming request example +#### Non-streaming request Here's how you can make a basic chat completion request: ```python from openai import OpenAI -import os MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2" # Use your deployed model # Initialize the OpenAI client client = OpenAI( - api_key=[RUNPOD_API_KEY], - base_url=f"https://api.runpod.ai/v2/[RUNPOD_ENDPOINT_ID]/openai/v1", + api_key="", + base_url="https://api.runpod.ai/v2//openai/v1", ) # Chat completion request (for instruction-tuned models) @@ -123,13 +123,11 @@ The API returns responses in this JSON format: } ``` -#### Streaming request example +#### Streaming request Streaming allows you to receive the model's output incrementally as it's generated, rather than waiting for the complete response. This real-time delivery enhances responsiveness, making it ideal for interactive applications like chatbots or for monitoring the progress of lengthy generation tasks. ```python -# ... Imports and initialization ... - # Create a streaming chat completion request stream = client.chat.completions.create( model=MODEL_NAME, @@ -154,16 +152,14 @@ print() The `/completions` endpoint is designed for base LLMs and text completion tasks. -#### Non-streaming request example +#### Non-streaming request Here's how you can make a text completion request: ```python -# ... Imports and initialization ... - # Text completion request response = client.completions.create( - model="mistralai/Mistral-7B-Instruct-v0.2", + model=MODEL_NAME, prompt="Write a poem about artificial intelligence:", temperature=0.7, max_tokens=150 @@ -199,11 +195,9 @@ The API returns responses in this JSON format: } ``` -#### Streaming request example +#### Streaming request ```python -# ... Imports and initialization ... - # Create a completion stream response_stream = client.completions.create( model=MODEL_NAME, @@ -212,6 +206,7 @@ response_stream = client.completions.create( max_tokens=100, stream=True, ) + # Stream the response for response in response_stream: print(response.choices[0].text or "", end="", flush=True) @@ -222,8 +217,6 @@ for response in response_stream: The `/models` endpoint allows you to get a list of available models on your endpoint: ```python -# ... Imports and initialization ... - models_response = client.models.list() list_of_models = [model.id for model in models_response] print(list_of_models) @@ -245,9 +238,67 @@ print(list_of_models) } ``` -## Request input parameters - -vLLM workers support various parameters to control generation behavior. You can find a complete list of OpenAI request input parameters on the [GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#openai-request-input-parameters). +## Chat completion parameters + +Here are all available parameters for the `/chat/completions` endpoint: + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `messages` | `list[dict[str, str]]` | Required | List of messages with `role` and `content` keys. The model's chat template will be applied automatically. | +| `model` | `string` | Required | The model repo that you've deployed on your Runpod Serverless endpoint. | +| `temperature` | `float` | `0.7` | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. | +| `top_p` | `float` | `1.0` | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | +| `n` | `int` | `1` | Number of output sequences to return for the given prompt. | +| `max_tokens` | `int` | None | Maximum number of tokens to generate per output sequence. | +| `seed` | `int` | None | Random seed to use for the generation. | +| `stop` | `string` or `list[str]` | `list` | String(s) that stop generation when produced. The returned output will not contain the stop strings. | +| `stream` | `bool` | `false` | Whether to stream the response. | +| `presence_penalty` | `float` | `0.0` | Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `frequency_penalty` | `float` | `0.0` | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `logit_bias` | `dict[str, float]` | None | Unsupported by vLLM. | +| `user` | `string` | None | Unsupported by vLLM. | + +### Additional vLLM parameters + +vLLM supports additional parameters beyond the standard OpenAI API: + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `best_of` | `int` | None | Number of output sequences generated from the prompt. From these `best_of` sequences, the top `n` sequences are returned. Must be ≥ `n`. Treated as beam width when `use_beam_search` is `true`. | +| `top_k` | `int` | `-1` | Controls the number of top tokens to consider. Set to -1 to consider all tokens. | +| `ignore_eos` | `bool` | `false` | Whether to ignore the EOS token and continue generating tokens after EOS is generated. | +| `use_beam_search` | `bool` | `false` | Whether to use beam search instead of sampling. | +| `stop_token_ids` | `list[int]` | `list` | List of token IDs that stop generation when produced. The returned output will contain the stop tokens unless they are special tokens. | +| `skip_special_tokens` | `bool` | `true` | Whether to skip special tokens in the output. | +| `spaces_between_special_tokens` | `bool` | `true` | Whether to add spaces between special tokens in the output. | +| `add_generation_prompt` | `bool` | `true` | Whether to add generation prompt. Read more [here](https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts). | +| `echo` | `bool` | `false` | Echo back the prompt in addition to the completion. | +| `repetition_penalty` | `float` | `1.0` | Penalizes new tokens based on whether they appear in the prompt and generated text so far. Values > 1 encourage new tokens, values < 1 encourage repetition. | +| `min_p` | `float` | `0.0` | Minimum probability for a token to be considered. | +| `length_penalty` | `float` | `1.0` | Penalizes sequences based on their length. Used in beam search. | +| `include_stop_str_in_output` | `bool` | `false` | Whether to include the stop strings in output text. | + +## Text completion parameters + +Here are all available parameters for the `/completions` endpoint: + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `prompt` | `string` or `list[str]` | Required | The prompt(s) to generate completions for. | +| `model` | `string` | Required | The model repo that you've deployed on your Runpod Serverless endpoint. | +| `temperature` | `float` | `0.7` | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. | +| `top_p` | `float` | `1.0` | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | +| `n` | `int` | `1` | Number of output sequences to return for the given prompt. | +| `max_tokens` | `int` | `16` | Maximum number of tokens to generate per output sequence. | +| `seed` | `int` | None | Random seed to use for the generation. | +| `stop` | `string` or `list[str]` | `list` | String(s) that stop generation when produced. The returned output will not contain the stop strings. | +| `stream` | `bool` | `false` | Whether to stream the response. | +| `presence_penalty` | `float` | `0.0` | Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `frequency_penalty` | `float` | `0.0` | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `logit_bias` | `dict[str, float]` | None | Unsupported by vLLM. | +| `user` | `string` | None | Unsupported by vLLM. | + +Text completions support the same additional vLLM parameters as chat completions (see the Additional vLLM parameters section above). ## Environment variables @@ -255,11 +306,11 @@ Use these environment variables to customize the OpenAI compatibility: | Variable | Default | Description | | ----------------------------------- | ----------- | ------------------------------------------- | -| `RAW_OPENAI_OUTPUT` | `1` (true) | Enables raw OpenAI SSE format for streaming | -| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override the model name in responses | -| `OPENAI_RESPONSE_ROLE` | `assistant` | Role for responses in chat completions | +| `RAW_OPENAI_OUTPUT` | `1` (true) | Enables raw OpenAI SSE format for streaming. | +| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override the model name in responses. | +| `OPENAI_RESPONSE_ROLE` | `assistant` | Role for responses in chat completions. | -You can find a complete list of vLLM environment variables on the [GitHub README](https://github.com/runpod-workers/worker-vllm#environment-variables). +For a complete list of vLLM environment variables, see the [environment variables documentation](/serverless/vllm/environment-variables). ## Client libraries @@ -267,16 +318,16 @@ The OpenAI-compatible API works with standard [OpenAI client libraries](https:// ### Python -```py +```python from openai import OpenAI client = OpenAI( - api_key="[RUNPOD_API_KEY]", - base_url=f"https://api.runpod.ai/v2/your_endpoint_id/openai/v1" + api_key="", + base_url="https://api.runpod.ai/v2//openai/v1" ) response = client.chat.completions.create( - model="[MODEL_NAME]", + model="", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} @@ -286,16 +337,16 @@ response = client.chat.completions.create( ### JavaScript -```js +```javascript import { OpenAI } from "openai"; const openai = new OpenAI({ - apiKey: "[RUNPOD_API_KEY]", - baseURL: "https://api.runpod.ai/v2/your_endpoint_id/openai/v1" + apiKey: "", + baseURL: "https://api.runpod.ai/v2//openai/v1" }); const response = await openai.chat.completions.create({ - model: "[MODEL_NAME]", + model: "", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Hello!" } @@ -307,16 +358,21 @@ const response = await openai.chat.completions.create({ While the vLLM worker aims for high compatibility, there are some differences from OpenAI's implementation: -1. **Token counting**: Token counts may differ slightly from OpenAI models. -2. **Streaming format**: The exact chunking of streaming responses may vary. -3. **Error format**: Error responses follow a similar but not identical format. -4. **Rate limits**: Rate limits follow Runpod's endpoint policies rather than OpenAI's. +**Token counting** may differ slightly from OpenAI models due to different tokenizers. + +**Streaming format** follows OpenAI's Server-Sent Events (SSE) format, but the exact chunking of streaming responses may vary. + +**Error responses** follow a similar but not identical format to OpenAI's error responses. + +**Rate limits** follow Runpod's endpoint policies rather than OpenAI's rate limiting structure. + +### Current limitations -The vLLM worker also currently has a few limitations: +The vLLM worker has a few limitations: -* The function and tool APIs are not currently supported. +* Function and tool calling APIs are not currently supported. * Some OpenAI-specific features like moderation endpoints are not available. -* Vision models and multimodal capabilities depend on the underlying model support. +* Vision models and multimodal capabilities depend on the underlying model support in vLLM. ## Troubleshooting @@ -324,14 +380,15 @@ Common issues and their solutions: | Issue | Solution | | ------------------------- | --------------------------------------------------------------------- | -| "Invalid model" error | Verify your model name matches what you deployed | -| Authentication error | Check that you're using your Runpod API key, not an OpenAI key | -| Timeout errors | Increase client timeout settings for large models | -| Incompatible responses | Set `RAW_OPENAI_OUTPUT=1` in your environment variables | -| Different response format | Some models may have different output formatting; use a chat template | +| "Invalid model" error | Verify your model name matches what you deployed. | +| Authentication error | Check that you're using your Runpod API key, not an OpenAI key. | +| Timeout errors | Increase client timeout settings for large models. | +| Incompatible responses | Set `RAW_OPENAI_OUTPUT=1` in your environment variables. | +| Different response format | Some models may have different output formatting; use a chat template. | ## Next steps -* [Learn how to send vLLM requests.](/serverless/vllm/vllm-requests) -* [Explore Runpod endpoint operations.](/serverless/endpoints/operations) -* [Explore the OpenAI API documentation.](https://platform.openai.com/docs/api-reference) +* [Learn how to send vLLM requests using Runpod's native API](/serverless/vllm/vllm-requests). +* [Explore environment variables for customization](/serverless/vllm/environment-variables). +* [Review all Serverless endpoint operations](/serverless/endpoints/send-requests). +* [Explore the OpenAI API documentation](https://platform.openai.com/docs/api-reference). diff --git a/serverless/vllm/overview.mdx b/serverless/vllm/overview.mdx index 0e50052e..9eabe2d4 100644 --- a/serverless/vllm/overview.mdx +++ b/serverless/vllm/overview.mdx @@ -1,81 +1,124 @@ --- -title: "vLLM worker overview" +title: "vLLM workers overview" sidebarTitle: "Overview" +description: "Learn what vLLM is, how it works, and why you should use it for deploying large language models on Runpod Serverless." --- -vLLM workers are specialized containers designed to efficiently deploy and serve large language models (LLMs) on Runpod's [Serverless infrastructure](/serverless/overview). By leveraging Runpod's vLLM workers, you can quickly deploy state-of-the-art language models with optimized performance, flexible scaling, and cost-effective operation. +vLLM workers let you deploy and serve large language models on Runpod Serverless. They use vLLM, a high-performance inference engine, to deliver fast and efficient LLM inference with automatic scaling. -For detailed information on model compatibility and configuration options, check out the [vLLM worker GitHub repository](https://github.com/runpod-workers/worker-vllm). +## What is vLLM? -## Key features +vLLM is an open-source inference engine designed to serve large language models efficiently. It maximizes throughput and minimizes latency when running LLM inference workloads. -vLLM workers offer several advantages that make them ideal for LLM deployment: +vLLM workers include the vLLM engine with GPU optimizations and support for both OpenAI's API and Runpod's native API. You can deploy any supported model from Hugging Face with minimal configuration and start serving requests immediately. The workers run on Runpod Serverless, which automatically scales based on demand. -* **Pre-built optimization**: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference. -* **OpenAI API compatibility**: They provide a drop-in replacement for OpenAI's API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key. -* **Hugging Face integration**: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others. -* **Configurable environments**: Extensive customization options through [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variables) allow you to adjust model parameters, performance settings, and other behaviors. -* **Auto-scaling architecture**: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis. +## How vLLM works -## Deployment options +vLLM uses several advanced techniques to achieve high performance when serving LLMs. Understanding these can help you optimize your deployments and troubleshoot issues. + +### PagedAttention for memory efficiency + +PagedAttention is the key innovation in vLLM. It dramatically improves how GPU memory is used during inference. Traditional LLM serving wastes memory by pre-allocating large contiguous blocks for key-value (KV) caches. PagedAttention breaks the KV cache into smaller pages, similar to how operating systems manage memory. + +This reduces memory waste and allows vLLM to serve more requests concurrently on the same GPU. You can handle higher throughput or serve larger models on smaller GPUs. + +### Continuous batching + +vLLM uses continuous batching (also called dynamic batching) to process multiple requests simultaneously. Unlike traditional batching, which waits for a batch to fill up before processing, continuous batching processes requests as they arrive and adds new requests to the batch as soon as previous ones complete. + +This keeps your GPU busy and reduces latency for individual requests, especially during periods of variable traffic. + +### Request lifecycle + +When you send a request to a vLLM worker endpoint: + +1. The request arrives at Runpod Serverless infrastructure. +2. If no worker is available, the request is queued and a worker starts automatically. +3. The worker loads your model from Hugging Face (or from the pre-baked Docker image). +4. vLLM processes the request using PagedAttention and continuous batching. +5. The response is returned to your application. +6. If there are no more requests, the worker scales down to zero after a configured timeout. + +vLLM endpoints use the same `/run` and `/runsync` operations as other Runpod Serverless endpoints. The only difference is the input format and the specialized LLM processing inside the worker. + +## Why use vLLM workers? + +vLLM workers offer several advantages over other LLM deployment options. + +### Performance and efficiency + +vLLM's PagedAttention and continuous batching deliver significantly better throughput than traditional serving methods. You can serve 2-3x more requests per GPU compared to naive implementations, which directly translates to lower costs and better user experiences. + +### OpenAI API compatibility -There are two ways to deploy a vLLM worker: +vLLM workers provide a drop-in replacement for OpenAI's API. If you're already using the OpenAI Python client or any other OpenAI-compatible library, you can switch to your Runpod endpoint by changing just two lines of code: the API key and the base URL. Your existing prompts, parameters, and response handling code continue to work without modification. -### Option 1: Quick deploy a vLLM endpoint +### Model flexibility -This is the simplest approach. Use Runpod's UI to deploy a model directly from Hugging Face with minimal configuration. For step-by-step instructions, see [Deploy a vLLM worker](/serverless/vllm/get-started). +You can deploy virtually any model available on Hugging Face, including popular options like Llama, Mistral, Qwen, Gemma, and thousands of others. vLLM supports a wide range of model architectures out of the box, and new architectures are added regularly. - +### Auto-scaling and cost efficiency -Quick-deployed workers will download models during initialization, which can take some time depending on the model selected. If you plan to run a vLLM endpoint in production, it’s best to package your model into a Docker image ahead of time (using the Docker image method below), as this can significantly reduce cold start times. +Runpod Serverless automatically scales your vLLM workers from zero to many based on demand. You only pay for the seconds when workers are actively processing requests. This makes vLLM workers ideal for workloads with variable traffic patterns or when you're getting started and don't want to pay for idle capacity. - +### Production-ready features -### Option 2: Deploy using a Docker image +vLLM workers come with features that make them suitable for production deployments, including streaming responses, configurable context lengths, quantization support (AWQ, GPTQ), multi-GPU tensor parallelism, and comprehensive error handling. -Deploy a packaged vLLM worker image from [GitHub](https://github.com/runpod-workers/worker-vllm) or [Docker Hub](https://hub.docker.com/r/runpod/worker-v1-vllm/tags), configuring your endpoint using [environment variables](https://github.com/runpod-workers/worker-vllm#environment-variablessettings). +## Deployment options + +There are two ways to deploy vLLM workers on Runpod. + +### Using pre-built Docker images + +This is the fastest and most common approach. Runpod provides pre-built vLLM worker images that you can deploy directly from the console. You specify your model name as an environment variable, and the worker downloads it from Hugging Face during initialization. + +This method is ideal for getting started quickly, testing different models, or deploying models that change frequently. However, model download time adds to your cold start latency. -Follow the instructions in the [vLLM worker README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside) to build a model into your worker image. +### Building custom Docker images with models baked in -You can add new functionality your vLLM worker deployment by customizing its [handler function](/serverless/workers/handler-functions). +For production deployments where cold start time matters, you can build a custom Docker image that includes your model weights. This eliminates download time and can reduce cold starts from minutes to seconds. + +This approach requires more upfront work but provides the best performance for production workloads with consistent traffic. ## Compatible models -You can deploy almost any model on [Hugging Face](https://huggingface.co/models?other=LLM) as a vLLM worker. You can find a full list of supported models architectures on the [GitHub README](https://github.com/runpod-workers/worker-vllm/blob/main/README.md#compatible-model-architectures). +vLLM supports most model architectures available on Hugging Face. You can deploy models from families including Llama (1, 2, 3, 3.1, 3.2), Mistral and Mixtral, Qwen2 and Qwen2.5, Gemma and Gemma 2, Phi (2, 3, 3.5, 4), DeepSeek (V2, V3, R1), GPT-2, GPT-J, OPT, BLOOM, Falcon, MPT, StableLM, Yi, and many others. -## How vLLM works +For a complete and up-to-date list of supported model architectures, see the [vLLM supported models documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). + +## Performance considerations -When deployed to a [Serverless endpoint](/serverless/endpoints/overview), vLLM workers: +Several factors affect vLLM worker performance. -1. Download and load the specified LLM from Hugging Face or other compatible sources. -2. Optimize the model for inference using vLLM's techniques like continuous batching and PagedAttention. -3. Expose API endpoints for both [OpenAI-compatible requests](/serverless/vllm/openai-compatibility) and Runpod's native [endpoint request](/serverless/endpoints/send-requests) format. -4. Process incoming requests by dynamically allocating GPU resources. -5. Scale workers up or down based on traffic patterns. +**GPU selection** is the most important factor. Larger models require more VRAM, and inference speed scales with GPU memory bandwidth. For 7B parameter models, an A10G or better is recommended. For 70B+ models, you'll need an A100 or H100. See [GPU types](/references/gpu-types) for details on available GPUs. + +**Model size** directly impacts both loading time and inference speed. Smaller models (7B parameters) load quickly and generate tokens fast. Larger models (70B+ parameters) provide better quality but require more powerful GPUs and have higher latency. + +**Quantization** reduces model size and memory requirements by using lower-precision weights. Methods like AWQ and GPTQ can reduce memory usage by 2-4x with minimal quality loss. This lets you run larger models on smaller GPUs or increase throughput on a given GPU. + +**Context length** affects memory requirements and processing time. Longer contexts require more memory for the KV cache and take longer to process. Set `MAX_MODEL_LEN` to the minimum value that meets your needs. + +**Concurrent requests** benefit from vLLM's continuous batching, but too many concurrent requests can exceed GPU memory and cause failures. The `MAX_NUM_SEQS` environment variable controls the maximum number of concurrent sequences. ## Use cases -vLLM workers are an effective choice for: +vLLM workers are ideal for several types of applications. -* High-performance inference for text generation. -* Cost-effective scaling for LLM workloads. -* Integration with existing OpenAI-based applications. -* Deploying open-source models with commercial licenses. -* AI systems requiring both synchronous and streaming responses. +**Production LLM APIs** benefit from vLLM's high throughput and OpenAI compatibility. You can build scalable APIs for chatbots, content generation, code completion, or any other LLM-powered feature. -## Performance considerations +**Cost-effective scaling** is enabled by Serverless auto-scaling. If your traffic varies significantly throughout the day or week, vLLM workers automatically scale down to zero during quiet periods, saving costs compared to always-on servers. + +**OpenAI migration** is straightforward because vLLM provides API compatibility. You can migrate existing OpenAI-based applications to open-source models by changing only your endpoint URL and API key. -The performance of vLLM workers depends on several factors: +**Custom model hosting** lets you deploy fine-tuned or specialized models. If you've trained a custom model or fine-tuned an existing one, vLLM workers make it easy to serve it at scale. -* **GPU selection**: Larger models require more VRAM (A10G or better recommended for 7B+ parameter models). For a list of available GPUs, see [GPU types](/references/gpu-types) -* **Model size**: Affects both loading time and inference speed. -* **Quantization**: Options like AWQ or GPTQ can reduce memory requirements at a small quality cost. -* **Batch size settings**: Impact throughput and latency tradeoffs. -* **Context length**: Longer contexts require more memory and processing time. +**Development and experimentation** is cheaper with pay-per-second billing. You can test multiple models and configurations without worrying about idle costs. ## Next steps -* [Deploy a vLLM worker as a Serverless endpoint.](/serverless/vllm/get-started) -* [Send requests to a vLLM endpoint.](/serverless/vllm/vllm-requests) -* [Learn about Runpod's OpenAI API compatibility.](/serverless/vllm/openai-compatibility) -* [Deploy Google's Gemma model using a vLLM Worker.](/tutorials/serverless/run-gemma-7b) +Ready to deploy your first vLLM worker? Start with the [get started guide](/serverless/vllm/get-started) to deploy a model in minutes. + +Once your endpoint is running, learn how to send requests using [Runpod's native API](/serverless/vllm/vllm-requests) or the [OpenAI-compatible API](/serverless/vllm/openai-compatibility). + +For advanced configuration options, see the [environment variables documentation](/serverless/vllm/environment-variables). diff --git a/serverless/vllm/vllm-requests.mdx b/serverless/vllm/vllm-requests.mdx index 2f245095..dbe9b613 100644 --- a/serverless/vllm/vllm-requests.mdx +++ b/serverless/vllm/vllm-requests.mdx @@ -1,158 +1,275 @@ --- title: "Send requests to vLLM workers" sidebarTitle: "Send vLLM requests" +description: "Send requests to vLLM workers using Runpod's native API." --- -This guide covers different methods for sending requests to vLLM workers on Runpod, including code examples and best practices for Runpod's native API format. Use this guide to effectively integrate LLMs into your applications while maintaining control over performance and cost. +This guide covers how to send requests to vLLM workers using Runpod's native API format. vLLM workers use the same request operations as any other Runpod Serverless endpoint, with specialized input parameters for LLM inference. -## Requirements +## How vLLM requests work -* You've [created a Runpod account](/get-started/manage-accounts). -* You've created a [Runpod API key](/get-started/api-keys). -* You've installed [Python](https://www.python.org/downloads/). -* (For gated models) You've created a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens). +vLLM workers are queue-based Serverless endpoints. They use the same `/run` and `/runsync` operations as other Runpod endpoints, following the standard [Serverless request structure](/serverless/endpoints/send-requests). -Many of the code samples below will require you to input your endpoint ID. You can find your endpoint ID on the endpoint details page: +The key difference is the input format. vLLM workers expect specific parameters for language model inference, such as prompts, messages, and sampling parameters. The worker's handler processes these inputs using the vLLM engine and returns generated text. - - - +## Request operations -## Runpod API requests +vLLM endpoints support both synchronous and asynchronous requests. -Runpod's native API provides additional flexibility and control over your requests. These requests follow Runpod's standard [endpoint operations](/serverless/endpoints/operations) format. +### Asynchronous requests with `/run` -### Python Example - -Replace `[RUNPOD_API_KEY]` with your Runpod API key. +Use `/run` to submit a job that processes in the background. You'll receive a job ID immediately, then poll for results using the `/status` endpoint. ```python import requests url = "https://api.runpod.ai/v2//run" -headers = {"Authorization": "Bearer [RUNPOD_API_KEY]", "Content-Type": "application/json"} +headers = { + "Authorization": "Bearer ", + "Content-Type": "application/json" +} data = { "input": { - "messages": [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Write a short poem."} - ], - "sampling_params": {"temperature": 0.7, "max_tokens": 100} + "prompt": "Explain quantum computing in simple terms.", + "sampling_params": { + "temperature": 0.7, + "max_tokens": 200 + } } } response = requests.post(url, headers=headers, json=data) -print(response.json()) +job_id = response.json()["id"] +print(f"Job ID: {job_id}") ``` -### cURL Example +### Synchronous requests with `/runsync` + +Use `/runsync` to wait for the complete response in a single request. The client blocks until processing is complete. + +```python +import requests -Run the following command in your local terminal, replacing `[RUNPOD_API_KEY]` with your Runpod API key and `[RUNPOD_ENDPOINT_ID]` with your vLLM endpoint ID. +url = "https://api.runpod.ai/v2//runsync" +headers = { + "Authorization": "Bearer ", + "Content-Type": "application/json" +} + +data = { + "input": { + "prompt": "Explain quantum computing in simple terms.", + "sampling_params": { + "temperature": 0.7, + "max_tokens": 200 + } + } +} -```sh -curl -X POST "https://api.runpod.ai/v2/[RUNPOD_ENDPOINT_ID]/run" \ - -H "Authorization: Bearer [RUNPOD_API_KEY]" \ - -H "Content-Type: application/json" \ - -d '{ - "input": { - "prompt": "Write a haiku about nature.", - "sampling_params": {"temperature": 0.8, "max_tokens": 50} - } - }' +response = requests.post(url, headers=headers, json=data) +print(response.json()) ``` -## Request formats +For more details on request operations, see [Send API requests to Serverless endpoints](/serverless/endpoints/send-requests). + +## Input formats -vLLM workers accept two primary input formats: +vLLM workers accept two input formats for text generation. ### Messages format (for chat models) +Use the messages format for instruction-tuned models that expect conversation history. The worker automatically applies the model's chat template. + ```json { - "messages": [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Tell me about the solar system."} - ] + "input": { + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is the capital of France?"} + ], + "sampling_params": { + "temperature": 0.7, + "max_tokens": 100 + } + } } ``` ### Prompt format (for text completion) +Use the prompt format for base models or when you want to provide raw text without a chat template. + ```json { - "prompt": "Summarize the following text: Climate change is a global challenge that affects..." + "input": { + "prompt": "The capital of France is", + "sampling_params": { + "temperature": 0.7, + "max_tokens": 50 + } + } +} +``` + +### Applying chat templates to prompts + +If you use the prompt format but want the model's chat template applied, set `apply_chat_template` to `true`. + +```json +{ + "input": { + "prompt": "What is the capital of France?", + "apply_chat_template": true, + "sampling_params": { + "temperature": 0.7, + "max_tokens": 100 + } + } } ``` ## Request input parameters -vLLM workers support various parameters to control generation behavior. Here are some commonly used parameters: +Here are all available parameters you can include in the `input` object of your request. + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `prompt` | `string` | None | Prompt string to generate text based on. | +| `messages` | `list[dict[str, str]]` | None | List of messages with `role` and `content` keys. The model's chat template will be applied automatically. Overrides `prompt`. | +| `apply_chat_template` | `bool` | `false` | Whether to apply the model's chat template to the `prompt`. | +| `sampling_params` | `dict` | `{}` | Sampling parameters to control generation (see Sampling parameters section below). | +| `stream` | `bool` | `false` | Whether to enable streaming of output. If `true`, responses are streamed as they are generated. | +| `max_batch_size` | `int` | env `DEFAULT_BATCH_SIZE` | The maximum number of tokens to stream per HTTP POST call. | +| `min_batch_size` | `int` | env `DEFAULT_MIN_BATCH_SIZE` | The minimum number of tokens to stream per HTTP POST call. | +| `batch_size_growth_factor` | `int` | env `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | The growth factor by which `min_batch_size` multiplies for each call until `max_batch_size` is reached. | + +## Sampling parameters + +Sampling parameters control how the model generates text. Include them in the `sampling_params` dictionary in your request. + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `n` | `int` | `1` | Number of output sequences generated from the prompt. The top `n` sequences are returned. | +| `best_of` | `int` | `n` | Number of output sequences generated from the prompt. The top `n` sequences are returned from these `best_of` sequences. Must be ≥ `n`. Treated as beam width in beam search. | +| `presence_penalty` | `float` | `0.0` | Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `frequency_penalty` | `float` | `0.0` | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | +| `repetition_penalty` | `float` | `1.0` | Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition. | +| `temperature` | `float` | `1.0` | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. | +| `top_p` | `float` | `1.0` | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | +| `top_k` | `int` | `-1` | Controls the number of top tokens to consider. Set to -1 to consider all tokens. | +| `min_p` | `float` | `0.0` | Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. | +| `use_beam_search` | `bool` | `false` | Whether to use beam search instead of sampling. | +| `length_penalty` | `float` | `1.0` | Penalizes sequences based on their length. Used in beam search. | +| `early_stopping` | `bool` or `string` | `false` | Controls stopping condition in beam search. Can be `true`, `false`, or `"never"`. | +| `stop` | `string` or `list[str]` | `None` | String(s) that stop generation when produced. The output will not contain these strings. | +| `stop_token_ids` | `list[int]` | `None` | List of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens. | +| `ignore_eos` | `bool` | `false` | Whether to ignore the End-Of-Sequence token and continue generating tokens after its generation. | +| `max_tokens` | `int` | `16` | Maximum number of tokens to generate per output sequence. | +| `min_tokens` | `int` | `0` | Minimum number of tokens to generate per output sequence before EOS or stop sequences. | +| `skip_special_tokens` | `bool` | `true` | Whether to skip special tokens in the output. | +| `spaces_between_special_tokens` | `bool` | `true` | Whether to add spaces between special tokens in the output. | +| `truncate_prompt_tokens` | `int` | `None` | If set, truncate the prompt to this many tokens. | + +## Streaming responses + +Enable streaming to receive tokens as they're generated instead of waiting for the complete response. + +```python +import requests +import json -| Parameter | Type | Description | -| -------------------- | ------------------- | ----------------------------------------------------------- | -| `temperature` | `float` | Controls randomness (0.0-1.0) | -| `max_tokens` | `int` | Maximum number of tokens to generate | -| `top_p` | `float` | Nucleus sampling parameter (0.0-1.0) | -| `top_k` | `int` | Limits consideration to top k tokens | -| `stop` | `string` or `array` | Sequence(s) at which to stop generation | -| `repetition_penalty` | `float` | Penalizes repetition (1.0 = no penalty) | -| `presence_penalty` | `float` | Penalizes new tokens already in text | -| `frequency_penalty` | `float` | Penalizes token frequency | -| `min_p` | `float` | Minimum probability threshold relative to most likely token | -| `best_of` | `int` | Number of completions to generate server-side | -| `use_beam_search` | `boolean` | Whether to use beam search instead of sampling | +url = "https://api.runpod.ai/v2/ENDPOINT_ID/run" +headers = { + "Authorization": "Bearer RUNPOD_API_KEY", + "Content-Type": "application/json" +} -You can find a complete list of request input parameters on the [GitHub README](https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#usage-standard-non-openai). +data = { + "input": { + "prompt": "Write a short story about a robot.", + "sampling_params": { + "temperature": 0.8, + "max_tokens": 500 + }, + "stream": True + } +} -## Error handling +response = requests.post(url, headers=headers, json=data) +job_id = response.json()["id"] + +# Stream the results +stream_url = f"https://api.runpod.ai/v2//stream/{job_id}" +with requests.get(stream_url, headers=headers, stream=True) as r: + for line in r.iter_lines(): + if line: + print(json.loads(line)) +``` + +Replace `ENDPOINT_ID` and `RUNPOD_API_KEY` with your actual values. -When working with vLLM workers, it's crucial to implement proper error handling to address potential issues such as network timeouts, rate limiting, worker initialization delays, and model loading errors. +For more information on streaming, see the [stream operation documentation](/serverless/endpoints/send-requests#stream). -Here is an example error handling implementation: +## Error handling + +Implement proper error handling to manage network timeouts, rate limiting, worker initialization delays, and model loading errors. ```python import requests import time -import backoff # pip install backoff - -@backoff.on_exception(backoff.expo, requests.exceptions.RequestException, max_tries=5) -def send_request(url, headers, payload): - response = requests.post(url, headers=headers, json=payload) - response.raise_for_status() # Raises an exception for 4XX/5XX responses - return response.json() - -try: - result = send_request(url, headers, payload) - print(f"Success: {result}") -except requests.exceptions.HTTPError as e: - if e.response.status_code == 429: - print("Rate limit exceeded. Try again later.") - elif e.response.status_code == 500: - print("Server error. The model may be having trouble loading.") - else: - print(f"HTTP error: {e}") -except requests.exceptions.ConnectionError: - print("Connection error. Check your network and endpoint ID.") -except requests.exceptions.Timeout: - print("Request timed out. The model may be processing a large batch.") -except Exception as e: - print(f"An unexpected error occurred: {e}") + +def send_vllm_request(url, headers, payload, max_retries=3): + for attempt in range(max_retries): + try: + response = requests.post(url, headers=headers, json=payload, timeout=300) + response.raise_for_status() + return response.json() + except requests.exceptions.Timeout: + print(f"Request timed out. Attempt {attempt + 1}/{max_retries}") + if attempt < max_retries - 1: + time.sleep(2 ** attempt) # Exponential backoff + except requests.exceptions.HTTPError as e: + if e.response.status_code == 429: + print("Rate limit exceeded. Waiting before retry...") + time.sleep(5) + elif e.response.status_code >= 500: + print(f"Server error: {e.response.status_code}") + if attempt < max_retries - 1: + time.sleep(2 ** attempt) + else: + raise + except requests.exceptions.RequestException as e: + print(f"Request failed: {e}") + if attempt < max_retries - 1: + time.sleep(2 ** attempt) + + raise Exception("Max retries exceeded") + +# Usage +result = send_vllm_request(url, headers, data) ``` ## Best practices -Here are some best practices to keep in mind when creating your requests: +Follow these best practices when sending requests to vLLM workers. + +**Set appropriate timeouts** based on your model size and expected generation length. Larger models and longer generations require longer timeouts. + +**Implement retry logic** with exponential backoff for failed requests. This handles temporary network issues and worker initialization delays. + +**Use streaming for long responses** to provide a better user experience. Users see output immediately instead of waiting for the entire response. + +**Optimize sampling parameters** for your use case. Lower temperature for factual tasks, higher temperature for creative tasks. + +**Monitor response times** to identify performance issues. If requests consistently take longer than expected, consider using a more powerful GPU or optimizing your parameters. + +**Handle rate limits** gracefully by implementing queuing or request throttling in your application. -1. **Use appropriate timeouts**: Set timeouts based on your model size and complexity. -2. **Implement retry logic**: Add exponential backoff for failed requests. -3. **Optimize batch size**: Adjust request frequency based on model inference speed. -4. **Monitor response times**: Track performance to identify optimization opportunities. -5. **Use streaming for long responses**: Improve user experience for lengthy content generation. -6. **Cache frequent requests**: Reduce redundant API calls for common queries. -7. **Handle rate limits**: Implement queuing for high-volume applications. +**Cache common requests** when appropriate to reduce redundant API calls and improve response times. ## Next steps -* [Send requests using the OpenAI-compatible API.](/serverless/vllm/openai-compatibility) -* [Learn how to use Serverless endpoint operations.](/serverless/endpoints/operations) +* [Learn about OpenAI API compatibility](/serverless/vllm/openai-compatibility). +* [Explore environment variables for customization](/serverless/vllm/environment-variables). +* [Review all Serverless endpoint operations](/serverless/endpoints/send-requests). From 242ace225a37a0de163dc6e1b474e1e94bc1d443 Mon Sep 17 00:00:00 2001 From: Mo King Date: Tue, 21 Oct 2025 15:20:50 -0400 Subject: [PATCH 2/2] vLLM rewrite second draft --- .github/ISSUE_TEMPLATE/bug-report.md | 2 +- .../ISSUE_TEMPLATE/documentation-request.md | 2 +- api-reference/openapi.json | 2 +- api-reference/overview.mdx | 27 +++ .../comfyui-to-api/overview.mdx | 150 +++++++++++++ community-solutions/overview.mdx | 8 + docs.json | 105 ++++----- overview.mdx | 7 +- pods/choose-a-pod.mdx | 6 + pods/manage-pods.mdx | 32 ++- pods/overview.mdx | 6 +- pods/pricing.mdx | 4 +- pods/storage/create-network-volumes.mdx | 2 +- pods/storage/types.mdx | 22 +- references/billing-information.mdx | 39 +++- references/faq.mdx | 204 ------------------ references/faq/manage-cards.mdx | 29 --- references/security-and-compliance.mdx | 37 ++++ .../troubleshooting/manage-payment-cards.mdx | 38 ++++ references/troubleshooting/storage-full.mdx | 6 +- references/troubleshooting/zero-gpus.mdx | 29 +++ runpodctl/reference/runpodctl-create-pod.mdx | 2 +- runpodctl/reference/runpodctl-create-pods.mdx | 2 +- sdks/graphql/manage-pods.mdx | 2 +- serverless/endpoints/send-requests.mdx | 1 + serverless/load-balancing/build-a-worker.mdx | 8 +- serverless/load-balancing/overview.mdx | 57 +++++ serverless/load-balancing/vllm-worker.mdx | 8 +- serverless/overview.mdx | 2 +- serverless/pricing.mdx | 2 +- serverless/storage/overview.mdx | 6 +- serverless/vllm/environment-variables.mdx | 169 +++++++++++++++ serverless/vllm/get-started.mdx | 2 +- serverless/vllm/openai-compatibility.mdx | 40 ++-- serverless/vllm/overview.mdx | 16 +- serverless/vllm/vllm-requests.mdx | 16 +- storage/network-volumes.mdx | 173 +++++++++++++-- storage/s3-api.mdx | 15 +- tutorials/pods/comfyui.mdx | 2 +- 39 files changed, 887 insertions(+), 393 deletions(-) create mode 100644 community-solutions/comfyui-to-api/overview.mdx delete mode 100644 references/faq.mdx delete mode 100644 references/faq/manage-cards.mdx create mode 100644 references/security-and-compliance.mdx create mode 100644 references/troubleshooting/manage-payment-cards.mdx create mode 100644 references/troubleshooting/zero-gpus.mdx create mode 100644 serverless/vllm/environment-variables.mdx diff --git a/.github/ISSUE_TEMPLATE/bug-report.md b/.github/ISSUE_TEMPLATE/bug-report.md index 9f674ae4..e6a982a7 100644 --- a/.github/ISSUE_TEMPLATE/bug-report.md +++ b/.github/ISSUE_TEMPLATE/bug-report.md @@ -3,7 +3,7 @@ name: Bug report about: Create a report to help us improve title: "[Bug]" labels: '' -assignees: AllysonRosenthal, rachfop +assignees: muhsinking --- diff --git a/.github/ISSUE_TEMPLATE/documentation-request.md b/.github/ISSUE_TEMPLATE/documentation-request.md index 9406a085..d656e95f 100644 --- a/.github/ISSUE_TEMPLATE/documentation-request.md +++ b/.github/ISSUE_TEMPLATE/documentation-request.md @@ -3,7 +3,7 @@ name: Documentation request about: Suggest an idea for this project title: "[Request]" labels: '' -assignees: AllysonRosenthal, rachfop +assignees: muhsinking --- diff --git a/api-reference/openapi.json b/api-reference/openapi.json index 253c1a19..a2244fe8 100644 --- a/api-reference/openapi.json +++ b/api-reference/openapi.json @@ -2535,7 +2535,7 @@ "SECURE", "COMMUNITY" ], - "description": "Set to SECURE to create the Pod in Secure Cloud. Set to COMMUNITY to create the Pod in Community Cloud. To determine which one suits your needs, see https://docs.runpod.io/references/faq/#secure-cloud-vs-community-cloud." + "description": "Set to SECURE to create the Pod in Secure Cloud. Set to COMMUNITY to create the Pod in Community Cloud. To determine which one suits your needs, see https://docs.runpod.io/pods/overview#pod-types." }, "computeType": { "type": "string", diff --git a/api-reference/overview.mdx b/api-reference/overview.mdx index 97a09c37..00d87c3f 100644 --- a/api-reference/overview.mdx +++ b/api-reference/overview.mdx @@ -31,4 +31,31 @@ To use the REST API, you'll need a [Runpod API key](/get-started/api-keys) with All API requests require authentication using your API key in the request headers. The API uses standard HTTP methods (GET, POST, PATCH, DELETE) and returns JSON responses with detailed error information when needed. +## Retrieve the OpenAPI schema + +You can get the complete OpenAPI specification for the Runpod API using the `/openapi.json` endpoint. Use this to generate client libraries, validate requests, or integrate the API specification into your development tools. + +The schema includes all available endpoints, request and response formats, authentication requirements, and data models. + + + +```bash cURL +curl --request GET \ + --url https://rest.runpod.io/v1/openapi.json \ + --header 'Authorization: Bearer RUNPOD_API_KEY' +``` + +```python Python +import requests + +url = "https://rest.runpod.io/v1/openapi.json" +headers = {"Authorization": "Bearer RUNPOD_API_KEY"} +response = requests.get(url, headers=headers) +print(response.json()) +``` + + + +The endpoint returns the OpenAPI 3.0 specification in JSON format. You can use it with tools like Swagger UI, Postman, or code generation utilities. + For detailed endpoint documentation, request/response schemas, and code examples, explore the sections in the sidebar to the left. diff --git a/community-solutions/comfyui-to-api/overview.mdx b/community-solutions/comfyui-to-api/overview.mdx new file mode 100644 index 00000000..c262c39d --- /dev/null +++ b/community-solutions/comfyui-to-api/overview.mdx @@ -0,0 +1,150 @@ +--- +title: "ComfyUI-to-API" +description: "Convert your ComfyUI workflows into deployment-ready GitHub repositories with automated Dockerfile generation" +icon: "wand-magic-sparkles" +--- + +Access the tool at [comfy.getrunpod.io](https://comfy.getrunpod.io) + +ComfyUI-to-API is a web-based tool that analyzes your ComfyUI workflows and automatically generates a Dockerfile and GitHub repository for deployment on Runpod Serverless. The tool uses a custom Comfy Agent to understand your workflow dependencies, searches for the right custom nodes and models, and creates a complete repository that you can deploy using [Runpod's GitHub integration](/serverless/workers/github-integration). + +## What you'll learn + +In this guide, you'll learn how to use ComfyUI-to-API to transform your ComfyUI workflows into Serverless workers. You'll upload a workflow file exported from ComfyUI, review the AI-powered analysis, generate a Dockerfile with all necessary dependencies, create a GitHub repository, and deploy it to Runpod Serverless using GitHub integration. + +## Requirements + +Before getting started, make sure you have: + +- A ComfyUI workflow file exported via Comfy → File → Export (the full workflow export, not the API export). +- A GitHub account for repository creation. +- A Runpod account for Serverless deployment. +- Basic familiarity with ComfyUI workflows and custom nodes. + + + Make sure to export your workflow using **Comfy → File → Export** (not the API + export). The tool needs the complete workflow information to properly analyze + your setup. + + +## Step 1: Sign in with GitHub + +Visit [comfy.getrunpod.io](https://comfy.getrunpod.io) to access ComfyUI-to-API. The tool requires you to sign in with your GitHub account so it can create a repository on your behalf in a later step. + +Click **Sign in with GitHub** and authorize the application. Once authenticated, you'll be able to proceed with uploading your workflow. + +## Step 2: Upload your workflow + +Once you've accessed the tool, the first step is to upload your ComfyUI workflow file. + +In ComfyUI, export your workflow using **Comfy → File → Export** (not the API export). This exports the complete workflow information needed for analysis. Then click the upload area or drag and drop your `workflow.json` file into the tool. + +After uploading, you'll see a confirmation showing the filename and basic workflow information. The tool performs a quick validation to ensure your file is a valid ComfyUI workflow before proceeding to analysis. + +## Step 3: Analyze your workflow + +Click the **Analyze** button to run the custom Comfy Agent. You'll see progress information as the tool works through your workflow. + +The analysis process detects custom nodes in your workflow, locates required models and their download URLs, resolves dependencies between nodes, and determines installation methods for each component. This typically takes 30-60 seconds depending on the complexity of your workflow. + +The analysis results are displayed in expandable sections showing custom nodes found, models identified, and any potential issues or recommendations. + +## Step 4: Review the generated Dockerfile + +After analysis completes, the tool generates a Dockerfile based on your workflow. Review the Dockerfile to see what will be installed. The generated Dockerfile uses the [runpod-workers/worker-comfyui](https://github.com/runpod-workers/worker-comfyui) base image, which provides a clean ComfyUI installation without any models. The Dockerfile then adds installation commands for all your custom nodes, model download instructions with URLs, proper Python dependencies, and optimized caching layers for faster builds. + +You can review and tweak the Dockerfile if needed, or skip ahead to the next step to create your GitHub repository. + + + Review the Dockerfile carefully to ensure all your custom nodes and models are + included. The AI is quite thorough, but you may want to add additional + customizations for your specific use case. + + +## Step 5: Create your repository + +Now you can create a GitHub repository with all the necessary files for Serverless deployment. Since you authenticated with GitHub in step 1, the tool can create the repository on your behalf. Enter your desired repository name and optionally provide a description. + +The tool creates a repository containing your Dockerfile and a README with setup and deployment instructions. + +Click **Create GitHub repo** and watch the real-time progress as the tool creates your repository. Once complete, you'll see a success message with a direct link to your new GitHub repository. + +## Step 6: Deploy to Runpod Serverless + +With your GitHub repository created, you're ready to deploy to Runpod Serverless using the [GitHub integration](/serverless/workers/github-integration). This is where Runpod builds your Docker image from the Dockerfile and deploys it to an endpoint. + +Navigate to [create a new endpoint](https://console.runpod.io/serverless/new-endpoint) in the Runpod console. Under "Import Git Repository", select your newly created repository from the dropdown menu. You'll need to connect your GitHub account to Runpod once if you haven't already. + +Configure your deployment settings including which branch to deploy from (typically `main`), the path to your Dockerfile (typically in the root), and your GPU requirements based on your workflow's needs. + + + For detailed instructions on deploying from GitHub, including how to authorize + Runpod with your GitHub account and configure deployment options, see [Deploy + workers from GitHub](/serverless/workers/github-integration). + + +After clicking "Deploy Endpoint", Runpod builds the Docker image from your repository and deploys it to your endpoint automatically. You can monitor the build status in the "Builds" tab of your endpoint detail page. Once the build is complete, you can use the **example API request** provided in step 5 of the tool to test your endpoint. + +## Understanding the repository files + +The generated repository includes the essential files needed to build your Serverless worker. + +The `Dockerfile` contains all the instructions to build your worker image. It starts from the [runpod-workers/worker-comfyui](https://github.com/runpod-workers/worker-comfyui) base image, which provides a clean ComfyUI installation along with all the handler code needed to run as a Serverless endpoint. Your Dockerfile then adds your custom nodes and downloads your required models. When you deploy via [GitHub integration](/serverless/workers/github-integration), Runpod uses this Dockerfile to build your Docker image. + +The `README.md` provides documentation about your workflow and deployment instructions. + + + You can customize the Dockerfile after creation. For example, you might want + to add additional dependencies, adjust model download locations, or optimize + the build process. + + +## Updating your deployed worker + +When you make changes to your repository, such as updating the Dockerfile to add new models or custom nodes, Runpod automatically detects these changes. + +The [GitHub integration](/serverless/workers/github-integration) monitors your `main` branch for changes. When you push commits to `main`, Runpod automatically rebuilds the Docker image and updates your endpoint with the new version. You can monitor the rebuild progress in the "Builds" tab of your endpoint detail page. + +For more details on managing deployments and monitoring builds, see [Deploy workers from GitHub](/serverless/workers/github-integration). + +## Troubleshooting + + + + **Problem:** The tool shows an error or doesn't detect my workflow properly. + **Solution:** + {""} + 1. Make sure you exported using **Comfy → File → Export** (not the API + export). + {""} + 2. The file should be named `workflow.json` and contain the complete + workflow structure. + {""} + 3. If you used the API export, go back to ComfyUI and use File → Export + instead. + + + **Problem:** The generated Dockerfile doesn't include all your custom nodes + or models. + {""} + **Solution:** + {""} + 1. Review the analysis results to see what was detected. + {""} + 2. Some very new or obscure custom nodes may not be found automatically. + {""} + 3. You can manually edit the Dockerfile in your GitHub repository to add + missing components. + {""} + 4. Check the ComfyUI Registry to find installation instructions for missing + nodes. + + + +## Related documentation + +- [Deploy workers from GitHub](/serverless/workers/github-integration) +- [Handler functions](/serverless/workers/handler-functions) +- [Creating a Dockerfile](/serverless/workers/deploy#creating-a-dockerfile) +- [Serverless overview](/serverless/overview) +- [Managing endpoints](/serverless/endpoints/manage-endpoints) diff --git a/community-solutions/overview.mdx b/community-solutions/overview.mdx index dd46e4e0..8a07d5e4 100644 --- a/community-solutions/overview.mdx +++ b/community-solutions/overview.mdx @@ -17,6 +17,14 @@ Community tools and solutions are provided as-is and maintained by their creator Explore these community-created tools that can enhance your Runpod workflow: + + Web-based tool that analyzes ComfyUI workflows and automatically generates Dockerfiles and GitHub repositories for deployment on Runpod Serverless. + + + Submit a support request using our contact page. + + Email help@runpod.io for direct support. + Check the status of Runpod services and infrastructure. Join the Runpod community on Discord. - + diff --git a/pods/choose-a-pod.mdx b/pods/choose-a-pod.mdx index c9e71fea..d6a84801 100644 --- a/pods/choose-a-pod.mdx +++ b/pods/choose-a-pod.mdx @@ -56,6 +56,12 @@ When selecting a Pod, consider these strategies for balancing performance and co 3. For extended usage, explore Runpod's [savings plans](/pods/pricing#savings-plans) to optimize your spending while ensuring access to the resources you need. +## Secure Cloud vs Community Cloud + +Secure Cloud operates in T3/T4 data centers with high reliability, redundancy, security, and fast response times to minimize downtime. It's designed for sensitive and enterprise workloads. + +Community Cloud connects individual compute providers to users through a peer-to-peer GPU computing platform. Hosts are invite-only and vetted to maintain quality standards. Community Cloud offers competitive pricing with good server quality, though with less redundancy for power and networking compared to Secure Cloud. + ## Next steps Once you've determined your resource requirements, you can learn how to: diff --git a/pods/manage-pods.mdx b/pods/manage-pods.mdx index 865ca4da..2634821f 100644 --- a/pods/manage-pods.mdx +++ b/pods/manage-pods.mdx @@ -158,12 +158,12 @@ With custom templates, you can: If your Pod has a [network volume](/storage/network-volumes) attached, it cannot be stopped, only terminated. When you terminate the Pod, data in the `/workspace` directory will be preserved in the network volume, and you can regain access by deploying a new Pod with the same network volume attached. -When a Pod is stopped, data in the container volume is cleared, but data in the `/workspace` directory is preserved. To learn more about how Pod storage works, see [Storage overview](/pods/storage/types). +When a Pod is stopped, data in the container disk is cleared, but data in the `/workspace` directory is preserved. To learn more about how Pod storage works, see [Storage overview](/pods/storage/types). -By stopping a Pod you are effectively releasing the GPU on the machine, and you may be reallocated 0 GPUs when you start the Pod again. For more info, see the [FAQ](/references/faq#why-do-i-have-zero-gpus-assigned-to-my-pod%3F). +By stopping a Pod you are effectively releasing the GPU on the machine, and you may be reallocated [zero GPUs](/references/troubleshooting/zero-gpus) when you start the Pod again. -After a Pod is stopped, you will still be charged for its [disk volume](/pods/storage/types#disk-volume) storage. If you don't need to retain your Pod environment, you should terminate it completely. +After a Pod is stopped, you will still be charged for its [volume disk](/pods/storage/types#volume-disk) storage. If you don't need to retain your Pod environment, you should terminate it completely. @@ -255,6 +255,8 @@ pod "wu5ekmn69oh1xr" started with $0.290 / hr + + ## Terminate a Pod @@ -328,3 +330,27 @@ Pods provide two types of logs to help you monitor and troubleshoot your workloa - **System logs** provide detailed information about your Pod's lifecycle, such as container creation, image download, extraction, startup, and shutdown events. To view your logs, open the [Pods page](https://www.console.runpod.io/pods), expand your Pod, and click the **Logs** button. This gives you real-time access to both container and system logs, making it easy to diagnose issues or monitor your Pod's activity. + +## Troubleshooting + +Below are some common issues and solutions for troubleshooting Pod deployments. + +### Zero GPU Pods + +See [Zero GPU Pods on restart](/references/troubleshooting/zero-gpus). + +### Pod stuck on initializing + +If your Pod is stuck on initializing, check for these common issues: + +1. You're trying to SSH into the Pod but didn't provide an idle job like `sleep infinity` to keep it running. +2. The Pod received a command it can't execute. Check your logs for syntax errors or invalid commands. + +If you need help, [contact support](https://www.runpod.io/contact). + +### Docker daemon limitations + +Runpod manages the Docker daemon for you, which means you can't run your own Docker instance inside a Pod. This prevents you from building Docker containers or using tools like Docker Compose. + +To work around this, create a [custom template](/pods/templates/overview) with the Docker image you need. + diff --git a/pods/overview.mdx b/pods/overview.mdx index 3ca996bf..d4ea4054 100644 --- a/pods/overview.mdx +++ b/pods/overview.mdx @@ -32,11 +32,11 @@ Templates eliminate the need to manually set up environments, saving time and re Pods offer three types of storage to match different use cases: -Every Pod comes with a resizable **container volume** that houses the operating system and stores temporary files, which are cleared after the Pod stops. +Every Pod comes with a resizable **container disk** that houses the operating system and stores temporary files, which are cleared after the Pod stops. -**Disk volumes** provide persistent storage that is preserved throughout the Pod's lease, functioning like a dedicated hard drive. Data stored in the disk volume directory (`/workspace` by default) persists when you stop the Pod, but is erased when the Pod is deleted. +**Volume disks** provide persistent storage that is preserved throughout the Pod's lease, functioning like a dedicated hard drive. Data stored in the volume disk directory (`/workspace` by default) persists when you stop the Pod, but is erased when the Pod is deleted. -Optional [network volumes](/pods/storage/network-volumes) provide more flexible permanent storage that can be transferred between Pods, replacing the disk volume when attached. When using a Pod with network volume attached, you can safely delete your Pod without losing the data stored in your network volume directory (`/workspace` by default). +Optional [network volumes](/pods/storage/network-volumes) provide more flexible permanent storage that can be transferred between Pods, replacing the volume disk when attached. When using a Pod with network volume attached, you can safely delete your Pod without losing the data stored in your network volume directory (`/workspace` by default). To learn more, see [Storage options](/pods/storage/types). diff --git a/pods/pricing.mdx b/pods/pricing.mdx index b27cbe42..c55e511a 100644 --- a/pods/pricing.mdx +++ b/pods/pricing.mdx @@ -136,7 +136,7 @@ Runpod offers [three types of storage](/pods/storage/types) for Pods:: You are not charged for storage if the host machine is down or unavailable from the public internet. -Container and disk volume storage will be included in your Pod's displayed hourly cost during deployment. +Container and volume disk storage will be included in your Pod's displayed hourly cost during deployment. Runpod is not designed as a long-term cloud storage system. Storage is provided to support compute tasks. We recommend regularly backing up critical data to your local machine or to a dedicated cloud storage provider. @@ -144,7 +144,7 @@ Runpod is not designed as a long-term cloud storage system. Storage is provided ## Pricing for stopped Pods -When you [stop a Pod](/pods/manage-pods#stop-a-pod), you will no longer be charged for the Pod's hourly GPU cost, but will continue to be charged for the Pod's disk volume at a rate of \$0.20 per GB per month. +When you [stop a Pod](/pods/manage-pods#stop-a-pod), you will no longer be charged for the Pod's hourly GPU cost, but will continue to be charged for the Pod's volume disk at a rate of \$0.20 per GB per month. ## Tracking costs and savings plans diff --git a/pods/storage/create-network-volumes.mdx b/pods/storage/create-network-volumes.mdx index 02da4868..37d79b94 100644 --- a/pods/storage/create-network-volumes.mdx +++ b/pods/storage/create-network-volumes.mdx @@ -6,7 +6,7 @@ description: "Persistent, portable storage for Pods." Network volumes offer persistent storage that exists independently of the lifecycle of a Pod. This means your data is retained even if a Pod is terminated. They can be attached to multiple Pods, making them ideal for sharing data, transferring storage between machines, or maintaining datasets. -When attached to a Pod, a network volume replaces the Pod's default disk volume and is typically mounted at ` /workspace`. +When attached to a Pod, a network volume replaces the Pod's default volume disk and is typically mounted at ` /workspace`. For a comparison with other storage types, see [Storage options](/pods/storage/types). diff --git a/pods/storage/types.mdx b/pods/storage/types.mdx index 3a879a4f..b114bd87 100644 --- a/pods/storage/types.mdx +++ b/pods/storage/types.mdx @@ -7,21 +7,21 @@ Choosing the right type of storage is crucial for optimizing your workloads, whe This page describes the different types of storage options available for your Pods, and when to use each in your workflow. -## Container volume +## Container disk -A container volume houses the operating system and provides temporary storage for a Pod. It's created when a Pod is launched and is directly tied to the Pod's lifecycle. +A container disk houses the operating system and provides temporary storage for a Pod. It's created when a Pod is launched and is directly tied to the Pod's lifecycle. -## Disk volume +## Volume disk -A disk volume provides persistent storage that remains available for the duration of the Pod's lease. It functions like a dedicated hard drive, allowing you to store data that needs to be retained even if the Pod is stopped or rebooted. +A volume disk provides persistent storage that remains available for the duration of the Pod's lease. It functions like a dedicated hard drive, allowing you to store data that needs to be retained even if the Pod is stopped or rebooted. -The disk volume is mounted at `/workspace` by default (this will be replaced by the network volume if one is attached). This can be changed by [editing your Pod configuration](#modifying-storage-capacity). +The volume disk is mounted at `/workspace` by default (this will be replaced by the network volume if one is attached). This can be changed by [editing your Pod configuration](#modifying-storage-capacity). ## Network volume -[Network volumes](/storage/network-volumes) offer persistent storage similar to the disk volume, but with the added benefit that they can be attached to multiple Pods, and that they persist independently from the Pod's lifecycle. This allows you to share and access data across multiple instances or transfer storage between machines, and retain data even after a Pod is deleted. +[Network volumes](/storage/network-volumes) offer persistent storage similar to the volume disk, but with the added benefit that they can be attached to multiple Pods, and that they persist independently from the Pod's lifecycle. This allows you to share and access data across multiple instances or transfer storage between machines, and retain data even after a Pod is deleted. -When attached to a Pod, a network volume replaces the disk volume, and by default they are similarly mounted at `/workspace`. +When attached to a Pod, a network volume replaces the volume disk, and by default they are similarly mounted at `/workspace`. @@ -32,7 +32,7 @@ Network volumes must be attached during Pod creation, and cannot be unattached l ## Storage type comparison This table provides a comparative overview of the storage types available for your Pods: -| Feature | Container Volume | Disk Volume | Network Volume | +| Feature | Container Disk | Volume Disk | Network Volume | | :---------------- | :------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------- | | **Data persistence** | Volatile (lost on stop/restart) | Persistent (retained until Pod deletion) | Permanent (retained independently from Pod lifecycles) | | **Lifecycle** | Tied directly to the Pod's active session | Tied to the Pod's lease period | Independent, can outlive Pods | @@ -51,17 +51,17 @@ Here's what you should consider when selecting storage for your Pods: ## Modifying storage capacity -To update the size of a Pod's container or disk volume: +To update the size of a Pod's container or volume disk: 1. Navigate to the [Pod page](https://console.runpod.io/pod) in the Runpod console. 2. Expand the Pod you want to modify. 3. Click the hamburger menu in the bottom right of the Pod card and select **Edit Pod**. -4. Adjust the storage capacity for the container or disk volume. Disk volume size can be increased, but not decreased. +4. Adjust the storage capacity for the container or volume disk. Volume disk size can be increased, but not decreased. 5. Click **Save** to apply the changes. -Editing a running Pod will cause it to reset completely, erasing all data that isn't stored in your disk/network volume mount path (`/workspace` by default). +Editing a running Pod will cause it to reset completely, erasing all data that isn't stored in your volume disk/network volume mount path (`/workspace` by default). diff --git a/references/billing-information.mdx b/references/billing-information.mdx index f7d1e8d5..10679a7d 100644 --- a/references/billing-information.mdx +++ b/references/billing-information.mdx @@ -1,10 +1,47 @@ --- title: "Billing information" +description: "Understand how billing works for Pods, storage, network volumes, refunds, and spending limits." --- All billing, including per-hour compute and storage billing, is charged per minute. -For more information on billing questions, see [Billing FAQ](/references/faq/#billing). +## How billing works + +Every Pod has an hourly cost based on [GPU type](/references/gpu-types). Your Runpod credits are charged every minute the Pod is running. If you run out of credits, your Pods are automatically stopped and you'll receive an email notification. Pods are eventually terminated if you don't refill your credits. + + +Runpod pre-emptively stops all your Pods when your account balance is projected to cover less than 10 minutes of remaining runtime. This ensures your account retains a small balance to help preserve your data volumes. If your balance is completely drained, all Pods are subject to deletion. Setting up [automatic payments](https://www.console.runpod.io/user/billing) is recommended to avoid service interruptions. + + +You must have at least one hour's worth of runtime in your balance to rent a Pod at your given spec. If your balance is insufficient, consider renting the Pod on Spot, depositing additional funds, or lowering your GPU spec requirements. + +## Storage billing + +Storage billing varies depending on Pod state. Running Pods are charged \$0.10 per GB per month for all storage, while stopped Pods are charged \$0.20 per GB per month for volume storage. + +Storage is charged per minute. You are not charged for storage if the host machine is down or unavailable from the public internet. + +## Network volume billing + +Network volumes are billed hourly based on storage size. For storage below 1TB, you'll pay \$0.07 per GB per month. Above 1TB, the rate drops to \$0.05 per GB per month. + +Network volumes are hosted on storage servers located in the same datacenters where you rent GPU servers. These servers are connected via a high-speed local network (25Gbps to 200Gbps depending on location) and use NVME SSDs for storage. + + +If your machine-based storage or network volume is terminated due to lack of funds, that disk space is immediately freed up for use by other clients. Runpod cannot assist in recovering lost storage. Runpod is not designed as a cloud storage system—storage is provided to support running tasks on GPUs. Back up critical data regularly to your local machine or a dedicated cloud storage provider. + + +## Refunds and withdrawals + +Runpod does not offer the option to withdraw your unused balance after depositing funds. When you add funds to your Runpod account, credits are non-refundable and can only be used for Runpod services. Only deposit the amount you intend to use. + +If you aren't sure if Runpod is right for you, you can load as little as \$10 into your account to try things out. Visit the [Discord community](https://discord.gg/pJ3P2DbUUq) to ask questions or email [help@runpod.io](mailto:help@runpod.io). Refunds and trial credits are not currently offered due to processing overhead. + +If you have questions about billing or need assistance planning your Runpod expenses, contact support at [help@runpod.io](mailto:help@runpod.io). + +## Spending limits + +Spending limits are implemented for newer accounts to prevent fraud. These limits grow over time and should not impact normal usage. If you need an increased spending limit, [contact support](https://www.runpod.io/contact) and share your use case. ### Payment methods diff --git a/references/faq.mdx b/references/faq.mdx deleted file mode 100644 index d900b35f..00000000 --- a/references/faq.mdx +++ /dev/null @@ -1,204 +0,0 @@ ---- -title: "Frequently asked questions" -sidebarTitle: "FAQ" ---- - -## Secure Cloud vs Community Cloud - -Runpod provides two cloud computing services: [Secure Cloud](https://www.console.runpod.io/gpu-secure-cloud) and [Community Cloud.](https://www.console.runpod.io/gpu-cloud) - -**Secure Cloud** runs in T3/T4 data centers by our trusted partners. Our close partnership comes with high-reliability with redundancy, security, and fast response times to mitigate any downtimes. For any sensitive and enterprise workloads, we highly recommend Secure Cloud. - -**Community Cloud** brings power in numbers and diversity spanning the whole world. Through our decentralized platform, we can offer peer-to-peer GPU computing that connects individual compute providers to compute consumers. Our Community Cloud hosts are invite-only and vetted by us, and still have to abide by our standards. Even though their associated infrastructure might not offer as much redundancy for power and networking, they still offer good servers that combine quality and affordability. - -Both solutions offer far more competitive prices than large cloud providers such as AWS or GCP. - -## On-Demand vs. Spot Pod - -**On-Demand Pods** can run forever without interruptions with resources dedicated to your Pod. They do incur higher costs than Spot Pods. - -**Spot Pods** use spare compute capacity, allowing you to bid for those compute resources. Resources are dedicated to your Pod, but someone else can bid higher or start an On-Demand Pod that will stop your Pod. When this happens, your Pod is given a signal to stop 5 seconds prior with SIGTERM, and eventually, the kill signal SIGKILL after 5 seconds. You can use volumes to save any data to the disk in that 5s period or push data to the cloud periodically. - -### How does Runpod work? - -Runpod leverages technologies like [Docker](https://www.docker.com/) to containerize and isolate guest workloads on a host machine. We have built a decentralized platform where thousands of servers can be connected to offer a seamless experience for all users. - -### Where can I go for help? - -We'd be happy to help! Join our community on [Discord](https://discord.gg/pJ3P2DbUUq), message us in our support chat, or email us at [help@runpod.io](mailto:help@runpod.io). - -### What is Runpod's policy on refunds and credits? - -If you aren't sure if Runpod is for you, feel free to hang out in our [Discord](https://discord.gg/cUpRmau42V) to ask questions or email [help@runpod.io](mailto:help@runpod.io) You can load as little as \$10 into your account to try things out. We don't currently offer refunds or trial credits due to the overhead of processing these requests. Please plan accordingly! - -## What are Pods? - -*** - -### What is an On-Demand instance? - -**On-Demand instances** are for non-interruptible workloads. You pay the On-Demand price and cannot be displaced by other users if you have funds to keep your Pod running. - -### What is a Spot instance? - -A **Spot instance** is an interruptible instance that can generally be rented much cheaper than an On-Demand one. Spot instances are great for stateless workloads like an API or for workloads you can periodically save to a volume disk. Your volume disk is retained even if your Spot instance is interrupted. - -### What is a Savings Plan? - -Savings Plans are a way for you to pay up-front and get a discount for it. This is great for when you know you will need prolonged access to compute. You can learn more on the about [Savings Plans here](/pods/pricing#savings-plans). - -## Billing - -All billing, including per-hour compute and storage billing, is charged per minute. - -### How does Pod billing work? - -Every Pod has an hourly cost based on GPU type. Your Runpod credits are charged for the Pod every minute as long as the Pod is running. If you ever run out of credits, your Pods will be automatically stopped, and you will get an email notification. Eventually, Pods will be terminated if you don't refill your credit. **We pre-emptively stop all of your Pods if you get down to 10 minutes of remaining run time. This gives your account enough balance to keep your data volumes around in the case you need access to your data. Please plan accordingly.** - -Once a balance has been completely drained, all pods are subject to deletion at the discretion of the service. An attempt will be made to hold the pods for as long as possible, but this should not be relied upon! We highly recommend setting up [automatic payments](https://www.console.runpod.io/user/billing) to ensure balances are automatically topped up as needed. - - - -You must have at least one hour's worth of time in your balance to rent a Pod at your given spec. If your balance is insufficient to rent a Pod, then consider renting the Pod on Spot, depositing additional funds, or lowering your GPU spec requirements. - - - -### How does storage billing work? - -We currently charge \$0.10/GB/month for all storage on running Pods and \$0.20/GB/month for volume storage on stopped Pods. Storage is tied to compute servers, and we want to ensure active users have enough space to run their workloads. Storage is charged per minute, and we never charge users if the host machine is down or unavailable from the public internet. - -### How does Network Volume billing work? - -For storage requirements below 1TB, we charge a competitive rate of \$0.07/GB/Month. If your storage requirements exceed 1TB, we provide a cost-effective pricing of \$0.05/GB/Month. This ensures that you receive significant savings as your data storage scales. - -When you choose to create a Network Volume, you gain access to our robust infrastructure, which includes state-of-the-art storage servers located in the same datacenters where you rent GPU servers from us. These servers are connected via a high-speed 25Gbps local network, up to 200 Gbps in some locations, guaranteeing efficient data transfer and minimal latency. Everything is stored on high-speed NVME SSDs to ensure best performance. - -Network volumes are billed on a per-hour basis. Please note that if your machine-based storage or network volume is terminated due to lack of funds, that disk space is immediately freed up for use by other clients, and Runpod is unable to assist in recovering lost storage. Runpod is also not designed to be a cloud storage system; storage is provided in the pursuit of running tasks using its GPUs, and not meant to be a long-term backup solution. It is highly advisable to continually back up anything you want to save offsite locally or to a cloud provider. - -### Can I withdraw my unused balance? - -No, Runpod does not offer the option to withdraw your unused balance after depositing funds into your account. When you add funds to your Runpod account, these credits are non-refundable and can only be used for Runpod services. - - - -When depositing funds into your Runpod account, please be aware that you cannot withdraw your balance once it has been added. Only deposit the amount you intend to use for Runpod services. - - - -We recommend carefully considering the amount you wish to deposit based on your expected usage of our services. If you have any questions about billing or need assistance in planning your Runpod expenses, please don't hesitate to contact our support team at [help@runpod.io](mailto:help@runpod.io). - -## Security - -*** - -### Is my data protected from other clients? - -Yes. Your data is run in a multi-tenant environment where other clients can't access your pod. For sensitive workloads requiring the best security, please use Secure Cloud. - -### Is my data protected from the host of the machine my Pod is running on? - -Data privacy is important to us at Runpod. Our Terms of Service prohibit hosts from trying to inspect your Pod data or usage patterns in any way. If you want the highest level of security, use Secure Cloud. - -## Usability - -*** - -### What can I do in a Runpod Pod? - -You can run any Docker container available on any publicly reachable container registry. If you are not well versed in containers, we recommend sticking with the default run templates like our Runpod PyTorch template. However, if you know what you are doing, you can do a lot more! - -### Does Runpod support and maintain Community Templates? - -**No, Runpod does not provide customer support for community templates.** Community templates are Docker container images created and shared by individual users and organizations in the Runpod community. While you can use these templates, you are responsible for ensuring they work for your use case. - -If you encounter issues with a community template: -- Contact the template creator directly for support. -- Check the template's documentation or repository for troubleshooting guides. -- Ask for help in the [Runpod Discord community](https://discord.gg/runpod). -- Consider creating your own custom template based on your specific needs. - -Runpod support can only assist with platform-related issues (Pod deployment, billing, network connectivity, etc.) and does not extend to the contents or functionality of community templates. - -### Can I run my own Docker daemon on Runpod? - -You can't currently spin up your own instance of Docker, as we run Docker for you! Unfortunately, this means that you cannot currently build Docker containers on Runpod or use things like Docker Compose. Many use cases can be solved by creating a custom template with the Docker image that you want to run. - -### My Pod is stuck on initializing. What gives? - -Usually, this happens for one of several reasons. If you can't figure it out, [contact us](https://www.runpod.io/contact), and we'll gladly help you. - -1. You are trying to run a Pod to SSH into, but you did not give the Pod an idle job to run like "sleep infinity." -2. You have given your Pod a command that it doesn't know how to run. Check the logs to make sure that you don't have any syntax errors, etc. - -### Can I run Windows? - -We don't currently support Windows. We want to do this in the future, but we do not have a solid timeframe for Windows support. - -### How do I find a reliable server in Community Cloud? - -Runpod needs to provide you with reliable servers. All of our listed servers must meet minimum reliability, and most are running in a data center! However, if you want the highest level of reliability and security, use Secure Cloud. Runpod calculates server reliability by maintaining a heartbeat with each server in real-time. - -### Why do I have zero GPUs assigned to my Pod? - -Most of our machines have between 4 and 8 GPUs per physical machine. When you start a Pod, it is locked to a specific physical machine. If you keep it running (On-Demand), then that GPU cannot be taken from you. However, if you stop your Pod, it becomes available for a different user to rent. When you want to start your Pod again, your specific machine may be wholly occupied. In this case, we give you the option to spin up your Pod with zero GPUs so you can retain access to your data. - -Remember that this does not mean there are no more GPUs of that type available, just none on the physical machine that specific Pod is locked to. Note that transfer Pods have limited computing capabilities, so transferring files using a UI may be difficult, and you may need to resort to terminal access or cloud sync options. - -If you want to avoid this, using network volumes is the best choice. [Learn how to use them here](/storage/network-volumes). - -#### What are Network Volumes? - -Network volumes allow you to share data between Pods and generally be more mobile with your important data. This feature is only available in specific secure cloud data centers, but we are actively rolling it out to more and more of our secure cloud footprint. If you use network volumes, you should rarely run into situations where you cannot use your data with a GPU without a file transfer. - -[Read about it here](/storage/network-volumes). - -## What if? - -*** - -### What if I run out of funds? - -All your Pods are stopped automatically when you don't have enough funds to keep your Pods running for at least ten more minutes. When your Pods are stopped, your container disk data will be lost, but your volume data will be preserved. Pods are scheduled for removal if adequate credit balance is not maintained. If you fail to do so, your Pods will be terminated, and Pod volumes will be removed. - -After you add more funds to your account, you can start your Pod if you wish (assuming enough GPUs are available on the host machine). - -### What if the machine that my Pod is running loses power? - -If the host machine loses power, it will attempt to start your Pod again when it returns online. Your volume data will be preserved, and your container will run the same command as it ran the first time you started renting it. Your container disk and anything in memory will be lost! - -### What if my Pod loses internet connectivity? - -The host machine continues to run your Pod to the best of its ability, even if it is not connected to the internet. If your job requires internet connectivity, then it will not function. You will not be charged if the host loses internet connectivity, even if it continues to run your job. You may, of course, request that your Pod exit while the host is offline, and it will exit your Pod when it regains network connectivity. - -### What if it says that my spending limit has been exceeded? - -We implement a spending limit for newer accounts that will grow over time. This is because we have found that sometimes scammers try to interfere with the natural workings of the platform. We believe that this limit should not impact normal usage. We would be delighted to up your spending limit if you [contact us](https://www.runpod.io/contact) and share your use case. - -## Legal - -*** - -### Do you have some legal stuff I can look at? - -Sure, do! Take a look at our [legal page](https://www.runpod.io/legal). - -## GDPR Compliance - -At Runpod, we take data protection and privacy seriously. We have implemented robust policies, procedures, and technical measures to ensure compliance with the GDPR requirements. - -### Is Runpod compliant with GDPR for data processed in Europe? - -Yes, Runpod is fully compliant with the General Data Protection Regulation (GDPR) requirements for any data processed within our European data center regions. - -### What measures does Runpod take to ensure GDPR compliance? - -For servers hosted in GDPR-compliant regions like the European Union, we ensure: - -* **Data processing procedures**: We have established clear procedures for the collection, storage, processing, and deletion of personal data, ensuring transparency and accountability in our data processing activities. -* **Data protection measures**: We have implemented appropriate technical and organizational measures to safeguard personal data against unauthorized access, disclosure, alteration, and destruction. -* **Consent mechanisms**: We obtain and record consent from individuals for the processing of their personal data in accordance with GDPR requirements, and we provide mechanisms for individuals to withdraw consent if desired. -* **Rights of data subjects**: We facilitate the rights of data subjects under the GDPR, including the right to access, rectify, erase, or restrict the processing of their personal data, and we handle data subject requests promptly and efficiently. -* **Data transfer mechanisms**: We ensure lawful and secure transfer of personal data outside the EU, where applicable, in compliance with GDPR requirements, utilizing appropriate mechanisms such as adequacy decisions, standard contractual clauses, or binding corporate rules. -* **Compliance monitoring**: We regularly monitor and review our GDPR compliance to ensure ongoing effectiveness and adherence to regulatory requirements, conducting data protection impact assessments and internal audits as needed. - -For any inquiries or concerns regarding our GDPR compliance or our data protection practices, reach out to our team through email at [help@runpod.io](mailto:help@runpod.io). diff --git a/references/faq/manage-cards.mdx b/references/faq/manage-cards.mdx deleted file mode 100644 index c189eb52..00000000 --- a/references/faq/manage-cards.mdx +++ /dev/null @@ -1,29 +0,0 @@ ---- -title: "Manage Payment Card Declines" ---- - -Runpod is a US-based organization that serves clients all across the world. However, credit card processors have in general keyed into international transactions as a potential vector for fraud and tend to apply more stringent standards for blocking transactions. If your card is declined, don't panic! To minimize potential interruptions to your service, you'll want to follow these steps. - -**Keep your balance topped up** - -To avoid any potential issues with your balance being overrun, it's best to refresh your balance at least a few days before you're due to run out so you have a chance to address any last minute delays. Also be aware that there is an option to automatically refresh your balance when you run low under the Billing [page](https://www.console.runpod.io/user/billing): - - - - - -**Call the bank that issued your card** - -Once you do experience a card decline, the first step you'll want to do is to contact your issuing bank to see why a card is declined. Due to consumer/merchant privacy standards in the US, we are not provided with a reason that the card is declined, only that the transaction was not processed. Only your issuing bank can specifically tell you why a payment was declined. Many times, declines are for completely innocent reasons, such as your bank's anti-fraud protection tripping; just the same, Runpod is unable to assist with blocks put in place by your bank. - -It's important that you call your bank for the initial decline before trying a different card, because the processor may block *all* funding attempts from an account if it seems declines from multiple cards for the same account, even if these attempts would have otherwise not had any problems. These account blocks generally clear after 24 hours, but it may be difficult to load the account until then. - -**Other potential reasons for card blocks** - -Our payment processor may block cards for specific users based on their risk profile, so certain use patterns may trigger a block. If you use several different cards within a short period time, or have had disputed transactions in the past, this may also cause cards to decline. - -To see a list of supported cards on Stripe, [click here](https://stripe.com/docs/payments/cards/supported-card-brands%3E). - -**Contact us for support** - -If all else fails, then feel free to contact [Runpod support](https://www.runpod.io/contact) if you are still having trouble loading your account. We ask that you check with your bank first, but if everything checks out on your end, we will be glad to help! diff --git a/references/security-and-compliance.mdx b/references/security-and-compliance.mdx new file mode 100644 index 00000000..e85fa374 --- /dev/null +++ b/references/security-and-compliance.mdx @@ -0,0 +1,37 @@ +--- +title: "Data security and legal compliance" +sidebarTitle: "Security and compliance" +description: "Information about data security, GDPR compliance, and legal resources." +--- + +This page explains how Runpod secures your data, complies with privacy regulations, and where to find legal documentation. + +## Multi-tenant isolation + +Your Pods and workers run in a multi-tenant environment with containerized isolation that prevents other users from accessing your data. Each Pod/worker operates in its own container with strict separation from other workloads. + +For sensitive workloads requiring enhanced security, Secure Cloud operates in T3/T4 data centers with enterprise-grade security, high reliability, redundancy, and fast response times. + +## Host access policies + +Runpod's [terms of service](https://www.runpod.io/legal/terms-of-service) prohibit hosts from inspecting your Pod/worker data or analyzing your usage patterns. Any violation results in immediate removal from the platform. + +For workloads requiring the highest level of security, Secure Cloud provides vetted infrastructure partners who meet enterprise security standards including SOC 2, ISO 27001, and PCI DSS certifications. + +## GDPR compliance + +Runpod is fully compliant with the General Data Protection Regulation (GDPR) for data processed in European data center regions. The platform implements comprehensive policies, procedures, and technical measures to meet GDPR requirements. + +### Compliance measures + +For servers hosted in GDPR-compliant regions like the European Union, Runpod maintains clear procedures for the collection, storage, processing, and deletion of personal data. These procedures ensure transparency and accountability in data processing activities. + +Technical and organizational measures protect personal data against unauthorized access, disclosure, alteration, and destruction. Runpod obtains and records consent from individuals for processing their personal data, and provides mechanisms for individuals to withdraw consent. + +The platform facilitates data subject rights under GDPR, including the right to access, rectify, erase, or restrict the processing of personal data. Data subject requests are handled promptly and efficiently. + +For lawful transfer of personal data outside the EU, Runpod uses appropriate mechanisms such as adequacy decisions, standard contractual clauses, or binding corporate rules. Regular monitoring and internal audits ensure ongoing GDPR compliance, including data protection impact assessments as needed. + +## Legal resources + +For detailed information about terms, policies, and legal agreements, visit the [Runpod legal page](https://www.runpod.io/legal). \ No newline at end of file diff --git a/references/troubleshooting/manage-payment-cards.mdx b/references/troubleshooting/manage-payment-cards.mdx new file mode 100644 index 00000000..c1df39ef --- /dev/null +++ b/references/troubleshooting/manage-payment-cards.mdx @@ -0,0 +1,38 @@ +--- +title: "Manage payment card declines" +description: "Learn how to troubleshoot declined payment cards and prevent service interruptions on Runpod." +--- + +Payment card declines can occur when adding funds to your Runpod account. Credit card processors apply stringent fraud detection standards, particularly for international transactions. This document provides guidance to help you troubleshoot payment issues. + +**Keep your balance topped up** + +To prevent service interruptions, refresh your balance at least a few days before it runs out. This gives you time to address any payment delays. + +You can enable automatic balance refresh from the [Billing page](https://www.console.runpod.io/user/billing): + + + + + +## Contact your card issuer + +If your card is declined, contact your issuing bank to determine the reason. Due to privacy standards, payment processors only indicate that a transaction was not processed without providing specific details. Your bank can tell you why the payment was declined. + +Card declines often occur for routine reasons, such as anti-fraud protection. Your bank can resolve blocks they have placed on your card. + + + +Contact your bank about the initial decline before trying a different card. The payment processor may block all funding attempts from your account if it detects multiple card declines, even if those cards would otherwise work. These account blocks typically clear after 24 hours. + + + +## Other reasons for card blocks + +The payment processor may block cards based on user risk profiles. Using several different cards within a short period or having disputed transactions in the past may trigger card declines. + +For a list of supported card brands, see [Stripe's supported cards documentation](https://stripe.com/docs/payments/cards/supported-card-brands). + +## Contact support + +If you're still having trouble after checking with your bank, contact [Runpod support](https://www.runpod.io/contact) for assistance. diff --git a/references/troubleshooting/storage-full.mdx b/references/troubleshooting/storage-full.mdx index 6c23c687..33b8370e 100644 --- a/references/troubleshooting/storage-full.mdx +++ b/references/troubleshooting/storage-full.mdx @@ -34,14 +34,16 @@ tmpfs 252G 0 252G 0% /sys/devices/virtual/powerca ## Key areas to check -**Container Disk Usage**: The primary storage area for your container is mounted on the `overlay` filesystem. This indicates the container’s root directory. +**Container Disk Usage**: The primary storage area for your container is mounted on the `overlay` filesystem. This indicates the container's root directory. ```sh Filesystem Size Used Avail Use% Mounted on overlay 20G 16M 20G 1% / ``` -You can use the command `du -sh .` to check the space usage of the current directory. The default volume of container volume or network volume is mounted at `/workspace`, You can check the usage with the following example:: +You can use the command `du -sh .` to check the space usage of the current directory. + +By default, the volume disk or network volume is mounted at `/workspace`, You can check the usage with the following example:: ```sh root@9b8e325167b2:/# cd workspace/ diff --git a/references/troubleshooting/zero-gpus.mdx b/references/troubleshooting/zero-gpus.mdx new file mode 100644 index 00000000..2b3e2c3a --- /dev/null +++ b/references/troubleshooting/zero-gpus.mdx @@ -0,0 +1,29 @@ +--- +title: "Zero GPU Pods on restart" +sidebarTitle: "Zero GPU Pods" +description: "What to do when your Pod machine has zero GPUs." +--- + +When you restart a stopped Pod, you might see a message telling you that there are "Zero GPU Pods." This is because there are no GPUs available on the machine where your Pod was running. + +## Why does this happen? + +When you deploy a Pod, it's assigned to a GPU on a specific physical machine. This creates a link between your Pod and that particular piece of hardware. As long as your Pod is running, that GPU is exclusively reserved for you. When you stop your Pod, you release that specific GPU, allowing other users to rent it. Your Pod's [volume storage](/pods/storage/types) remains on the physical machine, but the GPU slot becomes available. + +If another user rents that GPU while your Pod is stopped, the GPU will be occupied when you try to restart. Because your Pod is still tied to that original machine, it cannot start with a GPU. + +When this happens, Runpod gives you the option to start the Pod with zero GPUs. This is primarily a data recovery feature, allowing you to access your Pod's volume disk without access to the GPU. + +## What are my options? + +If you encounter this situation, you have three choices: + +1. **Start with zero GPUs for data access**: Start the Pod without a GPU to access its local storage. This is useful for retrieving files, but the Pod will have limited CPU resources and is not suitable for compute tasks. You should use this option to back up or transfer your data before terminating the Pod. +2. **Wait and retry**: You can wait and try to restart the Pod again later. The GPU may become available if the other user stops their Pod, but there is no guarantee of when that will happen. +3. **Terminate and redeploy**: If you need a GPU immediately, terminate the current Pod and deploy a new one with the same configuration. The new Pod will be scheduled on any machine in the Runpod network with an available GPU of your chosen type. + +## How do I prevent this? + +The most effective way to avoid this issue is to use **[network volumes](/storage/network-volumes)**. + +Network volumes decouple your data from a specific physical machine. Your `/workspace` data is stored on a separate, persistent volume that can be attached to any Pod. If you need to terminate a Pod, you can simply deploy a new one and attach the same network volume, giving you immediate access to your data on a new machine with an available GPU. \ No newline at end of file diff --git a/runpodctl/reference/runpodctl-create-pod.mdx b/runpodctl/reference/runpodctl-create-pod.mdx index 0f33cdf1..7d2279ac 100644 --- a/runpodctl/reference/runpodctl-create-pod.mdx +++ b/runpodctl/reference/runpodctl-create-pod.mdx @@ -33,7 +33,7 @@ runpodctl create pod [flags] --secureCloud create in secure cloud --templateId string templateId to use with the pod --vcpu int minimum vCPUs needed (default 1) - --volumePath string container volume path (default "/runpod") + --volumePath string volume disk path (default "/workspace") --volumeSize int persistent volume disk size in GB (default 1) --networkVolumeId string network volume id ``` diff --git a/runpodctl/reference/runpodctl-create-pods.mdx b/runpodctl/reference/runpodctl-create-pods.mdx index 04340d9c..5602ca83 100644 --- a/runpodctl/reference/runpodctl-create-pods.mdx +++ b/runpodctl/reference/runpodctl-create-pods.mdx @@ -32,7 +32,7 @@ runpodctl create pods [flags] --ports strings ports to expose; max only 1 http and 1 tcp allowed; e.g. '8888/http' --secureCloud create in secure cloud --vcpu int minimum vCPUs needed (default 1) - --volumePath string container volume path (default "/runpod") + --volumePath string volume disk path (default "/workspace") --volumeSize int persistent volume disk size in GB (default 1) ``` diff --git a/sdks/graphql/manage-pods.mdx b/sdks/graphql/manage-pods.mdx index 5ea6c7a2..70cba509 100644 --- a/sdks/graphql/manage-pods.mdx +++ b/sdks/graphql/manage-pods.mdx @@ -773,7 +773,7 @@ query GpuTypes { ### Check GPU type availability -You can check if a specific [GPU type](/references/gpu-types) is available in the [Secure Cloud or Community Cloud](/references/faq#secure-cloud-vs-community-cloud). +You can check if a specific [GPU type](/references/gpu-types) is available in the [Secure Cloud or Community Cloud](/pods/overview#pod-types). The `stockStatus` field can help you determine the likelihood that a particular GPU type will be available when creating a Pod. For example, if `stockStatus` is `"Low"`, it means there are very few GPUs of that type available. diff --git a/serverless/endpoints/send-requests.mdx b/serverless/endpoints/send-requests.mdx index 3b7cfd4b..693ba462 100644 --- a/serverless/endpoints/send-requests.mdx +++ b/serverless/endpoints/send-requests.mdx @@ -607,6 +607,7 @@ async function main() { } main(); +``` diff --git a/serverless/load-balancing/build-a-worker.mdx b/serverless/load-balancing/build-a-worker.mdx index 16b20128..37b8901b 100644 --- a/serverless/load-balancing/build-a-worker.mdx +++ b/serverless/load-balancing/build-a-worker.mdx @@ -203,13 +203,9 @@ curl -X GET "https://ENDPOINT_ID.api.runpod.ai/stats" \ After sending a request, your workers will take some time to initialize. You can track their progress by checking the logs in the **Workers** tab of your endpoint page. -If you see the following error: +If you see: `{"error":"no workers available"}%` after running the request, this means your workers did not initialize in time to process it. If you try running the request again, this will usually resolve the issue. -``` -{"error":"no workers available"}% -``` - -This means your workers did not initialize in time to process the request. If you try running the request again, this will usually resolve the issue. +For production applications, implement a health check with retries before sending requests. See [Handling cold start errors](/serverless/load-balancing/overview#handling-cold-start-errors) for a complete code example. Congratulations! You've now successfully deployed and tested a load balancing endpoint. If you want to use a real model, you can follow the [vLLM worker](/serverless/load-balancing/vllm-worker) tutorial. diff --git a/serverless/load-balancing/overview.mdx b/serverless/load-balancing/overview.mdx index b48b78c3..e3bed8a7 100644 --- a/serverless/load-balancing/overview.mdx +++ b/serverless/load-balancing/overview.mdx @@ -187,6 +187,63 @@ If your server is misconfigured and the ports are not correctly opened, your wor +## Handling cold start errors + +When you first send a request to a load balancing endpoint, you might get a "no workers available" error. This happens because workers need time to initialize, i.e. the server is up, but the health check at `/ping` isn't passing yet. + +For production applications, you should implement a health check with retries before sending your actual requests. + +Here's a Python function that handles this: + +```python +import requests +import time + +def health_check_with_retry(base_url, api_key, max_retries=3, delay=2): + """Simple health check with retry logic for Runpod cold starts""" + headers = {"Authorization": f"Bearer {api_key}"} + + for attempt in range(max_retries): + try: + response = requests.get(f"{base_url}/ping", headers=headers, timeout=10) + if response.status_code == 200: + print("✓ Health check passed") + return True + except Exception as e: + print(f"Attempt {attempt + 1} failed: {e}") + + if attempt < max_retries - 1: + time.sleep(delay) + + print("✗ Health check failed after retries") + return False + +# Usage example +base_url = "https://ENDPOINT_ID.api.runpod.ai" +api_key = "RUNPOD_API_KEY" + +# Ensures that a worker is ready (with retries) +if health_check_with_retry(base_url, api_key): + # Worker is ready, send your actual /generate request + response = requests.post( + f"{base_url}/generate", + headers={"Authorization": f"Bearer {api_key}"}, + json={"prompt": "Hello, world!"} + ) + print(response.json()) +else: + print("Worker failed to initialize") +``` + +The `health_check_with_retry` function: + +- Sends requests to the `/ping` endpoint with configurable retries (default: 3 attempts). +- Waits between attempts to give workers time to initialize (default: 2 seconds). +- Uses a 10-second timeout per health check request. +- Returns `True` when the worker is ready, or `False` if initialization fails. + +Use at least 3 retries with 5-10 second delays between attempts. This gives workers enough time to complete their cold start process before you send production requests. + ## Technical details The load balancing system employs an HTTP load balancer that inspects application-level protocols to make routing decisions. When a request arrives at `https://ENDPOINT_ID.api.runpod.ai/PATH`, the system: diff --git a/serverless/load-balancing/vllm-worker.mdx b/serverless/load-balancing/vllm-worker.mdx index 437d919b..24223ab7 100644 --- a/serverless/load-balancing/vllm-worker.mdx +++ b/serverless/load-balancing/vllm-worker.mdx @@ -605,13 +605,9 @@ curl -X POST "https://ENDPOINT_ID.api.runpod.ai/v1/chat/completions" \ After sending a request, your workers will take some time to initialize. You can track their progress by checking the logs in the **Workers** tab of your endpoint page. -If you see the following error: +If you see: `{"error":"no workers available"}%` after running the request, this means your workers did not initialize in time to process it. If you try running the request again, this will usually resolve the issue. -``` -{"error":"no workers available"}% -``` - -This means your workers did not initialize in time to process the request. If you try running the request again, this will usually resolve the issue. +For production applications, implement a health check with retries before sending requests. See [Handling cold start errors](/serverless/load-balancing/overview#handling-cold-start-errors) for a complete code example. Congrats! You've created a load balancing vLLM endpoint and used it to serve a large language model. diff --git a/serverless/overview.mdx b/serverless/overview.mdx index 4ae414ae..37991597 100644 --- a/serverless/overview.mdx +++ b/serverless/overview.mdx @@ -7,7 +7,7 @@ Runpod Serverless is a cloud computing platform that lets you run AI models and - ## Network volumes for Instant Clusters @@ -182,4 +169,160 @@ The S3-compatible API supports standard S3 operations including file uploads, do The S3-compatible API is currently available for network volumes in the following datacenters: `EUR-IS-1`, `EU-RO-1`, `EU-CZ-1`, `US-KS-2`, `US-CA-2`. - \ No newline at end of file + + + +## Migrate files + +You can migrate files between network volumes (including between data centers) using the following methods: + +### Using runpodctl + +The simplest way to migrate files between network volumes is to use `runpodctl send` and `receive` on two running Pods. + +Before you begin, you'll need: + +- A source network volume containing the data you want to migrate. +- A destination network volume (which can be empty or contain existing data). + + + +Deploy two Pods using the default Runpod PyTorch template. Each Pod should have one [network volume attached](#attach-to-a-pod). + +1. Deploy the first Pod in the source data center and attach the source network volume. +2. Deploy the second Pod in the target data center and attach the target network volume. +3. Start the [web terminal](/pods/connect-to-a-pod#web-terminal) in both Pods. + + + +Using your source Pod's web terminal, navigate to the network volume directory (usually `/workspace`): + +```bash +cd workspace +``` + + + +Use `runpodctl send` to start the transfer. To transfer the entire volume: + +```bash +runpodctl send * +``` + +You can also specify specific files or directories instead of `*`. + + + +After running the send command, copy the `receive` command from the output. It will look something like this: + +```bash +runpodctl receive 8338-galileo-collect-fidel +``` + + + +Using your destination Pod's web terminal, navigate to the network volume directory (usually `/workspace`): + +```bash +cd workspace +``` + + + +Paste and run the `receive` command you copied earlier: + +```bash +runpodctl receive 8338-galileo-collect-fidel +``` + +The transfer will begin and show progress as it copies files from the source to the destination volume. + + + +For a visual walkthrough using JupyterLab, check out this video tutorial: + + + + + +### Using rsync over SSH + +For faster migration speed and more reliability for large transfers, you can use `rsync` over SSH on two running Pods. + +Before you begin, you'll need: + +- A network volume in the source data center containing the data you want to migrate. +- A network volume in the target data center (which can be empty or contain existing data). + + + +Deploy two Pods using the default Runpod PyTorch template. Each Pod should have one [network volume attached](#attach-to-a-pod). + +1. Deploy the first Pod in the source data center and attach the source network volume. +2. Deploy the second Pod in the target data center and attach the target network volume. +3. Start the [web terminal](/pods/connect-to-a-pod#web-terminal) in both Pods. + + + +On the source Pod, install required packages and generate an SSH key pair: + +```bash +apt update && apt install -y vim rsync && \ +ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -q && \ +cat ~/.ssh/id_ed25519.pub +``` + +Copy the public key that appears in the terminal output. + + + +On the destination Pod, install required packages and add the source Pod's public key to `authorized_keys`: + +```bash +apt update && apt install -y vim rsync && \ +ip=$(printenv RUNPOD_PUBLIC_IP) && \ +port=$(printenv RUNPOD_TCP_PORT_22) && \ +echo "rsync -avzP --inplace -e \"ssh -p $port\" /workspace/ root@$ip:/workspace" && \ +vi ~/.ssh/authorized_keys +``` + +In the editor that opens, paste the public key you copied from the source Pod, then save and exit (press `Esc`, type `:wq`, and press `Enter`). + +The command above also displays the `rsync` command you'll need to run on the source Pod. Copy this command for the next step. + + + +On the source Pod, run the `rsync` command from the previous step. If you didn't copy it, you can construct it manually using the destination Pod's IP address and port number. + +```bash +# Replace DESTINATION_PORT and DESTINATION_IP with values from the destination Pod +rsync -avzP --inplace -e "ssh -p DESTINATION_PORT" /workspace/ root@DESTINATION_IP:/workspace + +# Example: +rsync -avzP --inplace -e "ssh -p 18598" /workspace/ root@157.66.254.13:/workspace +``` + +The `rsync` command displays progress as it transfers files. Depending on the size of your data, this may take some time. + + + +After the `rsync` command completes, verify the data transfer by checking disk usage on both Pods: + +```bash +du -sh /workspace +``` + +The destination Pod should show similar disk usage to the source Pod if all files transferred successfully. + + +You can run the `rsync` command multiple times if the transfer is interrupted. The `--inplace` flag ensures that `rsync` resumes from where it left off rather than starting over. + + + \ No newline at end of file diff --git a/storage/s3-api.mdx b/storage/s3-api.mdx index 847410d1..2748d810 100644 --- a/storage/s3-api.mdx +++ b/storage/s3-api.mdx @@ -112,7 +112,7 @@ You can use the S3-compatible API to interact with your Runpod network volumes u * [AWS s3api CLI](https://docs.aws.amazon.com/cli/latest/reference/s3api/). * [The Boto3 Python library](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html). -Core AWS CLI operations such as `ls`, `cp`, `mv`, `rm`, and `sync` function as expected. +Standard AWS CLI operations such as `ls`, `cp`, `mv`, and `rm` work as expected for most file operations. The `sync` command works for basic use cases but may encounter issues with large numbers of files (10,000+) or complex directory structures. @@ -530,6 +530,19 @@ Workarounds: + + +The `aws s3 sync` command has limited support in Runpod's S3-compatible API. While it works for basic use cases (syncing small numbers of files and simple directory structures), you may encounter errors when syncing directories with very large numbers of files (over 10,000) or complex nested structures. + +Common issues include: +- EOF errors with 200 OK responses. +- Duplicate ContinuationToken errors in ListObjectsV2. +- Intermittent AccessDenied errors. + +To sync large numbers of files, consider breaking your sync operations into smaller batches or using individual `cp` commands for better reliability. Full `sync` support is in development. + + + When uploading large files (10GB+), you may encounter timeout errors during the `CompleteMultipartUpload` operation. To resolve this, we recommend using the [multipart upload helper script](#uploading-very-large-files). diff --git a/tutorials/pods/comfyui.mdx b/tutorials/pods/comfyui.mdx index e080d9de..3accff57 100644 --- a/tutorials/pods/comfyui.mdx +++ b/tutorials/pods/comfyui.mdx @@ -52,7 +52,7 @@ First, you'll deploy a Pod using a template that pre-installs ComfyUI and the Co Configure your Pod with these settings: - **GPU selection:** Choose an L40 or RTX 4090 for optimal performance with SDXL-Turbo. Lower VRAM GPUs may work for smaller models. - - **Storage:** The default container and disk volume sizes set by the template should be sufficient for SDXL-Turbo. You can also add a [network volume](/storage/network-volumes) to your Pod if you want persistent storage. + - **Storage:** The default container and volume disk sizes set by the template should be sufficient for SDXL-Turbo. You can also add a [network volume](/storage/network-volumes) to your Pod if you want persistent storage. - **Deployment type:** Select **On-Demand** for flexibility.