# Deploying a Vision capable LLMs

This tutorial walks you through deploying a vision-capable LLM using Ray Serve LLM.  

---

## Prerequisites

* Access to GPU compute.
* (Optional) A **Hugging Face token** if using gated models like Meta’s Llama. Store it in `export HF_TOKEN=<YOUR-TOKEN-HERE>`

> Depending on the organization, you can usually request access on the model's Hugging Face page. For example, Meta’s Llama models approval can take anywhere from a few hours to several weeks.

**Dependencies:**  
```bash
pip install "ray[serve,llm]"
```

---

## Configure Ray Serve LLM

You can configure your deployment using the Ray Serve LLM Python SDK for fast iteration, or use a Ray Serve config file for better integration and long-term maintainability in your systems.

Make sure to set your Hugging Face token in the config file to run gated models.

::::{tab-set}

:::{tab-item} Python SDK

Ray Serve LLM provides multiple [Python APIs](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) for defining your application. Use [`build_openai_app`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.build_openai_app.html#ray.serve.llm.build_openai_app) to build a full application from your [`LLMConfig`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) object.
```python
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-qwen-VL",
        model_source="qwen/Qwen2.5-VL-7B-Instruct",
    ),
    accelerator_type="L40S",
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=2, max_replicas=2,
        )
    ),
    engine_kwargs=dict(
        max_model_len=8192,
        ### Uncomment if your model is gated and need your Huggingface Token to access it
        #hf_token=os.environ["HF_TOKEN"],
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})

serve.run(app, blocking=True)
```

:::

:::{tab-item} Serve Config (YAML)

In your Ray Serve config file:
```yaml
applications:
- name: vision-llm-app
  route_prefix: "/"
  import_path: ray.serve.llm:build_openai_app
  args:
    llm_configs:
      - model_loading_config:
          model_id: my-qwen-VL
          model_source: qwen/Qwen2.5-VL-7B-Instruct
        accelerator_type: L40S
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 4
        engine_kwargs:
          max_model_len: 8192
          ### Uncomment if your model is gated and need your Huggingface Token to access it
          #hf_token: <YOUR-TOKEN-HERE>
```

:::

::::

---

## Deploy

There are different ways to deploy your service depending on how you defined it.  

::::{tab-set}

:::{tab-item} From the Python SDK

Follow the instructions at [Configure Ray Serve LLM (Python SDK)](#configure-ray-serve) to define your app in a Python module `serve_my_qwen_VL.py`.  
> Make sure your script runs your application with `serve.run(app, blocking=True)`.  

In a terminal, run:  
```bash
python serve_my_qwen_VL.py
```

:::

:::{tab-item} From a Serve Config (YAML)  

Follow the instructions at [Configure Ray Serve LLM (Serve Config)](#configure-ray-serve) to define your app with a Ray Serve config file `serve_my_qwen_VL.yaml`.  

In a terminal, run:
```bash
serve run serve_my_qwen_VL.yaml
```

:::

::::

---

## Sending Requests with Images

> Follow the [Deployment instructions](#deploy) to launch your application on your Ray Cluster.

Deployment typically takes a few minutes as the cluster is provisioned, the vLLM server starts, and the model is downloaded.

**Retrieve your authentication token and endpoint**  
If running locally, your model will be available at `<YOUR-ENDPOINT-HERE> = "http://localhost:8000"` and you can use a placeholder authentication token: `<YOUR-TOKEN-HERE> = "FAKE_KEY"`
  > **Note:** The OpenAI client requires an `api_key`, but this is **not needed** for local deployments.  

Otherwise, retrieve both your endpoint and authentication token from your deployment environment’s dashboard or logs.

**Send a request**  
Use the `model_id` defined in your config (here, `my-qwen-VL`) to query your model.

::::{tab-set}

:::{tab-item} Example Curl
```bash
curl -X POST <YOUR-ENDPOINT-HERE>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR-TOKEN-HERE>" \
  -d '{
        "model": "my-qwen-VL",
        "messages": [
          {
            "role": "user",
            "content": [
              {"type": "text", "text": "What do you see in this image?"},
              {"type": "image_url", "image_url": {
                "url": "http://images.cocodataset.org/val2017/000000039769.jpg"
              }}
            ]
          }
        ]
      }'
```

:::

:::{tab-item} Example Python
```python
from urllib.parse import urljoin
import base64
from openai import OpenAI

api_key = <YOUR-TOKEN-HERE>
base_url = <YOUR-ENDPOINT-HERE>

client = OpenAI(base_url=urljoin(base_url, "v1"), api_key=api_key)

### From an image locally saved as `example.jpg`
# Load and encode image as base64
with open("example.jpg", "rb") as f:
    img_base64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="my-qwen-VL",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"}}
            ]
        }
    ],
    temperature=0.5,
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

### From an image's URI
response = client.chat.completions.create(
    model="my-qwen-VL",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/val2017/000000039769.jpg"}}
            ]
        }
    ],
    temperature=0.5,
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
```

:::

::::

---

## Limiting Images per Prompt

Ray Serve LLM uses [vLLM](https://docs.vllm.ai/en/latest/) as its backend engine. You can configure vLLM by passing parameters through the `engine_kwargs` section of your Serve LLM configuration. For a full list of supported options, see the [vLLM documentation](https://docs.vllm.ai/en/latest/configuration/engine_args.html#multimodalconfig).  

In particular, you can limit the number of images per request by setting `limit_mm_per_prompt` in your configuration.  
```yaml
applications:
- ...
  args:
    llm_configs:
        ...
        engine_kwargs:
          ...
          limit_mm_per_prompt: {"image": 3}
```

---

## Summary

In this tutorial, you deployed a vision-capable LLM with Ray Serve LLM, from development to production. You learned how to configure Ray Serve LLM, deploy your service on your Ray Cluster, and how to send requests with images.