# Vision Language Model

SGLang supports vision language models in the same way as completion models. Here are some example models:

- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
- [lmms-lab/llava-onevision-qwen2-7b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov)


## Launch A Server

The following code is equivalent to running this in the shell:

```bash
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
 --port=30010 --chat-template=llama_3_vision
```

Remember to add `--chat-template=llama_3_vision` to specify the vision chat template, otherwise the server only supports text.

In [1]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

embedding_process = execute_shell_command(
    """
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
        --port=30010 --chat-template=llama_3_vision

"""
)

wait_for_server("http://localhost:30010")

[2024-11-01 20:30:44] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-11B-Vision-Instruct', chat_template='llama_3_vision', is_embedding=False, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=940330052, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_metho

## Use Curl

In [3]:
import subprocess, json, os

curl_command = """
curl -s http://localhost:30010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer None" \
  -d '{
    "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What’s in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'
"""

response = json.loads(subprocess.check_output(curl_command, shell=True))
print_highlight(response)

sh: 0: getcwd() failed: No such file or directory


[2024-11-01 20:56:01 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 6462, cache hit rate: 49.97%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-11-01 20:56:01 TP0] Decode batch. #running-req: 1, #token: 6465, token usage: 0.05, gen throughput (token/s): 0.03, #queue-req: 0
[2024-11-01 20:56:03 TP0] Decode batch. #running-req: 1, #token: 6505, token usage: 0.05, gen throughput (token/s): 33.70, #queue-req: 0
[2024-11-01 20:56:04 TP0] Decode batch. #running-req: 1, #token: 6545, token usage: 0.05, gen throughput (token/s): 33.66, #queue-req: 0
[2024-11-01 20:56:05] INFO:     127.0.0.1:35294 - "POST /v1/chat/completions HTTP/1.1" 200 OK


## Using requests

In [7]:
import requests, json

url = "http://localhost:30010/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer None"
}
data = {
    "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What’s in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                    }
                }
            ]
        }
    ],
    "max_tokens": 300
}

response = requests.post(url, headers=headers, json=data)
print_highlight(response.json())

[2024-11-01 20:58:16 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 6462, cache hit rate: 79.97%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-11-01 20:58:17 TP0] Decode batch. #running-req: 1, #token: 6499, token usage: 0.05, gen throughput (token/s): 1.12, #queue-req: 0
[2024-11-01 20:58:19 TP0] Decode batch. #running-req: 1, #token: 6539, token usage: 0.05, gen throughput (token/s): 33.68, #queue-req: 0
[2024-11-01 20:58:19] INFO:     127.0.0.1:44978 - "POST /v1/chat/completions HTTP/1.1" 200 OK


## Using OpenAI Compatible API

In [13]:
import base64, requests
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30010/v1", api_key="None")


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


import os
import requests

def download_image(image_url, image_path):
    try:
        # 确保目录存在
        os.makedirs(os.path.dirname(image_path), exist_ok=True)
        
        response = requests.get(image_url)
        response.raise_for_status()  # 检查请求是否成功
        
        with open(image_path, 'wb') as f:
            f.write(response.content)
        
        print(f"Image successfully downloaded to {image_path}")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading image: {e}")
    except FileNotFoundError as e:
        print(f"Error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")

# 使用示例
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_path = "boardwalk.jpeg"  # 确保路径是有效的
download_image(image_url, image_path)

base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
            ],
        }
    ],
    max_tokens=300,
)

print_highlight(response.choices[0].message.content)

Error: [Errno 2] No such file or directory: ''


FileNotFoundError: [Errno 2] No such file or directory: 'boardwalk.jpeg'

## Multiple Images Input

In [17]:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30010/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Are there any differences between these two images?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)
print(response.choices[0])

[2024-10-31 23:11:28 TP0] Prefill batch. #new-seq: 1, #new-token: 12871, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-10-31 23:11:30 TP0] Decode batch. #running-req: 1, #token: 12899, token usage: 0.10, gen throughput (token/s): 15.36, #queue-req: 0
[2024-10-31 23:11:31 TP0] Decode batch. #running-req: 1, #token: 12939, token usage: 0.10, gen throughput (token/s): 33.33, #queue-req: 0
[2024-10-31 23:11:32 TP0] Decode batch. #running-req: 1, #token: 12979, token usage: 0.10, gen throughput (token/s): 33.28, #queue-req: 0
[2024-10-31 23:11:33] INFO:     127.0.0.1:50966 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The two images depict a serene and idyllic scene, with the first image showing a well-trodden wooden path through a field, while the second image shows an overgrown, less-traveled path through the same field. The first image features a 

In [18]:
terminate_process(embedding_process)
os.remove(image_path)

## Chat Template

As mentioned before, if you do not specify a vision model's `chat-template`, the server uses Hugging Face's default template, which only supports text.

You can add your custom chat template by referring to the [custom chat template](../references/custom_chat_template.md).

We list popular vision models with their chat templates:

- [meta-llama/Llama-3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) uses `llama_3_vision`.
- [LLaVA-NeXT](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff) uses `chatml-llava`.
- [llama3-llava-next](https://huggingface.co/lmms-lab/llama3-llava-next-8b) uses `llava_llama_3`.
- [llava-onevision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) uses `chatml-llava`.
- [liuhaotian/llava-v1.5 / 1.6](https://huggingface.co/liuhaotian/llava-v1.5-13b) uses `vicuna_v1.1`.