# OpenAI APIs - Vision

SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.
A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).
This tutorial covers the vision APIs for vision language models.

SGLang supports vision language models such as Llama 3.2, LLaVA-OneVision, and QWen-VL2  
- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)  
- [lmms-lab/llava-onevision-qwen2-72b-ov-chat](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-chat)  
- [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)  

## Launch A Server

This code block is equivalent to executing 

```bash
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
  --port 30010 --chat-template llama_3_vision
```
in your terminal and wait for the server to be ready.

Remember to add `--chat-template llama_3_vision` to specify the vision chat template, otherwise the server only supports text.
We need to specify `--chat-template` for vision language models because the chat template provided in Hugging Face tokenizer only supports text.

In [1]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

embedding_process = execute_shell_command(
"""
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
    --port=30010 --chat-template=llama_3_vision
"""
)

wait_for_server("http://localhost:30010")

[2024-11-01 21:04:42] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-11B-Vision-Instruct', chat_template='llama_3_vision', is_embedding=False, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=728666368, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_metho

## Using curl

Once the server is up, you can send test requests using curl or requests.

In [3]:
import subprocess

curl_command = """
curl -s http://localhost:30010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer None" \
  -d '{
    "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What’s in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'
"""

response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)

[2024-11-01 21:05:28 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 6462, cache hit rate: 49.97%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-11-01 21:05:29 TP0] Decode batch. #running-req: 1, #token: 6467, token usage: 0.05, gen throughput (token/s): 1.59, #queue-req: 0
[2024-11-01 21:05:29] INFO:     127.0.0.1:32892 - "POST /v1/chat/completions HTTP/1.1" 200 OK


## Using requests

In [4]:
import requests

url = "http://localhost:30010/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer None"
}
data = {
    "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What’s in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    }
                }
            ]
        }
    ],
    "max_tokens": 300
}

response = requests.post(url, headers=headers, json=data)
print_highlight(response.text)

[2024-11-01 21:06:37 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 6462, cache hit rate: 66.63%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-11-01 21:06:38 TP0] Decode batch. #running-req: 1, #token: 6478, token usage: 0.05, gen throughput (token/s): 0.58, #queue-req: 0
[2024-11-01 21:06:39 TP0] Decode batch. #running-req: 1, #token: 6518, token usage: 0.05, gen throughput (token/s): 33.71, #queue-req: 0
[2024-11-01 21:06:40 TP0] Decode batch. #running-req: 1, #token: 6558, token usage: 0.05, gen throughput (token/s): 33.66, #queue-req: 0
[2024-11-01 21:06:41 TP0] Decode batch. #running-req: 1, #token: 6598, token usage: 0.05, gen throughput (token/s): 33.64, #queue-req: 0
[2024-11-01 21:06:42 TP0] Decode batch. #running-req: 1, #token: 6638, token usage: 0.05, gen throughput (token/s): 33.65, #queue-req: 0
[2024-11-01 21:06:44 TP0] Decode batch. #running-req: 1, #token: 6678, token usage: 0.05, gen throughput (token/s): 33.64, #queue-req: 0
[2024-11-01 21:

## Using OpenAI Python Client

Also, you can use the OpenAI Python API library to send requests.

In [5]:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30010/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"},
                },
            ],
        }
    ],
    max_tokens=300,
)

print_highlight(response.choices[0].message.content)

[2024-11-01 21:06:48 TP0] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 6452, cache hit rate: 74.93%, token usage: 0.05, #running-req: 0, #queue-req: 0
[2024-11-01 21:06:49 TP0] Decode batch. #running-req: 1, #token: 6485, token usage: 0.05, gen throughput (token/s): 8.04, #queue-req: 0
[2024-11-01 21:06:50 TP0] Decode batch. #running-req: 1, #token: 6525, token usage: 0.05, gen throughput (token/s): 33.74, #queue-req: 0
[2024-11-01 21:06:51 TP0] Decode batch. #running-req: 1, #token: 6565, token usage: 0.05, gen throughput (token/s): 33.67, #queue-req: 0
[2024-11-01 21:06:51] INFO:     127.0.0.1:57550 - "POST /v1/chat/completions HTTP/1.1" 200 OK


## Multiple-Image Inputs

The server also supports multiple images and interleaved text and images if the model supports it.

In [6]:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30010/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true",
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
                    },
                },
                {
                    "type": "text",
                    "text": "I have two very different images. They are not related at all. "
                            "Please describe the first image in one sentence, and then describe the second image in another sentence.",
                },
            ],
        }
    ],
    temperature=0,
)

print_highlight(response.choices[0].message.content)

[2024-11-01 21:07:05 TP0] Prefill batch. #new-seq: 1, #new-token: 12895, #cached-token: 0, cache hit rate: 50.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-01 21:07:07 TP0] Decode batch. #running-req: 1, #token: 12919, token usage: 0.10, gen throughput (token/s): 2.57, #queue-req: 0
[2024-11-01 21:07:08 TP0] Decode batch. #running-req: 1, #token: 12959, token usage: 0.10, gen throughput (token/s): 33.40, #queue-req: 0
[2024-11-01 21:07:08] INFO:     127.0.0.1:34782 - "POST /v1/chat/completions HTTP/1.1" 200 OK


In [7]:
terminate_process(embedding_process)

: 

## Chat Template

As mentioned before, if you do not specify a vision model's `--chat-template`, the server uses Hugging Face's default template, which only supports text.

We list popular vision models with their chat templates:

- [meta-llama/Llama-3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) uses `llama_3_vision`.
- [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) uses `qwen2-vl`.
- [LlaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) uses `chatml-llava`.
- [LLaVA-NeXT](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff) uses `chatml-llava`.
- [Llama3-LLaVA-NeXT](https://huggingface.co/lmms-lab/llama3-llava-next-8b) uses `llava_llama_3`.
- [LLaVA-v1.5 / 1.6](https://huggingface.co/liuhaotian/llava-v1.6-34b) uses `vicuna_v1.1`.