# OpenAI APIs - Vision

SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.
A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).
This tutorial covers the vision APIs for vision language models.

SGLang supports vision language models such as Llama 3.2, LLaVA-OneVision, and QWen-VL2  
- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)  
- [lmms-lab/llava-onevision-qwen2-72b-ov-chat](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-chat)  
- [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)  

## Launch A Server

Launch the server in your terminal and wait for it to initialize.

**Remember to add `--chat-template llama_3_vision` to specify the vision chat template, otherwise the server only supports text, and performance degradation may occur.**

We need to specify `--chat-template` for vision language models because the chat template provided in Hugging Face tokenizer only supports text.

In [1]:
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

embedding_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
    --chat-template=llama_3_vision
"""
)

wait_for_server(f"http://localhost:{port}")

INFO 02-19 22:16:49 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:16:51,109 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:16:51] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-11B-Vision-Instruct', chat_template='llama_3_vision', is_embedding=False, revision=None, skip_tokenizer_init=False, host='127.0.0.1', port=30049, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=763110421, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_

[2025-02-19 22:16:54] Use chat template for the OpenAI-compatible API server: llama_3_vision
[2025-02-19 22:16:54] Using a chat_template: 'llama_3_vision', which is different from official chat template: 'llama-3-instruct', This discrepancy may lead to performance degradation.


INFO 02-19 22:16:55 __init__.py:190] Automatically detected platform cuda.
INFO 02-19 22:16:55 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:16:57,221 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-02-19 22:16:57,234 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[2025-02-19 22:16:59 TP0] Overlap scheduler is disabled for multimodal models.
[2025-02-19 22:17:00 TP0] Automatically reduce --mem-fraction-static to 0.836 because this is a multimodal model.
[2025-02-19 22:17:00 TP0] Automatically turn off --chunked-prefill-size for mllama.
[2025-02-19 22:17:00 TP0] Init torch distributed begin.


[2025-02-19 22:17:00 TP0] Load weight begin. avail mem=78.84 GB


[2025-02-19 22:17:00 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:01,  3.92it/s]


Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:01,  1.53it/s]


Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.20it/s]


Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.05it/s]


Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.14it/s]

[2025-02-19 22:17:05 TP0] Load weight end. type=MllamaForConditionalGeneration, dtype=torch.bfloat16, avail mem=58.67 GB


[2025-02-19 22:17:05 TP0] KV Cache is allocated. K size: 22.87 GB, V size: 22.87 GB.
[2025-02-19 22:17:05 TP0] Memory pool end. avail mem=11.89 GB
[2025-02-19 22:17:06 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

2025-02-19 22:17:06,685 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-02-19 22:17:06,707 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False


  4%|▍         | 1/23 [00:01<00:28,  1.30s/it]

  9%|▊         | 2/23 [00:01<00:14,  1.40it/s]

 13%|█▎        | 3/23 [00:01<00:10,  1.99it/s]

 17%|█▋        | 4/23 [00:02<00:07,  2.43it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.81it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.10it/s]

 30%|███       | 7/23 [00:02<00:04,  3.35it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.48it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.55it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.65it/s]

 48%|████▊     | 11/23 [00:03<00:03,  3.64it/s]

 52%|█████▏    | 12/23 [00:04<00:03,  3.61it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  3.54it/s]

 61%|██████    | 14/23 [00:04<00:02,  3.58it/s]

 65%|██████▌   | 15/23 [00:05<00:02,  3.59it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  3.51it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  3.58it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  3.63it/s]

 83%|████████▎ | 19/23 [00:06<00:01,  3.66it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  3.67it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  3.74it/s]

 96%|█████████▌| 22/23 [00:07<00:00,  3.64it/s]

100%|██████████| 23/23 [00:07<00:00,  3.62it/s]100%|██████████| 23/23 [00:07<00:00,  3.16it/s]
[2025-02-19 22:17:13 TP0] Capture cuda graph end. Time elapsed: 7.29 s


[2025-02-19 22:17:15 TP0] max_total_num_tokens=299781, chunked_prefill_size=-1, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-02-19 22:17:15] INFO:     Started server process [1850809]
[2025-02-19 22:17:15] INFO:     Waiting for application startup.
[2025-02-19 22:17:15] INFO:     Application startup complete.
[2025-02-19 22:17:15] INFO:     Uvicorn running on http://127.0.0.1:30049 (Press CTRL+C to quit)


[2025-02-19 22:17:15] INFO:     127.0.0.1:55226 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-19 22:17:16] INFO:     127.0.0.1:55238 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-19 22:17:16 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-02-19 22:17:16,401 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-19 22:17:16,425 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False


[2025-02-19 22:17:16] INFO:     127.0.0.1:55250 - "POST /generate HTTP/1.1" 200 OK
[2025-02-19 22:17:16] The server is fired up and ready to roll!




                    NOTE: Typically, the server runs in a separate terminal.
                    In this notebook, we run the server and notebook code together, so their outputs are combined.
                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
                    We are running those notebooks in a CI parallel environment, so the throughput is not representative of the actual performance.
                    


## Using cURL

Once the server is up, you can send test requests using curl or requests.

In [2]:
import subprocess

curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \\
  -d '{{
    "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
    "messages": [
      {{
        "role": "user",
        "content": [
          {{
            "type": "text",
            "text": "What’s in this image?"
          }},
          {{
            "type": "image_url",
            "image_url": {{
              "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
            }}
          }}
        ]
      }}
    ],
    "max_tokens": 300
  }}'
"""

response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)


response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)

[2025-02-19 22:17:24 TP0] Prefill batch. #new-seq: 1, #new-token: 6463, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-19 22:17:26 TP0] Decode batch. #running-req: 1, #token: 6496, token usage: 0.02, gen throughput (token/s): 3.72, #queue-req: 0


[2025-02-19 22:17:26] INFO:     127.0.0.1:56912 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"eeab91c5f481483e90946088a5e00a3e","object":"chat.completion","created":1740003446,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"This image depicts an ironing board and iron set up on the back of a taxi cab. A man in a yellow jacket is ironing a shirt on the board. There are pink banners in the background hanging from light poles. The image appears to be a city street with a second taxi in the background.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":128009}],"usage":{"prompt_tokens":6463,"total_tokens":6526,"completion_tokens":63,"prompt_tokens_details":null}}


[2025-02-19 22:17:27 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 6462, cache hit rate: 49.97%, token usage: 0.02, #running-req: 0, #queue-req: 0


[2025-02-19 22:17:27 TP0] Decode batch. #running-req: 1, #token: 6474, token usage: 0.02, gen throughput (token/s): 34.76, #queue-req: 0


[2025-02-19 22:17:27 TP0] Decode batch. #running-req: 1, #token: 0, token usage: 0.00, gen throughput (token/s): 63.48, #queue-req: 0
[2025-02-19 22:17:27] INFO:     127.0.0.1:56922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"c164e16fc0244938998bab75614a95b3","object":"chat.completion","created":1740003447,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a man ironing clothes on the back of a yellow taxi cab in the middle of the street. The man is standing on the back of the taxi, holding an iron and ironing board, and appears to be ironing a shirt.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":128009}],"usage":{"prompt_tokens":6463,"total_tokens":6515,"completion_tokens":52,"prompt_tokens_details":null}}


## Using Python Requests

In [3]:
import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    "max_tokens": 300,
}

response = requests.post(url, json=data)
print_highlight(response.text)

[2025-02-19 22:17:28 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 6462, cache hit rate: 66.63%, token usage: 0.02, #running-req: 0, #queue-req: 0


[2025-02-19 22:17:28] INFO:     127.0.0.1:56936 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"98a78fb7619c46529409674974fd58e5","object":"chat.completion","created":1740003448,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image depicts a man ironing clothes on the back of a taxi cab, which is a common sight in many cities where taxis are abundant.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":128009}],"usage":{"prompt_tokens":6463,"total_tokens":6493,"completion_tokens":30,"prompt_tokens_details":null}}


## Using OpenAI Python Client

In [4]:
from openai import OpenAI

client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print_highlight(response.choices[0].message.content)

[2025-02-19 22:17:29 TP0] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 6452, cache hit rate: 74.93%, token usage: 0.02, #running-req: 0, #queue-req: 0


[2025-02-19 22:17:29 TP0] Decode batch. #running-req: 1, #token: 6474, token usage: 0.02, gen throughput (token/s): 23.93, #queue-req: 0


[2025-02-19 22:17:29] INFO:     127.0.0.1:56952 - "POST /v1/chat/completions HTTP/1.1" 200 OK
The image shows a man ironing clothes on an ironing board that is mounted on the back of a yellow taxi cab.


## Multiple-Image Inputs

The server also supports multiple images and interleaved text and images if the model supports it.

In [5]:
from openai import OpenAI

client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true",
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
                    },
                },
                {
                    "type": "text",
                    "text": "I have two very different images. They are not related at all. "
                    "Please describe the first image in one sentence, and then describe the second image in another sentence.",
                },
            ],
        }
    ],
    temperature=0,
)

print_highlight(response.choices[0].message.content)

[2025-02-19 22:17:30 TP0] Prefill batch. #new-seq: 1, #new-token: 12895, #cached-token: 0, cache hit rate: 50.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-19 22:17:31 TP0] Decode batch. #running-req: 1, #token: 12921, token usage: 0.04, gen throughput (token/s): 18.08, #queue-req: 0


[2025-02-19 22:17:32 TP0] Decode batch. #running-req: 1, #token: 12961, token usage: 0.04, gen throughput (token/s): 62.98, #queue-req: 0


[2025-02-19 22:17:32] INFO:     127.0.0.1:51056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
The first image shows a man in a yellow shirt ironing a shirt on the back of a yellow taxi cab, with a small icon of a computer code snippet in the top-left corner. The second image shows a large orange "SGI" logo, with the "S" and "G" in a large font and the "I" in a smaller font, with a white background.


In [6]:
terminate_process(embedding_process)

## Chat Template

As mentioned before, if you do not specify a vision model's `--chat-template`, the server uses Hugging Face's default template, which only supports text.

We list popular vision models with their chat templates:

- [meta-llama/Llama-3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) uses `llama_3_vision`.
- [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) uses `qwen2-vl`.
- [LlaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) uses `chatml-llava`.
- [LLaVA-NeXT](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff) uses `chatml-llava`.
- [Llama3-LLaVA-NeXT](https://huggingface.co/lmms-lab/llama3-llava-next-8b) uses `llava_llama_3`.
- [LLaVA-v1.5 / 1.6](https://huggingface.co/liuhaotian/llava-v1.6-34b) uses `vicuna_v1.1`.