# OpenAI APIs - Vision

SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.
A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).
This tutorial covers the vision APIs for vision language models.

SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](../supported_models/multimodal_language_models.md).

As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py).

## Launch A Server

Launch the server in your terminal and wait for it to initialize.

In [1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

vision_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct
"""
)

wait_for_server(f"http://localhost:{port}")

W0824 08:22:00.534000 2155419 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 08:22:00.534000 2155419 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[2025-08-24 08:22:01] MOE_RUNNER_BACKEND is not initialized, using triton backend


[2025-08-24 08:22:02] Ignore import error when loading sglang.srt.multimodal.processors.glm4v: No module named 'transformers.models.glm4v_moe'


You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-08-24 08:22:02] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


[2025-08-24 08:22:03] Using default HuggingFace chat template with detected content format: openai


W0824 08:22:08.872000 2155848 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 08:22:08.872000 2155848 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0824 08:22:08.872000 2155847 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 08:22:08.872000 2155847 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-08-24 08:22:10] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


[2025-08-24 08:22:10] Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-08-24 08:22:10] Init torch distributed begin.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-08-24 08:22:11] Init torch distributed ends. mem usage=0.00 GB
[2025-08-24 08:22:11] MOE_RUNNER_BACKEND is not initialized, using triton backend


[2025-08-24 08:22:12] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-24 08:22:12] Load weight begin. avail mem=78.08 GB


[2025-08-24 08:22:12] Multimodal attention backend not set. Use fa3.
[2025-08-24 08:22:12] Using fa3 as multimodal attention backend.


[2025-08-24 08:22:12] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.50it/s]


Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.46it/s]


Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.44it/s]


Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.45it/s]


Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.86it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.66it/s]

[2025-08-24 08:22:15] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=35.46 GB, mem usage=42.61 GB.
[2025-08-24 08:22:15] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-08-24 08:22:15] Memory pool end. avail mem=34.13 GB


[2025-08-24 08:22:16] Capture cuda graph begin. This can take up to several minutes. avail mem=33.56 GB


[2025-08-24 08:22:16] Capture cuda graph bs [1, 2, 4]


  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=33.56 GB):   0%|          | 0/3 [00:00<?, ?it/s][2025-08-24 08:22:16] IS_TBO_ENABLED is not initialized, using False


Capturing batches (bs=4 avail_mem=33.56 GB):  33%|███▎      | 1/3 [00:03<00:06,  3.40s/it]Capturing batches (bs=2 avail_mem=28.96 GB):  33%|███▎      | 1/3 [00:03<00:06,  3.40s/it]

Capturing batches (bs=2 avail_mem=28.96 GB):  67%|██████▋   | 2/3 [00:06<00:03,  3.05s/it]Capturing batches (bs=1 avail_mem=28.86 GB):  67%|██████▋   | 2/3 [00:06<00:03,  3.05s/it]

Capturing batches (bs=1 avail_mem=28.86 GB): 100%|██████████| 3/3 [00:07<00:00,  2.06s/it]Capturing batches (bs=1 avail_mem=28.86 GB): 100%|██████████| 3/3 [00:07<00:00,  2.36s/it]
[2025-08-24 08:22:23] Capture cuda graph end. Time elapsed: 7.87 s. mem usage=4.72 GB. avail mem=28.83 GB.


[2025-08-24 08:22:25] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=128, context_len=128000, available_gpu_mem=28.83 GB


[2025-08-24 08:22:25] INFO:     Started server process [2155419]
[2025-08-24 08:22:25] INFO:     Waiting for application startup.
[2025-08-24 08:22:25] INFO:     Application startup complete.
[2025-08-24 08:22:25] INFO:     Uvicorn running on http://127.0.0.1:32474 (Press CTRL+C to quit)
[2025-08-24 08:22:25] INFO:     127.0.0.1:50956 - "GET /v1/models HTTP/1.1" 200 OK


[2025-08-24 08:22:26] INFO:     127.0.0.1:50970 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-08-24 08:22:26] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 


[2025-08-24 08:22:29] INFO:     127.0.0.1:50972 - "POST /generate HTTP/1.1" 200 OK
[2025-08-24 08:22:29] The server is fired up and ready to roll!


## Using cURL

Once the server is up, you can send test requests using curl or requests.

In [2]:
import subprocess

curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{{
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
      {{
        "role": "user",
        "content": [
          {{
            "type": "text",
            "text": "What’s in this image?"
          }},
          {{
            "type": "image_url",
            "image_url": {{
              "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
            }}
          }}
        ]
      }}
    ],
    "max_tokens": 300
  }}'
"""

response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)


response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)

[2025-08-24 08:22:31] Prefill batch. #new-seq: 1, #new-token: 307, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 


[2025-08-24 08:22:32] Decode batch. #running-req: 1, #token: 340, token usage: 0.02, cuda graph: True, gen throughput (token/s): 5.58, #queue-req: 0, 


[2025-08-24 08:22:32] Decode batch. #running-req: 1, #token: 380, token usage: 0.02, cuda graph: True, gen throughput (token/s): 148.36, #queue-req: 0, 
[2025-08-24 08:22:32] INFO:     127.0.0.1:50988 - "POST /v1/chat/completions HTTP/1.1" 200 OK


[2025-08-24 08:22:33] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0, 


[2025-08-24 08:22:36] Decode batch. #running-req: 1, #token: 342, token usage: 0.02, cuda graph: True, gen throughput (token/s): 12.24, #queue-req: 0, 


[2025-08-24 08:22:36] Decode batch. #running-req: 1, #token: 382, token usage: 0.02, cuda graph: True, gen throughput (token/s): 150.39, #queue-req: 0, 
[2025-08-24 08:22:36] INFO:     127.0.0.1:50994 - "POST /v1/chat/completions HTTP/1.1" 200 OK


## Using Python Requests

In [3]:
import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    "max_tokens": 300,
}

response = requests.post(url, json=data)
print_highlight(response.text)

[2025-08-24 08:22:36] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0, 
[2025-08-24 08:22:37] Decode batch. #running-req: 1, #token: 329, token usage: 0.02, cuda graph: True, gen throughput (token/s): 53.17, #queue-req: 0, 


[2025-08-24 08:22:37] Decode batch. #running-req: 1, #token: 369, token usage: 0.02, cuda graph: True, gen throughput (token/s): 150.18, #queue-req: 0, 
[2025-08-24 08:22:37] INFO:     127.0.0.1:60084 - "POST /v1/chat/completions HTTP/1.1" 200 OK


## Using OpenAI Python Client

In [4]:
from openai import OpenAI

client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print_highlight(response.choices[0].message.content)

[2025-08-24 08:22:37] Prefill batch. #new-seq: 1, #new-token: 292, #cached-token: 15, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-08-24 08:22:38] Decode batch. #running-req: 1, #token: 320, token usage: 0.02, cuda graph: True, gen throughput (token/s): 53.30, #queue-req: 0, 


[2025-08-24 08:22:38] Decode batch. #running-req: 1, #token: 360, token usage: 0.02, cuda graph: True, gen throughput (token/s): 150.04, #queue-req: 0, 


[2025-08-24 08:22:38] Decode batch. #running-req: 1, #token: 400, token usage: 0.02, cuda graph: True, gen throughput (token/s): 150.01, #queue-req: 0, 
[2025-08-24 08:22:38] INFO:     127.0.0.1:60090 - "POST /v1/chat/completions HTTP/1.1" 200 OK


## Multiple-Image Inputs

The server also supports multiple images and interleaved text and images if the model supports it.

In [5]:
from openai import OpenAI

client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true",
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
                    },
                },
                {
                    "type": "text",
                    "text": "I have two very different images. They are not related at all. "
                    "Please describe the first image in one sentence, and then describe the second image in another sentence.",
                },
            ],
        }
    ],
    temperature=0,
)

print_highlight(response.choices[0].message.content)

[2025-08-24 08:22:39] Prefill batch. #new-seq: 1, #new-token: 2532, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0, 


[2025-08-24 08:22:39] Decode batch. #running-req: 1, #token: 2581, token usage: 0.13, cuda graph: True, gen throughput (token/s): 30.70, #queue-req: 0, 
[2025-08-24 08:22:40] INFO:     127.0.0.1:60098 - "POST /v1/chat/completions HTTP/1.1" 200 OK


In [6]:
terminate_process(vision_process)