<a href="https://colab.research.google.com/github/MehediAhamed/vlmrun-cookbook/blob/Autoreload-error-fixed-orion-image-understanding/notebooks/12_orion_image_understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a> | <a href="https://chat.vlm.run"><b>Chat</b></a>
</p>
</div>

# VLM Run Orion - Image Understanding, Reasoning and Execution

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) image understanding, reasoning and execution capabilities. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful visual intelligence with the same familiar chat-completions interface.

We'll cover the following topics:
 1. Image VQA (captioning, tagging, question-answering)
 2. Object Detection (people, faces, objects, etc.)
 3. Object Segmentation (semantic, instance, etc.)
 4. UI Parsing (Graphical UI parsing and understanding)
 5. OCR (text detection, recognition, and understanding)
 6. Image Generation (text-to-image, in-painting, out-painting, etc.)
 7. Image Tools (cropping, super-resolution, rotating, etc.)

## Prerequisites

- Python 3.10+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- VLM Run Python Client with OpenAI extra `vlmrun[openai]`

## Setup

First, install the required packages and configure the environment.

In [1]:
# Install required packages
!pip install vlmrun[openai] --upgrade --quiet
!pip install pillow requests numpy --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.4/88.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.0/66.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.3/151.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [2]:
import os
import getpass
import json
from typing import List, Any
from functools import cached_property

import numpy as np
from PIL import Image
from pydantic import BaseModel, Field

VLMRUN_API_KEY = os.getenv("VLMRUN_API_KEY", None)
if VLMRUN_API_KEY is None:
    VLMRUN_API_KEY = getpass.getpass("Enter your VLM Run API key: ")

Enter your VLM Run API key: ··········


## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.

In [3]:
from vlmrun.client import VLMRun

client = VLMRun(
    api_key=VLMRUN_API_KEY, base_url="https://agent.vlm.run/v1"
)
print("VLM Run client initialized successfully!")
print(f"Base URL: https://agent.vlm.run/v1")
print(f"Model: vlmrun-orion-1")

VLM Run client initialized successfully!
Base URL: https://agent.vlm.run/v1
Model: vlmrun-orion-1


## Response Models (dtypes)

We define Pydantic models for structured outputs. These models include **cached properties** that automatically download and convert images/masks from URLs to PIL Images or numpy arrays for easy manipulation.

In [4]:
from PIL import ImageDraw
from vlmrun.common.utils import download_image


class ImageUrlResponse(BaseModel):
    """Response model for image URL operations."""
    url: str = Field(..., description="Pre-signed URL to the image")

    @cached_property
    def image(self) -> Image.Image | None:
        """Download and return the image as a PIL Image (for image types)."""
        return download_image(self.url) if self.url else None


class ImageUrlListResponse(BaseModel):
    """Response model for multiple image URLs."""
    urls: List[ImageUrlResponse] = Field(..., description="List of pre-signed image URL responses")

    @cached_property
    def images(self) -> List[Image.Image]:
        """Download and return all images as PIL Images."""
        return [item.image for item in self.urls if item.image is not None]


class DetectionsResponse(BaseModel):
    """Collection of object detections."""

    class Detection(BaseModel):
        """Single object detection result."""
        label: str = Field(..., description="Name of the detected object")
        xywh: tuple[float, float, float, float] = Field(..., description="Bounding box (x, y, width, height) normalized from 0-1")
        confidence: float | None = Field(None, description="Detection confidence score from 0-1")

    detections: List[Detection] = Field(..., description="List of detected objects with bounding boxes")

    def render(self, image: Image.Image) -> Image.Image:
        """Render the detections on the image."""
        vis = image.copy()
        W, H = vis.size
        draw = ImageDraw.Draw(vis)
        for detection in self.detections:
            x, y, w, h = detection.xywh
            draw.rectangle([int(x * W), int(y * H), int((x + w) * W), int((y + h) * H)], outline="red", width=4)
            draw.text((int(x * W), int(y * H)), detection.label, fill="white", font_size=12)
        return vis


class KeypointsResponse(BaseModel):
    """Collection of keypoint detections."""

    class KeyPoint(BaseModel):
        """Single keypoint detection."""
        xy: tuple[float, float] = Field(..., description="Normalized keypoint coordinates (x, y) between 0-1")
        label: str = Field(..., description="Label of the keypoint")

    keypoints: List[KeyPoint] = Field(..., description="List of detected keypoints")

    def render(self, image: Image.Image) -> Image.Image:
        """Render the keypoint detections on the image."""
        vis = image.copy()
        W, H = vis.size
        draw = ImageDraw.Draw(vis)
        for keypoint in self.keypoints:
            x, y = keypoint.xy
            draw.circle([int(x * W), int(y * H)], 5, fill="green")
            draw.text((int(x * W), int(y * H)), keypoint.label, fill="white", font_size=12)
        return vis


print("Response models defined successfully!")
print("Models include cached properties for automatic image/mask downloading.")

Response models defined successfully!
Models include cached properties for automatic image/mask downloading.


## Helper Functions

We create helper functions to simplify making chat completion requests with structured outputs.

In [5]:
import hashlib
import cachetools
from typing import Type, TypeVar
from IPython.display import HTML
from vlmrun.common.image import encode_image


T = TypeVar('T', bound=BaseModel)


def display(images: Image.Image | list[Image.Image], texts: list[str] | None = None, width: int = 300):
    if isinstance(images, Image.Image):
        images = [images]
    if texts is None:
        texts = [None] * len(images)
    elif isinstance(texts, str):
        texts = [texts]
    elif len(texts) != len(images):
        raise ValueError("`texts` must be a list of the same length as `images`")

    imgs_html = ""
    for image, text in zip(images, texts):
        W, H = image.size
        if W > width:
            H = int(H * width / W)
            W = width
            image = image.resize((W, H))
        im_bytes = encode_image(image, format="JPEG")
        imgs_html += f"<div style='display:inline-block; margin:5px; text-align:center'>"
        imgs_html += f"<img src='{im_bytes}' style='width:{width}px; border-radius:6px'>"
        if text:
            imgs_html += f"<div style='font-size:12px; color:#666; margin-top:5px'>{text}</div>"
        imgs_html += f"</div>"
    return HTML(f"<div style='display:flex; flex-wrap:wrap'>{imgs_html}</div>")


def custom_key(prompt: str, images: list[Image.Image] | list[str] | None = None, response_model: Type[T] | None = None, model: str = "vlmrun-orion-1:auto"):
    """Custom key for caching chat_completion."""
    image_keys = []
    for image in images:
        if isinstance(image, Image.Image):
            thumb = image.copy()
            thumb.thumbnail((128, 128))
            encoded = encode_image(thumb, format="JPEG")
            image_keys.append(encoded)
        elif isinstance(image, str):
            image_keys.append(image)


    response_key = hashlib.sha256(json.dumps(response_model.model_json_schema(), sort_keys=True).encode()).hexdigest() if response_model else ""
    return (prompt, tuple(image_keys), response_key, model)



In [6]:
from typing import Type, TypeVar, Any
from pathlib import Path
from PIL import Image
import uuid
import tempfile

T = TypeVar("T")


# ---------- Artifact Resolver ----------

def is_artifact_ref(obj) -> bool:
    return isinstance(obj, str) and (
        obj.startswith("img_")
        or obj.startswith("vid_")
        or obj.startswith("url_")
    )


def resolve_artifacts(obj, client, session_id):
    if is_artifact_ref(obj):
        return client.artifacts.get(
            session_id=session_id,
            object_id=obj
        )

    if isinstance(obj, list):
        return [resolve_artifacts(x, client, session_id) for x in obj]

    if hasattr(obj, "__dict__"):
        for k, v in obj.__dict__.items():
            setattr(obj, k, resolve_artifacts(v, client, session_id))
        return obj

    return obj


# ---------- Chat Completion (FIXED) ----------

def chat_completion(
    prompt: str,
    images: list[str] | list[Image.Image] | None = None,
    videos: list[str] | list[Path] | None = None,
    response_model: Type[T] | None = None,
    model: str = "vlmrun-orion-1:auto"
) -> Any:

    session_id = str(uuid.uuid4())

    content = [{"type": "text", "text": prompt}]

    # ---------- IMAGES ----------
    if images:
        for img in images:

            # ✅ CASE 1: URL image
            if isinstance(img, str):
                assert img.startswith("http"), "Image URL must be https"
                content.append({
                    "type": "image_url",
                    "image_url": {"url": img}
                })

            # ✅ CASE 2: PIL Image → upload → input_file
            elif isinstance(img, Image.Image):
                with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as f:
                    img.convert("RGB").save(f.name, format="JPEG")
                    uploaded = client.files.upload(
                        file=Path(f.name),
                        purpose="assistants"
                    )
                content.append({
                    "type": "input_file",
                    "file_id": uploaded.id
                })

            else:
                raise TypeError(f"Unsupported image type: {type(img)}")

    # ---------- VIDEOS ----------
    if videos:
        for vid in videos:
            if isinstance(vid, str):
                assert vid.startswith("http"), "Video URL must be https"
                content.append({
                    "type": "video_url",
                    "video_url": {"url": vid}
                })
            elif isinstance(vid, Path):
                uploaded = client.files.upload(
                    file=vid,
                    purpose="assistants"
                )
                content.append({
                    "type": "input_file",
                    "file_id": uploaded.id
                })
            else:
                raise TypeError(f"Unsupported video type: {type(vid)}")

    kwargs = {
        "model": model,
        "messages": [{"role": "user", "content": content}],
        "extra_body": {"session_id": session_id}
    }

    if response_model:
        kwargs["response_format"] = {
            "type": "json_schema",
            "schema": response_model.model_json_schema()
        }

    response = client.agent.completions.create(**kwargs)

    text = response.choices[0].message.content

    if not response_model:
        return text

    parsed = response_model.model_validate_json(text)

    return resolve_artifacts(parsed, client, response.session_id)


## Image Understanding, Reasoning, and Execution Capabilities

VLM Run agents can perform a wide range of image processing tasks including object detection, face detection, segmentation, OCR, and more.

### 1. Captioning & Tagging

The simplest operation - load an image from a URL and caption it.

In [7]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

result = chat_completion(
    prompt=f"Generate a detailed description of this image.",
    images=[IMAGE_URL],
)
print(">> RESPONSE")
print(result)
print(">> IMAGE")
display(images=[download_image(IMAGE_URL)], texts=[result], width=600)

>> RESPONSE
The image features a vintage mint green Volkswagen Beetle parked curbside, viewed from its side profile. The car is characterized by its distinctive rounded shape, chrome hubcaps, and side mirrors. In the background, there is an aged building with a textured, light yellow wall. This wall features two dark brown wooden doors; one is a double door with arched upper panels located behind the car's roof, and the other is a single door to the right of the car's front. The ground is paved with block-like stones, suggesting a traditional street or sidewalk. The scene is brightly lit, indicating a daytime setting.
>> IMAGE


### 2a. Object Detection

Detect objects in images with bounding boxes. The agent can detect common objects like people, vehicles, animals, and more.

In [8]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/10-finding-nemo.jpeg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the sea creatures in this image",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} objects"], width=600)

[32m2025-12-19 20:30:50.050[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmpspxl0daq.jpg][0m
[32m2025-12-19 20:30:50.052[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmpspxl0daq.jpg, hash=debe646c46e40a7ae3e25601d8b0b149][0m
[32m2025-12-19 20:30:50.054[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmpspxl0daq.jpg, hash=debe646c46e40a7ae3e25601d8b0b149][0m
[32m2025-12-19 20:30:51.001[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=None][0m
[32m2025-12-19 20:30:51.002[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mupload[0m:[36m136[0m - [34m[1mUpload method [file=

>> RESPONSE
detections=[Detection(label='great white shark', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='hammerhead shark', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='grey shark', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='Moorish Idol fish', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='small yellow fish (Yellow Tang)', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='large blue and yellow fish (Royal Blue Tang)', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='pufferfish', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='adult sea turtle', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='clownfish', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='seahorse', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='purple and yellow striped fish', xywh=(0.0, 0.0, 0.0, 0.0), confidence=None), Detection(label='red and white shrimp', xywh=(0.0, 0.

### 2b. Object Detection with Specific Prompt

You can specify exactly which objects to detect using natural language.

In [9]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect the 'car' and its 'wheels' in the image",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

[32m2025-12-19 20:31:16.333[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmpi167ppmr.jpg][0m
[32m2025-12-19 20:31:16.334[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmpi167ppmr.jpg, hash=c916796575012c67b2167e6317d0f208][0m
[32m2025-12-19 20:31:16.335[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmpi167ppmr.jpg, hash=c916796575012c67b2167e6317d0f208][0m
[32m2025-12-19 20:31:17.668[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=None][0m
[32m2025-12-19 20:31:17.669[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mupload[0m:[36m136[0m - [34m[1mUpload method [file=

>> RESPONSE
detections=[Detection(label='car', xywh=(0.053, 0.343, 0.881, 0.428), confidence=None), Detection(label='wheel', xywh=(0.148, 0.575, 0.166, 0.186), confidence=None), Detection(label='wheel', xywh=(0.697, 0.575, 0.168, 0.186), confidence=None)]

>> IMAGE


### 2c. Face Detection

Detect and localize faces in images with bounding boxes.

In [10]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the faces in the image",
    images=[image],
    response_model=DetectionsResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} faces"], width=600)

[32m2025-12-19 20:31:48.600[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmpkpg4smk5.jpg][0m
[32m2025-12-19 20:31:48.602[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmpkpg4smk5.jpg, hash=4b6845eb54264536c0f5ce418b53c94c][0m
[32m2025-12-19 20:31:48.603[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmpkpg4smk5.jpg, hash=4b6845eb54264536c0f5ce418b53c94c][0m
[32m2025-12-19 20:31:49.424[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=None][0m
[32m2025-12-19 20:31:49.425[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mupload[0m:[36m136[0m - [34m[1mUpload method [file=

>> RESPONSE
detections=[Detection(label='face', xywh=(0.055, 0.149, 0.287, 0.566), confidence=0.98), Detection(label='face', xywh=(0.356, 0.153, 0.286, 0.562), confidence=0.97), Detection(label='face', xywh=(0.655, 0.175, 0.285, 0.54), confidence=0.96)]

>> IMAGE


### 2d. Person Detection

Detect and localize people in images with bounding boxes.

In [11]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the people in the image",
    images=[image],
    response_model=DetectionsResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} people"], width=600)

[32m2025-12-19 20:32:15.292[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmp3dd33i5e.jpg][0m
[32m2025-12-19 20:32:15.294[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmp3dd33i5e.jpg, hash=6d3f62086fd395a4510a5566da16b8a9][0m
[32m2025-12-19 20:32:15.295[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmp3dd33i5e.jpg, hash=6d3f62086fd395a4510a5566da16b8a9][0m
[32m2025-12-19 20:32:16.113[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=None][0m
[32m2025-12-19 20:32:16.114[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mupload[0m:[36m136[0m - [34m[1mUpload method [file=

>> RESPONSE
detections=[Detection(label='person', xywh=(0.044, 0.304, 0.07, 0.24), confidence=0.98), Detection(label='person', xywh=(0.087, 0.291, 0.084, 0.274), confidence=0.97), Detection(label='person', xywh=(0.167, 0.285, 0.088, 0.265), confidence=0.96), Detection(label='person', xywh=(0.232, 0.281, 0.088, 0.299), confidence=0.95), Detection(label='person', xywh=(0.314, 0.318, 0.086, 0.28), confidence=0.94), Detection(label='person', xywh=(0.385, 0.315, 0.089, 0.27), confidence=0.93), Detection(label='person', xywh=(0.468, 0.3, 0.088, 0.295), confidence=0.92), Detection(label='person', xywh=(0.55, 0.327, 0.088, 0.278), confidence=0.91), Detection(label='person', xywh=(0.632, 0.318, 0.087, 0.297), confidence=0.9), Detection(label='person', xywh=(0.71, 0.336, 0.089, 0.29), confidence=0.89), Detection(label='person', xywh=(0.791, 0.314, 0.091, 0.322), confidence=0.88)]

>> IMAGE


### 2e. Detect and blur faces

Detect faces and blur them for privacy protection. Here we combine object / face detection with an image tool.

In [12]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Blur all the faces in this image and return the blurred image",
    images=[image],
    response_model=ImageUrlResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.image], texts=[f"Blurred image"], width=600)

[32m2025-12-19 20:34:47.422[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmp_zm6v9n_.jpg][0m
[32m2025-12-19 20:34:47.423[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmp_zm6v9n_.jpg, hash=4b6845eb54264536c0f5ce418b53c94c][0m
[32m2025-12-19 20:34:47.424[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmp_zm6v9n_.jpg, hash=4b6845eb54264536c0f5ce418b53c94c][0m
[32m2025-12-19 20:34:48.157[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=892ce9a8-6e73-4b1a-bf64-759b3ff2aa6a][0m


>> RESPONSE
url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/936bbed1-21af-4b5f-83e7-1036dd346123/img_cd9567.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T203506Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=0dc089e9069d337fa6cc64b76d2202a92fccb8416dd737f6ab5dc70f3bc00fc82477d2820d23fb6377cc556a0cef7ebb91724ceb6e20b7c66d2a0d380de9484859d0e7db538f829930b34317e00c5b5910607e9fcf238e10ac237763dfa4c44e4ec3067afbbb60f3ed4c3eb5726efa69c7a159074d7c81a9edf2697caae4570185729c834afa3d22112624499ab2bc7ba7118ae73e97a22aca49f1a37478f2e1d22f2faa55bbdc3e1fbe7ba947b3e370e9069f61989455a742a96bd0d9b84239dd1fac65d95545f98a52981e8c1ea6cd4ec94b90d554075467b88f53a6e52b1922714f8a821c0fff56aadacb7024f4fb9972bd23abd402d529a2977126bf3b90'

>> IMAGE


### 3. Keypoint Detection

Detect keypoints in images for counting and localization tasks.

In [13]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.object-detection/donuts.png"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the donuts as keypoints and return the coordinates.",
    images=[image],
    response_model=KeypointsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

[32m2025-12-19 20:35:12.737[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmp0gof6kx2.jpg][0m
[32m2025-12-19 20:35:12.738[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmp0gof6kx2.jpg, hash=f2bb319d8c05e904c5b1b778a51e97da][0m
[32m2025-12-19 20:35:12.739[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmp0gof6kx2.jpg, hash=f2bb319d8c05e904c5b1b778a51e97da][0m
[32m2025-12-19 20:35:13.278[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=None][0m
[32m2025-12-19 20:35:13.279[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mupload[0m:[36m136[0m - [34m[1mUpload method [file=

>> RESPONSE
keypoints=[KeyPoint(xy=(0.1045, 0.1094), label='donuts'), KeyPoint(xy=(0.3594, 0.0781), label='donuts'), KeyPoint(xy=(0.542, 0.1094), label='donuts'), KeyPoint(xy=(0.7881, 0.1045), label='donuts'), KeyPoint(xy=(0.7686, 0.335), label='donuts'), KeyPoint(xy=(0.5, 0.5), label='donuts'), KeyPoint(xy=(0.2725, 0.3594), label='donuts'), KeyPoint(xy=(0.0537, 0.5), label='donuts'), KeyPoint(xy=(0.0293, 0.8525), label='donuts'), KeyPoint(xy=(0.2305, 0.7441), label='donuts'), KeyPoint(xy=(0.5, 0.8096), label='donuts'), KeyPoint(xy=(0.7881, 0.7441), label='donuts'), KeyPoint(xy=(0.832, 0.9414), label='donuts'), KeyPoint(xy=(0.9639, 0.5), label='donuts'), KeyPoint(xy=(0.2197, 0.959), label='donuts')]

>> IMAGE


### 4. Segmentation

Create pixel-level segmentation masks for objects, people or regions in images.

In [14]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the people in this image, and segment them.",
    images=[image],
    response_model=ImageUrlResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.image, width=600)

[32m2025-12-19 20:36:34.376[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmp_vqkvw65.jpg][0m
[32m2025-12-19 20:36:34.379[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmp_vqkvw65.jpg, hash=6d3f62086fd395a4510a5566da16b8a9][0m
[32m2025-12-19 20:36:34.381[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmp_vqkvw65.jpg, hash=6d3f62086fd395a4510a5566da16b8a9][0m
[32m2025-12-19 20:36:35.128[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=b012f827-8b16-439a-995f-01c0cb6ef316][0m


>> RESPONSE
url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/2d91425e-1bd2-4c59-a4b1-611ad0406d7f/img_898895.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T203907Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=64eba5a3716f6048c3c38c6edc70e083cac93351186df970df1a14875beb421992a07a07af6a6e093e0a70bff12d271da8b41cec7d567eb5a0740bdf890115433d2db08549fb225930ab522071e95866796f9d2120927edd3a8b5878fb1c0df0238dcece883b590f7789426ca4c5902b087bd474bfabeb583b728edf589f4097710a3d05b238292614aba392fafa480664a9210fb49dab848654503a1c348be15ab9c0a4bf2b4a8d283fee3a8141493658530bbf944b0ae4dc06190d3ede55f20e7a4c91c33abe999779e6dc43e334610f626d3e6d40f3774c98a3e9f17e66703c45e014c83f11244379b77a6b13afa703120a7c650097c8b7a56e788932c4d4'

>> IMAGE


## 6. OCR (Optical Character Recognition)

Extract text from images using OCR capabilities.

In [15]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Read the text in this image",
    images=[image],
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[image], texts=[result], width=600)

[32m2025-12-19 20:39:13.312[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmp2cu1lpyz.jpg][0m
[32m2025-12-19 20:39:13.314[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmp2cu1lpyz.jpg, hash=e2920024398b82dec1c6693ef53a9c69][0m
[32m2025-12-19 20:39:13.315[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmp2cu1lpyz.jpg, hash=e2920024398b82dec1c6693ef53a9c69][0m
[32m2025-12-19 20:39:14.060[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=None][0m
[32m2025-12-19 20:39:14.061[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mupload[0m:[36m136[0m - [34m[1mUpload method [file=

>> RESPONSE
Today is Thursday, October 20th- But it definitely feels like a Friday. I'm already considering making a second cup of coffee- and I haven't even finished my first. Do I have a problem? Sometimes I'll flip through older notes I've taken, and my handwriting is unrecognizable, Perhaps it depends on the type of pen I use? I've tried writing in all caps But IT Looks So FORCED AND UNNATURAL Often times, I'll just take notes on muy laptop, but I still seem to gravitate toward pen and paper. Any advice on what to I'm prove ? I already feel stressed out looking back at what I've just written- it looks like 3 different people wrote this!

>> IMAGE


### 5. Image Generation

Create, modify and remix images from text prompts or existing visuals.

### 5a. Virtual Try-On

Generate a virtual try-on of a dress on a person, with unique views and a seamless compositing.

In [16]:
img_1 = download_image("https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png")
img_2 = download_image("https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png")
display([img_1, img_2], width=400)

In [17]:
# Generate a virtual try-on of a dress on a person, with unique views
result = chat_completion(
    prompt="You are provided with two images: one of a dress(the first image), one of a person(the second image). Generate a few highly realistic virtual try-on by seamlessly compositing the dress onto the person, ensuring natural fit, alignment, and that the person appears fully and appropriately dressed. Provide 2 images (9:16 aspect ratio) as output: one from the front and one from the side.",
    images=[
        "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png",
        "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png"
    ],
    response_model=ImageUrlListResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGES")
display(result.images, width=400)

>> RESPONSE
urls=[ImageUrlResponse(url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/d6f69181-81a1-4ffa-adf0-098059df5c79/img_ad12a3.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T204021Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=a5da2a051b53f55118878243369ee456f377d47e4f4087a49a627c90172b7e57502ebf2e586267405d030143e79e748524e5ed1a229e5200c4951d5c98e0e62abed9e58fc7047d80155402c4b61d3e9d0ceceeeb1c8818b15c10b1e82ac409ef2e46110b975ce69c5caaf056d04261af7ecf411fbbd15b15e577c0675ad11d5ad143706a0fad6b372b0e341d4355045145144be26cff83697b957800f8b8807e290bab6b1d153a49a2b4f9cacc690be9d0384338889656a775a812813b3f96b5ff8f2da0664179889447f9ebd8fb423fe45b7d6c4cd4d57e1a63ed8ab15fc3421535b1e3d3147ee9dd526be8458da5478ad3ee8010366224adf26863979fe945'), ImageUrlResponse(url='https://storage.goog

### 6. Template Matching

Find a template image within a larger reference image.

In [18]:
TEMPLATE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-12.png"
REFERENCE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-13.png"

template_img = download_image(TEMPLATE_URL)
reference_img = download_image(REFERENCE_URL)
display([template_img, reference_img], width=400)


In [19]:
result = chat_completion(
    prompt=f"Given two images, identify the specified item from the second image within the first image. Clearly highlight and draw bounding boxes around all occurrences of the item in the first image. Provide a brief description of the results.",
    images=[template_img, reference_img],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(reference_img), width=600)

[32m2025-12-19 20:40:34.657[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmpkw53ss6k.jpg][0m
[32m2025-12-19 20:40:34.658[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmpkw53ss6k.jpg, hash=e91dafab54eb93684293db17a374a0a7][0m
[32m2025-12-19 20:40:34.659[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmpkw53ss6k.jpg, hash=e91dafab54eb93684293db17a374a0a7][0m
[32m2025-12-19 20:40:35.428[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=None][0m
[32m2025-12-19 20:40:35.429[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mupload[0m:[36m136[0m - [34m[1mUpload method [file=

>> RESPONSE
detections=[Detection(label='lemon', xywh=(0.0, 0.0, 1.0, 1.0), confidence=0.99)]

>> IMAGE


### 7. UI Parsing

Parse user interface elements from screenshots.

In [21]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/web.ui-automation/win11.jpeg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Parse the UI of this screenshot and detect all the UI elements.",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

[32m2025-12-19 20:50:02.428[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m58[0m - [34m[1mComputing md5 hash for file [file=/tmp/tmptel7ljvy.jpg][0m
[32m2025-12-19 20:50:02.429[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m64[0m - [34m[1mComputed md5 hash for file [file=/tmp/tmptel7ljvy.jpg, hash=e983201dad05760b7601a86b7de272f0][0m
[32m2025-12-19 20:50:02.431[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m67[0m - [34m[1mChecking if file exists in the database [file=/tmp/tmptel7ljvy.jpg, hash=e983201dad05760b7601a86b7de272f0][0m
[32m2025-12-19 20:50:03.602[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m78[0m - [34m[1mFile response [file_response=85aea7a1-e848-4f40-99a7-be69fc1e5e63][0m


>> RESPONSE
detections=[Detection(label='search', xywh=(0.3779, 0.1096, 0.0332, 0.0209), confidence=None), Detection(label='Pinned', xywh=(0.3096, 0.1896, 0.0371, 0.0243), confidence=None), Detection(label='WinObs', xywh=(0.3428, 0.8678, 0.0371, 0.0243), confidence=None), Detection(label='Store', xywh=(0.4953, 0.2329, 0.0782, 0.1091), confidence=None), Detection(label='Microsoft', xywh=(0.2863, 0.228, 0.0755, 0.1094), confidence=None), Detection(label='AoX', xywh=(0.36, 0.3444, 0.0637, 0.0998), confidence=None), Detection(label='Mcte', xywh=(0.6368, 0.5957, 0.0521, 0.041), confidence=None), Detection(label='(II} png', xywh=(0.5158, 0.6509, 0.137, 0.0739), confidence=None), Detection(label='Woiuz', xywh=(0.9256, 0.9386, 0.0671, 0.0567), confidence=None), Detection(label='Recommended', xywh=(0.2998, 0.5829, 0.1325, 0.0648), confidence=None), Detection(label='Photoshop', xywh=(0.5706, 0.4562, 0.067, 0.1075), confidence=None), Detection(label='Calendar', xywh=(0.4358, 0.2324, 0.0598, 0.108

### 8. Streaming Responses

For long-running tasks, you can use streaming to get partial results as they become available.

In [22]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

stream = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": f"Describe this image in detail"},
            {"type": "image_url", "image_url": {"url": IMAGE_URL}}
        ]
    }],
    stream=True
)

print("Streaming response:")
full_response = ""
for chunk in stream:
    if getattr(chunk.choices[0].delta, "content", None):
        content = chunk.choices[0].delta.content
        full_response += content
        print(content, end="", flush=True)

Streaming response:
This image, img_a29e46, displays a vintage, light blue or mint green Volkswagen Beetle, positioned in a side profile view. The car is characterized by its classic rounded fenders, chrome hubcaps, chrome trim around the windows, and chrome bumpers, with a subtle white stripe running along its lower side panel. It is parked on a paved surface made of rectangular, light-colored interlocking stones or cobblestones.

Behind the car stands a building with a light yellow or ochre-colored stucco wall. The building features a brown wooden window with two distinct arched panels on the left side and a larger double brown wooden door, framed in white, serving as an entrance on the right. Hints of greenery are visible above the wall on the right side of the image, contributing to an overall atmosphere of nostalgia and charm, suggesting a setting in a historic town.

---

## Conclusion

This cookbook demonstrated the comprehensive capabilities of the **VLM Run Orion Image Agent API**.

### Key Takeaways

1. **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.
2. **Structured Outputs**: Use Pydantic models with `response_model` parameter to get type-safe, validated responses with automatic parsing.
3. **Cached Properties**: Response models can include `@cached_property` decorators to lazily download and cache images, masks, and other binary data.
4. **Streaming Support**: For long-running tasks, enable streaming to receive partial results as they become available, improving user experience.
5. **Flexible Prompting**: Natural language prompts allow you to combine multiple operations in a single request, reducing API calls and latency.
6. **Rich Rendering**: Built-in visualization methods like `render()` make it easy to display detection results directly in notebooks.

### Next Steps

- Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
- Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
- Check out more examples in the [VLM Run Cookbook](https://github.com/vlm-run/vlmrun-cookbook)

Happy building!