<a href="https://colab.research.google.com/github/MehediAhamed/vlmrun-cookbook/blob/orion-image-understanding-code-fix/notebooks/12_orion_image_understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a> | <a href="https://chat.vlm.run"><b>Chat</b></a>
</p>
</div>

# VLM Run Orion - Image Understanding, Reasoning and Execution

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) image understanding, reasoning and execution capabilities. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful visual intelligence with the same familiar chat-completions interface.

We'll cover the following topics:
 1. Image VQA (captioning, tagging, question-answering)
 2. Object Detection (people, faces, objects, etc.)
 3. Object Segmentation (semantic, instance, etc.)
 4. UI Parsing (Graphical UI parsing and understanding)
 5. OCR (text detection, recognition, and understanding)
 6. Image Generation (text-to-image, in-painting, out-painting, etc.)
 7. Image Tools (cropping, super-resolution, rotating, etc.)

## Prerequisites

- Python 3.10+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- VLM Run Python Client with OpenAI extra `vlmrun[openai]`

## Setup

First, install the required packages and configure the environment.

In [3]:
# Install required packages
!pip install vlmrun[openai] --upgrade --quiet
!pip install pillow requests numpy --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.4/88.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.0/66.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.3/151.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m57.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
import os
import getpass
import json
from typing import List, Any
from functools import cached_property

import numpy as np
from PIL import Image
from pydantic import BaseModel, Field

VLMRUN_API_KEY = os.getenv("VLMRUN_API_KEY", None)
if VLMRUN_API_KEY is None:
    VLMRUN_API_KEY = getpass.getpass("Enter your VLM Run API key: ")

Enter your VLM Run API key: ··········


## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.

In [5]:
from vlmrun.client import VLMRun

client = VLMRun(
    api_key=VLMRUN_API_KEY, base_url="https://agent.vlm.run/v1"
)
print("VLM Run client initialized successfully!")
print(f"Base URL: https://agent.vlm.run/v1")
print(f"Model: vlmrun-orion-1")

VLM Run client initialized successfully!
Base URL: https://agent.vlm.run/v1
Model: vlmrun-orion-1


## Response Models (dtypes)

We define Pydantic models for structured outputs. These models include **cached properties** that automatically download and convert images/masks from URLs to PIL Images or numpy arrays for easy manipulation.

In [6]:
from PIL import ImageDraw
from vlmrun.common.utils import download_image


class ImageUrlResponse(BaseModel):
    """Response model for image URL operations."""
    url: str = Field(..., description="Pre-signed URL to the image")

    @cached_property
    def image(self) -> Image.Image | None:
        """Download and return the image as a PIL Image (for image types)."""
        return download_image(self.url) if self.url else None


class ImageUrlListResponse(BaseModel):
    """Response model for multiple image URLs."""
    urls: List[ImageUrlResponse] = Field(..., description="List of pre-signed image URL responses")

    @cached_property
    def images(self) -> List[Image.Image]:
        """Download and return all images as PIL Images."""
        return [item.image for item in self.urls if item.image is not None]


class DetectionsResponse(BaseModel):
    """Collection of object detections."""

    class Detection(BaseModel):
        """Single object detection result."""
        label: str = Field(..., description="Name of the detected object")
        xywh: tuple[float, float, float, float] = Field(..., description="Bounding box (x, y, width, height) normalized from 0-1")
        confidence: float | None = Field(None, description="Detection confidence score from 0-1")

    detections: List[Detection] = Field(..., description="List of detected objects with bounding boxes")

    def render(self, image: Image.Image) -> Image.Image:
        """Render the detections on the image."""
        vis = image.copy()
        W, H = vis.size
        draw = ImageDraw.Draw(vis)
        for detection in self.detections:
            x, y, w, h = detection.xywh
            draw.rectangle([int(x * W), int(y * H), int((x + w) * W), int((y + h) * H)], outline="red", width=4)
            draw.text((int(x * W), int(y * H)), detection.label, fill="white", font_size=12)
        return vis


class KeypointsResponse(BaseModel):
    """Collection of keypoint detections."""

    class KeyPoint(BaseModel):
        """Single keypoint detection."""
        xy: tuple[float, float] = Field(..., description="Normalized keypoint coordinates (x, y) between 0-1")
        label: str = Field(..., description="Label of the keypoint")

    keypoints: List[KeyPoint] = Field(..., description="List of detected keypoints")

    def render(self, image: Image.Image) -> Image.Image:
        """Render the keypoint detections on the image."""
        vis = image.copy()
        W, H = vis.size
        draw = ImageDraw.Draw(vis)
        for keypoint in self.keypoints:
            x, y = keypoint.xy
            draw.circle([int(x * W), int(y * H)], 5, fill="green")
            draw.text((int(x * W), int(y * H)), keypoint.label, fill="white", font_size=12)
        return vis


print("Response models defined successfully!")
print("Models include cached properties for automatic image/mask downloading.")

Response models defined successfully!
Models include cached properties for automatic image/mask downloading.


## Helper Functions

We create helper functions to simplify making chat completion requests with structured outputs.

In [7]:
import hashlib
import cachetools
from typing import Type, TypeVar
from IPython.display import HTML
from vlmrun.common.image import encode_image


T = TypeVar('T', bound=BaseModel)


def display(images: Image.Image | list[Image.Image], texts: list[str] | None = None, width: int = 300):
    if isinstance(images, Image.Image):
        images = [images]
    if texts is None:
        texts = [None] * len(images)
    elif isinstance(texts, str):
        texts = [texts]
    elif len(texts) != len(images):
        raise ValueError("`texts` must be a list of the same length as `images`")

    imgs_html = ""
    for image, text in zip(images, texts):
        W, H = image.size
        if W > width:
            H = int(H * width / W)
            W = width
            image = image.resize((W, H))
        im_bytes = encode_image(image, format="JPEG")
        imgs_html += f"<div style='display:inline-block; margin:5px; text-align:center'>"
        imgs_html += f"<img src='{im_bytes}' style='width:{width}px; border-radius:6px'>"
        if text:
            imgs_html += f"<div style='font-size:12px; color:#666; margin-top:5px'>{text}</div>"
        imgs_html += f"</div>"
    return HTML(f"<div style='display:flex; flex-wrap:wrap'>{imgs_html}</div>")


def custom_key(prompt: str, images: list[Image.Image] | list[str] | None = None, response_model: Type[T] | None = None, model: str = "vlmrun-orion-1:auto"):
    """Custom key for caching chat_completion."""
    image_keys = []
    for image in images:
        if isinstance(image, Image.Image):
            thumb = image.copy()
            thumb.thumbnail((128, 128))
            encoded = encode_image(thumb, format="JPEG")
            image_keys.append(encoded)
        elif isinstance(image, str):
            image_keys.append(image)


    response_key = hashlib.sha256(json.dumps(response_model.model_json_schema(), sort_keys=True).encode()).hexdigest() if response_model else ""
    return (prompt, tuple(image_keys), response_key, model)


@cachetools.cached(cache=cachetools.TTLCache(maxsize=1000, ttl=3600), key=custom_key)
def chat_completion(
    prompt: str,
    images: list[Image.Image] | list[str] | None = None,
    response_model: Type[T] | None = None,
    model: str = "vlmrun-orion-1:auto"
) -> Any:
    """
    Make a chat completion request with optional images and structured output.

    Args:
        prompt: The text prompt/instruction
        images: Optional list of images to process (either PIL Images or URLs)
        response_model: Optional Pydantic model for structured output
        model: Model to use (default: vlmrun-orion-1:auto)

    Returns:
        Parsed response model if response_model provided, else raw response text
    """
    content = []
    content.append({"type": "text", "text": prompt})

    if images:
        for image in images:
            if isinstance(image, str):
                assert image.startswith("http"), "Image URLs must start with http or https"
                content.append({
                    "type": "image_url",
                    "image_url": {"url": image, "detail": "auto"}
                })
            elif isinstance(image, Image.Image):
                content.append({
                    "type": "image_url",
                    "image_url": {"url": encode_image(image, format="JPEG"), "detail": "auto"}
                })
            else:
                raise ValueError("Images must be either PIL Images or URLs")

    kwargs = {
        "model": model,
        "messages": [{"role": "user", "content": content}]
    }

    if response_model:
        kwargs["response_format"] = {
            "type": "json_schema",
            "schema": response_model.model_json_schema()
        }

    response = client.agent.completions.create(**kwargs)
    response_text = response.choices[0].message.content

    if response_model:
        return response_model.model_validate_json(response_text)

    return response_text

print("Helper functions defined!")

Helper functions defined!


## Image Understanding, Reasoning, and Execution Capabilities

VLM Run agents can perform a wide range of image processing tasks including object detection, face detection, segmentation, OCR, and more.

### 1. Captioning & Tagging

The simplest operation - load an image from a URL and caption it.

In [8]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

result = chat_completion(
    prompt=f"Generate a detailed description of this image.",
    images=[IMAGE_URL],
)
print(">> RESPONSE")
print(result)
print(">> IMAGE")
display(images=[download_image(IMAGE_URL)], texts=[result], width=600)

>> RESPONSE
This image shows a vintage Volkswagen Beetle, painted a light mint green or aqua, parked on a street. The car is facing right and appears to be well-maintained, featuring chrome bumpers, trim around the windows, and shiny chrome hubcaps on its black tires. A subtle white stripe can be seen along the side.

Behind the car, there's a building with a textured, pale yellow or ochre stucco wall, showing some signs of age and weathering. A prominent dark brown wooden double door, with vertical planks and visible hardware, is set within a white, slightly raised frame on the right side of the building. To the left, above the car's roof, is a window with dark brown wooden shutters that have an arched, traditional design.

The car is parked on a surface paved with light-colored, irregularly laid rectangular or square pavers, resembling cobblestones or bricks. The ground is dry and clean, and the bright, sunny lighting illuminates the entire scene.
>> IMAGE


### 2a. Object Detection

Detect objects in images with bounding boxes. The agent can detect common objects like people, vehicles, animals, and more.

In [9]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/10-finding-nemo.jpeg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the sea creatures in this image",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} objects"], width=600)

>> RESPONSE
detections=[Detection(label='Nemo', xywh=(0.014, 0.475, 0.163, 0.499), confidence=None), Detection(label='Dory', xywh=(0.324, 0.136, 0.438, 0.498), confidence=None), Detection(label='Marlin', xywh=(0.28, 0.348, 0.226, 0.295), confidence=None), Detection(label='Marlin', xywh=(0.486, 0.634, 0.352, 0.35), confidence=None), Detection(label='Crush', xywh=(0.021, 0.036, 0.234, 0.261), confidence=None), Detection(label='Crush', xywh=(0.116, 0.238, 0.205, 0.154), confidence=None), Detection(label='Squirt', xywh=(0.785, 0.567, 0.187, 0.255), confidence=None), Detection(label='Squirt', xywh=(0.805, 0.074, 0.179, 0.189), confidence=None), Detection(label='Squirt', xywh=(0.805, 0.4, 0.17, 0.203), confidence=None), Detection(label='Bruce the shark', xywh=(0.148, 0.534, 0.15, 0.358), confidence=None), Detection(label='Bruce the shark', xywh=(0.012, 0.078, 0.207, 0.185), confidence=None), Detection(label='Bruce the shark', xywh=(0.148, 0.024, 0.21, 0.161), confidence=None)]

>> IMAGE


### 2b. Object Detection with Specific Prompt

You can specify exactly which objects to detect using natural language.

In [10]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect the 'car' and its 'wheels' in the image",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

>> RESPONSE
detections=[Detection(label='car', xywh=(0.053, 0.334, 0.881, 0.43), confidence=0.98), Detection(label='wheels', xywh=(0.142, 0.586, 0.164, 0.182), confidence=0.98), Detection(label='wheels', xywh=(0.703, 0.576, 0.162, 0.192), confidence=0.97)]

>> IMAGE


### 2c. Face Detection

Detect and localize faces in images with bounding boxes.

In [11]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the faces in the image",
    images=[image],
    response_model=DetectionsResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} faces"], width=600)

>> RESPONSE
detections=[Detection(label='HAIDI STROUD-WATTS', xywh=(0.063, 0.194, 0.275, 0.53), confidence=0.98), Detection(label='ANNABELLE DROULERS', xywh=(0.346, 0.188, 0.274, 0.536), confidence=0.97), Detection(label='VONNIE QUINN', xywh=(0.632, 0.194, 0.274, 0.53), confidence=0.96)]

>> IMAGE


### 2d. Person Detection

Detect and localize people in images with bounding boxes.

In [12]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the people in the image",
    images=[image],
    response_model=DetectionsResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} people"], width=600)

>> RESPONSE
detections=[Detection(label='person', xywh=(0.031, 0.295, 0.077, 0.252), confidence=0.98), Detection(label='person', xywh=(0.066, 0.274, 0.078, 0.279), confidence=0.97), Detection(label='person', xywh=(0.133, 0.269, 0.081, 0.288), confidence=0.96), Detection(label='person', xywh=(0.198, 0.268, 0.08, 0.298), confidence=0.95), Detection(label='person', xywh=(0.258, 0.274, 0.078, 0.303), confidence=0.94), Detection(label='person', xywh=(0.319, 0.295, 0.08, 0.289), confidence=0.93), Detection(label='person', xywh=(0.382, 0.288, 0.081, 0.301), confidence=0.92), Detection(label='person', xywh=(0.447, 0.282, 0.08, 0.31), confidence=0.91), Detection(label='person', xywh=(0.513, 0.315, 0.081, 0.287), confidence=0.9), Detection(label='person', xywh=(0.578, 0.309, 0.081, 0.299), confidence=0.89), Detection(label='person', xywh=(0.642, 0.318, 0.081, 0.297), confidence=0.88), Detection(label='person', xywh=(0.707, 0.312, 0.081, 0.31), confidence=0.87), Detection(label='person', xywh=(0.

### 2e. Detect and blur faces

Detect faces and blur them for privacy protection. Here we combine object / face detection with an image tool.

In [13]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Blur all the faces in this image and return the blurred image",
    images=[image],
    response_model=ImageUrlResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.image], texts=[f"Blurred image"], width=600)

>> RESPONSE
url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/da021367-a6ff-4556-99dc-cd7f936ac0cb/img_0efab6.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251216%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251216T192032Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=4d377c555f679e0b5e913b5033e86105058a26f045b9d869d034b2596bc9d5b738077ec4646a5d63d002703363760ed9898726b79e36685356226e51d807ff4fc1d9660816db3e0389b63789621d123e05b3ae0e3ff781cb741e74914056fbc71b40624eaa86a46753cd2343f81327d69e0c62af788727e592d9d81c630d900cf3c74658113353fadd687c4297480b178dd372ac6f40224e8e2cc51a9555914553331cf5085582854b838ae34a1ee495435b0fcca11f3f4baf682ffebaf83812a7ed29b713fc96795974c13a6eed1aed02c396f5cefcb9a472914ba34bd4c75b4c00f796b852e01c731badec6fa34db3e77bf2b2f401c5abfbbb9e4fb2c1cf30'

>> IMAGE


### 3. Keypoint Detection

Detect keypoints in images for counting and localization tasks.

In [14]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.object-detection/donuts.png"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the donuts as keypoints and return the coordinates.",
    images=[image],
    response_model=KeypointsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

>> RESPONSE
keypoints=[KeyPoint(xy=(0.0391, 0.8525), label='donuts'), KeyPoint(xy=(0.1885, 0.7686), label='donuts'), KeyPoint(xy=(0.2109, 0.959), label='donuts'), KeyPoint(xy=(0.5, 0.832), label='donuts'), KeyPoint(xy=(0.5, 0.5), label='donuts'), KeyPoint(xy=(0.7686, 0.6738), label='donuts'), KeyPoint(xy=(0.8105, 0.9414), label='donuts'), KeyPoint(xy=(0.7881, 0.3594), label='donuts'), KeyPoint(xy=(0.959, 0.5), label='donuts'), KeyPoint(xy=(0.832, 0.1094), label='donuts'), KeyPoint(xy=(0.5596, 0.1885), label='donuts'), KeyPoint(xy=(0.3594, 0.1094), label='donuts'), KeyPoint(xy=(0.335, 0.3594), label='donuts'), KeyPoint(xy=(0.1094, 0.5), label='donuts'), KeyPoint(xy=(0.1094, 0.1885), label='donuts')]

>> IMAGE


### 4. Segmentation

Create pixel-level segmentation masks for objects, people or regions in images.

In [15]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the people in this image, and segment them.",
    images=[image],
    response_model=ImageUrlResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.image, width=600)

>> RESPONSE
url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/84367bea-1750-46b5-a19a-2978edae081f/img_a2c8b7.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251216%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251216T192212Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=4d6992d9388cf940b208bff3db40c0f9f7b1791cf4c6931c97a2d434bc383bd289b0e6281536536537989bfa1a28f2f89c2f8ce5abf1abc1a1f362dfa95a6a91b07f4274698048a5074396c7165e7f77c31ca39a81e00c2a977085d2138fcd235d163b7634144dcc794f518be68a6b713b2fc139e96e9eaef5fa9fdf9241d12dfb6a25eaadca99d2979606b560adfd249087e1d057fc2cc8a5f10342f5b20395cdf7f4f0b10cdb8bc0d78b842f2012499bffa0eb3ca771cedfc70108d74b2ec0e12e34da4420623dfaa0921abbed049dfee098346ee7df7921866d3f06536b2af41a95dbebd0d859d365f8e5eb7e7b13b2bb89af5143675694a4eb3bb7e3d5ae'

>> IMAGE


## 6. OCR (Optical Character Recognition)

Extract text from images using OCR capabilities.

In [16]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Read the text in this image",
    images=[image],
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[image], texts=[result], width=600)

>> RESPONSE
The text in the image img_f9922b reads: "Today is Thursday, October 20th- But it definitely feels like a Friday. I'm already considering making a second cup of coffee- and I haven't even finished my first. Do I have a problem? Sometimes I'll flip through older notes I've taken, and my handwriting is unrecognizable, Perhaps it depends on the type of pen I use? I've tried writing in all caps But IT Looks So FORCED AND UNNATURAL Often times, I'll just take notes on my laptop, but I still seem to gravitate toward pen and paper. Any advice on what to I'm prove ? I already feel stressed out looking back at what I've just written- it looks like 3 different people wrote this!"

>> IMAGE


### 5. Image Generation

Create, modify and remix images from text prompts or existing visuals.

### 5a. Virtual Try-On

Generate a virtual try-on of a dress on a person, with unique views and a seamless compositing.

In [22]:
img_1 = download_image("https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png")
img_2 = download_image("https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png")
display([img_1, img_2], width=400)

In [40]:
# Generate a virtual try-on of a dress on a person, with unique views
result = chat_completion(
    prompt="You are provided with two images: one of a dress(the first image) and one of a person(the second image). Generate a few highly realistic virtual try on by seamlessly compositing the dress onto the person, ensuring natural fit, alignment, and that the person appears fully and appropriately dressed. Provide 2 images (9:16 aspect ratio) as output: one from the front and one from the side. Always give presigned valid URL with https",
    images=[
        "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png",
        "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png"

    ],
    response_model=ImageUrlListResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGES")
display(result.images, width=400)

>> RESPONSE
urls=[ImageUrlResponse(url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/7015e08e-358a-45c6-8fb9-6013945df99a/img_e24214.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251216%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251216T194728Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=594c159715d6ce2ca5b3c8e708d21e9f61cb148479b445fd6d0709378ce10ee37838d761b0fa9027e6641f13bfcab552fff982e477bd77f712e3321b023268f0896aa20545f5a1f96dd904d9eb89b2ad61fd2b45b73b99ae19f786b44f1a575e4037e4653dfba8a9b3fba3c412fad489067a4fdab758fa211480e1e34fa69860a1dfe5a5e8a9c6f9b7671e2247b2114b661ffe0834f275b9dcbfe7122dd4483d5074ca0cd284d6e70faa0488efa0190961229cf8ebce761db2aca6356d29035a796a450362168d924d05220f2b62e9f0318d12f1aaf466472b3f2c2f899fb9948ad8396ef640008ad229c50ab909242ec4218ecd7d08928ab7b2c498f1f3271c'), ImageUrlResponse(url='https://storage.goog

### 6. Template Matching

Find a template image within a larger reference image.

In [33]:
TEMPLATE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-12.png"
REFERENCE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-13.png"

template_img = download_image(TEMPLATE_URL)
reference_img = download_image(REFERENCE_URL)
display([template_img, reference_img], width=400)


In [34]:
result = chat_completion(
    prompt=f"Given two images, identify the specified item from the second image within the first image. Clearly highlight and draw bounding boxes around all occurrences of the item in the first image. Provide a brief description of the results.",
    images=[template_img, reference_img],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(reference_img), width=600)

>> RESPONSE
detections=[Detection(label='lemon', xywh=(0.0, 0.0, 0.999, 0.999), confidence=0.99)]

>> IMAGE


### 7. UI Parsing

Parse user interface elements from screenshots.

In [35]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/web.ui-automation/win11.jpeg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Parse the UI of this screenshot and detect all the UI elements.",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

>> RESPONSE
detections=[Detection(label='text', xywh=(0.3779, 0.1096, 0.0332, 0.0209), confidence=None), Detection(label='icon', xywh=(0.497, 0.229, 0.0766, 0.1193), confidence=None), Detection(label='icon', xywh=(0.2862, 0.2271, 0.077, 0.1149), confidence=None), Detection(label='icon', xywh=(0.3615, 0.3448, 0.0668, 0.1039), confidence=None), Detection(label='icon', xywh=(0.6376, 0.5957, 0.0519, 0.0414), confidence=None), Detection(label='icon', xywh=(0.3036, 0.6456, 0.2022, 0.0829), confidence=None), Detection(label='icon', xywh=(0.5168, 0.6492, 0.1814, 0.0774), confidence=None), Detection(label='icon', xywh=(0.3058, 0.7263, 0.1988, 0.0925), confidence=None), Detection(label='icon', xywh=(0.5173, 0.7267, 0.1776, 0.0923), confidence=None), Detection(label='icon', xywh=(0.9246, 0.9386, 0.0683, 0.0565), confidence=None), Detection(label='icon', xywh=(0.3007, 0.581, 0.2063, 0.0685), confidence=None), Detection(label='icon', xywh=(0.2884, 0.8365, 0.1219, 0.0916), confidence=None), Detectio

### 8. Streaming Responses

For long-running tasks, you can use streaming to get partial results as they become available.

In [36]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

stream = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": f"Describe this image in detail"},
            {"type": "image_url", "image_url": {"url": IMAGE_URL}}
        ]
    }],
    stream=True
)

print("Streaming response:")
full_response = ""
for chunk in stream:
    if getattr(chunk.choices[0].delta, "content", None):
        content = chunk.choices[0].delta.content
        full_response += content
        print(content, end="", flush=True)

Streaming response:
This image depicts a classic, pastel mint green Volkswagen Beetle, commonly known as a "Bug," parked on a textured street. The car is adorned with chrome accents on its bumpers, hubcaps, window trim, and side mirrors, and features a subtle white stripe along its side below the windows. Behind the vehicle stands a building with a warm, faded yellow or ochre facade, resembling stucco. This building has two distinct dark brown wooden doors; the left door exhibits two arched panels, while the right is a tall, rectangular door with a white frame. Both doors display noticeable wood grain, indicating their age. The foreground reveals a street or sidewalk paved with interlocking gray and brown pavers or cobblestones. The overall impression of the image is one of calmness, charm, and nostalgia, suggesting a bright day in a historic or picturesque town.

---

## Conclusion

This cookbook demonstrated the comprehensive capabilities of the **VLM Run Orion Image Agent API**.

### Key Takeaways

1. **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.
2. **Structured Outputs**: Use Pydantic models with `response_model` parameter to get type-safe, validated responses with automatic parsing.
3. **Cached Properties**: Response models can include `@cached_property` decorators to lazily download and cache images, masks, and other binary data.
4. **Streaming Support**: For long-running tasks, enable streaming to receive partial results as they become available, improving user experience.
5. **Flexible Prompting**: Natural language prompts allow you to combine multiple operations in a single request, reducing API calls and latency.
6. **Rich Rendering**: Built-in visualization methods like `render()` make it easy to display detection results directly in notebooks.

### Next Steps

- Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
- Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
- Check out more examples in the [VLM Run Cookbook](https://github.com/vlm-run/vlmrun-cookbook)

Happy building!