<a href="https://colab.research.google.com/github/MehediAhamed/vlmrun-cookbook/blob/mehedi/notebooks/12_orion_image_understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a> | <a href="https://chat.vlm.run"><b>Chat</b></a>
</p>
</div>

# VLM Run Orion - Image Understanding, Reasoning and Execution

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) image understanding, reasoning and execution capabilities. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful visual intelligence with the same familiar chat-completions interface.

We'll cover the following topics:
 1. Image VQA (captioning, tagging, question-answering)
 2. Object Detection (people, faces, objects, etc.)
 3. Object Segmentation (semantic, instance, etc.)
 4. UI Parsing (Graphical UI parsing and understanding)
 5. OCR (text detection, recognition, and understanding)
 6. Image Generation (text-to-image, in-painting, out-painting, etc.)
 7. Image Tools (cropping, super-resolution, rotating, etc.)

## Prerequisites

- Python 3.10+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- VLM Run Python Client with OpenAI extra `vlmrun[openai]`

## Setup

First, install the required packages and configure the environment.

In [1]:
# Install required packages
!pip install vlmrun[openai] --upgrade --quiet
!pip install pillow requests numpy --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.4/88.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.8/58.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.9/62.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.8/172.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [2]:
import os
import getpass
import json
from typing import List, Any
from functools import cached_property

import numpy as np
from PIL import Image
from pydantic import BaseModel, Field

VLMRUN_API_KEY = os.getenv("VLMRUN_API_KEY", None)
if VLMRUN_API_KEY is None:
    VLMRUN_API_KEY = getpass.getpass("Enter your VLM Run API key: ")

Enter your VLM Run API key: ··········


## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.

In [3]:
from vlmrun.client import VLMRun

client = VLMRun(
    api_key=VLMRUN_API_KEY, base_url="https://agent.vlm.run/v1"
)
print("VLM Run client initialized successfully!")
print(f"Base URL: https://agent.vlm.run/v1")
print(f"Model: vlmrun-orion-1")

VLM Run client initialized successfully!
Base URL: https://agent.vlm.run/v1
Model: vlmrun-orion-1


## Response Models (dtypes)

We define Pydantic models for structured outputs. These models include **cached properties** that automatically download and convert images/masks from URLs to PIL Images or numpy arrays for easy manipulation.

In [4]:
from PIL import ImageDraw
from vlmrun.common.utils import download_image


class ImageUrlResponse(BaseModel):
    """Response model for image URL operations."""
    url: str = Field(..., description="Pre-signed URL to the image")

    @cached_property
    def image(self) -> Image.Image | None:
        """Download and return the image as a PIL Image (for image types)."""
        return download_image(self.url) if self.url else None


class ImageUrlListResponse(BaseModel):
    """Response model for multiple image URLs."""
    urls: List[ImageUrlResponse] = Field(..., description="List of pre-signed image URL responses")

    @cached_property
    def images(self) -> List[Image.Image]:
        """Download and return all images as PIL Images."""
        return [item.image for item in self.urls if item.image is not None]


class DetectionsResponse(BaseModel):
    """Collection of object detections."""

    class Detection(BaseModel):
        """Single object detection result."""
        label: str = Field(..., description="Name of the detected object")
        xywh: tuple[float, float, float, float] = Field(..., description="Bounding box (x, y, width, height) normalized from 0-1")
        confidence: float | None = Field(None, description="Detection confidence score from 0-1")

    detections: List[Detection] = Field(..., description="List of detected objects with bounding boxes")

    def render(self, image: Image.Image) -> Image.Image:
        """Render the detections on the image."""
        vis = image.copy()
        W, H = vis.size
        draw = ImageDraw.Draw(vis)
        for detection in self.detections:
            x, y, w, h = detection.xywh
            draw.rectangle([int(x * W), int(y * H), int((x + w) * W), int((y + h) * H)], outline="red", width=4)
            draw.text((int(x * W), int(y * H)), detection.label, fill="white", font_size=12)
        return vis


class KeypointsResponse(BaseModel):
    """Collection of keypoint detections."""

    class KeyPoint(BaseModel):
        """Single keypoint detection."""
        xy: tuple[float, float] = Field(..., description="Normalized keypoint coordinates (x, y) between 0-1")
        label: str = Field(..., description="Label of the keypoint")

    keypoints: List[KeyPoint] = Field(..., description="List of detected keypoints")

    def render(self, image: Image.Image) -> Image.Image:
        """Render the keypoint detections on the image."""
        vis = image.copy()
        W, H = vis.size
        draw = ImageDraw.Draw(vis)
        for keypoint in self.keypoints:
            x, y = keypoint.xy
            draw.circle([int(x * W), int(y * H)], 5, fill="green")
            draw.text((int(x * W), int(y * H)), keypoint.label, fill="white", font_size=12)
        return vis


print("Response models defined successfully!")
print("Models include cached properties for automatic image/mask downloading.")

Response models defined successfully!
Models include cached properties for automatic image/mask downloading.


## Helper Functions

We create helper functions to simplify making chat completion requests with structured outputs.

In [5]:
import hashlib
import cachetools
from typing import Type, TypeVar
from IPython.display import HTML
from vlmrun.common.image import encode_image


T = TypeVar('T', bound=BaseModel)


def display(images: Image.Image | list[Image.Image], texts: list[str] | None = None, width: int = 300):
    if isinstance(images, Image.Image):
        images = [images]
    if texts is None:
        texts = [None] * len(images)
    elif isinstance(texts, str):
        texts = [texts]
    elif len(texts) != len(images):
        raise ValueError("`texts` must be a list of the same length as `images`")

    imgs_html = ""
    for image, text in zip(images, texts):
        W, H = image.size
        if W > width:
            H = int(H * width / W)
            W = width
            image = image.resize((W, H))
        im_bytes = encode_image(image, format="JPEG")
        imgs_html += f"<div style='display:inline-block; margin:5px; text-align:center'>"
        imgs_html += f"<img src='{im_bytes}' style='width:{width}px; border-radius:6px'>"
        if text:
            imgs_html += f"<div style='font-size:12px; color:#666; margin-top:5px'>{text}</div>"
        imgs_html += f"</div>"
    return HTML(f"<div style='display:flex; flex-wrap:wrap'>{imgs_html}</div>")


def custom_key(prompt: str, images: list[Image.Image] | list[str] | None = None, response_model: Type[T] | None = None, model: str = "vlmrun-orion-1:auto"):
    """Custom key for caching chat_completion."""
    image_keys = []
    for image in images:
        if isinstance(image, Image.Image):
            thumb = image.copy()
            thumb.thumbnail((128, 128))
            encoded = encode_image(thumb, format="JPEG")
            image_keys.append(encoded)
        elif isinstance(image, str):
            image_keys.append(image)


    response_key = hashlib.sha256(json.dumps(response_model.model_json_schema(), sort_keys=True).encode()).hexdigest() if response_model else ""
    return (prompt, tuple(image_keys), response_key, model)


@cachetools.cached(cache=cachetools.TTLCache(maxsize=1000, ttl=3600), key=custom_key)
def chat_completion(
    prompt: str,
    images: list[Image.Image] | list[str] | None = None,
    response_model: Type[T] | None = None,
    model: str = "vlmrun-orion-1:auto"
) -> Any:
    """
    Make a chat completion request with optional images and structured output.

    Args:
        prompt: The text prompt/instruction
        images: Optional list of images to process (either PIL Images or URLs)
        response_model: Optional Pydantic model for structured output
        model: Model to use (default: vlmrun-orion-1:auto)

    Returns:
        Parsed response model if response_model provided, else raw response text
    """
    content = []
    content.append({"type": "text", "text": prompt})

    if images:
        for image in images:
            if isinstance(image, str):
                assert image.startswith("http"), "Image URLs must start with http or https"
                content.append({
                    "type": "image_url",
                    "image_url": {"url": image, "detail": "auto"}
                })
            elif isinstance(image, Image.Image):
                content.append({
                    "type": "image_url",
                    "image_url": {"url": encode_image(image, format="JPEG"), "detail": "auto"}
                })
            else:
                raise ValueError("Images must be either PIL Images or URLs")

    kwargs = {
        "model": model,
        "messages": [{"role": "user", "content": content}]
    }

    if response_model:
        kwargs["response_format"] = {
            "type": "json_schema",
            "schema": response_model.model_json_schema()
        }

    response = client.agent.completions.create(**kwargs)
    response_text = response.choices[0].message.content

    if response_model:
        return response_model.model_validate_json(response_text)

    return response_text

print("Helper functions defined!")

Helper functions defined!


## Image Understanding, Reasoning, and Execution Capabilities

VLM Run agents can perform a wide range of image processing tasks including object detection, face detection, segmentation, OCR, and more.

### 1. Captioning & Tagging

The simplest operation - load an image from a URL and caption it.

In [6]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

result = chat_completion(
    prompt=f"Generate a detailed description of this image.",
    images=[IMAGE_URL],
)
print(">> RESPONSE")
print(result)
print(">> IMAGE")
display(images=[download_image(IMAGE_URL)], texts=[result], width=600)

>> RESPONSE
The image features a vintage, mint green Volkswagen Beetle parked on a cobblestone street against a warm yellow building. The car appears to be in excellent, well-maintained condition with a glossy finish. Distinguishing features include chrome bumpers at the front and rear, chrome trim along the side panels, chrome hubcaps with a central cap, and subtle white-wall tires. A side mirror and a vent on the rear quarter window are also visible. The car's classic design is intact, suggesting careful preservation or restoration.

The building behind the car has a warm yellow, stucco-like facade that shows signs of age and texture but is largely intact. It features two rustic wooden doors. The door on the left is smaller, dark brown, and appears to be inset, possibly a boarded-up window or an older, less used entrance, with two distinct arched panels at the top. The door on the right is larger, also dark brown, and rectangular with visible horizontal panels, a dark handle/knob, an

### 2a. Object Detection

Detect objects in images with bounding boxes. The agent can detect common objects like people, vehicles, animals, and more.

In [7]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/10-finding-nemo.jpeg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the sea creatures in this image",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} objects"], width=600)

>> RESPONSE
detections=[Detection(label='Dory', xywh=(0.442, 0.127, 0.314, 0.511), confidence=None), Detection(label='Marlin', xywh=(0.285, 0.338, 0.219, 0.283), confidence=None), Detection(label='Marlin', xywh=(0.404, 0.645, 0.215, 0.335), confidence=None), Detection(label='Nemo', xywh=(0.014, 0.463, 0.162, 0.506), confidence=None), Detection(label='Crush', xywh=(0.029, 0.015, 0.18, 0.273), confidence=None), Detection(label='Bruce', xywh=(0.101, 0.02, 0.215, 0.19), confidence=None), Detection(label='Squirt', xywh=(0.772, 0.55, 0.184, 0.255), confidence=None), Detection(label='Squirt', xywh=(0.793, 0.069, 0.192, 0.186), confidence=None), Detection(label='Squirt', xywh=(0.031, 0.074, 0.155, 0.173), confidence=None), Detection(label='Squirt', xywh=(0.025, 0.214, 0.291, 0.18), confidence=None)]

>> IMAGE


### 2b. Object Detection with Specific Prompt

You can specify exactly which objects to detect using natural language.

In [8]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect the 'car' and its 'wheels' in the image",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

>> RESPONSE
detections=[Detection(label='car', xywh=(0.038, 0.327, 0.904, 0.44), confidence=0.98), Detection(label='wheel', xywh=(0.116, 0.577, 0.188, 0.19), confidence=0.96), Detection(label='wheel', xywh=(0.701, 0.575, 0.168, 0.19), confidence=0.95)]

>> IMAGE


### 2c. Face Detection

Detect and localize faces in images with bounding boxes.

In [9]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the faces in the image",
    images=[image],
    response_model=DetectionsResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} faces"], width=600)

>> RESPONSE
detections=[Detection(label='face', xywh=(0.063, 0.199, 0.27, 0.526), confidence=0.98), Detection(label='face', xywh=(0.352, 0.199, 0.27, 0.526), confidence=0.97), Detection(label='face', xywh=(0.651, 0.199, 0.27, 0.526), confidence=0.96)]

>> IMAGE


### 2d. Person Detection

Detect and localize people in images with bounding boxes.

In [10]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the people in the image",
    images=[image],
    response_model=DetectionsResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} people"], width=600)

>> RESPONSE
detections=[Detection(label='person', xywh=(0.04, 0.303, 0.074, 0.247), confidence=0.98), Detection(label='person', xywh=(0.089, 0.287, 0.083, 0.278), confidence=0.97), Detection(label='person', xywh=(0.168, 0.283, 0.082, 0.267), confidence=0.96), Detection(label='person', xywh=(0.23, 0.28, 0.082, 0.3), confidence=0.95), Detection(label='person', xywh=(0.3, 0.316, 0.082, 0.279), confidence=0.94), Detection(label='person', xywh=(0.367, 0.313, 0.083, 0.272), confidence=0.93), Detection(label='person', xywh=(0.44, 0.297, 0.083, 0.293), confidence=0.92), Detection(label='person', xywh=(0.508, 0.316, 0.082, 0.279), confidence=0.91), Detection(label='person', xywh=(0.575, 0.323, 0.083, 0.287), confidence=0.9), Detection(label='person', xywh=(0.644, 0.316, 0.083, 0.299), confidence=0.89), Detection(label='person', xywh=(0.712, 0.33, 0.083, 0.295), confidence=0.88), Detection(label='person', xywh=(0.78, 0.312, 0.095, 0.323), confidence=0.87)]

>> IMAGE


### 2e. Detect and blur faces

Detect faces and blur them for privacy protection. Here we combine object / face detection with an image tool.

In [11]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Blur all the faces in this image and return the blurred image",
    images=[image],
    response_model=ImageUrlResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.image], texts=[f"Blurred image"], width=600)

>> RESPONSE
url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/1b56decb-e336-4f34-b6e5-395f1b0b44f5/img_bf6717.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251211%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251211T093256Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=50364e23ea6626d76d6c40a7559e0f32fd573da82f24a3902fa3e0368010f68be78d2fe35820f1ec3a44f97e241f3df046e06839183b19a38526db700b9cfd9398388dfc6c2d7adba5904599319cb438be4e538f0502ad50fbac76eea49f61af5afd321d19b4ea0c90feab44d033375d20d8fa8ed6f492b7c96e62ae0ad3ab35ef27da22a5a5a9a7f3b198475a931ea5e7d0b1abe4642063474c2066f8a2336f8f45aa2647352ab5245bc3b5a9ea9650886d1aee759486c36689ad5d4a7081482ec0cedd10bdb01a781564c5dae36b87c2c1c7bf82ffbe1de7cd3f8cacd4dd3075ebb6d630e2579dda8132b6190624c3e17ce88f43ded91edc6d088b84e6654e'

>> IMAGE


### 3. Keypoint Detection

Detect keypoints in images for counting and localization tasks.

In [12]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.object-detection/donuts.png"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Detect all the donuts as keypoints and return the coordinates.",
    images=[image],
    response_model=KeypointsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

>> RESPONSE
keypoints=[KeyPoint(xy=(0.0391, 0.8525), label='donuts'), KeyPoint(xy=(0.1885, 0.7686), label='donuts'), KeyPoint(xy=(0.2109, 0.959), label='donuts'), KeyPoint(xy=(0.5, 0.832), label='donuts'), KeyPoint(xy=(0.5, 0.5), label='donuts'), KeyPoint(xy=(0.7686, 0.6738), label='donuts'), KeyPoint(xy=(0.8105, 0.9414), label='donuts'), KeyPoint(xy=(0.7881, 0.3594), label='donuts'), KeyPoint(xy=(0.959, 0.5), label='donuts'), KeyPoint(xy=(0.832, 0.1094), label='donuts'), KeyPoint(xy=(0.5596, 0.1885), label='donuts'), KeyPoint(xy=(0.3594, 0.1094), label='donuts'), KeyPoint(xy=(0.335, 0.3594), label='donuts'), KeyPoint(xy=(0.1094, 0.5), label='donuts'), KeyPoint(xy=(0.1094, 0.1885), label='donuts')]

>> IMAGE


### 4. Segmentation

Create pixel-level segmentation masks for objects, people or regions in images.

In [39]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/chat_examples/vision/intersection.jpg"

image = download_image(IMAGE_URL)

result = chat_completion(
    prompt="Detect all the people in this image, and segment them.",
    images=[image],
    response_model=ImageUrlResponse

)

print(">> RESPONSE")
print(result)

display(images=[result.image], texts=[f"Detected Persons"], width=600)

>> RESPONSE
url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/8dd81fc0-d6f7-4525-a762-990e61e66232/img_dbce6a.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251211%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251211T095754Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=089c60d70475c641a4bae90cf383622fc2f072e83e91eaa848abafaf6b1a0399681086b4368666ed5ef4d6b99bca6864d9b5c33359669d91b3b4865b38b054b5e43abbe48489aeb921605e8bcf83d513420ca21b52004f8c0181321027a33ca2c4395af23a2718623be8caf93590c342275cac7fe3777691d548ff8b21329fd43a9574d790e813828bfcfa3002561095ba58f983f1a39abbf0b6846a9f847f2ed08a33d425aa506a2132f01a55190523fdfe8c1d00db4109f5498db8c287204b04826f84013d274c12b158725a1c53561a356160e29b369ecf8c4a635bb79b8356048359d5b07b4758811c189d2de4a5eb3f55befa37e4c04e3de8b801971a7e'


## 6. OCR (Optical Character Recognition)

Extract text from images using OCR capabilities.

In [18]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Read the text in this image",
    images=[image],
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[image], texts=[result], width=600)

>> RESPONSE
Here is the text I extracted from the image `img_d169eb`:

"Today is Thursday, October 20th- But it definitely feels like a Friday. I'm already considering making a second cup of coffee- and I haven't even finished my first. Do I have a problem? Sometimes I'll flip through older notes I've taken, and my handwriting is unrecognizable. Perhaps it depends on the type of pen I use? I've tried writing in all caps But IT Looks So FORCED AND UNNATURAL Often times, I'll just take notes on my laptop, but I still seem to gravitate toward pen and paper. Any advice on what to I'm prove? I already feel stressed out looking back at what I've just written- it looks like 3 different people wrote this!"

>> IMAGE


### 5. Image Generation

Create, modify and remix images from text prompts or existing visuals.

### 5a. Virtual Try-On

Generate a virtual try-on of a dress on a person, with unique views and a seamless compositing.

In [19]:
img_1 = download_image("https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png")
img_2 = download_image("https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png")
display([img_1, img_2], width=400)

In [20]:
# Generate a virtual try-on of a dress on a person, with unique views
result = chat_completion(
    prompt="You are provided with two images: one of a dress(the first image) and one of a person(the second image). Generate a few highly realistic virtual try-on by seamlessly compositing the dress onto the person, ensuring natural fit, alignment, and that the person appears fully and appropriately dressed. Provide 2 images (9:16 aspect ratio) as output: one from the front and one from the side.",
    images=[
        "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png",
        "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png"
    ],
    response_model=ImageUrlListResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGES")
display(result.images, width=400)

>> RESPONSE
urls=[ImageUrlResponse(url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/36d0b1e0-47e2-4e84-a76d-6b4cd36757d8/img_07e15b.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251211%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251211T094122Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=66c91ad19b74383fcb99b5adae73ea1f140954e619c32b2290ba524dd9dff7cc51ffbeff2d0a36c30f3f0899055075bc168ccee778bec9d471d05ca0a65ef2754bc02203de565ae1b5ca303b9538eea03c80f320ef04c9fd55f355fc351a58e4a30513f8fbbd500d97020e3a61a4c98cf0c98554d9bf7994036c8f924bf7eaeee92f78b9f3cffdd0ef5d268f70ae12c8f318949b54a6b3906c40421031ebee1d68e03e170760193d7efcab3834e8d3c57b42feec116b8f6ef2f478f76e60af8b322dd6ba83e2c0729c6892a86038e2bddd798e8b7fee3f6b52df9b0146c0a29cfa214728ff2cdc2db6a3fa3f0a8172ba82845af37689d040d80652e9f157db5c'), ImageUrlResponse(url='https://storage.goog

### 6. Template Matching

Find a template image within a larger reference image.

In [21]:
TEMPLATE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-12.png"
REFERENCE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-13.png"

template_img = download_image(TEMPLATE_URL)
reference_img = download_image(REFERENCE_URL)
display([template_img, reference_img], width=400)


In [22]:
result = chat_completion(
    prompt=f"Given two images, identify the specified item from the second image within the first image. Clearly highlight and draw bounding boxes around all occurrences of the item in the first image. Provide a brief description of the results.",
    images=[template_img, reference_img],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(reference_img), width=600)

>> RESPONSE
detections=[Detection(label='lemon', xywh=(0.02, 0.113, 0.94, 0.874), confidence=0.98)]

>> IMAGE


### 7. UI Parsing

Parse user interface elements from screenshots.

In [23]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/web.ui-automation/win11.jpeg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Parse the UI of this screenshot and detect all the UI elements.",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

>> RESPONSE
detections=[Detection(label='search', xywh=(0.3779, 0.1096, 0.0332, 0.0209), confidence=None), Detection(label='Store', xywh=(0.497, 0.229, 0.0766, 0.1193), confidence=None), Detection(label='Microsoft', xywh=(0.2862, 0.2271, 0.077, 0.1149), confidence=None), Detection(label='Aox', xywh=(0.3615, 0.3448, 0.0668, 0.1039), confidence=None), Detection(label='Mcte', xywh=(0.6376, 0.5957, 0.0519, 0.0414), confidence=None), Detection(label='(12) Fng', xywh=(0.3036, 0.6456, 0.2022, 0.0829), confidence=None), Detection(label='(II} png', xywh=(0.5168, 0.6492, 0.1814, 0.0774), confidence=None), Detection(label='Tonday a', xywh=(0.3058, 0.7263, 0.1988, 0.0925), confidence=None), Detection(label='(Blpng', xywh=(0.5173, 0.7267, 0.1776, 0.0923), confidence=None), Detection(label='Waiuz', xywh=(0.9246, 0.9386, 0.0683, 0.0565), confidence=None), Detection(label='Recommended', xywh=(0.3007, 0.581, 0.2063, 0.0685), confidence=None), Detection(label='WinObs', xywh=(0.2884, 0.8365, 0.1219, 0.09

### 8. Streaming Responses

For long-running tasks, you can use streaming to get partial results as they become available.

In [24]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

stream = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": f"Describe this image in detail"},
            {"type": "image_url", "image_url": {"url": IMAGE_URL}}
        ]
    }],
    stream=True
)

print("Streaming response:")
full_response = ""
for chunk in stream:
    if getattr(chunk.choices[0].delta, "content", None):
        content = chunk.choices[0].delta.content
        full_response += content
        print(content, end="", flush=True)

Streaming response:
The image shows a vintage, well-maintained Volkswagen Beetle, which appears to be a light blue or mint green color, with shiny paint and clean chrome accents. It is parked on a cobblestone street, paved with light grey, rectangular pavers that exhibit a somewhat uneven surface. Behind the car is a building with a light yellow or beige stucco wall. The building features two prominent dark brown wooden openings: on the left, a window with two arched top panels, and on the right, a large double door set within a white frame, featuring vertical planks. The overall scene evokes a sense of nostalgia and charm, suggesting a historic or picturesque setting.

---

## Conclusion

This cookbook demonstrated the comprehensive capabilities of the **VLM Run Orion Image Agent API**.

### Key Takeaways

1. **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.
2. **Structured Outputs**: Use Pydantic models with `response_model` parameter to get type-safe, validated responses with automatic parsing.
3. **Cached Properties**: Response models can include `@cached_property` decorators to lazily download and cache images, masks, and other binary data.
4. **Streaming Support**: For long-running tasks, enable streaming to receive partial results as they become available, improving user experience.
5. **Flexible Prompting**: Natural language prompts allow you to combine multiple operations in a single request, reducing API calls and latency.
6. **Rich Rendering**: Built-in visualization methods like `render()` make it easy to display detection results directly in notebooks.

### Next Steps

- Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
- Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
- Check out more examples in the [VLM Run Cookbook](https://github.com/vlm-run/vlmrun-cookbook)

Happy building!