<a href="https://colab.research.google.com/github/MehediAhamed/vlmrun-cookbook/blob/Autoreload-error-fixed-orion-image-understanding/notebooks/12_orion_image_understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a> | <a href="https://chat.vlm.run"><b>Chat</b></a>
</p>
</div>

# VLM Run Orion - Image Understanding, Reasoning and Execution

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) image understanding, reasoning and execution capabilities. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful visual intelligence with the same familiar chat-completions interface.

We'll cover the following topics:
 1. Image VQA (captioning, tagging, question-answering)
 2. Object Detection (people, faces, objects, etc.)
 3. Object Segmentation (semantic, instance, etc.)
 4. UI Parsing (Graphical UI parsing and understanding)
 5. OCR (text detection, recognition, and understanding)
 6. Image Generation (text-to-image, in-painting, out-painting, etc.)
 7. Image Tools (cropping, super-resolution, rotating, etc.)

## Prerequisites

- Python 3.10+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- VLM Run Python Client with OpenAI extra `vlmrun[openai]`

## Setup

First, install the required packages and configure the environment.

In [1]:
# Install required packages
!pip install vlmrun[openai] --upgrade --quiet
!pip install pillow requests numpy --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.4/88.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.0/66.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.3/151.3 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [2]:
import os
import getpass
import json
from typing import List, Any
from functools import cached_property

import numpy as np
from PIL import Image
from pydantic import BaseModel, Field

VLMRUN_API_KEY = os.getenv("VLMRUN_API_KEY", None)
if VLMRUN_API_KEY is None:
    VLMRUN_API_KEY = getpass.getpass("Enter your VLM Run API key: ")

Enter your VLM Run API key: ··········


## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.

In [3]:
from vlmrun.client import VLMRun

client = VLMRun(
    api_key=VLMRUN_API_KEY, base_url="https://agent.vlm.run/v1"
)
print("VLM Run client initialized successfully!")
print(f"Base URL: https://agent.vlm.run/v1")
print(f"Model: vlmrun-orion-1")

VLM Run client initialized successfully!
Base URL: https://agent.vlm.run/v1
Model: vlmrun-orion-1


## Response Models (dtypes)

We define Pydantic models for structured outputs. These models include **cached properties** that automatically download and convert images/masks from URLs to PIL Images or numpy arrays for easy manipulation.

In [4]:
from PIL import ImageDraw
from vlmrun.common.utils import download_image


class ImageUrlResponse(BaseModel):
    """Response model for image URL operations."""
    url: str = Field(..., description="Pre-signed URL to the image")

    @cached_property
    def image(self) -> Image.Image | None:
        """Download and return the image as a PIL Image (for image types)."""
        return download_image(self.url) if self.url else None


class ImageUrlListResponse(BaseModel):
    """Response model for multiple image URLs."""
    urls: List[ImageUrlResponse] = Field(..., description="List of pre-signed image URL responses")

    @cached_property
    def images(self) -> List[Image.Image]:
        """Download and return all images as PIL Images."""
        return [item.image for item in self.urls if item.image is not None]


class DetectionsResponse(BaseModel):
    """Collection of object detections."""

    class Detection(BaseModel):
        """Single object detection result."""
        label: str = Field(..., description="Name of the detected object")
        xywh: tuple[float, float, float, float] = Field(..., description="Bounding box (x, y, width, height) normalized from 0-1")
        confidence: float | None = Field(None, description="Detection confidence score from 0-1")

    detections: List[Detection] = Field(..., description="List of detected objects with bounding boxes")

    def render(self, image: Image.Image) -> Image.Image:
        """Render the detections on the image."""
        vis = image.copy()
        W, H = vis.size
        draw = ImageDraw.Draw(vis)
        for detection in self.detections:
            x, y, w, h = detection.xywh
            draw.rectangle([int(x * W), int(y * H), int((x + w) * W), int((y + h) * H)], outline="red", width=4)
            draw.text((int(x * W), int(y * H)), detection.label, fill="white", font_size=12)
        return vis


class KeypointsResponse(BaseModel):
    """Collection of keypoint detections."""

    class KeyPoint(BaseModel):
        """Single keypoint detection."""
        xy: tuple[float, float] = Field(..., description="Normalized keypoint coordinates (x, y) between 0-1")
        label: str = Field(..., description="Label of the keypoint")

    keypoints: List[KeyPoint] = Field(..., description="List of detected keypoints")

    def render(self, image: Image.Image) -> Image.Image:
        """Render the keypoint detections on the image."""
        vis = image.copy()
        W, H = vis.size
        draw = ImageDraw.Draw(vis)
        for keypoint in self.keypoints:
            x, y = keypoint.xy
            draw.circle([int(x * W), int(y * H)], 5, fill="green")
            draw.text((int(x * W), int(y * H)), keypoint.label, fill="white", font_size=12)
        return vis


print("Response models defined successfully!")
print("Models include cached properties for automatic image/mask downloading.")

Response models defined successfully!
Models include cached properties for automatic image/mask downloading.


## Helper Functions

We create helper functions to simplify making chat completion requests with structured outputs.

In [25]:
import hashlib
import cachetools
from typing import Type, TypeVar
from IPython.display import HTML
from vlmrun.common.image import encode_image
import re
import json
from urllib.parse import urlparse

T = TypeVar('T', bound=BaseModel)


def display(images: Image.Image | list[Image.Image], texts: list[str] | None = None, width: int = 300):
    if isinstance(images, Image.Image):
        images = [images]
    if texts is None:
        texts = [None] * len(images)
    elif isinstance(texts, str):
        texts = [texts]
    elif len(texts) != len(images):
        raise ValueError("`texts` must be a list of the same length as `images`")

    imgs_html = ""
    for image, text in zip(images, texts):
        W, H = image.size
        if W > width:
            H = int(H * width / W)
            W = width
            image = image.resize((W, H))
        im_bytes = encode_image(image, format="JPEG")
        imgs_html += f"<div style='display:inline-block; margin:5px; text-align:center'>"
        imgs_html += f"<img src='{im_bytes}' style='width:{width}px; border-radius:6px'>"
        if text:
            imgs_html += f"<div style='font-size:12px; color:#666; margin-top:5px'>{text}</div>"
        imgs_html += f"</div>"
    return HTML(f"<div style='display:flex; flex-wrap:wrap'>{imgs_html}</div>")


def custom_key(prompt: str, images: list[Image.Image] | list[str] | None = None, response_model: Type[T] | None = None, model: str = "vlmrun-orion-1:auto"):
    """Custom key for caching chat_completion."""
    image_keys = []
    for image in images:
        if isinstance(image, Image.Image):
            thumb = image.copy()
            thumb.thumbnail((128, 128))
            encoded = encode_image(thumb, format="JPEG")
            image_keys.append(encoded)
        elif isinstance(image, str):
            image_keys.append(image)


    response_key = hashlib.sha256(json.dumps(response_model.model_json_schema(), sort_keys=True).encode()).hexdigest() if response_model else ""
    return (prompt, tuple(image_keys), response_key, model)


@cachetools.cached(cache=cachetools.TTLCache(maxsize=1000, ttl=3600), key=custom_key)
def chat_completion(
    prompt: str,
    images: list[Image.Image] | list[str] | None = None,
    response_model: Type[T] | None = None,
    model: str = "vlmrun-orion-1:auto"
) -> Any:
    """
    Make a chat completion request with optional images and structured output.

    Args:
        prompt: The text prompt/instruction
        images: Optional list of images to process (either PIL Images or URLs)
        response_model: Optional Pydantic model for structured output
        model: Model to use (default: vlmrun-orion-1:auto)

    Returns:
        Parsed response model if response_model provided, else raw response text
    """
    content = []
    content.append({"type": "text", "text": prompt})

    if images:
        for image in images:
            if isinstance(image, str):
                assert image.startswith("http"), "Image URLs must start with http or https"
                content.append({
                    "type": "image_url",
                    "image_url": {"url": image, "detail": "auto"}
                })
            elif isinstance(image, Image.Image):
                content.append({
                    "type": "image_url",
                    "image_url": {"url": encode_image(image, format="JPEG"), "detail": "auto"}
                })
            else:
                raise ValueError("Images must be either PIL Images or URLs")

    kwargs = {
        "model": model,
        "messages": [{"role": "user", "content": content}]
    }

    if response_model:
        kwargs["response_format"] = {
            "type": "json_schema",
            "schema": response_model.model_json_schema()
        }

    response = client.agent.completions.create(**kwargs)
    response_text = response.choices[0].message.content

    if response_model:
        return response_model.model_validate_json(response_text), response

    return response_text, response


def extract_object_id(chat_response):
    """
    Extracts object ID like: img_e98ca8
    from ChatCompletion response content JSON.
    """

    # Extract JSON inside content='...'
    match = re.search(r"content='({.*?})'", str(chat_response))
    if not match:
        return None

    try:
        data = json.loads(match.group(1))
    except json.JSONDecodeError:
        return None

    url = data.get("url")
    if not url:
        return None

    # Remove query params safely
    path = urlparse(url).path
    filename = path.split("/")[-1]

    # Remove extension (.jpg, .png, .mp4, etc.)
    object_id = filename.rsplit(".", 1)[0]

    return object_id

print("Helper functions defined!")

Helper functions defined!


## Image Understanding, Reasoning, and Execution Capabilities

VLM Run agents can perform a wide range of image processing tasks including object detection, face detection, segmentation, OCR, and more.

### 1. Captioning & Tagging

The simplest operation - load an image from a URL and caption it.

In [6]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

result, response = chat_completion(
    prompt=f"Generate a detailed description of this image.",
    images=[IMAGE_URL],
)
print(">> RESPONSE")
print(result)
print(">> IMAGE")
display(images=[download_image(IMAGE_URL)], texts=[result], width=600)

>> RESPONSE
This image features a vintage Volkswagen Beetle, painted in a distinctive light teal or mint green color, parked on a paved street. The car is viewed from its left side, showcasing its classic rounded silhouette. Key details of the car include chrome hubcaps on its wheels, which have a darker central circle and a lighter, reflective outer ring, and a chrome bumper visible at the rear. The side windows have chrome trim, and there's a subtle, lighter stripe or reflection running horizontally along the lower part of the side door. The car's paint appears somewhat faded or distressed in places, adding to its vintage character.

The background consists of a building with a light, warm yellow or ochre-colored stucco wall. This wall shows signs of wear and age, with some areas appearing darker or dirtier, particularly towards the top. There are two distinct wooden doors set into the wall. The door on the left is double-leafed with an arched top, featuring symmetrical panels and a 

### 2a. Object Detection

Detect objects in images with bounding boxes. The agent can detect common objects like people, vehicles, animals, and more.

In [7]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/10-finding-nemo.jpeg"

image: Image.Image = download_image(IMAGE_URL)
result, response = chat_completion(
    prompt=f"Detect all the sea creatures in this image",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} objects"], width=600)

>> RESPONSE
detections=[Detection(label='Nemo', xywh=(0.28, 0.336, 0.212, 0.292), confidence=0.95), Detection(label='Dory', xywh=(0.421, 0.133, 0.346, 0.488), confidence=0.96), Detection(label='Marlin', xywh=(0.41, 0.628, 0.214, 0.357), confidence=0.94), Detection(label='Crush', xywh=(0.597, 0.565, 0.25, 0.383), confidence=0.93), Detection(label='Squirt', xywh=(0.804, 0.496, 0.171, 0.3), confidence=0.92), Detection(label='Bruce the shark', xywh=(0.002, 0.467, 0.176, 0.507), confidence=0.91)]

>> IMAGE


### 2b. Object Detection with Specific Prompt

You can specify exactly which objects to detect using natural language.

In [9]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

image: Image.Image = download_image(IMAGE_URL)
result, response = chat_completion(
    prompt=f"Detect the 'car' and its 'wheels' in the image",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

>> RESPONSE
detections=[Detection(label='car', xywh=(0.054, 0.338, 0.876, 0.435), confidence=0.98), Detection(label='wheel', xywh=(0.146, 0.587, 0.154, 0.186), confidence=0.95), Detection(label='wheel', xywh=(0.709, 0.581, 0.151, 0.192), confidence=0.94)]

>> IMAGE


### 2c. Face Detection

Detect and localize faces in images with bounding boxes.

In [10]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"

image: Image.Image = download_image(IMAGE_URL)
result, response = chat_completion(
    prompt=f"Detect all the faces in the image",
    images=[image],
    response_model=DetectionsResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} faces"], width=600)

>> RESPONSE
detections=[Detection(label='face', xywh=(0.066, 0.198, 0.27, 0.522), confidence=0.98), Detection(label='face', xywh=(0.358, 0.19, 0.27, 0.53), confidence=0.97), Detection(label='face', xywh=(0.65, 0.198, 0.27, 0.522), confidence=0.96)]

>> IMAGE


### 2d. Person Detection

Detect and localize people in images with bounding boxes.

In [11]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg"

image: Image.Image = download_image(IMAGE_URL)
result, response = chat_completion(
    prompt=f"Detect all the people in the image",
    images=[image],
    response_model=DetectionsResponse,
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[result.render(image)], texts=[f"Detected {len(result.detections)} people"], width=600)

>> RESPONSE
detections=[Detection(label='person', xywh=(0.04, 0.304, 0.074, 0.24), confidence=0.98), Detection(label='person', xywh=(0.087, 0.289, 0.084, 0.275), confidence=0.97), Detection(label='person', xywh=(0.167, 0.285, 0.088, 0.265), confidence=0.96), Detection(label='person', xywh=(0.232, 0.281, 0.088, 0.305), confidence=0.95), Detection(label='person', xywh=(0.314, 0.318, 0.086, 0.28), confidence=0.94), Detection(label='person', xywh=(0.385, 0.315, 0.089, 0.271), confidence=0.93), Detection(label='person', xywh=(0.468, 0.3, 0.088, 0.296), confidence=0.92), Detection(label='person', xywh=(0.55, 0.327, 0.088, 0.28), confidence=0.91), Detection(label='person', xywh=(0.632, 0.318, 0.087, 0.297), confidence=0.9), Detection(label='person', xywh=(0.71, 0.336, 0.089, 0.29), confidence=0.89), Detection(label='person', xywh=(0.785, 0.314, 0.097, 0.322), confidence=0.88)]

>> IMAGE


### 2e. Detect and blur faces

Detect faces and blur them for privacy protection. Here we combine object / face detection with an image tool.

In [29]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"

image: Image.Image = download_image(IMAGE_URL)
result, response = chat_completion(
    prompt=f"Blur all the faces in this image and return the blurred image",
    images=[image],
    response_model=ImageUrlResponse,
)

print(">> RESPONSE")
print(result)

artifact_bytes = client.artifacts.get(
    session_id = response.session_id,
    object_id  = extract_object_id(response)
)
print("\n>> SAVED")
print(artifact_bytes)

print("\n>> IMAGE")
display(images=[artifact_bytes], texts=[f"Blurred image"], width=600)

>> RESPONSE
url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/2784e4d9-df81-44fa-a804-c7764b8007fc/img_313cc8.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251223%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251223T145211Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=ac81d24bdcb42856afd3881145b1be0d2c6a8898bce70cffd5fb558cbda17252b097ed44158e776ad1d60168ff0f924f27b72af978d0730e2f64fcf3715b38ce9a0182b04bb3594c1cf1912a283e559eb8137cc01a4b273705cd07ce8aa2cedffcd2e8558c1c982d0c976f6a8b910075647bbd252aa2acb4fff6f759d8adf782696ee8ece8dea4c736fa2684c805f0abd7d893919dafb524e0d8d2b2eee4423c4d575918f0189c419859f0aba046ff4854dd138a6e3d49fd5816aee7176c3ec5436159fa2cf74d5e2c5a5481383ad629475c04b6ae55f213e2aa7c6cd823b8091488e2265a291e593ca6f90f7f71ec1fed540aeee9a61dd6fd8cb07762a7f29f'

>> SAVED
<PIL.Image.Image image mode=RGB size=1453x760 at 0x7EE344

### 3. Keypoint Detection

Detect keypoints in images for counting and localization tasks.

In [13]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.object-detection/donuts.png"

image: Image.Image = download_image(IMAGE_URL)
result, response = chat_completion(
    prompt=f"Detect all the donuts as keypoints and return the coordinates.",
    images=[image],
    response_model=KeypointsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

>> RESPONSE
keypoints=[KeyPoint(xy=(0.0391, 0.8525), label='donuts'), KeyPoint(xy=(0.1885, 0.7686), label='donuts'), KeyPoint(xy=(0.2109, 0.9521), label='donuts'), KeyPoint(xy=(0.5, 0.832), label='donuts'), KeyPoint(xy=(0.5, 0.5), label='donuts'), KeyPoint(xy=(0.7686, 0.6738), label='donuts'), KeyPoint(xy=(0.8105, 0.9414), label='donuts'), KeyPoint(xy=(0.7881, 0.3594), label='donuts'), KeyPoint(xy=(0.959, 0.5), label='donuts'), KeyPoint(xy=(0.832, 0.1094), label='donuts'), KeyPoint(xy=(0.5596, 0.1885), label='donuts'), KeyPoint(xy=(0.3594, 0.1094), label='donuts'), KeyPoint(xy=(0.335, 0.3594), label='donuts'), KeyPoint(xy=(0.1094, 0.5), label='donuts'), KeyPoint(xy=(0.1094, 0.1885), label='donuts')]

>> IMAGE


### 4. Segmentation

Create pixel-level segmentation masks for objects, people or regions in images.

In [14]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg"

image: Image.Image = download_image(IMAGE_URL)
result, response = chat_completion(
    prompt=f"Detect all the people in this image, and segment them.",
    images=[image],
    response_model=ImageUrlResponse
)

print(">> RESPONSE")
print(result)

artifact_bytes = client.artifacts.get(
    session_id = response.session_id,
    object_id  = extract_object_id(response)
)
print("\n>> SAVED")
print(artifact_bytes)

print("\n>> IMAGE")
display(artifact_bytes, width=600)

>> RESPONSE
url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/dea75928-b90b-47e1-92e4-3f599db5e55c/img_8e963c.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251223%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251223T144015Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=5646dc5e97eb41360cc7569da0e62b46bca5d3c8858bca41e16c843cddc4acd8ba5652b848eaace982a4d2d2637e58a01ddd606352067d3a9b0dcdbc45c3980236f7de7b567b5950612d795c542ef06f7f8445391bdc3ad8e792ccc7a180cd5b186b86fd8697c61f5366a4f4ebb916b7cec6acff0e0b454f8e0583c01f97f1f191554378dc3bd7102ab9f1f66b300c6ab79163f3e70f3fb9c2ee4b8c27e742af956353124397a46bdcf21f246bdcd98ac953e167e3d55e4c968064e4bd69b4432c7f0df4e18f027ff2fa9cf6265f64ba26d9d2560b7690c8b2bc63a5c154c668a0490d9f4cc9f94e5b6d97fa0772c55113fd8b500032e7102948fe197edb0f07'

>> SAVED
<PIL.Image.Image image mode=RGB size=1024x804 at 0x7EE344

## 6. OCR (Optical Character Recognition)

Extract text from images using OCR capabilities.

In [15]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg"

image: Image.Image = download_image(IMAGE_URL)
result = chat_completion(
    prompt=f"Read the text in this image",
    images=[image],
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(images=[image], texts=[result], width=600)

>> RESPONSE
('Here is the text I extracted from the image `img_30c705`:\n\n"Today is Thursday, October 20th- But it definitely feels like a Friday. I\'m already considering making a second cup of coffee- and I haven\'t even finished my first. Do I have a problem? Sometimes I\'ll flip through older notes I\'ve taken, and my handwriting is unrecognizable, Perhaps it depends on the type of pen I use? I\'ve tried writing in all caps But IT Looks So FORCED AND UNNATURAL Often times, I\'ll just take notes on my laptop, but I still seem to gravitate toward pen and paper. Any advice on what to I\'m prove ? I already feel stressed out looking back at what I\'ve just written- it looks like 3 different people wrote this!"', ChatCompletion(id='chatcmpl-102ee75e-12f1-4f66-a844-43e6f1e77e70', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here is the text I extracted from the image `img_30c705`:\n\n"Today is Thursday, October 20th- But it definit

### 5. Image Generation

Create, modify and remix images from text prompts or existing visuals.

### 5a. Virtual Try-On

Generate a virtual try-on of a dress on a person, with unique views and a seamless compositing.

In [16]:
img_1 = download_image("https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png")
img_2 = download_image("https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png")
display([img_1, img_2], width=400)

In [40]:
# Generate a virtual try-on of a dress on a person, with unique views
result, response = chat_completion(
    prompt="You are provided with two images: one of a dress(the first image) and one of a person(the second image). Generate a few highly realistic virtual try-on by seamlessly compositing the dress onto the person, ensuring natural fit, alignment, and that the person appears fully and appropriately dressed. Provide 2 images (9:16 aspect ratio) as output: one from the front and one from the side.",
    images=[
        "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png",
        "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png"
    ],
    response_model=ImageUrlListResponse
)

print(">> RESPONSE")
print(result)

print("\n>> IMAGES")
display(result.images, width=400)

>> RESPONSE
urls=[ImageUrlResponse(url='https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/70cbdd6d-25a3-45f7-918f-94a9e61d9dec/img_d1cdbc.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251223%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251223T145332Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=21c938af914533003a8710cc65dd3564504c580ad890af515ad546b375afa8ac591dc30d8be6872ed12ecbbe0bc645b15f59b19c07c654c01b0514c7c8956ea3ec38b44f8f612f860199243a5d7bd519bf4642a3a81194998efb47e9d6da53b22b8c4deec8c0805328ac8cf7c7d654f1e5c28c18b7beddbf8a8b436628fe21dcfbf708e37c972fa44544b84f237d46590bae1852dc7b9003d990ab63580b20ac340feff64eaef57495af954e122a351b35ba06b83cc3c2920f91566d906fa6bbf9ef74bb69d3543ee29faa846a5caf577bc1a8816b972d4e1c752caf78aa74648f5aeaf847fc6660f349b23f6c12a05dc4966f807294dac7dcf2cef4d4c1613f'), ImageUrlResponse(url='https://storage.goog

### 6. Template Matching

Find a template image within a larger reference image.

In [33]:
TEMPLATE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-12.png"
REFERENCE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-13.png"

template_img = download_image(TEMPLATE_URL)
reference_img = download_image(REFERENCE_URL)
display([template_img, reference_img], width=400)


In [34]:
result, response = chat_completion(
    prompt=f"Given two images, identify the specified item from the second image within the first image. Clearly highlight and draw bounding boxes around all occurrences of the item in the first image. Provide a brief description of the results.",
    images=[template_img, reference_img],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(reference_img), width=600)

>> RESPONSE
detections=[Detection(label='yellow fruit', xywh=(0.252, 0.231, 0.15, 0.136), confidence=0.98), Detection(label='yellow fruit', xywh=(0.508, 0.627, 0.207, 0.209), confidence=0.97), Detection(label='yellow fruit', xywh=(0.121, 0.431, 0.134, 0.143), confidence=0.96), Detection(label='yellow fruit', xywh=(0.741, 0.256, 0.115, 0.116), confidence=0.95)]

>> IMAGE


### 7. UI Parsing

Parse user interface elements from screenshots.

In [35]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/web.ui-automation/win11.jpeg"

image: Image.Image = download_image(IMAGE_URL)
result, response = chat_completion(
    prompt=f"Parse the UI of this screenshot and detect all the UI elements.",
    images=[image],
    response_model=DetectionsResponse
)

print(">> RESPONSE")
print(result)
print("\n>> IMAGE")
display(result.render(image), width=600)

>> RESPONSE
detections=[Detection(label='text', xywh=(0.3779, 0.1096, 0.0332, 0.0209), confidence=None), Detection(label='Store', xywh=(0.497, 0.229, 0.0766, 0.1193), confidence=None), Detection(label='Microsoft', xywh=(0.2862, 0.2271, 0.077, 0.1149), confidence=None), Detection(label='Aox', xywh=(0.3615, 0.3448, 0.0668, 0.1039), confidence=None), Detection(label='Mcte', xywh=(0.6376, 0.5957, 0.0519, 0.0414), confidence=None), Detection(label='(12) Fng', xywh=(0.3036, 0.6456, 0.2022, 0.0829), confidence=None), Detection(label='(II} png', xywh=(0.5168, 0.6492, 0.1814, 0.0774), confidence=None), Detection(label='Tonday a', xywh=(0.3058, 0.7263, 0.1988, 0.0925), confidence=None), Detection(label='(Blpng', xywh=(0.5173, 0.7267, 0.1776, 0.0923), confidence=None), Detection(label='Waiuz', xywh=(0.9246, 0.9386, 0.0683, 0.0565), confidence=None), Detection(label='Recommended', xywh=(0.3007, 0.581, 0.2063, 0.0685), confidence=None), Detection(label='WinObs "', xywh=(0.2884, 0.8365, 0.1219, 0.09

### 8. Streaming Responses

For long-running tasks, you can use streaming to get partial results as they become available.

In [36]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg"

stream = client.agent.completions.create(
    model="vlmrun-orion-1:auto",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": f"Describe this image in detail"},
            {"type": "image_url", "image_url": {"url": IMAGE_URL}}
        ]
    }],
    stream=True
)

print("Streaming response:")
full_response = ""
for chunk in stream:
    if getattr(chunk.choices[0].delta, "content", None):
        content = chunk.choices[0].delta.content
        full_response += content
        print(content, end="", flush=True)

Streaming response:
The image features a classic light teal or mint green Volkswagen Beetle, seen from its side, parked on a paved street. The car is well-maintained, with chrome hubcaps. Behind the car is a light yellow or beige stucco wall with two wooden features: a smaller, recessed window with wooden shutters on the left and a larger, taller wooden door framed in white on the right. The street is made of light-colored bricks or cobblestones, and some green foliage can be seen at the top of the image.

---

## Conclusion

This cookbook demonstrated the comprehensive capabilities of the **VLM Run Orion Image Agent API**.

### Key Takeaways

1. **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.
2. **Structured Outputs**: Use Pydantic models with `response_model` parameter to get type-safe, validated responses with automatic parsing.
3. **Cached Properties**: Response models can include `@cached_property` decorators to lazily download and cache images, masks, and other binary data.
4. **Streaming Support**: For long-running tasks, enable streaming to receive partial results as they become available, improving user experience.
5. **Flexible Prompting**: Natural language prompts allow you to combine multiple operations in a single request, reducing API calls and latency.
6. **Rich Rendering**: Built-in visualization methods like `render()` make it easy to display detection results directly in notebooks.

### Next Steps

- Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
- Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
- Check out more examples in the [VLM Run Cookbook](https://github.com/vlm-run/vlmrun-cookbook)

Happy building!