<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a> | <a href="https://chat.vlm.run"><b>Chat</b></a>
</p>
</div>

# VLM Run Orion - Image Understanding, Reasoning and Execution (Node.js)

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) image understanding, reasoning and execution capabilities using **Node.js/TypeScript**. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful visual intelligence with the same familiar chat-completions interface.

We'll cover the following topics:
 1. Image VQA (captioning, tagging, question-answering)
 2. Object Detection (people, faces, objects, etc.)
 3. Object Segmentation (semantic, instance, etc.)
 4. UI Parsing (Graphical UI parsing and understanding)
 5. OCR (text detection, recognition, and understanding)
 6. Image Generation (text-to-image, in-painting, out-painting, etc.)
 7. Image Tools (cropping, super-resolution, rotating, etc.)

## Prerequisites

- Node.js 18+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- Deno or tslab kernel for running TypeScript in Jupyter


## Setup

First, install the required packages and configure the environment.


In [None]:
// Install the VLM Run SDK
// npm install vlmrun openai zod zod-to-json-schema

// If using Deno kernel, install dependencies via npm specifiers
// For tslab, run: npm install vlmrun openai zod zod-to-json-schema in your project directory


In [1]:
// Import the VLM Run SDK and dependencies
import { VlmRun } from "vlmrun";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";


In [2]:
// Get API key from environment variable
const VLMRUN_API_KEY = Deno.env.get("VLMRUN_API_KEY");

if (!VLMRUN_API_KEY) {
    throw new Error("Please set the VLMRUN_API_KEY environment variable");
}

console.log("✓ API Key loaded successfully");


✓ API Key loaded successfully


## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.


In [3]:
// Initialize the VLM Run client using the SDK
const client = new VlmRun({
    apiKey: VLMRUN_API_KEY,
    baseURL: "https://agent.vlm.run/v1"  // Use the agent API endpoint
});

console.log("✓ VLM Run client initialized successfully!");
console.log("Base URL: https://agent.vlm.run/v1");
console.log("Model: vlmrun-orion-1");


✓ VLM Run client initialized successfully!
Base URL: https://agent.vlm.run/v1
Model: vlmrun-orion-1


## Response Models (Schemas)

We define Zod schemas for structured outputs. These schemas provide type-safe, validated responses.


In [4]:
// Helper function to download images from URLs
async function downloadImage(url: string): Promise<Uint8Array> {
    const response = await fetch(url);
    if (!response.ok) {
        throw new Error(`Failed to download image: ${response.statusText}`);
    }
    return new Uint8Array(await response.arrayBuffer());
}

// Image URL Response Schema
const ImageUrlResponseSchema = z.object({
    url: z.string().describe("Pre-signed URL to the image")
});

type ImageUrlResponse = z.infer<typeof ImageUrlResponseSchema>;

// Image URL List Response Schema
const ImageUrlListResponseSchema = z.object({
    urls: z.array(ImageUrlResponseSchema).describe("List of pre-signed image URL responses")
});

type ImageUrlListResponse = z.infer<typeof ImageUrlListResponseSchema>;

// Detection Schema
const DetectionSchema = z.object({
    label: z.string().describe("Name of the detected object"),
    xywh: z.tuple([z.number(), z.number(), z.number(), z.number()])
        .describe("Bounding box (x, y, width, height) normalized from 0-1"),
    confidence: z.number().nullable().optional().describe("Detection confidence score from 0-1")
});

// Detections Response Schema
const DetectionsResponseSchema = z.object({
    detections: z.array(DetectionSchema).describe("List of detected objects with bounding boxes")
});

type DetectionsResponse = z.infer<typeof DetectionsResponseSchema>;

// Keypoint Schema
const KeypointSchema = z.object({
    xy: z.tuple([z.number(), z.number()])
        .describe("Normalized keypoint coordinates (x, y) between 0-1"),
    label: z.string().describe("Label of the keypoint")
});

// Keypoints Response Schema
const KeypointsResponseSchema = z.object({
    keypoints: z.array(KeypointSchema).describe("List of detected keypoints")
});

type KeypointsResponse = z.infer<typeof KeypointsResponseSchema>;

console.log("✓ Response schemas defined successfully!");
console.log("Schemas include type-safe validation for structured outputs.");


✓ Response schemas defined successfully!
Schemas include type-safe validation for structured outputs.


## Helper Functions

We create helper functions to simplify making chat completion requests with structured outputs.


In [5]:
/**
 * Make a chat completion request with optional images and structured output.
 * 
 * @param prompt - The text prompt/instruction
 * @param images - Optional list of images to process (URLs)
 * @param responseSchema - Optional Zod schema for structured output
 * @param model - Model to use (default: vlmrun-orion-1:auto)
 * @returns Parsed response if responseSchema provided, else raw response text
 */
async function chatCompletion<T>(
    prompt: string,
    images?: string[],
    responseSchema?: z.ZodSchema<T>,
    model: string = "vlmrun-orion-1:auto"
): Promise<T | string> {
    const content: any[] = [];
    content.push({ type: "text", text: prompt });

    if (images) {
        for (const image of images) {
            if (typeof image === "string") {
                if (!image.startsWith("http")) {
                    throw new Error("Image URLs must start with http or https");
                }
                content.push({
                    type: "image_url",
                    image_url: { url: image, detail: "auto" }
                });
            }
        }
    }

    const kwargs: any = {
        model: model,
        messages: [{ role: "user", content: content }]
    };

    if (responseSchema) {
        kwargs.response_format = {
            type: "json_schema",
            schema: zodToJsonSchema(responseSchema)
        } as any;
    }

    const response = await client.agent.completions.create(kwargs);
    const responseText = response.choices[0].message.content || "";

    if (responseSchema) {
        const parsed = JSON.parse(responseText);
        return responseSchema.parse(parsed) as T;
    }

    return responseText;
}

console.log("✓ Helper functions defined!");


✓ Helper functions defined!


## Image Understanding, Reasoning, and Execution Capabilities

VLM Run agents can perform a wide range of image processing tasks including object detection, face detection, segmentation, OCR, and more.


### 1. Captioning & Tagging

The simplest operation - load an image from a URL and caption it.


In [6]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg";

const result = await chatCompletion(
    "Generate a detailed description of this image.",
    [IMAGE_URL]
);

console.log(">> RESPONSE");
console.log(result);
console.log("\n>> IMAGE URL:", IMAGE_URL);


>> RESPONSE
The image features a classic Volkswagen Beetle, painted in a vibrant aqua green or mint green color, parked parallel to a building. The car is well-preserved, showcasing chrome hubcaps and bumpers, with visible side mirrors. In the background, there's a light yellow-orange wall, possibly stucco, which shows signs of age and character. Set into this wall are two dark brown wooden doors: a double door with distinct arched top panels on the left and a simpler single door on the right. The ground is paved with textured, greyish-tan cobblestones or similar paving stones. The overall atmosphere is quaint, nostalgic, and warm, suggesting a sunny day in a historic or charming locale.

>> IMAGE URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg


### 2a. Object Detection

Detect objects in images with bounding boxes. The agent can detect common objects like people, vehicles, animals, and more.


In [8]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/10-finding-nemo.jpeg";

const result = await chatCompletion(
    "Detect all the sea creatures in this image",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.detections.length} objects`);
result.detections.forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [
    {
      label: "Marlin",
      xywh: [ 0.276, 0.343, 0.213, 0.295 ],
      confidence: 0.95
    },
    {
      label: "Dory",
      xywh: [ 0.426, 0.14, 0.34, 0.49 ],
      confidence: 0.96
    },
    {
      label: "Marlin",
      xywh: [ 0.375, 0.642, 0.233, 0.344 ],
      confidence: 0.94
    },
    {
      label: "seahorse",
      xywh: [ 0.014, 0.473, 0.16, 0.505 ],
      confidence: 0.92
    },
    {
      label: "turtle",
      xywh: [ 0.782, 0.4, 0.176, 0.2 ],
      confidence: 0.91
    },
    {
      label: "fish",
      xywh: [ 0.774, 0.568, 0.192, 0.262 ],
      confidence: 0.93
    },
    {
      label: "fish",
      xywh: [ 0.144, 0.525, 0.133, 0.365 ],
      confidence: 0.92
    },
    {
      label: "fish",
      xywh: [ 0.488, 0.035, 0.073, 0.1 ],
      confidence: 0.91
    },
    {
      label: "fish",
      xywh: [ 0.613, 0.008, 0.07, 0.119 ],
      confidence: 0.9
    },
    {
      label: "shark",
      xywh: [ 0.161, 0.035, 0.163, 

### 2b. Object Detection with Specific Prompt

You can specify exactly which objects to detect using natural language.


In [9]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg";

const result = await chatCompletion(
    "Detect the 'car' and its 'wheels' in the image",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.detections.length} objects`);
result.detections.forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [
    {
      label: "car",
      xywh: [ 0.054, 0.343, 0.88, 0.428 ],
      confidence: 0.98
    },
    {
      label: "wheel",
      xywh: [ 0.142, 0.575, 0.147, 0.192 ],
      confidence: 0.98
    },
    {
      label: "wheel",
      xywh: [ 0.707, 0.569, 0.149, 0.198 ],
      confidence: 0.97
    }
  ]
}

>> Detected 3 objects
  1. car: xywh=[0.054, 0.343, 0.880, 0.428]
  2. wheel: xywh=[0.142, 0.575, 0.147, 0.192]
  3. wheel: xywh=[0.707, 0.569, 0.149, 0.198]


### 2c. Face Detection

Detect and localize faces in images with bounding boxes.


In [10]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg";

const result = await chatCompletion(
    "Detect all the faces in the image",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.detections.length} faces`);
result.detections.forEach((det, i) => {
    console.log(`  Face ${i + 1}: ${det.label}, xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [
    {
      label: "face",
      xywh: [ 0.063, 0.197, 0.281, 0.527 ],
      confidence: 0.98
    },
    {
      label: "face",
      xywh: [ 0.359, 0.193, 0.278, 0.531 ],
      confidence: 0.97
    },
    {
      label: "face",
      xywh: [ 0.657, 0.197, 0.28, 0.527 ],
      confidence: 0.96
    }
  ]
}

>> Detected 3 faces
  Face 1: face, xywh=[0.063, 0.197, 0.281, 0.527]
  Face 2: face, xywh=[0.359, 0.193, 0.278, 0.531]
  Face 3: face, xywh=[0.657, 0.197, 0.280, 0.527]


### 2d. Person Detection

Detect and localize people in images with bounding boxes.


In [11]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg";

const result = await chatCompletion(
    "Detect all the people in the image",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.detections.length} people`);
result.detections.forEach((det, i) => {
    console.log(`  Person ${i + 1}: ${det.label}, xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [
    {
      label: "person",
      xywh: [ 0.044, 0.306, 0.078, 0.24 ],
      confidence: null
    },
    {
      label: "person",
      xywh: [ 0.094, 0.292, 0.081, 0.273 ],
      confidence: null
    },
    {
      label: "person",
      xywh: [ 0.172, 0.287, 0.085, 0.259 ],
      confidence: null
    },
    {
      label: "person",
      xywh: [ 0.242, 0.285, 0.084, 0.3 ],
      confidence: null
    },
    {
      label: "person",
      xywh: [ 0.318, 0.322, 0.086, 0.28 ],
      confidence: null
    },
    {
      label: "person",
      xywh: [ 0.39, 0.317, 0.09, 0.275 ],
      confidence: null
    },
    {
      label: "person",
      xywh: [ 0.47, 0.303, 0.092, 0.297 ],
      confidence: null
    },
    {
      label: "person",
      xywh: [ 0.55, 0.33, 0.086, 0.28 ],
      confidence: null
    },
    {
      label: "person",
      xywh: [ 0.629, 0.318, 0.086, 0.302 ],
      confidence: null
    },
    {
      label: "person",
      xywh: [ 0.703, 0.3

### 2e. Detect and blur faces

Detect faces and blur them for privacy protection. Here we combine object / face detection with an image tool.


In [12]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg";

const result = await chatCompletion(
    "Blur all the faces in this image and return the blurred image",
    [IMAGE_URL],
    ImageUrlResponseSchema
) as ImageUrlResponse;

console.log(">> RESPONSE");
console.log(result);
console.log("\n>> Blurred image URL:", result.url);


>> RESPONSE
{
  url: "https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/e69bf92d-2bcb-4520-91b6-aeb59810df37/img_5e39d8.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T083700Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=5618099a04c626b9d3368c6573d5f31f4cfcadcb843052f31301074d719f7bbe10a1870710fb808c68e5daab21c299abdd69f51d4cd69432fa07f316808f154a35b8d24dc9de3c4daff8396a852319be5711af7378eece165b437cd9b7b7e3a662ae26f91724364fcc81491c06286b646f78b37e4ba9132463b7115054c9478eed4e41e443979c5a008f673be37e54ab01502ec53a0298d67eece40d389e8460ab4828ffe9541bc79d39c382f959f43e5559f4856611afb3f74e218a2426b48537d7afe859110435a1df37a04e6b899a8cf97eba2eea45bccc7a5c9b109792c8b5517668e33bf3525f54f4ead64e7783335eaf122557a78d8e778456b755c0ee"
}

>> Blurred image URL: https://storage.googleapis.com/vlm-us

### 3. Keypoint Detection

Detect keypoints in images for counting and localization tasks.


In [13]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.object-detection/donuts.png";

const result = await chatCompletion(
    "Detect all the donuts as keypoints and return the coordinates.",
    [IMAGE_URL],
    KeypointsResponseSchema
) as KeypointsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.keypoints.length} keypoints`);
result.keypoints.forEach((kp, i) => {
    console.log(`  ${i + 1}. ${kp.label}: xy=[${kp.xy.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  keypoints: [
    { xy: [ 0.1094, 0.1094 ], label: "donuts" },
    { xy: [ 0.3594, 0.0781 ], label: "donuts" },
    { xy: [ 0.5479, 0.1094 ], label: "donuts" },
    { xy: [ 0.7881, 0.1094 ], label: "donuts" },
    { xy: [ 0.7881, 0.2842 ], label: "donuts" },
    { xy: [ 0.5, 0.5 ], label: "donuts" },
    { xy: [ 0.2656, 0.3594 ], label: "donuts" },
    { xy: [ 0.0537, 0.5 ], label: "donuts" },
    { xy: [ 0.0293, 0.8525 ], label: "donuts" },
    { xy: [ 0.2305, 0.7441 ], label: "donuts" },
    { xy: [ 0.5, 0.832 ], label: "donuts" },
    { xy: [ 0.7881, 0.7441 ], label: "donuts" },
    { xy: [ 0.8486, 0.9414 ], label: "donuts" },
    { xy: [ 0.9639, 0.5 ], label: "donuts" }
  ]
}

>> Detected 14 keypoints
  1. donuts: xy=[0.109, 0.109]
  2. donuts: xy=[0.359, 0.078]
  3. donuts: xy=[0.548, 0.109]
  4. donuts: xy=[0.788, 0.109]
  5. donuts: xy=[0.788, 0.284]
  6. donuts: xy=[0.500, 0.500]
  7. donuts: xy=[0.266, 0.359]
  8. donuts: xy=[0.054, 0.500]
  9. donuts: xy=[0.029

### 4. Segmentation

Create pixel-level segmentation masks for objects, people or regions in images.


In [14]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg";

const result = await chatCompletion(
    "Detect all the people in this image, and segment them.",
    [IMAGE_URL],
    ImageUrlResponseSchema
) as ImageUrlResponse;

console.log(">> RESPONSE");
console.log(result);
console.log("\n>> Segmented image URL:", result.url);


>> RESPONSE
{
  url: "https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/dfc02661-6871-49eb-9232-35b35d543952/img_981194.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T084334Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=74aa63c9d557a95f5676f489ab73fd0ad0b3a1cdc61c23bfeec6790328568d2721a07422a5818b103b8855bac6aee95d024e59ec5a99aa796032fad045ef2e29909eb8751f04f8f682413d755fb4b0de22db3b4a739f43c1a1d7d4190a9b621f219d6af2a2e1c8c7998e595af690afb940a05bd27d84ddd9f0c6b94543f9d104ce1ed612d7c7a71238c54295629d64265df741becf120738b865378c7b657717b9aeb4e387d704ba9064a384667630cd798bac4cacb6979b0b4bb747b7d6ef424769fe5ef7dcb1f8aceb287774e77594ec333e35e4bc78eec729b8398b02fdcb1002a0fa8850e2a8806abc462c4e27cd597f6fee933dc03e42f9908c13e0361e"
}

>> Segmented image URL: https://storage.googleapis.com/vlm-

## 5. OCR (Optical Character Recognition)

Extract text from images using OCR capabilities.


In [15]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg";

const result = await chatCompletion(
    "Read the text in this image",
    [IMAGE_URL]
);

console.log(">> RESPONSE");
console.log(result);
console.log("\n>> IMAGE URL:", IMAGE_URL);


>> RESPONSE
Today is Thursday, October 20th - But it definitely feels like a Friday. I'm already considering making a second cup of coffee - and I haven't even finished my first. Do I have a problem? Sometimes I'll flip through older notes I've taken, and my handwriting is unrecognizable. Perhaps it depends on the type of pen I use? I've tried writing in all caps BUT IT LOOKS SO FORCED AND UNNATURAL. Often times, I'll just take notes on my laptop, but I still seem to gravitate toward pen and paper. Any advice on what to improve? I already feel stressed out looking back at what I've just written - it looks like 3 different people wrote this!

>> IMAGE URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg


### 6. Image Generation

Create, modify and remix images from text prompts or existing visuals.


### 6a. Virtual Try-On

Generate a virtual try-on of a dress on a person, with unique views and a seamless compositing.


In [16]:
const DRESS_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png";
const PERSON_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png";

console.log("Dress URL:", DRESS_URL);
console.log("Person URL:", PERSON_URL);


Dress URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png
Person URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png


In [17]:
// Generate a virtual try-on of a dress on a person, with unique views
const result = await chatCompletion(
    "You are provided with two images: one of a dress(the first image) and one of a person(the second image). Generate a few highly realistic virtual try-on by seamlessly compositing the dress onto the person, ensuring natural fit, alignment, and that the person appears fully and appropriately dressed. Provide 2 images (9:16 aspect ratio) as output: one from the front and one from the side.",
    [DRESS_URL, PERSON_URL],
    ImageUrlListResponseSchema
) as ImageUrlListResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Generated ${result.urls.length} images`);
result.urls.forEach((url, i) => {
    console.log(`  Image ${i + 1}: ${url.url}`);
});


>> RESPONSE
{
  urls: [
    {
      url: "https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/ec3fc215-4583-4093-b3ee-55201aee2e34/img_3dc78a.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T084509Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=45c8f19fc28a12f92bc94e093185b236f13ac7a2e47843e4c77bde94ca5592bf9d8bfcd5e5fae1231e33ad712c27cf023556207cbb9c35e79c318ac304e9aa3941837941f70b31da86b66c43b8d68a00053b7290580510c45e002db165994f49ac816bebb5121f6828b7a9825f7715a8f280934ef55eef4c9d1f88c90f2ee206be6a14ba0715d9094347fd1bedaba6274770611df36c16afafde82b18d1c2c093e825d6f6c41d6c49d1570cdb59ec699071d33f770df379f1a8be3066277f4a60bef717b6c3a52c8bdcb4087a45e12be15b6c1ac5bdc9ed3840fe4d4ecaf7e47ccc933f643cfc27d9d7f33afb32bb077cb5d4f6b9f14f6e6eddb2e076469e31f"
    },
    {
      url: "https://storage.g

### 7. Template Matching

Find a template image within a larger reference image.


In [18]:
const TEMPLATE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-12.png";
const REFERENCE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-13.png";

console.log("Template URL:", TEMPLATE_URL);
console.log("Reference URL:", REFERENCE_URL);


Template URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-12.png
Reference URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-13.png


In [19]:
const result = await chatCompletion(
    "Given two images, identify the specified item from the second image within the first image. Clearly highlight and draw bounding boxes around all occurrences of the item in the first image. Provide a brief description of the results.",
    [TEMPLATE_URL, REFERENCE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Found ${result.detections.length} matches`);
result.detections.forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [
    {
      label: "lemon",
      xywh: [ 0.252, 0.231, 0.15, 0.137 ],
      confidence: 0.98
    },
    {
      label: "lemon",
      xywh: [ 0.505, 0.625, 0.209, 0.21 ],
      confidence: 0.97
    },
    {
      label: "lemon",
      xywh: [ 0.737, 0.254, 0.118, 0.118 ],
      confidence: 0.96
    },
    {
      label: "lemon",
      xywh: [ 0.118, 0.43, 0.138, 0.141 ],
      confidence: 0.95
    }
  ]
}

>> Found 4 matches
  1. lemon: xywh=[0.252, 0.231, 0.150, 0.137]
  2. lemon: xywh=[0.505, 0.625, 0.209, 0.210]
  3. lemon: xywh=[0.737, 0.254, 0.118, 0.118]
  4. lemon: xywh=[0.118, 0.430, 0.138, 0.141]


### 8. UI Parsing

Parse user interface elements from screenshots.


In [22]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/web.ui-automation/win11.jpeg";

const result = await chatCompletion(
    "Parse the UI of this screenshot and detect all the UI elements.",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(`Detected ${result.detections.length} UI elements`);
result.detections.slice(0, 10).forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});
if (result.detections.length > 10) {
    console.log(`  ... and ${result.detections.length - 10} more`);
}


>> RESPONSE
Detected 47 UI elements
  1. search: xywh=[0.378, 0.110, 0.033, 0.021]
  2. Store: xywh=[0.497, 0.231, 0.076, 0.119]
  3. Microsoft: xywh=[0.287, 0.227, 0.076, 0.115]
  4. Aox: xywh=[0.361, 0.346, 0.067, 0.099]
  5. Mole: xywh=[0.637, 0.596, 0.053, 0.042]
  6. (12) png: xywh=[0.305, 0.646, 0.198, 0.082]
  7. Tonday alls: xywh=[0.305, 0.727, 0.200, 0.092]
  8. (II} png: xywh=[0.516, 0.649, 0.181, 0.078]
  9. Waiuz: xywh=[0.925, 0.938, 0.068, 0.056]
  10. (Blpng: xywh=[0.517, 0.726, 0.178, 0.091]
  ... and 37 more


### 9. Streaming Responses

For long-running tasks, you can use streaming to get partial results as they become available.


In [23]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg";

const stream = await client.agent.completions.create({
    model: "vlmrun-orion-1:auto",
    messages: [{
        role: "user",
        content: [
            { type: "text", text: "Describe this image in detail" },
            { type: "image_url", image_url: { url: IMAGE_URL } }
        ]
    }],
    stream: true
});

console.log("Streaming response:");
console.log("----------------------------------------");
let fullResponse = "";
for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
        fullResponse += content;
        // In a real notebook, you might want to display this incrementally
        process.stdout.write(content);
    }
}
console.log("\n----------------------------------------");
console.log("\n>> Full response length:", fullResponse.length, "characters");


Streaming response:
----------------------------------------

----------------------------------------

>> Full response length: 252 characters


---

## Conclusion

This cookbook demonstrated the comprehensive capabilities of the **VLM Run Orion Image Agent API** using Node.js/TypeScript.

### Key Takeaways

1. **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.
2. **Structured Outputs**: Use Zod schemas with `response_format` parameter to get type-safe, validated responses with automatic parsing.
3. **Type Safety**: TypeScript and Zod provide compile-time and runtime type checking for better developer experience.
4. **Streaming Support**: For long-running tasks, enable streaming to receive partial results as they become available, improving user experience.
5. **Flexible Prompting**: Natural language prompts allow you to combine multiple operations in a single request, reducing API calls and latency.
6. **Rich Capabilities**: The API supports object detection, segmentation, OCR, image generation, UI parsing, and more.

### Next Steps

- Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
- Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
- Check out more examples in the [VLM Run Cookbook](https://github.com/vlm-run/vlmrun-cookbook)
- Review the [VLM Run Node.js SDK](https://github.com/vlm-run/vlmrun-node-sdk) documentation

Happy building!
