<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a> | <a href="https://chat.vlm.run"><b>Chat</b></a>
</p>
</div>

# VLM Run Orion - Image Understanding, Reasoning and Execution (Node.js)

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) image understanding, reasoning and execution capabilities using **Node.js/TypeScript**. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful visual intelligence with the same familiar chat-completions interface.

We'll cover the following topics:
 1. Image VQA (captioning, tagging, question-answering)
 2. Object Detection (people, faces, objects, etc.)
 3. Object Segmentation (semantic, instance, etc.)
 4. UI Parsing (Graphical UI parsing and understanding)
 5. OCR (text detection, recognition, and understanding)
 6. Image Generation (text-to-image, in-painting, out-painting, etc.)
 7. Image Tools (cropping, super-resolution, rotating, etc.)

## Prerequisites

- Node.js 18+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- Deno or tslab kernel for running TypeScript in Jupyter


## Setup

First, install the required packages and configure the environment.


In [2]:
// Install the VLM Run SDK
// npm install vlmrun openai zod zod-to-json-schema

// If using Deno kernel, install dependencies via npm specifiers
// For tslab, run: npm install vlmrun openai zod zod-to-json-schema in your project directory


In [62]:
// Import the VLM Run SDK and dependencies
import { VlmRun } from "npm:vlmrun";
import { z } from "npm:zod";
import { zodToJsonSchema } from "npm:zod-to-json-schema";


In [63]:
// Get API key from environment variable
const VLMRUN_API_KEY = Deno.env.get("VLMRUN_API_KEY");

if (!VLMRUN_API_KEY) {
    throw new Error("Please set the VLMRUN_API_KEY environment variable");
}

console.log("✓ API Key loaded successfully");


✓ API Key loaded successfully


## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.


In [64]:
// Initialize the VLM Run client using the SDK
const client = new VlmRun({
    apiKey: VLMRUN_API_KEY,
    baseURL: "https://agent.vlm.run/v1"  // Use the agent API endpoint
});

console.log("✓ VLM Run client initialized successfully!");
console.log("Base URL: https://agent.vlm.run/v1");
console.log("Model: vlmrun-orion-1");


✓ VLM Run client initialized successfully!
Base URL: https://agent.vlm.run/v1
Model: vlmrun-orion-1


## Response Models (Schemas)

We define Zod schemas for structured outputs. These schemas provide type-safe, validated responses.


In [None]:
/**
 * Parse artifact URL to extract sessionId and objectId.
 * Artifact URLs format: .../artifacts/{sessionId}/{uuid}/{objectId}.jpg
 * 
 * @param artifactUrl - The artifact URL from API response
 * @returns Object with sessionId and objectId, or null if parsing fails
 */
function parseArtifactUrl(artifactUrl: string): { sessionId: string; objectId: string } | null {
    try {
        const url = new URL(artifactUrl);
        const pathParts = url.pathname.split('/').filter(p => p);
        const artifactsIndex = pathParts.indexOf('artifacts');
        
        if (artifactsIndex === -1 || artifactsIndex + 3 >= pathParts.length) {
            return null;
        }
        
        const sessionId = pathParts[artifactsIndex + 1];
        const objectIdWithExt = pathParts[artifactsIndex + 3];
        const objectId = objectIdWithExt.split('.')[0];
        
        if (sessionId && objectId) {
            return { sessionId, objectId };
        }
    } catch (e) {
        // URL parsing failed
    }
    return null;
}

/**
 * Get artifact image using the artifacts API.
 * 
 * @param artifactUrl - The artifact URL from API response (optional if sessionId/objectId provided)
 * @param sessionId - Optional: hardcoded session ID
 * @param objectId - Optional: hardcoded object ID
 * @returns Image data as Uint8Array, or null if failed
 */
async function getArtifactImage(
    artifactUrl?: string,
    sessionId?: string,
    objectId?: string
): Promise<Uint8Array | null> {
    let parsed: { sessionId: string; objectId: string } | null = null;
    
    // Use provided sessionId/objectId or parse from URL
    if (sessionId && objectId) {
        parsed = { sessionId, objectId };
    } else if (artifactUrl) {
        parsed = parseArtifactUrl(artifactUrl);
        if (!parsed) {
            console.error("Could not parse artifact URL");
            return null;
        }
    } else {
        console.error("Must provide either artifactUrl or both sessionId and objectId");
        return null;
    }
    
    try {
        // Use SDK artifacts API
        const artifact = await client.artifacts.get({
            sessionId: parsed.sessionId,
            objectId: parsed.objectId
        });
        
        // Convert Buffer to Uint8Array
        if (artifact instanceof Buffer) {
            return new Uint8Array(artifact);
        } else if (artifact instanceof Uint8Array) {
            return artifact;
        }
        
        return null;
    } catch (error) {
        const err = error instanceof Error ? error : new Error(String(error));
        console.error(`Failed to get artifact: ${err.message}`);
        return null;
    }
}

// Image URL Response Schema
const ImageUrlResponseSchema = z.object({
    url: z.string().describe("Pre-signed URL to the image")
});

type ImageUrlResponse = z.infer<typeof ImageUrlResponseSchema>;

// Image URL List Response Schema
const ImageUrlListResponseSchema = z.object({
    urls: z.array(ImageUrlResponseSchema).describe("List of pre-signed image URL responses")
});

type ImageUrlListResponse = z.infer<typeof ImageUrlListResponseSchema>;

// Detection Schema
const DetectionSchema = z.object({
    label: z.string().describe("Name of the detected object"),
    xywh: z.tuple([z.number(), z.number(), z.number(), z.number()])
        .describe("Bounding box (x, y, width, height) normalized from 0-1"),
    confidence: z.number().nullable().optional().describe("Detection confidence score from 0-1")
});

// Detections Response Schema
const DetectionsResponseSchema = z.object({
    detections: z.array(DetectionSchema).describe("List of detected objects with bounding boxes")
});

type DetectionsResponse = z.infer<typeof DetectionsResponseSchema>;

// Keypoint Schema
const KeypointSchema = z.object({
    xy: z.tuple([z.number(), z.number()])
        .describe("Normalized keypoint coordinates (x, y) between 0-1"),
    label: z.string().describe("Label of the keypoint")
});

// Keypoints Response Schema
const KeypointsResponseSchema = z.object({
    keypoints: z.array(KeypointSchema).describe("List of detected keypoints")
});

type KeypointsResponse = z.infer<typeof KeypointsResponseSchema>;

console.log("✓ Response schemas defined successfully!");
console.log("Schemas include type-safe validation for structured outputs.");


✓ Response schemas defined successfully!
Schemas include type-safe validation for structured outputs.


## Helper Functions

We create helper functions to simplify making chat completion requests with structured outputs.


In [67]:
/**
 * Make a chat completion request with optional images and structured output.
 * 
 * @param prompt - The text prompt/instruction
 * @param images - Optional list of images to process (URLs)
 * @param responseSchema - Optional Zod schema for structured output
 * @param model - Model to use (default: vlmrun-orion-1:auto)
 * @returns Parsed response if responseSchema provided, else raw response text
 */
async function chatCompletion<T>(
    prompt: string,
    images?: string[],
    responseSchema?: z.ZodSchema<T>,
    model: string = "vlmrun-orion-1:auto"
): Promise<T | string> {
    const content: any[] = [];
    content.push({ type: "text", text: prompt });

    if (images) {
        for (const image of images) {
            if (typeof image === "string") {
                if (!image.startsWith("http")) {
                    throw new Error("Image URLs must start with http or https");
                }
                content.push({
                    type: "image_url",
                    image_url: { url: image, detail: "auto" }
                });
            }
        }
    }

    const kwargs: any = {
        model: model,
        messages: [{ role: "user", content: content }]
    };

    if (responseSchema) {
        kwargs.response_format = {
            type: "json_schema",
            schema: zodToJsonSchema(responseSchema)
        } as any;
    }

    const response = await client.agent.completions.create(kwargs);
    const responseText = response.choices[0].message.content || "";

    if (responseSchema) {
        const parsed = JSON.parse(responseText);
        return responseSchema.parse(parsed) as T;
    }

    return responseText;
}

console.log("✓ Helper functions defined!");


✓ Helper functions defined!


## Image Understanding, Reasoning, and Execution Capabilities

VLM Run agents can perform a wide range of image processing tasks including object detection, face detection, segmentation, OCR, and more.


### 1. Captioning & Tagging

The simplest operation - load an image from a URL and caption it.


In [48]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg";

const result = await chatCompletion(
    "Generate a detailed description of this image.",
    [IMAGE_URL]
);

console.log(">> RESPONSE");
console.log(result);
console.log("\n>> IMAGE URL:", IMAGE_URL);


>> RESPONSE
The image features a vintage, mint green Volkswagen Beetle parked on a cobblestone street. The car is a classic Volkswagen Beetle, identifiable by its iconic rounded body shape, light mint green or seafoam green color. It has chrome bumpers, chrome trim around the windows, chrome hubcaps with a visible logo on the wheels, a white accent stripe along the side below the windows, side mirrors, and running boards. The car is positioned in front of a rustic building with a weathered, faded yellow or beige facade, likely made of stucco or plaster, showing signs of age and discoloration. The building has two prominent openings: on the left is a recessed dark brown wooden structure, possibly shutters or a window, characterized by two arched top sections. On the right, there is a large, dark brown wooden entrance door with vertical panels, set within a distinct white frame. The building has a flat roofline, and the ground in front of it is paved with light-colored cobblestones or pa

### 2a. Object Detection

Detect objects in images with bounding boxes. The agent can detect common objects like people, vehicles, animals, and more.


In [9]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/10-finding-nemo.jpeg";

const result = await chatCompletion(
    "Detect all the sea creatures in this image",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.detections.length} objects`);
result.detections.forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [
    { label: "Nemo", xywh: [ 0, 0, 0, 0 ], confidence: null },
    { label: "Dory", xywh: [ 0, 0, 0, 0 ], confidence: null },
    { label: "Marlin", xywh: [ 0, 0, 0, 0 ], confidence: null },
    { label: "Crush", xywh: [ 0, 0, 0, 0 ], confidence: null },
    { label: "Squirt", xywh: [ 0, 0, 0, 0 ], confidence: null },
    { label: "Hank", xywh: [ 0, 0, 0, 0 ], confidence: null }
  ]
}

>> Detected 6 objects
  1. Nemo: xywh=[0.000, 0.000, 0.000, 0.000]
  2. Dory: xywh=[0.000, 0.000, 0.000, 0.000]
  3. Marlin: xywh=[0.000, 0.000, 0.000, 0.000]
  4. Crush: xywh=[0.000, 0.000, 0.000, 0.000]
  5. Squirt: xywh=[0.000, 0.000, 0.000, 0.000]
  6. Hank: xywh=[0.000, 0.000, 0.000, 0.000]


### 2b. Object Detection with Specific Prompt

You can specify exactly which objects to detect using natural language.


In [10]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg";

const result = await chatCompletion(
    "Detect the 'car' and its 'wheels' in the image",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.detections.length} objects`);
result.detections.forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [
    {
      label: "car",
      xywh: [ 0.053, 0.343, 0.881, 0.424 ],
      confidence: 0.99
    },
    {
      label: "wheel",
      xywh: [ 0.148, 0.579, 0.159, 0.19 ],
      confidence: 0.98
    },
    {
      label: "wheel",
      xywh: [ 0.702, 0.579, 0.161, 0.19 ],
      confidence: 0.97
    }
  ]
}

>> Detected 3 objects
  1. car: xywh=[0.053, 0.343, 0.881, 0.424]
  2. wheel: xywh=[0.148, 0.579, 0.159, 0.190]
  3. wheel: xywh=[0.702, 0.579, 0.161, 0.190]


### 2c. Face Detection

Detect and localize faces in images with bounding boxes.


In [11]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg";

const result = await chatCompletion(
    "Detect all the faces in the image",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.detections.length} faces`);
result.detections.forEach((det, i) => {
    console.log(`  Face ${i + 1}: ${det.label}, xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [
    {
      label: "face",
      xywh: [ 0.066, 0.197, 0.268, 0.522 ],
      confidence: null
    },
    {
      label: "face",
      xywh: [ 0.354, 0.186, 0.268, 0.533 ],
      confidence: null
    },
    {
      label: "face",
      xywh: [ 0.655, 0.197, 0.268, 0.522 ],
      confidence: null
    }
  ]
}

>> Detected 3 faces
  Face 1: face, xywh=[0.066, 0.197, 0.268, 0.522]
  Face 2: face, xywh=[0.354, 0.186, 0.268, 0.533]
  Face 3: face, xywh=[0.655, 0.197, 0.268, 0.522]


### 2d. Person Detection

Detect and localize people in images with bounding boxes.


In [12]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg";

const result = await chatCompletion(
    "Detect all the people in the image",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.detections.length} people`);
result.detections.forEach((det, i) => {
    console.log(`  Person ${i + 1}: ${det.label}, xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [
    {
      label: "person",
      xywh: [ 0.04, 0.304, 0.082, 0.25 ],
      confidence: 0.98
    },
    {
      label: "person",
      xywh: [ 0.089, 0.288, 0.088, 0.276 ],
      confidence: 0.97
    },
    {
      label: "person",
      xywh: [ 0.168, 0.285, 0.09, 0.266 ],
      confidence: 0.96
    },
    {
      label: "person",
      xywh: [ 0.232, 0.282, 0.09, 0.303 ],
      confidence: 0.95
    },
    {
      label: "person",
      xywh: [ 0.304, 0.318, 0.09, 0.28 ],
      confidence: 0.94
    },
    {
      label: "person",
      xywh: [ 0.372, 0.315, 0.09, 0.27 ],
      confidence: 0.93
    },
    {
      label: "person",
      xywh: [ 0.446, 0.299, 0.09, 0.293 ],
      confidence: 0.92
    },
    {
      label: "person",
      xywh: [ 0.524, 0.326, 0.09, 0.275 ],
      confidence: 0.91
    },
    {
      label: "person",
      xywh: [ 0.6, 0.315, 0.09, 0.3 ],
      confidence: 0.9
    },
    {
      label: "person",
      xywh: [ 0.674, 0.335, 0.

### 2e. Detect and blur faces

Detect faces and blur them for privacy protection. Here we combine object / face detection with an image tool.


In [13]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg";

const result = await chatCompletion(
    "Blur all the faces in this image and return the blurred image",
    [IMAGE_URL],
    ImageUrlResponseSchema
) as ImageUrlResponse;

console.log(">> RESPONSE");
console.log(result);
console.log("\n>> Blurred image URL:", result.url);


>> RESPONSE
{
  url: "https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/37d626ae-2fff-47c0-88c4-f1c7ef4d2b0d/img_bde11c.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251221%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251221T081707Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=2ebb1f563dcd0d68b1cfec0e893bfb3975493e872548eebaaccb87c6d3017dcaeedcb83897fe11a555c0fc3f1277998d6b81da228f9a50cd9d7e040d6e110abe56f3ed93048a964b062d1003312c59cf29fe2339c0dfc78b64613a4d9fc97750243ff8520d5b5bd833effc2b330ca23643d7cc0691044116febadf02abf9e08bd2129bd5aa620b3752e05bfed98074b7959b407b28ff72cc5c9f8ddfd6f7343b69cb01ba22ab09438c9f91b9c1c2ecb79ce1426a1e5d600306802dd6980bc8db5ac1c0618c7748b0cfbbf10a016c6f3bcb90ada3ea7d812452ab06f20609f225b73a3a584317a23d7c2e97b577110a22a89e951a97088a40c63b02264793ae16"
}

>> Blurred image URL: https://storage.googleapis.com/vlm-us

### 3. Keypoint Detection

Detect keypoints in images for counting and localization tasks.


In [14]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.object-detection/donuts.png";

const result = await chatCompletion(
    "Detect all the donuts as keypoints and return the coordinates.",
    [IMAGE_URL],
    KeypointsResponseSchema
) as KeypointsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Detected ${result.keypoints.length} keypoints`);
result.keypoints.forEach((kp, i) => {
    console.log(`  ${i + 1}. ${kp.label}: xy=[${kp.xy.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  keypoints: [
    { xy: [ 0.1094, 0.1094 ], label: "donuts" },
    { xy: [ 0.3594, 0.0781 ], label: "donuts" },
    { xy: [ 0.5479, 0.1094 ], label: "donuts" },
    { xy: [ 0.7881, 0.1094 ], label: "donuts" },
    { xy: [ 0.7686, 0.2842 ], label: "donuts" },
    { xy: [ 0.5, 0.5 ], label: "donuts" },
    { xy: [ 0.2725, 0.3594 ], label: "donuts" },
    { xy: [ 0.0537, 0.5 ], label: "donuts" },
    { xy: [ 0.0293, 0.8525 ], label: "donuts" },
    { xy: [ 0.2305, 0.7441 ], label: "donuts" },
    { xy: [ 0.5, 0.832 ], label: "donuts" },
    { xy: [ 0.7881, 0.7441 ], label: "donuts" },
    { xy: [ 0.8486, 0.9414 ], label: "donuts" },
    { xy: [ 0.9639, 0.5 ], label: "donuts" }
  ]
}

>> Detected 14 keypoints
  1. donuts: xy=[0.109, 0.109]
  2. donuts: xy=[0.359, 0.078]
  3. donuts: xy=[0.548, 0.109]
  4. donuts: xy=[0.788, 0.109]
  5. donuts: xy=[0.769, 0.284]
  6. donuts: xy=[0.500, 0.500]
  7. donuts: xy=[0.273, 0.359]
  8. donuts: xy=[0.054, 0.500]
  9. donuts: xy=[0.029

### 4. Segmentation

Create pixel-level segmentation masks for objects, people or regions in images.


In [15]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/lunch-skyscraper.jpg";

const result = await chatCompletion(
    "Detect all the people in this image, and segment them.",
    [IMAGE_URL],
    ImageUrlResponseSchema
) as ImageUrlResponse;

console.log(">> RESPONSE");
console.log(result);
console.log("\n>> Segmented image URL:", result.url);


>> RESPONSE
{
  url: "https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/bf055d72-2cbb-4caa-8967-a838467012ec/img_c2d1f3.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251221%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251221T081926Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=68dfcf62c093c15418ecf0213df0a876464f2fe2d4429210ce44926f0baef71b26b53db9f11d43d120fda84ad7c3e3ed5f45611963ab7f067e95fa7fad8235782566fcaca7379865ea25dbe2fd1d3c63fe917acd6eda2c2909152810ddc7875d7b7bb0e1138a7dc4b7195eac97d061b73df4bf9beb23b1725ea191a07bdce87491381b0a7c0907a72dd83073b626e7ee4495791a45146a43fa121ce5b6c1a95cc9e1d7091cd320d1c0def0941db00a4d7dd39bf96523cfe0fbb048c5318b0b49d9ead768533ac011adfff85cba001b85a52f3e9d870906555e9cc1139e4176e4e7ea6b069b3c0ae5d7bded69cec12ed9beb05978f137d6cd56fd52b78c8dd76b"
}

>> Segmented image URL: https://storage.googleapis.com/vlm-

## 5. OCR (Optical Character Recognition)

Extract text from images using OCR capabilities.


In [16]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg";

const result = await chatCompletion(
    "Read the text in this image",
    [IMAGE_URL]
);

console.log(">> RESPONSE");
console.log(result);
console.log("\n>> IMAGE URL:", IMAGE_URL);


>> RESPONSE
The text in the image `img_eb6ce4` reads:

"Today is Thursday, October 20th- But it definitely feels like a Friday. I'm already considering making a second cup of coffee- and I haven't even finished my first. Do I have a problem? Sometimes I'll flip through older notes I've taken, and my handwriting is unrecognizable, Perhaps it depends on the type of pen I use? I've tried writing in all caps But IT Looks So FORCED AND UNNATURAL Often times, I'll just take notes on my laptop, but I still seem to gravitate toward pen and paper. Any advice on what to I'm prove ? I already feel stressed out looking back at what I've just written- it looks like 3 different people wrote this!"

>> IMAGE URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg


### 6. Image Generation

Create, modify and remix images from text prompts or existing visuals.


### 6a. Virtual Try-On

Generate a virtual try-on of a dress on a person, with unique views and a seamless compositing.


In [17]:
const DRESS_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png";
const PERSON_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png";

console.log("Dress URL:", DRESS_URL);
console.log("Person URL:", PERSON_URL);


Dress URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/dress.png
Person URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/virtual_try_on/person.png


In [18]:
// Generate a virtual try-on of a dress on a person, with unique views
const result = await chatCompletion(
    "You are provided with two images: one of a dress(the first image) and one of a person(the second image). Generate a few highly realistic virtual try-on by seamlessly compositing the dress onto the person, ensuring natural fit, alignment, and that the person appears fully and appropriately dressed. Provide 2 images (9:16 aspect ratio) as output: one from the front and one from the side.",
    [DRESS_URL, PERSON_URL],
    ImageUrlListResponseSchema
) as ImageUrlListResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Generated ${result.urls.length} images`);
result.urls.forEach((url, i) => {
    console.log(`  Image ${i + 1}: ${url.url}`);
});


>> RESPONSE
{
  urls: [
    {
      url: "https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/afb27bf5-beb7-47fa-875d-e24546f6f4cc/img_193249.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251221%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251221T082035Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=64126c1487345cdb047fbeda9de5ba829c2b3f85e1f631a908000f429bc22e8569203d5c394a0d7d012907470bd447d71e93453a615b763dbc344c7f7f7bb0449356439da80f0aa9dadac8c9cbcfe14c1e96a0bad0f2ae69e167897f2f7037ad8912b76261440adf9eff80f6c1e95538ea93675e72d4f47efeacf72966182092ebcc8b65484342246080457c4a12bad72a6686bfa7e6522be1188c0104e4bb48213b6b26b9664673d04ef660d9cbf5198e8b86eb14fa9c37293cd14e205b89545a87fd9f73b73aadbba595d17a17505795f9a521ec6f28a021372132b0561cd5d9d3b1e41424ee1f75c553ddbd8f1a43162046f5a214840ef78fca210bd645a1"
    },
    {
      url: "https://storage.g

### 7. Template Matching

Find a template image within a larger reference image.


In [19]:
const TEMPLATE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-12.png";
const REFERENCE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-13.png";

console.log("Template URL:", TEMPLATE_URL);
console.log("Reference URL:", REFERENCE_URL);


Template URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-12.png
Reference URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/template-search/image-13.png


In [20]:
const result = await chatCompletion(
    "Given two images, identify the specified item from the second image within the first image. Clearly highlight and draw bounding boxes around all occurrences of the item in the first image. Provide a brief description of the results.",
    [TEMPLATE_URL, REFERENCE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(result);
console.log(`\n>> Found ${result.detections.length} matches`);
result.detections.forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});


>> RESPONSE
{
  detections: [ { label: "lemon", xywh: [ 0, 0, 1, 1 ], confidence: 0.99 } ]
}

>> Found 1 matches
  1. lemon: xywh=[0.000, 0.000, 1.000, 1.000]


### 8. UI Parsing

Parse user interface elements from screenshots.


In [22]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/web.ui-automation/win11.jpeg";

const result = await chatCompletion(
    "Parse the UI of this screenshot and detect all the UI elements.",
    [IMAGE_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> RESPONSE");
console.log(`Detected ${result.detections.length} UI elements`);
result.detections.slice(0, 10).forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});
if (result.detections.length > 10) {
    console.log(`  ... and ${result.detections.length - 10} more`);
}


>> RESPONSE
Detected 47 UI elements
  1. text: search: xywh=[0.378, 0.110, 0.033, 0.021]
  2. icon: Store: xywh=[0.497, 0.231, 0.076, 0.119]
  3. icon: Microsoft: xywh=[0.287, 0.227, 0.076, 0.115]
  4. icon: Aox: xywh=[0.361, 0.346, 0.067, 0.099]
  5. icon: Mole: xywh=[0.637, 0.596, 0.053, 0.042]
  6. icon: (12) png: xywh=[0.305, 0.646, 0.198, 0.082]
  7. icon: Tonday alls: xywh=[0.305, 0.727, 0.200, 0.092]
  8. icon: (II} png: xywh=[0.516, 0.649, 0.181, 0.078]
  9. icon: Waiuz: xywh=[0.925, 0.938, 0.068, 0.056]
  10. icon: (Blpng: xywh=[0.517, 0.726, 0.178, 0.091]
  ... and 37 more


### 9. Streaming Responses

For long-running tasks, you can use streaming to get partial results as they become available.


In [23]:
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg";

const stream = await client.agent.completions.create({
    model: "vlmrun-orion-1:auto",
    messages: [{
        role: "user",
        content: [
            { type: "text", text: "Describe this image in detail" },
            { type: "image_url", image_url: { url: IMAGE_URL } }
        ]
    }],
    stream: true
});

console.log("Streaming response:");
console.log("----------------------------------------");
let fullResponse = "";
for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
        fullResponse += content;
        // In a real notebook, you might want to display this incrementally
        process.stdout.write(content);
    }
}
console.log("\n----------------------------------------");
console.log("\n>> Full response length:", fullResponse.length, "characters");


Streaming response:
----------------------------------------

----------------------------------------

>> Full response length: 1118 characters


### 10. Using Artifacts API for Preview Images

When the API returns image URLs (like in image generation, segmentation, or filtering operations), you can use the artifacts API to retrieve the actual image data for preview, saving, or further processing.

In [73]:
// Step 1: Generate a processed image
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg";

const result = await chatCompletion(
    "Apply a vintage filter to this image and return the processed image",
    [IMAGE_URL],
    ImageUrlResponseSchema
) as ImageUrlResponse;

console.log(">> Generated image URL:", result.url);

// Step 2: Retrieve the image using artifacts API
console.log("\n>> Retrieving image using artifacts API...");
const imageData = await getArtifactImage(result.url);

if (imageData) {
    console.log(`✓ Successfully retrieved: ${imageData.length} bytes`);
    console.log("\n>> The image data is now available for:");
    console.log("  - Saving to file");
    console.log("  - Displaying in notebook");
    console.log("  - Further processing with image libraries");
} else {
    console.log("✗ Failed to retrieve artifact");
}


>> Generated image URL: https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/7193b85e-de88-40e5-bc7d-063dc6bfa0ea/img_9f6ca0.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251221%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251221T123545Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=51df172ae0204d3d7025cc9b075f129c2b33f447c735bb4cd292222c90016183f732b729b67c160e1e6ef6599124261e846f33490fe82a86712935ed64050601d4afd87c7a49458507f4f6c1e17e958e980bba98ae95533a5d64687e7a4529de30d4f10a61e1489d3e1ff16a012eb2149269ee93db8ce144f5f5427c96ef7452e28f1565b64e91046e6807792006f53874ecb58d504bab8ee9a86910a2dd9c162a1c4dbd98694f7c7215c8d3d95b89ac74e0f0900d1bb0ea680ac79c14ad65748c0fc384a931b666c345d3503de7d0582dad8ba6414f2b01172d75e38d324f17b07d4c1374526e121ff286e16d7cc641108c29eb687afd24d29d408d97d9045e

>> Retrieving image using artifacts API...
✓ Successfully re

In [None]:
// Test artifacts API with hardcoded sessionId and objectId
// Replace these with your actual values from an artifact URL
const SESSION_ID = "f2ab2e1f-f20e-478d-bad6-990449a4a9ee";
const OBJECT_ID = "img_330e53";

console.log(">> Testing Artifacts API with hardcoded values");
console.log(`Session ID: ${SESSION_ID}`);
console.log(`Object ID: ${OBJECT_ID}\n`);

try {
    // Method 1: Try using SDK artifacts API if available
    if (client.artifacts && typeof client.artifacts.get === 'function') {
        console.log("Trying SDK artifacts API...");
        const artifact = await client.artifacts.get({
            sessionId: SESSION_ID,
            objectId: OBJECT_ID
        });
        if (artifact instanceof Buffer) {
            const imageData = new Uint8Array(artifact);
            console.log(`✓ Successfully retrieved via SDK: ${imageData.length} bytes`);
        } else if (artifact instanceof Uint8Array) {
            console.log(`✓ Successfully retrieved via SDK: ${artifact.length} bytes`);
        } else {
            console.log(`✓ Retrieved (type: ${typeof artifact})`);
        }
    } else {
        console.log("SDK artifacts API not available, trying direct fetch...");
    }
    
    // Method 2: Direct fetch to artifacts API endpoint
    const baseURL = "https://agent.vlm.run/v1";
    const artifactsUrl = `${baseURL}/artifacts/${SESSION_ID}/${OBJECT_ID}`;
    
    console.log(`\nTrying direct fetch: ${artifactsUrl}`);
    const response = await fetch(artifactsUrl, {
        headers: {
            "Authorization": `Bearer ${VLMRUN_API_KEY}`,
        },
    });
    
    if (response.ok) {
        const arrayBuffer = await response.arrayBuffer();
        const imageData = new Uint8Array(arrayBuffer);
        console.log(`✓ Successfully retrieved via direct fetch: ${imageData.length} bytes`);
        console.log("\n>> The image data is now available for:");
        console.log("  - Saving to file");
        console.log("  - Displaying in notebook");
        console.log("  - Further processing with image libraries");
    } else {
        console.error(`✗ Failed: ${response.status} ${response.statusText}`);
    }
} catch (error) {
    console.error(`✗ Error: ${error instanceof Error ? error.message : String(error)}`);
}



>> Testing Artifacts API with hardcoded values
Session ID: f2ab2e1f-f20e-478d-bad6-990449a4a9ee
Object ID: img_330e53

Trying SDK artifacts API...
✓ Successfully retrieved via SDK: 103037 bytes

Trying direct fetch: https://agent.vlm.run/v1/artifacts/f2ab2e1f-f20e-478d-bad6-990449a4a9ee/img_330e53
✓ Successfully retrieved via direct fetch: 103037 bytes

>> The image data is now available for:
  - Saving to file
  - Displaying in notebook
  - Further processing with image libraries


---

## Conclusion

This cookbook demonstrated the comprehensive capabilities of the **VLM Run Orion Image Agent API** using Node.js/TypeScript.

### Key Takeaways

1. **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.
2. **Structured Outputs**: Use Zod schemas with `response_format` parameter to get type-safe, validated responses with automatic parsing.
3. **Type Safety**: TypeScript and Zod provide compile-time and runtime type checking for better developer experience.
4. **Streaming Support**: For long-running tasks, enable streaming to receive partial results as they become available, improving user experience.
5. **Flexible Prompting**: Natural language prompts allow you to combine multiple operations in a single request, reducing API calls and latency.
6. **Rich Capabilities**: The API supports object detection, segmentation, OCR, image generation, UI parsing, and more.

### Next Steps

- Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
- Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
- Check out more examples in the [VLM Run Cookbook](https://github.com/vlm-run/vlmrun-cookbook)
- Review the [VLM Run Node.js SDK](https://github.com/vlm-run/vlmrun-node-sdk) documentation

Happy building!
