<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a> | <a href="https://chat.vlm.run"><b>Chat</b></a>
</p>
</div>

# VLM Run Orion - Document Understanding (Node.js)

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) document understanding capabilities using **Node.js/TypeScript** including OCR, layout detection, redaction, and multi-document classification. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful document intelligence with the same familiar chat-completions interface.

We'll cover the following topics:
 1. **OCR (Optical Character Recognition)** - Extract text, tables, paragraphs, and figures from documents
 2. **Layout Detection** - Identify document structure (headers, footers, tables, figures, etc.)
 3. **Document Redaction** - Detect and redact sensitive information (PII, financial data, PHI, etc.)
 4. **Multi-Document Classification** - Classify documents into categories based on content and structure

## Prerequisites

- Node.js 18+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- Deno or tslab kernel for running TypeScript in Jupyter


## Setup

First, install the required packages and configure the environment.


In [1]:
// Install the VLM Run SDK
// npm install vlmrun openai zod zod-to-json-schema

// If using Deno kernel, install dependencies via npm specifiers
// For tslab, run: npm install vlmrun openai zod zod-to-json-schema in your project directory


In [2]:
// Import the VLM Run SDK and dependencies
import { VlmRun } from "vlmrun";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";


In [3]:
// Get API key from environment variable
const VLMRUN_API_KEY = Deno.env.get("VLMRUN_API_KEY");

if (!VLMRUN_API_KEY) {
    throw new Error("Please set the VLMRUN_API_KEY environment variable");
}

console.log("✓ API Key loaded successfully");


✓ API Key loaded successfully


## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.


In [4]:
// Initialize the VLM Run client using the SDK
const client = new VlmRun({
    apiKey: VLMRUN_API_KEY,
    baseURL: "https://agent.vlm.run/v1"  // Use the agent API endpoint
});

console.log("✓ VLM Run client initialized successfully!");
console.log("Base URL: https://agent.vlm.run/v1");
console.log("Model: vlmrun-orion-1");


✓ VLM Run client initialized successfully!
Base URL: https://agent.vlm.run/v1
Model: vlmrun-orion-1


## Response Models (Schemas)

We define Zod schemas for structured outputs. These schemas provide type-safe, validated responses for document understanding tasks.


In [5]:
// Helper function to download images from URLs
async function downloadImage(url: string): Promise<Uint8Array> {
    const response = await fetch(url);
    if (!response.ok) {
        throw new Error(`Failed to download image: ${response.statusText}`);
    }
    return new Uint8Array(await response.arrayBuffer());
}

// Image URL Response Schema
const ImageUrlResponseSchema = z.object({
    url: z.string().describe("Pre-signed URL to the image")
});

type ImageUrlResponse = z.infer<typeof ImageUrlResponseSchema>;

// Document URL Response Schema
const DocumentUrlResponseSchema = z.object({
    url: z.string().describe("Pre-signed URL to the document")
});

type DocumentUrlResponse = z.infer<typeof DocumentUrlResponseSchema>;

// Image URL List Response Schema
const ImageUrlListResponseSchema = z.object({
    urls: z.array(ImageUrlResponseSchema).describe("List of pre-signed image URL responses")
});

type ImageUrlListResponse = z.infer<typeof ImageUrlListResponseSchema>;

// Detection Schema
const DetectionSchema = z.object({
    label: z.string().describe("Name of the detected object or text"),
    xywh: z.tuple([z.number(), z.number(), z.number(), z.number()])
        .describe("Bounding box (x, y, width, height) normalized from 0-1"),
    confidence: z.number().nullable().optional().describe("Detection confidence score from 0-1")
});

// Detections Response Schema
const DetectionsResponseSchema = z.object({
    detections: z.array(DetectionSchema).describe("List of detected objects or text or layout elements with bounding boxes"),
    image_url: z.string().optional().describe("Url to the image for the detections")
});

type DetectionsResponse = z.infer<typeof DetectionsResponseSchema>;

// Document Classification Response Schema
const DocumentClassificationResponseSchema = z.object({
    rationale: z.string().describe("Rationale for the classification"),
    domain: z.string().describe("The classified domain of the document"),
    confidence: z.string().describe("Confidence level: hi, med, or lo"),
    tags: z.array(z.string()).nullable().optional().describe("List of tags describing the document")
});

type DocumentClassificationResponse = z.infer<typeof DocumentClassificationResponseSchema>;

// Layout Element Schema
const LayoutElementSchema = z.object({
    category: z.string().describe("Category: caption, footnote, formula, list-item, page-footer, page-header, picture, section-header, table, text, title"),
    xywh: z.tuple([z.number(), z.number(), z.number(), z.number()])
        .describe("Bounding box (x, y, width, height) normalized from 0-1"),
    text: z.string().nullable().optional().describe("Text content of the element if available")
});

// Layout Detection Response Schema
const LayoutDetectionResponseSchema = z.object({
    elements: z.array(LayoutElementSchema).describe("List of detected layout elements")
});

type LayoutDetectionResponse = z.infer<typeof LayoutDetectionResponseSchema>;

// Sensitive Item Schema
const SensitiveItemSchema = z.object({
    item_type: z.string().describe("Type of sensitive information"),
    value: z.string().describe("The detected sensitive value")
});

// Redaction Detection Response Schema
const RedactionDetectionResponseSchema = z.object({
    detected_items: z.array(SensitiveItemSchema).describe("List of detected sensitive items")
});

type RedactionDetectionResponse = z.infer<typeof RedactionDetectionResponseSchema>;

console.log("✓ Response schemas defined successfully!");
console.log("Schemas include type-safe validation for structured outputs.");


✓ Response schemas defined successfully!
Schemas include type-safe validation for structured outputs.


In [6]:
/**
 * Make a chat completion request with optional images, documents, and structured output.
 * 
 * @param prompt - The text prompt/instruction
 * @param images - Optional list of images to process (URLs)
 * @param documents - Optional list of documents to process (URLs)
 * @param responseSchema - Optional Zod schema for structured output
 * @param model - Model to use (default: vlmrun-orion-1:auto)
 * @returns Parsed response if responseSchema provided, else raw response text
 */
async function chatCompletion<T>(
    prompt: string,
    images?: string[],
    documents?: string[],
    responseSchema?: z.ZodSchema<T>,
    model: string = "vlmrun-orion-1:auto"
): Promise<T | string> {
    const content: any[] = [];
    content.push({ type: "text", text: prompt });

    // Add documents first (if any)
    if (documents) {
        for (const doc of documents) {
            if (typeof doc === "string") {
                if (!doc.startsWith("http")) {
                    throw new Error("Document URLs must start with http or https");
                }
                content.push({
                    type: "file_url",
                    file_url: { url: doc, detail: "auto" }
                });
            }
        }
    }

    // Add images (if any)
    if (images) {
        for (const image of images) {
            if (typeof image === "string") {
                if (!image.startsWith("http")) {
                    throw new Error("Image URLs must start with http or https");
                }
                content.push({
                    type: "image_url",
                    image_url: { url: image, detail: "auto" }
                });
            }
        }
    }

    const kwargs: any = {
        model: model,
        messages: [{ role: "user", content: content }]
    };

    if (responseSchema) {
        kwargs.response_format = {
            type: "json_schema",
            schema: zodToJsonSchema(responseSchema)
        } as any;
    }

    const response = await client.agent.completions.create(kwargs);
    const responseText = response.choices[0].message.content || "";

    if (responseSchema) {
        const parsed = JSON.parse(responseText);
        return responseSchema.parse(parsed) as T;
    }

    return responseText;
}

console.log("✓ Helper functions defined!");


✓ Helper functions defined!


### 1. OCR (Optical Character Recognition)

Extract text, tables, paragraphs, and figures from documents using OCR capabilities.


In [7]:
// Example: Extract text from a document image
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg";

const result = await chatCompletion(
    "Extract all text from this document image. Return the full text content with proper formatting.",
    [IMAGE_URL]
);

console.log(">> OCR RESULT");
console.log(result);
console.log("\n>> DOCUMENT IMAGE URL:", IMAGE_URL);


>> OCR RESULT
Today is Thursday, October 20th- But it definitely feels like a Friday. I'm already considering making a second cup of coffee- and I haven't even finished my first. Do I have a problem?
Sometimes I'll flip through older notes I've taken, and my handwriting is unrecognizable, Perhaps it depends on the type of pen I use?
I've tried writing in all caps But IT Looks So FORCED AND UNNATURAL
Often times, I'll just take notes on my laptop, but I still seem to gravitate toward pen and paper. Any advice on what to I'm prove ? I already feel stressed out looking back at what I've just written- it looks like 3 different people wrote this!

>> DOCUMENT IMAGE URL: https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/hand_writting_beautification/image-ocr.jpg


### 1b. OCR with Structured Extraction

Extract structured information including tables, paragraphs, and figures from documents.


In [8]:
// Example: Extract structured content (tables, paragraphs, figures)
const DOC_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/healthcare.patient-referral/handwritten-patient-referral.pdf";

const result = await chatCompletion(
    "Extract all handwritten text fields and ground the text. Also return the image as a presigned url",
    undefined,
    [DOC_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> STRUCTURED OCR RESULT");
console.log(">> RESPONSE");
console.log(`Found ${result.detections.length} text fields`);
result.detections.slice(0, 10).forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});
if (result.detections.length > 10) {
    console.log(`  ... and ${result.detections.length - 10} more`);
}
if (result.image_url) {
    console.log("\n>> Document image URL:", result.image_url);
}


>> STRUCTURED OCR RESULT
>> RESPONSE
Found 19 text fields
  1. Irene: xywh=[0.183, 0.119, 0.082, 0.021]
  2. Wong: xywh=[0.368, 0.114, 0.073, 0.025]
  3. family health: xywh=[0.615, 0.110, 0.172, 0.030]
  4. ireneworry@gmail.com: xywh=[0.200, 0.148, 0.268, 0.027]
  5. 6753369412: xywh=[0.610, 0.147, 0.237, 0.027]
  6. Kevin: xywh=[0.191, 0.201, 0.089, 0.025]
  7. chen: xywh=[0.368, 0.204, 0.068, 0.021]
  8. 9/15/2001: xywh=[0.616, 0.196, 0.168, 0.027]
  9. kerimchen@gmail.com: xywh=[0.228, 0.233, 0.269, 0.025]
  10. 6153012311: xywh=[0.611, 0.233, 0.194, 0.025]
  ... and 9 more

>> Document image URL: https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/70de51de-6233-4e2f-89a7-a6748be6f009/img_521cea.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T103556Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog

### 2. Layout Detection

Detect document structure including headers, footers, tables, figures, text blocks, and other layout elements.


In [9]:
// Example: Detect document layout elements
const IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.layout/blackhole.jpeg";

const result = await chatCompletion(
    "Detect and identify all layout elements in this document including headers, footers, titles, paragraphs, tables, figures, and text blocks. Return bounding boxes for each element with their categories.",
    [IMAGE_URL],
    undefined,
    LayoutDetectionResponseSchema
) as LayoutDetectionResponse;

console.log(">> LAYOUT DETECTION RESULT");
console.log(`Found ${result.elements.length} layout elements`);
result.elements.slice(0, 20).forEach((element, idx) => {
    console.log(`  ${idx + 1}. ${element.category}: xywh=[${element.xywh.map(v => v.toFixed(3)).join(", ")}]`);
    if (element.text) {
        console.log(`     Text: ${element.text.substring(0, 50)}...`);
    }
});
if (result.elements.length > 20) {
    console.log(`  ... and ${result.elements.length - 20} more`);
}
console.log("\n>> DOCUMENT IMAGE URL:", IMAGE_URL);


>> LAYOUT DETECTION RESULT
Found 27 layout elements
  1. section-header: xywh=[0.353, 0.042, 0.276, 0.019]
  2. picture: xywh=[0.013, 0.098, 0.273, 0.209]
  3. picture: xywh=[0.280, 0.099, 0.342, 0.126]
  4. text: xywh=[0.294, 0.163, 0.253, 0.070]
  5. text: xywh=[0.622, 0.125, 0.336, 0.193]
  6. picture: xywh=[0.329, 0.238, 0.283, 0.182]
  7. text: xywh=[0.035, 0.345, 0.255, 0.159]
  8. picture: xywh=[0.176, 0.409, 0.458, 0.150]
  9. text: xywh=[0.283, 0.440, 0.220, 0.070]
  10. picture: xywh=[0.630, 0.337, 0.340, 0.208]
  11. section-header: xywh=[0.402, 0.601, 0.176, 0.016]
  12. section-header: xywh=[0.102, 0.638, 0.160, 0.015]
  13. text: xywh=[0.030, 0.677, 0.268, 0.040]
  14. text: xywh=[0.030, 0.738, 0.275, 0.042]
  15. text: xywh=[0.030, 0.803, 0.253, 0.041]
  16. text: xywh=[0.030, 0.866, 0.292, 0.042]
  17. section-header: xywh=[0.380, 0.639, 0.223, 0.015]
  18. text: xywh=[0.337, 0.688, 0.302, 0.053]
  19. text: xywh=[0.337, 0.764, 0.286, 0.042]
  20. text: xywh=[0.337, 0.8

### 2b. Layout Detection with Categories

Detect specific layout categories like tables, figures, and text blocks.


In [10]:
// Example: Detect specific layout categories
const DOC_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/construction.markdown/sample-construction-plan-set.pdf";

const result = await chatCompletion(
    "Detect all tables, figures, and section headers in the first page of this document. Provide bounding boxes of each detected element and return all detections. Return also pre signed url of the image",
    undefined,
    [DOC_URL],
    DetectionsResponseSchema
) as DetectionsResponse;

console.log(">> STRUCTURED LAYOUT Category RESULT");
console.log(">> RESPONSE");
console.log(`Found ${result.detections.length} layout elements`);
result.detections.forEach((det, i) => {
    console.log(`  ${i + 1}. ${det.label}: xywh=[${det.xywh.map(v => v.toFixed(3)).join(", ")}]`);
});
if (result.image_url) {
    console.log("\n>> Document image URL:", result.image_url);
}


>> STRUCTURED LAYOUT Category RESULT
>> RESPONSE
Found 11 layout elements
  1. figure: xywh=[0.088, 0.036, 0.827, 0.940]
  2. page-header: xywh=[0.923, 0.044, 0.013, 0.012]
  3. page-header: xywh=[0.923, 0.512, 0.045, 0.204]
  4. text: xywh=[0.923, 0.737, 0.015, 0.009]
  5. text: xywh=[0.940, 0.737, 0.043, 0.009]
  6. figure: xywh=[0.923, 0.755, 0.060, 0.065]
  7. text: xywh=[0.923, 0.823, 0.028, 0.009]
  8. text: xywh=[0.931, 0.844, 0.032, 0.041]
  9. text: xywh=[0.923, 0.900, 0.038, 0.009]
  10. text: xywh=[0.933, 0.923, 0.022, 0.026]
  11. text: xywh=[0.840, 0.962, 0.075, 0.011]

>> Document image URL: https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/6c145972-c710-4327-ab59-660e99cf09b3/img_ba5078.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T103813Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-

### 3. Document Redaction

Detect sensitive information and apply blurring to redact it from the document.


In [11]:
// Example: Detect and blur sensitive information
const DOC_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.markdown/playground/1.pdf";

const result = await chatCompletion(
    "Detect all sensitive information (names, addresses, phone numbers, emails, dates) in this document and blur them to redact the information. Return the redacted image.",
    undefined,
    [DOC_URL],
    ImageUrlResponseSchema
) as ImageUrlResponse;

console.log(">> REDACTION RESULT");
console.log(result);
console.log("\n>> REDACTED DOCUMENT URL:", result.url);


>> REDACTION RESULT
{
  url: "https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/ae8ae740-ddd6-426b-bde1-75540f99f277/6a1ca916-2d0e-443b-96ba-7ae9b5d76690/img_44a2ea.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T103947Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=75f588a6093c66f2bc9965ef3153237765faf907c4d53ca03aec44af0046ce4d597530e8d39e111996c51a673d50e15ebb69f20b3393ef631a0928da59661b526e54d701be2417089b04aaa7c18f6d6e2f46a5957155087b744a74253516f811a684758eb102ab0c7fe52a2da833ec48388bd49149ec030e8f46df03d993b3965cf9d661c78d66833a621ced28327d85bec774846862a7f6773c5aacdbc29f53d628c290d988ee6a0b1538de12602465c9b68395c60a9011718e5ed783189fcd34f169145834152b1dea0d1fa9af5d37d2e4025612521f321fdb2a0a915b04fd5b602bcc8bd4a23d451349ee3aa390f66b5a3030d90a68d7e39bd3ef00d2693c"
}

>> REDACTED DOCUMENT URL: https://storage.googleapi

### 4. Multi-Document Classification

Classify documents into categories based on their content, structure, and visual features.


In [12]:
// Example: Classify the multi-page document and give each page class
const DOC_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.agent/multi-document-input-example.pdf";

const result = await chatCompletion(
    "Analyze this multi-page medical document set. Extract patient referral page, medical insurance card and identification form in 3 separate fields in JSON format.",
    undefined,
    [DOC_URL]
);

console.log(">> RESULT");
console.log(result);


>> RESULT
I have successfully extracted the patient referral page, medical insurance card, and identification form from the document. Here is the information in JSON format:

```json
{
  "patient_referral_page": "www.cviga.org\nCENTER\nFOR THE\nVISUALLY\nIMPAIRED\nPatient Referral Form\nPlease fax this form to CVI at 404-875-4568\nPatient Name: Samuel Jackson\nDate of Birth: 6/4/72 Patient's Phone: 847-292-8014\nAddress: 1643 Elmwood Drive\nCity/State/Zip: Treeslave, NY 10027\nPreferred Contact Name and Number (if other than patient):\nDiagnosis: Retinal Detachment\nVisual Acuities: Distance cc OD:\n20/40\ncc OS\n20160\nVisual Fields (please fax field chart if available):\nReferred by:\nPhysician's name (please print): Johu\nPhysician's signature: Ph\nUPIN:\nFirst\nA\nMiddle\nTravolta\nLast\nNPI: 1134562341 Phone:\n724-891-0707\nAddress: 583 Raven Lame\nCity, State, Zip: Elmsford, NY 10734\nReferral Date: 8128/24 Date of Office Visit: 8/25/24\nHow did you hear about CVI:\nQuestions? Co

## Conclusion

This cookbook demonstrated the comprehensive document understanding capabilities of the **VLM Run Orion Agent API** using Node.js/TypeScript.

### Key Takeaways

1. **OCR Capabilities**: Extract text, tables, paragraphs, and figures from documents with high accuracy using advanced OCR models.

2. **Layout Detection**: Identify document structure including headers, footers, tables, figures, text blocks, and other layout elements with precise bounding boxes.

3. **Document Redaction**: Detect and redact sensitive information including:
   - PII (Personally Identifiable Information)
   - Financial data (account numbers, SSNs, credit cards)
   - PHI (Protected Health Information) for HIPAA compliance
   - Domain-specific sensitive data (legal, insurance, real estate, etc.)

4. **Multi-Document Classification**: Classify documents into categories based on content, structure, and visual features with confidence scores and rationales.

5. **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.

6. **Structured Outputs**: Use Zod schemas with `response_format` parameter to get type-safe, validated responses with automatic parsing.

7. **Streaming Support**: For long-running document processing tasks, enable streaming to receive partial results as they become available.

8. **Combined Workflows**: Combine OCR, layout detection, redaction, and classification in a single pipeline for comprehensive document understanding.

### Use Cases

- **Document Processing**: Automate extraction of structured data from invoices, receipts, forms, and contracts
- **Compliance**: Ensure HIPAA, GDPR, PCI-DSS, and other regulatory compliance through automated redaction
- **Document Management**: Classify and organize large document collections
- **Data Extraction**: Extract tables, figures, and structured content from documents
- **Privacy Protection**: Automatically detect and redact sensitive information before sharing documents

### Next Steps

- Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
- Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
- Check out more examples in the [VLM Run Cookbook](https://github.com/vlm-run/vlmrun-cookbook)
- Review the [VLM Run Node.js SDK](https://github.com/vlm-run/vlmrun-node-sdk) documentation
- Review domain-specific redaction agents for financial, healthcare, legal, and other industries

Happy building!
