<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a>
</p>
<p align="center">
<a href="https://discord.gg/AMApC2UzVY"><img alt="Discord" src="https://img.shields.io/badge/discord-chat-purple?color=%235765F2&label=discord&logo=discord"></a>
<a href="https://twitter.com/vlmrun"><img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/vlmrun.svg?style=social&logo=twitter"></a>
</p>
</div>

Welcome to **[VLM Run Cookbooks](https://github.com/vlm-run/vlmrun-cookbook)**, a comprehensive collection of examples and notebooks demonstrating the power of structured visual understanding using the [VLM Run Platform](https://app.vlm.run). 

## 🎨 MCP Showcase

This guide walks through our [new MCP server](https://docs.vlm.run/mcp/introduction) and all the tools it provides to date. For a more detailed overview of the MCP server, please refer to the [MCP documentation](https://docs.vlm.run/mcp/introduction) and the [tools reference](https://docs.vlm.run/mcp/tools/overview).

## Prerequisites

* Python 3.9+
* VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
* OpenAI SDK and API key

In [1]:
! pip install vlmrun openai --upgrade --quiet

## MCP Server Setup

First, let's connect to our MCP server and list all the available tools. Our MCP server is remotely hosted on `https://mcp.vlm.run` and currently authenticated using an API key. You can head over to our [API keys](https://app.vlm.run/dashboard/settings/api_keys) page to get your full authenticated MCP server URL.

Let's use the OpenAI SDK to connect to our MCP server. In order to test our connection, let's list all the available tools in a neatly formatted markdown table.

In [None]:
import os
import openai
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", None)
if OPENAI_API_KEY is None:
    raise ValueError("OPENAI_API_KEY is not set")

VLMRUN_API_KEY = os.getenv("VLMRUN_API_KEY", None)
if VLMRUN_API_KEY is None:
    raise ValueError("VLMRUN_API_KEY is not set")


# Initialize the OpenAI client
client = openai.Client(api_key=OPENAI_API_KEY)

# Fill in the MCP base URL from the API keys page
# https://app.vlm.run/dashboard/settings/api_keys
MCP_BASE_URL = f"https://mcp.vlm.run/{VLMRUN_API_KEY}/sse"

# We'll use the OpenAI SDK to connect to our MCP server.
result = client.responses.create(
    model="gpt-4.1",
    tools=[
        {
            "type": "mcp",
            "server_label": "vlmrun-mcp-server",
            "server_url": MCP_BASE_URL,
            "require_approval": "never",
        },
    ],
    input="List all the available tools in a neatly formatted markdown table",
)
print(result.output_text)

Here is a neatly formatted markdown table listing all the available tools and their descriptions:

| Tool Name                                      | Description |
|------------------------------------------------|-------------|
| put_image_url                                  | Puts an image from a URL into the MCP server and returns an image reference. |
| put_file_url                                   | Puts a file from a URL into the MCP server and returns a file reference. |
| preview_file_url                               | Gets a preview URL of a file object from an object reference. |
| preview_image                                  | Gets a preview image from an image reference. |
| parse_image                                    | Parses the image into a structured format, optionally with a prompt. |
| rotate_image                                   | Returns a copy of the image rotated by a given angle counterclockwise. |
| crop_image                                     | Crop

## Defining a basic LLM agent equipped with [VLM Run MCP](https://docs.vlm.run/mcp/introduction) tools

Let's define a basic visual agent workflow that can be used to take in a language instruction input along with image references, and have the LLM agent build a workflow to process the image. The LLM agent will be able to leverage our VLM Run MCP tools in order to accomplish the task. 

In [110]:
from pydantic import BaseModel, Field

class ReasoningStep(BaseModel):
    tool_id: str = Field(..., description="The tool used in this step")
    reasoning: str = Field(..., description="The reasoning for this step")

class ToolCallStep(BaseModel):
    tool_id: str = Field(..., description="The tool used in this step")
    reasoning: str = Field(..., description="A short description of the reasoning for this tool call")

class Response(BaseModel):
    reasoning_steps: list[ReasoningStep] = Field(..., description="The reasoning steps for the response")
    tool_calls: list[ToolCallStep] = Field(..., description="The sequence of tool calls for the response to arrive at the final answer")
    result: str = Field(..., description="The final result of the response")
    url: str = Field(..., description="The preview URL of the image, in the format https://mcp.vlm.run/files/<file_id>")

def process(prompt: str) -> Response:
    """The process function is a wrapper around the OpenAI Responses API that 
    executes multiple tool calls to accomplish the user-intent, and finally 
    parses the resulting response into a structured format."""
    result = client.responses.parse(
        model="gpt-4.1",
        tools=[
            {
                "type": "mcp",
                "server_label": "vlmrun-mcp-server",
                "server_url": MCP_BASE_URL,
                "require_approval": "never",
            },
        ],
        input=prompt,
        text_format=Response
    )
    return result.output_parsed

## Use-cases

Now, let's go ahead and define a few use-cases where the user can provide a language instruction along with image references. 

In [111]:

class Example(BaseModel):
    name: str
    prompt: str
    inputs: dict[str, str]


EXAMPLES = [
    # image-workflows (basic)
    Example(
        name="load-image-and-preview",
        prompt="Load this image ({url}) and preview it.",
        inputs={"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"}
    ),
    Example(
        name="face-detection",
        prompt="Detect the faces in this image ({url}). Overlay the detected faces on the image and preview it.",
        inputs={"url": "https://cdn.mos.cms.futurecdn.net/Yvs83nR9GrDk9bq4Weq5eZ.jpg"}
    ),
    Example(
        name="face-detection-and-blurring",
        prompt="Load this image ({url}) and detect all the faces in the image, blur them, and overlay the detected faces on the blurred image, and return the preview URL of the blurred image. ",
        inputs={"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"}
    ),
    Example(
        name="object-detection",
        prompt="Load this image ({url}) and detect any objects or items in the image, and return the resulting image preview URL.",
        inputs={"url": "https://github.com/autonomi-ai/nos/blob/main/nos/test/test_data/test.jpg?raw=true"}
    ),
    Example(
        name="qr-code-detection",
        prompt="Load this image ({url}) and detect any QR codes in the image, and visualize the bounding box locations of each QR code. Also return the QR code content. ",
        inputs={"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/qr-code-screen.jpg"}
    ),
    Example(
        name="crop-right-face-and-preview",
        prompt="Load this image ({url}), detect all the faces in the image, crop the middle face, and preview the cropped face and provide a brief description of the cropped face. ",
        inputs={"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg"}
    ),
    # image-workflows (advanced)
    # Example(
    #     name="find-template-in-image",
    #     prompt="""Given a template image ({template_url}), match it with the following reference image ({url}). Localize the template in the reference image and visualize all the matches with bounding boxes drawn on the reference image. Finally preview the reference image with the bounding boxes. """,
    #     inputs={
    #         "template_url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/apple-logo.png",
    #         "url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/reddit-mac-laptops.jpg"
    #     }        
    # ),
    # document-workflows
    Example(
        name="extract-the-invoice-and-ground-details",
        prompt="Load this invoice ({url}), extract only these details: (invoice number, subtotal, total amount, date, etc.). Extract the bounding boxes or grounding information and visualize the invoice with the extracted details. ",
        inputs={"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.invoice/invoice_1.jpg"}
    ),
    Example(
        name="redact-doc-pii",
        prompt="Can you redact all the personally identifiable information of the patient (name, address, DOB, phone number, email, etc.) in the following image ({url}) and provide a link to the redacted image.",
        inputs={"url": "https://www.carepatron.com/files/physical-therapy-referral-form-sample-template.jpg"}
    ),
    # Example(
    #     name="document-layout-analysis",
    #     prompt="Load this document ({url}), take the first page, detect the layout of the page. Finally visualize the detections on the page image and preview it.",
    #     inputs={"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.markdown/2502.13923v1.pdf"}
    # )
]

## Let's execute!

Now that we have defined the individual use-cases, let's execute it and see how well our LLM agent equipped with the VLM Run MCP tools can accomplish the tasks!

In [112]:
import warnings
from typing import Any
from tqdm import tqdm

# Drop any UserWarning in pydantic
warnings.filterwarnings("ignore", category=UserWarning)

# Define the example with response
class ExampleWithResponse(BaseModel):
    input: Example
    response: Any | None = None
    error_msg: str | None = None


# Process all the examples
responses = []
for ex in tqdm(EXAMPLES, desc="Processing examples"):
    try:
        # Format the prompt with the inputs
        ex.prompt = ex.prompt.format(**ex.inputs)
        # Process the prompt
        response = process(ex.prompt)
        # Add the response to the list
        responses.append(ExampleWithResponse(input=ex, response=response, error_msg=None))
    except Exception as e:
        responses.append(ExampleWithResponse(input=ex, response=None, error_msg=str(e)))
        print(f"Error: {e}")

Processing examples: 100%|██████████| 8/8 [04:33<00:00, 34.16s/it]


Let's check the response of all the visual agent workflows we just created.

In [113]:
import pandas as pd

df = pd.DataFrame([r.model_dump(mode="json") for r in responses])
df = pd.concat([pd.json_normalize(df["input"]), pd.json_normalize(df["response"])], axis=1)
df.head()

Unnamed: 0,name,prompt,inputs.url,reasoning_steps,tool_calls,result,url
0,load-image-and-preview,Load this image (https://storage.googleapis.co...,https://storage.googleapis.com/vlm-data-public...,[{'tool_id': 'mcp_vlmrun-mcp-server.put_image_...,[{'tool_id': 'mcp_vlmrun-mcp-server.put_image_...,The image has been loaded and is available for...,https://mcp.vlm.run/files/img_2c4d
1,face-detection,Detect the faces in this image (https://cdn.mo...,https://cdn.mos.cms.futurecdn.net/Yvs83nR9GrDk...,[{'tool_id': 'mcp_vlmrun-mcp-server.put_image_...,[{'tool_id': 'mcp_vlmrun-mcp-server.put_image_...,The detected faces are now overlaid on the ima...,https://mcp.vlm.run/files/img_d1be
2,face-detection-and-blurring,Load this image (https://storage.googleapis.co...,https://storage.googleapis.com/vlm-data-public...,"[{'tool_id': 'put_image_url', 'reasoning': 'Lo...",[{'tool_id': 'mcp_vlmrun-mcp-server.put_image_...,Here is the preview URL of the image with blur...,https://mcp.vlm.run/files/img_e4ff
3,object-detection,Load this image (https://github.com/autonomi-a...,https://github.com/autonomi-ai/nos/blob/main/n...,"[{'tool_id': 'put_image_url', 'reasoning': 'Lo...",[{'tool_id': 'mcp_vlmrun-mcp-server.put_image_...,Objects detected in the image include a bench ...,https://mcp.vlm.run/files/img_2561
4,qr-code-detection,Load this image (https://storage.googleapis.co...,https://storage.googleapis.com/vlm-data-public...,[{'tool_id': 'mcp_vlmrun-mcp-server.put_image_...,[{'tool_id': 'mcp_vlmrun-mcp-server.put_image_...,A QR code was detected in the image. The conte...,https://mcp.vlm.run/files/img_166a


## Show me the results!

Enough talking - let's see the results!

In [127]:
from IPython.display import HTML, display

formatters = {
    "prompt": lambda x: f'<div style="width: 300px; word-wrap: break-word;">{x}</div>',
    "result": lambda x: f'<div style="width: 300px; word-wrap: break-word;">{x}</div>',
    "inputs.url": lambda x: f"<img src={x} width='500'>",
    "url": lambda x: f"<img src={x} width='500'>",
}

cols = ["name", "prompt", "result", "inputs.url", "url"]
html = df[cols].head(n=10).to_html(formatters=formatters, escape=False)
display(HTML(html))

Unnamed: 0,name,prompt,result,inputs.url,url
0,load-image-and-preview,Load this image (https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg) and preview it.,The image has been loaded and is available for preview below.,,
1,face-detection,Detect the faces in this image (https://cdn.mos.cms.futurecdn.net/Yvs83nR9GrDk9bq4Weq5eZ.jpg). Overlay the detected faces on the image and preview it.,The detected faces are now overlaid on the image. You can preview the result using the link below.,,
2,face-detection-and-blurring,"Load this image (https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg) and detect all the faces in the image, blur them, and overlay the detected faces on the blurred image, and return the preview URL of the blurred image.",Here is the preview URL of the image with blurred faces and red rectangles overlayed on the detected faces.,,
3,object-detection,"Load this image (https://github.com/autonomi-ai/nos/blob/main/nos/test/test_data/test.jpg?raw=true) and detect any objects or items in the image, and return the resulting image preview URL.",Objects detected in the image include a bench and multiple cars. Here is the processed image with bounding boxes and labels for detected objects.,,
4,qr-code-detection,"Load this image (https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.agent/qr-code-screen.jpg) and detect any QR codes in the image, and visualize the bounding box locations of each QR code. Also return the QR code content.","A QR code was detected in the image. The content is: ""https://vlm.run"". The image with the visualized bounding box (in red) around the QR code is provided at the link below.",,
5,crop-right-face-and-preview,"Load this image (https://storage.googleapis.com/vlm-data-public-prod/hub/examples/media.tv-news/finance_bb_3_speakers.jpg), detect all the faces in the image, crop the middle face, and preview the cropped face and provide a brief description of the cropped face.","The cropped middle face shows a close-up of a woman's face. She has blonde hair, is wearing visible makeup, and her eyes are closed or looking downward with her mouth slightly open, showing some teeth.",,
6,extract-the-invoice-and-ground-details,"Load this invoice (https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.invoice/invoice_1.jpg), extract only these details: (invoice number, subtotal, total amount, date, etc.). Extract the bounding boxes or grounding information and visualize the invoice with the extracted details.",Extracted Invoice Details: - Invoice Number: 9999999 - Date: 2023-11-11 - Subtotal: 400.00 - Total: 400.00 The requested fields are visually highlighted in the invoice preview below with their bounding boxes.,,
7,redact-doc-pii,"Can you redact all the personally identifiable information of the patient (name, address, DOB, phone number, email, etc.) in the following image (https://www.carepatron.com/files/physical-therapy-referral-form-sample-template.jpg) and provide a link to the redacted image.","All visible personally identifiable information (PII) such as full names, addresses, birthdate, phone numbers, and email addresses have been redacted from the image. You can view and download the redacted document at the link below.",,


### 💡 Pro Tips for Using VLM MCP tools

📚 Docs: [VLM Run MCP Docs](https://docs.vlm.run/mcp/introduction)
🌐 Website: [VLM Run MCP](https://vlm.run/mcp)

1. Familiarize yourself with the [tools and the current capabilities](https://docs.vlm.run/mcp/tools/overview)
   - Consider working with individual images and fewer tools first to make sure the workflow is working as expected.
   - Start building and extending the workflow, one new tool at a time. 
   - Provide any relevant guidance to the workflow with the tool capability, and avoid explicitly calling out tool-names.

2. Best Practices:
   - Ensure good image quality
   - Validate outputs with structured responses (Pydantic), if possible

3. Common Use Cases:
   - Document Processing: Invoices, resumes, IDs
   - Healthcare: Insurance cards, patient intake forms
   - Sports & Media: Game analysis, news content
   - Retail: Product cataloging, ad analysis
   - Aerospace: Satellite imagery analysis