<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a>
</p>
</div>

# VLM Run Orion - 3D Reconstruction

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) 3D reconstruction capabilities. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful 3D reconstruction workflows with the same familiar chat-completions interface.

We'll cover the following topics:
 1. 3D Reconstruction from Single Images (depth estimation and geometry inference)
 2. 3D Reconstruction from Multiple Images (multi-view stereo reconstruction)

## Prerequisites

- Python 3.10+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- VLM Run Python Client with OpenAI extra `vlmrun[openai]`

## Setup

First, install the required packages and configure the environment.

In [None]:
%load_ext autoreload
%autoreload 2


In [None]:
# Install required packages
%pip install vlmrun[openai] --upgrade --quiet
%pip install pillow requests numpy opencv-python open3d plydata --quiet

In [None]:
import os
import getpass
import json
from typing import List, Any
from functools import cached_property

import numpy as np
from PIL import Image
from pydantic import BaseModel, Field

VLMRUN_API_KEY = os.getenv("VLMRUN_API_KEY", None)
if VLMRUN_API_KEY is None:
    VLMRUN_API_KEY = getpass.getpass("Enter your VLM Run API key: ")

## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.

In [None]:
from vlmrun.client import VLMRun

client = VLMRun(
    api_key=VLMRUN_API_KEY, base_url="https://agent.vlm.run/v1"
)
print("VLM Run client initialized successfully!")
print(f"Base URL: https://agent.vlm.run/v1")
print(f"Model: vlmrun-orion-1")

## Response Models (dtypes)

We define Pydantic models for structured outputs. These models include **cached properties** that automatically download and convert images/masks from URLs to PIL Images or numpy arrays for easy manipulation.

In [None]:
from functools import cached_property
from pydantic import BaseModel, Field
from PIL import Image
from io import BytesIO
import base64

class Recon3DResponse(BaseModel):
    """Response model for 3D reconstruction operations, expecting 'recon_path' field."""
    recon_path: str = Field(..., description="Pre-signed URL to the 3D reconstruction file")

## Chat Completion Helper

We create a helper function to simplify making chat completion requests for 3D reconstruction.

In [None]:
import hashlib
import cachetools
from typing import Type, TypeVar
from IPython.display import HTML
from vlmrun.common.image import encode_image
import os
import requests
import numpy as np
import plotly.graph_objects as go
from PIL import Image
from io import BytesIO
import base64

def download_ply(url, filename=None):
    """Download a .ply file and return the local path."""
    if filename is None:
        filename = os.path.basename(url).split("?")[0] or "model.ply"
    print(f"Downloading → {filename}")
    r = requests.get(url, stream=True)
    r.raise_for_status()
    with open(filename, "wb") as f:
        for chunk in r.iter_content(8192):
            f.write(chunk)
    print(f"Saved to: {filename}")
    return filename


def load_gaussian_splat_ply(path):
    """Load Gaussian splat PLY file and extract parameters."""
    print("Loading Gaussian splat PLY...")

    with open(path, 'rb') as f:
        # Read header
        line = f.readline().decode('ascii').strip()
        if line != 'ply':
            raise ValueError("Not a valid PLY file")

        line = f.readline().decode('ascii').strip()
        if line != 'format binary_little_endian 1.0':
            raise ValueError("Only binary_little_endian format supported")

        # Read vertex count
        line = f.readline().decode('ascii').strip()
        if not line.startswith('element vertex '):
            raise ValueError("Missing vertex count")
        n = int(line.split()[-1])
        print(f"Found {n} Gaussians")

        # Parse properties
        fields = {}
        idx = 0
        while True:
            line = f.readline().decode('ascii').strip()
            if line == 'end_header':
                break
            if line.startswith('property float '):
                field_name = line.split()[-1]
                fields[field_name] = idx
                idx += 1

        # Required fields
        required = ['x', 'y', 'z', 'f_dc_0', 'f_dc_1', 'f_dc_2', 'opacity', 'scale_0', 'scale_1', 'scale_2']
        for field in required:
            if field not in fields:
                raise ValueError(f"Missing required field: {field}")

        num_fields = len(fields)

        # Read binary data
        data = np.frombuffer(f.read(), dtype=np.float32)
        data = data.reshape(n, num_fields)

        # Extract fields
        positions = data[:, [fields['x'], fields['y'], fields['z']]].astype(np.float32)
        colors = data[:, [fields['f_dc_0'], fields['f_dc_1'], fields['f_dc_2']]].astype(np.float32)
        opacities = data[:, fields['opacity']].astype(np.float32)
        scales = data[:, [fields['scale_0'], fields['scale_1'], fields['scale_2']]].astype(np.float32)

        # Apply transformations
        colors = 1 / (1 + np.exp(-colors))  # sigmoid for SH coefficients
        opacities = 1 / (1 + np.exp(-opacities))  # sigmoid for opacity
        scales = np.exp(scales)  # scales are stored in log space

        print(f"Loaded {n} Gaussians")
        return positions, colors, opacities, scales



def render_gaussian_splat(positions, colors, opacities, scales, max_points=100000):
    """Render Gaussian splat point cloud with soft distance-based sampling."""

    n = len(positions)

    # ---- 1. Compute center ----
    center = positions.mean(axis=0)

    # ---- 2. Distance from center ----
    dist = np.linalg.norm(positions - center, axis=1)

    # ---- 3. Soft distance weighting: closer = more weight ----
    # sigma = scale of soft falloff (25th percentile of distances)
    sigma = np.percentile(dist, 25)
    distance_weight = np.exp(-(dist**2) / (2 * sigma**2))

    # ---- 4. Soft sampling probability: opacity × distance weight ----
    probs = distance_weight * opacities
    probs /= probs.sum()

    # ---- 5. Sample if needed ----
    if n > max_points:
        print(f"Soft sampling {max_points} out of {n} points...")
        idx = np.random.choice(n, max_points, replace=False, p=probs)
        positions = positions[idx]
        colors = colors[idx]
        opacities = opacities[idx]
        scales = scales[idx]

    # ---- 6. Marker sizes ----
    scale_mags = np.linalg.norm(scales, axis=1)
    sizes = 1 + 9 * (scale_mags - scale_mags.min()) / (scale_mags.max() - scale_mags.min() + 1e-8)

    # ---- 7. Color formatting ----
    rgb = (colors * 255).astype(np.uint8)
    rgba = [f'rgba({r},{g},{b},{a:.3f})'
            for r, g, b, a in zip(rgb[:, 0], rgb[:, 1], rgb[:, 2], opacities)]

    # ---- 8. Plot ----
    fig = go.Figure(data=[go.Scatter3d(
        x=positions[:, 0], y=positions[:, 1], z=positions[:, 2],
        mode='markers',
        marker=dict(size=sizes, color=rgba, line=dict(width=0), sizemode='diameter')
    )])

    fig.update_layout(
        scene=dict(bgcolor='black', xaxis=dict(visible=False),
                   yaxis=dict(visible=False), zaxis=dict(visible=False)),
        paper_bgcolor='black', plot_bgcolor='black'
    )

    return fig

T = TypeVar('T', bound=BaseModel)

def custom_key(prompt: str, images: list[Image.Image] | list[str] | None = None, response_model: Type[T] | None = None, model: str = "vlmrun-orion-1:auto"):
    """Custom key for caching chat_completion."""
    image_keys = []
    for image in images:
        if isinstance(image, Image.Image):
            thumb = image.copy()
            thumb.thumbnail((128, 128))
            encoded = encode_image(thumb, format="JPEG")
            image_keys.append(encoded)
        elif isinstance(image, str):
            image_keys.append(image)


    response_key = hashlib.sha256(json.dumps(response_model.model_json_schema(), sort_keys=True).encode()).hexdigest() if response_model else ""
    return (prompt, tuple(image_keys), response_key, model)



#@cachetools.cached(cache=cachetools.TTLCache(maxsize=1000, ttl=3600), key=custom_key)
def chat_completion(
    prompt: str,
    images: list[Image.Image] | list[str] | None = None,
    video: str | None = None,
    response_model: Type[T] | None = None,
    model: str = "vlmrun-orion-1:auto"
) -> Any:
    """
    Make a chat completion request with optional images and structured output.

    Args:
        prompt: The text prompt/instruction
        images: Optional list of images to process (either PIL Images or URLs)
        response_model: Optional Pydantic model for structured output
        model: Model to use (default: vlmrun-orion-1:auto)

    Returns:
        Parsed response model if response_model provided, else raw response text
    """
    content = []
    content.append({"type": "text", "text": prompt})

    if images:
        for image in images:
            if isinstance(image, str):
                assert image.startswith("http"), "Image URLs must start with http or https"
                content.append({
                    "type": "image_url",
                    "image_url": {"url": image, "detail": "auto"}
                })
            elif isinstance(image, Image.Image):
                content.append({
                    "type": "image_url",
                    "image_url": {"url": encode_image(image, format="JPEG"), "detail": "auto"}
                })
            else:
                raise ValueError("Images must be either PIL Images or URLs")
    if video:
      content.append({"type": "video_url",
                      "video_url": {"url": video}})

    kwargs = {
        "model": model,
        "messages": [{"role": "user", "content": content}]
    }

    if response_model:
        kwargs["response_format"] = {
            "type": "json_schema",
            "schema": response_model.model_json_schema()
        }

    response = client.agent.completions.create(**kwargs)
    response_text = response.choices[0].message.content

    if response_model:
        return response_model.model_validate_json(response_text)

    return response_text

print("Helper functions defined!")

## 3D Reconstruction Use Cases

The VLM Run API can create 3D models from various inputs:
1. **From Images** - Single or multiple images of a scene/object
2. **From Imagw** - Generate multiple 3D models from text descriptions
3. **From Video** - Automatically extract frames and reconstruct the scene

### 1. 3D Reconstruction from a Single Image

Create a 3D model from a single image. The model will infer depth and geometry to create a full 3D reconstruction.

In [None]:
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/agent_use_cases/guided-segmentation/image-11.png"

result = chat_completion(
    prompt=f"Generate a 3D reconstruction of the table in the image",
    images=[IMAGE_URL],
    response_model=Recon3DResponse
)
print(">> RESPONSE")
print(result)
print(">> IMAGE")


In [None]:
from IPython.display import Image
Image(url=IMAGE_URL, width=500, height=300)


In [None]:

filename = download_ply(result.recon_path)
positions, colors, opacities, scales = load_gaussian_splat_ply(filename)
fig = render_gaussian_splat(positions, colors, opacities, scales)
fig.show()



### 2. Multi Object 3D Reconstruction from a Single Image

Create a multiple 3D model from a single image.

In [None]:
IMAGE_URL_FURN="https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/furniture-colorful.jpg"
from IPython.display import Image
Image(url=IMAGE_URL_FURN, width=500, height=300)


In [None]:
result_multi = chat_completion(
    prompt=f"Generate a 3D reconstruction of the two chairs in the image, by first detecting them, segmenting them and then reconstructing",
    images=[IMAGE_URL_FURN],
    response_model=Recon3DResponse
)
print(">> RESPONSE")
print(result_multi)


In [None]:

filename = download_ply(result_multi.recon_path)
positions, colors, opacities, scales = load_gaussian_splat_ply(filename)
fig = render_gaussian_splat(positions, colors, opacities, scales)
fig.show()



### 3. 3D Reconstruction from Multiple Images

For better results, provide multiple images of the same scene from different viewpoints. This allows the model to better understand depth and geometry.

In [None]:
VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/web/videos/tunnel.mp4"
from IPython.display import Video
Video(VIDEO_URL, width=500, height=300)

In [None]:

result_scene = chat_completion(
    prompt=f"Generate a 3D reconstruction of the scene but sampling some frames from the video",
    video=VIDEO_URL,
    response_model=Recon3DResponse
)
print(">> RESPONSE")
print(result_scene)
print(">> IMAGE")


In [None]:
filename = download_ply(result_scene.recon_path)
positions, colors, opacities, scales = load_gaussian_splat_ply(filename)
fig = render_gaussian_splat(positions, colors, opacities, scales)
fig.show()

## Conclusion
This cookbook demonstrated the comprehensive 3D reconstruction capabilities of the VLM Run Orion Agent API.

### Key Takeaways
*   **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.
*   **Structured Outputs**: Use Pydantic models with the `response_model` parameter to get type-safe, validated responses with automatic parsing.
*   **3D Reconstruction from Single Images**: Generate 3D models by inferring depth and geometry from a single input image.
*   **Multi-Object 3D Reconstruction**: Reconstruct multiple objects within a single image by first detecting and segmenting them.
*   **3D Reconstruction from Video**: Utilize videos to automatically extract frames and reconstruct a scene for more robust 3D models.
*   **Gaussian Splatting Visualization**: Display 3D reconstruction results using interactive Gaussian Splatting plots.

### Next Steps
*   Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
*   Check out the [Agent API docs](https://docs.vlm.run/agents/introduction) for advanced features
*   Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
*   Check out more examples in the [VLM Run Cookbook](https://docs.vlm.run/blog)

Happy building!