<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a>
</p>
<p align="center">
<a href="https://discord.gg/AMApC2UzVY"><img alt="Discord" src="https://img.shields.io/badge/discord-chat-purple?color=%235765F2&label=discord&logo=discord"></a>
<a href="https://twitter.com/vlmrun"><img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/vlmrun.svg?style=social&logo=twitter"></a>
</p>
</div>

Welcome to **[VLM Run Cookbooks](https://github.com/vlm-run/vlmrun-cookbook)**, a comprehensive collection of examples and notebooks demonstrating the power of structured visual understanding using the [VLM Run Platform](https://app.vlm.run). 

## Case Study: Document Markdown Extraction

The `document.markdown` domain is designed to convert document pages into structured markdown content with rich formatting, tables, and figures. This cookbook provides examples of how to use the domain through the VLM client.

### Response Model

The `document.markdown` domain returns a structured `MarkdownDocument` object that provides comprehensive access to document content, tables, and figures with advanced processing capabilities.

```python
# Main response object
class MarkdownDocument:
    pages: List[MarkdownPage]
    
    # Properties
    @property
    def content(self) -> str  # Combined content from all pages
    @property 
    def markdown_content(self) -> str  # Combined rendered markdown
    @property
    def tables(self) -> List[MarkdownTable]  # All tables across pages
    @property
    def figures(self) -> List[MarkdownFigure]  # All figures across pages
    
    # Methods
    def get_page(self, page_number: int) -> MarkdownPage
    def render(self) -> str  # Full document rendering
```

### MarkdownPage Structure
Each page in the document contains structured content with optional tables and figures:

```python
class MarkdownPage:
    content: str  # Raw markdown content with placeholders
    markdown_content: str  # Rendered markdown content  
    tables: Optional[List[MarkdownTable]] = None
    figures: Optional[List[MarkdownFigure]] = None
    page_number: Optional[int] = None  # 0-indexed page number
    
    # Methods
    def render(self) -> str  # Page-specific rendering
```

### Key Features
- **Structured Document**: Complete `MarkdownDocument` object with pages, tables, and figures
- **Page Numbers**: Optional 0-indexed page numbering for multi-page documents  
- **Rich Tables**: Full pandas DataFrame conversion, rendering, and data manipulation
- **Figures**: Image descriptions and captions (can be None)

### Table Structure
Tables are extracted as structured `MarkdownTable` objects with rich functionality:

```python
class MarkdownTable:
    metadata: TableMetadata
    headers: List[TableHeader] 
    data: List[List[TableCell]]
    bbox: Optional[BoxCoords] = None
    
    # Methods
    def to_dataframe(self) -> 'pandas.DataFrame'
    def render(self) -> str  # HTML table rendering
    def validate(self) -> bool  # Data validation
    def get_column_data(self, column_name: str) -> List[Any]
    def filter_rows(self, condition: callable) -> 'MarkdownTable'

class TableMetadata:
    title: Optional[str] = None
    caption: Optional[str] = None  
    notes: Optional[str] = None

class TableHeader:
    id: str
    column: int
    name: str
    dtype: str  # 'string', 'number', 'date', etc.

class TableCell:
    content: Any
    bbox: Optional[BoxCoords] = None
    column_index: int
    row_index: int
    colspan: int = 1
    rowspan: int = 1
```

### Figure Structure
Figures and images are represented as `MarkdownFigure` objects with rich metadata:

```python
class MarkdownFigure:
    id: int  # Reference ID (e.g., 0)
    title: Optional[str] = None
    caption: Optional[str] = None  
    content: Optional[str] = None  # Figure description
    bbox: Optional[BoxCoords] = None
    
    # Methods
    def render(self) -> str  # HTML figure rendering
```

## Example Files

The following example files are used throughout this cookbook:

1. **Presentation Deck** ([`fine-tuning-deck.pdf`](https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.markdown/fine-tuning-deck.pdf))
   - Multi-page presentation with mixed content
   - Contains text, tables, charts, and diagrams
   - Ideal for testing multi-page processing

2. **Single Table Document** ([`earnings_single_table.pdf`](https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.markdown/earnings_single_table.pdf))
   - Financial table with structured data
   - Perfect for testing table extraction

3. **Hardware Specification Sheet** ([`AD74413R.pdf`](https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.markdown/AD74413R.pdf))
   - Technical document with specifications
   - Contains structured data and technical details

4. **Business Report** ([`bcgxaltagamma-true-luxury-global-consumer-insight-2021.pdf`](https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.markdown/bcgxaltagamma-true-luxury-global-consumer-insight-2021.pdf))
   - Comprehensive report with mixed content
   - Contains text, tables, and visualizations

### Environment Setup

To get started, install the VLM Run Python SDK and sign-up for an API key on the [VLM Run App](https://app.vlm.run).
- Store the VLM Run API key under the `VLMRUN_API_KEY` environment variable.

## Prerequisites

* Python 3.9+
* VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))

## Setup

First, let's install the required packages:

In [1]:
! pip install vlmrun --upgrade --quiet
! pip install vlmrun-hub --upgrade --quiet

In [2]:
import os
import getpass

VLMRUN_BASE_URL = os.getenv("VLMRUN_BASE_URL", "https://api.vlm.run/v1")
VLMRUN_API_KEY = os.getenv("VLMRUN_API_KEY", None)
if VLMRUN_API_KEY is None:
    VLMRUN_API_KEY = getpass.getpass()

 ········


Let's initialize the VLM Run Client

In [3]:
from vlmrun.client import VLMRun

client = VLMRun(base_url=VLMRUN_BASE_URL, api_key=VLMRUN_API_KEY)

Let's start with a simple example - processing a presentation deck and viewing its contents:

In [4]:
from IPython.display import display, HTML
import requests
import base64

# Example presentation deck
PRESENTATION_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.markdown/fine-tuning-deck.pdf"

# First, let's display the PDF to see what we're working with
def display_pdf(url: str, width: int = 800, height: int = 600):
    """Display a PDF from a URL in a notebook."""
    response = requests.get(url)
    pdf_data = base64.b64encode(response.content).decode('utf-8')
    pdf_html = f"""
    <div style="margin: 20px 0;">
        <iframe
            src="data:application/pdf;base64,{pdf_data}"
            width="{width}"
            height="{height}"
            style="border: 1px solid #ddd; border-radius: 4px;"
        ></iframe>
    </div>
    """
    display(HTML(pdf_html))

# Display the PDF
display_pdf(PRESENTATION_URL)

In [7]:
# Now, let's process the document
prediction = client.document.generate(
    domain="document.markdown",
    url=PRESENTATION_URL,
    batch=True
)

# Wait for completion
prediction = client.predictions.wait(prediction.id)

if prediction.status == "completed":
    result = prediction.response
    print(f"Successfully processed document with {len(result.pages)} pages")
else:
    print(f"Processing failed: {prediction.status}")

Successfully processed document with 7 pages


## Understanding the Response

The application returns a structured response that contains the document's content in a markdown format. Let's examine what we got:

In [8]:
# Access the MarkdownDocument response
document = prediction.response

# Process each page
for i, page in enumerate(document.pages):
    print(f"\nPage {i+1}:")
    print("=" * 80)
    
    # Display rendered content  
    print("\nContent:")
    print(page.markdown_content)
    
    # Show what we found on this page
    stats = {
        "Tables": len(page.tables) if page.tables else 0,
        "Figures": len(page.figures) if page.figures else 0
    }
    print("\nPage Statistics:")
    for stat, value in stats.items():
        print(f"- {stat}: {value}")


Page 1:

Content:
<Figure id="fg-0"/>

OpenAI logo

# Fine-tuning
Technique

---

February 2024

Page Statistics:
- Tables: 0
- Figures: 2

Page 2:

Content:
# Overview

Fine-tuning involves adjusting the parameters of pre-trained models on a specific dataset or task. This process enhances the model's ability to generate more accurate and relevant responses for the given context by adapting it to the nuances and specific requirements of the task at hand.

## Example use cases

- Generate output in a consistent format
- Process input by following specific instructions

## What we'll cover

- When to fine-tune
- Preparing the dataset
- Best practices
- Hyperparameters
- Fine-tuning advances
- Resources

3

Page Statistics:
- Tables: 0
- Figures: 0

Page 3:

Content:
# What is Fine-tuning

<Figure id="fg-0"/>

Diagram illustrating the fine-tuning process: A Public Model and Training data are inputs to Training, which results in a Fine-tuned model.

Fine-tuning a model consists of trainin

## Working with Tables

The application excels at extracting tables from documents. Let's look at a document that contains a table:

In [9]:
TABLE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.markdown/earnings_single_table.pdf"

# Display the PDF
display_pdf(TABLE_URL)

In [10]:
prediction = client.document.generate(
    domain="document.markdown",
    url=TABLE_URL,
    batch=True
)

# Wait for completion
prediction = client.predictions.wait(prediction.id)

if prediction.status == "completed":
    document = prediction.response
    
    # Process each page
    for page in document.pages:
        if page.tables:
            for i, table in enumerate(page.tables):
                print(f"\nTable {i+1}:")
                print("=" * 80)
                
                # Display table metadata
                if table.metadata.title:
                    print(f"Title: {table.metadata.title}")
                if table.metadata.caption:
                    print(f"Caption: {table.metadata.caption}")
                
                # Display table headers  
                if table.headers:
                    print("\nHeaders:")
                    for header in table.headers:
                        print(f"- {header.name} ({header.dtype})")
                
                # Convert to DataFrame for easy viewing
                if hasattr(table, 'to_dataframe'):
                    df = table.to_dataframe()
                    print("\nTable Data:")
                    print(df)
                
                # Or access raw data
                print("\nRaw Data:")
                for row in table.data:
                    print(row)


Table 1:

Headers:
- Year (int)
- Partially Funded (currency)
- Wholly Unfunded (currency)
- Total (currency)

Table Data:
  year_0 partially_funded_1 wholly_unfunded_2    total_3
0   2023             $ 12.6               $ —     $ 12.6
1   2022          $ (145.0)               $ —  $ (145.0)

Raw Data:
{'year_0': '2023', 'partially_funded_1': '$ 12.6', 'wholly_unfunded_2': '$ —', 'total_3': '$ 12.6'}
{'year_0': '2022', 'partially_funded_1': '$ (145.0)', 'wholly_unfunded_2': '$ —', 'total_3': '$ (145.0)'}

Table 2:

Headers:
- Assumption (str)
- Partially Funded (percentage)
- Wholly Unfunded (percentage)
- Total (percentage)

Table Data:
                                        assumption_0 partially_funded_1  \
0                  December 31, 2023 > Discount rate              4.15%   
1  December 31, 2023 > Expected rate of compensat...              1.67%   
2                  December 31, 2022 > Discount rate              4.45%   
3  December 31, 2022 > Expected rate of compensat...

## Working with Figures

The application can also extract and describe figures from documents. Let's look at a document with figures:

In [11]:
REPORT_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.markdown/bcgxaltagamma-true-luxury-global-consumer-insight-2021.pdf"

prediction = client.document.generate(
    domain="document.markdown",
    url=REPORT_URL,
    batch=True
)

# Wait for completion
prediction = client.predictions.wait(prediction.id, 300)

if prediction.status == "completed":
    document = prediction.response
    
    # Process each page
    for page_num, page in enumerate(document.pages, 1):
        if page.figures:
            print(f"\nPage {page_num} Figures:")
            print("=" * 80)
            
            for i, figure in enumerate(page.figures):
                print(f"\nFigure {i+1}:")
                if figure.title:
                    print(f"Title: {figure.title}")
                if figure.caption:
                    print(f"Caption: {figure.caption}")
                print(f"Description: {figure.content}")
                
                # Use the render method if available
                if hasattr(figure, 'render'):
                    rendered = figure.render()
                    print(f"Rendered: {rendered}")


Page 1 Figures:

Figure 1:
Description: A woman's profile with eyes closed, illuminated by sunlight casting shadows.
Rendered: <Figure id="fg-0"/>

A woman's profile with eyes closed, illuminated by sunlight casting shadows.

Page 2 Figures:

Figure 1:
Description: A black and white image of a person standing on a road looking towards the horizon with a green glow behind them.
Rendered: <Figure id="fg-0"/>

A black and white image of a person standing on a road looking towards the horizon with a green glow behind them.

Figure 2:
Description: The logos for BCG and ALTAGAMMA CREATIVITÀ E CULTURA ITALIANA.
Rendered: <Figure id="fg-1"/>

The logos for BCG and ALTAGAMMA CREATIVITÀ E CULTURA ITALIANA.

Page 3 Figures:

Figure 1:
Description: ~19M number inside a circle graphic
Rendered: <Figure id="fg-0"/>

~19M number inside a circle graphic

Figure 2:
Description: Image representing respondents (people)
Rendered: <Figure id="fg-1"/>

Image representing respondents (people)

Figure 3:
Des

## Rendering Markdown Content

The client provides both raw and rendered markdown content. Here's how to preview the rendered markdown:

In [12]:
from IPython.display import Markdown

def render_markdown(markdown_content: str) -> None:
    """Render markdown content."""
    display(Markdown(markdown_content))

# Process a document
prediction = client.document.generate(
    domain="document.markdown",
    url=PRESENTATION_URL
)

# Wait for completion
prediction = client.predictions.wait(prediction.id)

In [13]:
if prediction.status == "completed":
    document = prediction.response
    
    # Render each page using the document object
    for page_num, page in enumerate(document.pages, 1):
        print(f"\nPage {page_num}:")
        print("=" * 80)
        render_markdown(page.markdown_content)
        
    # Alternative: Use document-level rendering if available
    if hasattr(document, 'render'):
        print("\nFull Document:")
        print("=" * 80)
        render_markdown(document.render())


Page 1:


<Figure id="fg-0"/>

OpenAI logo

# Fine-tuning
Technique

---

February 2024


Page 2:


# Overview

Fine-tuning involves adjusting the parameters of pre-trained models on a specific dataset or task. This process enhances the model's ability to generate more accurate and relevant responses for the given context by adapting it to the nuances and specific requirements of the task at hand.

## Example use cases

- Generate output in a consistent format
- Process input by following specific instructions

## What we'll cover

- When to fine-tune
- Preparing the dataset
- Best practices
- Hyperparameters
- Fine-tuning advances
- Resources

<Figure id="fg-0"/>



3


Page 3:


# What is Fine-tuning

<Figure id="fg-0"/>

Diagram illustrating the fine-tuning process. It shows a Public Model and Training data as inputs to a Training step, which outputs a Fine-tuned model.

Fine-tuning a model consists of training the
model to follow a set of given input/output
examples.

This will teach the model to behave in a
certain way when confronted with a similar
input in the future.

We recommend using 50-100 examples
even if the minimum is 10.




Page 4:


# When to fine-tune

## Good for <Figure id="fg-0"/>

Green checkmark icon indicating something is good for fine-tuning.

*   Following a given format or tone for the output
*   Processing the input following specific, complex instructions
*   Improving latency
*   Reducing token usage

## Not good for <Figure id="fg-1"/>

Red cross icon indicating something is not good for fine-tuning.

*   Teaching the model new knowledge
    *   Use RAG or custom models instead
*   Performing well at multiple, unrelated tasks
    *   Do prompt-engineering or create multiple FT models instead
*   Include up-to-date content in responses
    *   Use RAG instead

5


Page 5:


# Preparing the dataset

## Example format

```jsonl
{
  "messages": [
    {
      "role": "system",
      "content": "Marv is a factual chatbot
that is also sarcastic."
    },
    {
      "role": "user",
      "content": "What's the capital of
France?"
    },
    {
      "role": "assistant",
      "content": "Paris, as if everyone
doesn't know that already."
    }
  ]
}
```

-> Take the set of instructions and prompts that you
found worked best for the model prior to fine-tuning.
Include them in every training example

-> If you would like to shorten the instructions or
prompts, it may take more training examples to arrive
at good results

We recommend using 50-100 examples
even if the minimum is 10.

<Figure id="fg-0"/>

OpenAI logo with text 'OpenAI 6'


Page 6:


# Best practices

<Figure id="fg-0"/>

Title graphic for 'Best practices'.

## Curate examples carefully

Datasets can be difficult to build, start small and invest intentionally. Optimize for fewer high-quality training examples.

*   Consider "prompt baking", or using a basic prompt to generate your initial examples
*   If your conversations are multi-turn, ensure your examples are representative
*   Collect examples to target issues detected in evaluation
*   Consider the balance & diversity of data
*   Make sure your examples contain all the information needed in the response

## Automate your feedback pipeline

Introduce automated evaluations to highlight potential problem cases to clean up and use as training data.

Consider the G-Eval approach of using GPT-4 to perform automated testing using a scorecard.

## Iterate on hyperparameters

Start with the defaults and adjust based on performance.

*   If the model does not appear to converge, increase the learning rate multiplier
*   If the model does not follow the training data as much as expected increase the number of epochs
*   If the model becomes less diverse than expected decrease the # of epochs by 1-2

## Establish a baseline

Often users start with a zero-shot or few-shot prompt to build a baseline evaluation before graduating to fine-tuning.

## Optimize for latency and token efficiency

When using GPT-4, once you have a baseline evaluation and training examples consider fine-tuning 3.5 to get similar performance for less cost and latency.

Experiment with reducing or removing system instructions with subsequent fine-tuned model versions.


Page 7:


# Hyperparameters

## Epochs

Refers to 1 full cycle through the training dataset

If you have hundreds of thousands of examples, we would recommend experimenting with two epochs (or one) to avoid overfitting.

default: auto (standard is 4)

## Batch size

Number of training examples used to train a single forward & backward pass
In general, we've found that larger batch sizes tend to work better for larger datasets

default: ~0.2% x N* (max 256)

*N = number of training examples

## Learning rate multiplier

Scaling factor for the original learning rate

We recommend experimenting with values between 0.02-0.2. We've found that larger learning rates often perform better with larger batch sizes.

default: 0.05, 0.1 or 0.2*

*depends on final batch size

8

## Best Practices & Troubleshooting

### Response Handling
- Use `markdown_content` for the fully rendered version with tables and figures in place
- The `content` field contains raw markdown with `<Table>` and `<Figure>` placeholders that reference the structured data
- For complex documents, use `detail="hi"` in the config to improve extraction accuracy
- Always check `prediction.status` before accessing results
- Handle missing tables/figures with `if page['tables']` checks

### Error Handling
```python
try:
    prediction = client.document.generate(
        domain="document.markdown",
        url=PRESENTATION_URL
    )
    
    prediction = client.predictions.wait(prediction.id)
    
    if prediction.status == "completed":
        result = prediction.response
        # Process and display results
    else:
        print(f"Error: Processing failed with status {prediction.status}")
        
except Exception as e:
    print(f"Error: {str(e)}")
```

### Performance Optimization
- Use batch mode for large documents
- Implement proper timeouts
- Cache results when processing the same document multiple times
- Use appropriate detail levels based on document complexity