# Visual PDF RAG with Vespa - ColPali demo application

We created an end-to-end demo application for visual retrieval of PDF pages using Vespa, including a frontend web application. To see the live demo, visit https://vespa-engine-colpali-vespa-visual-retrieval.hf.space/.

The main goal of the demo is to make it easy for _you_ to create your own PDF Enterprise Search application using Vespa.
To deploy a full demo, you need two main components:

1. A Vespa application that lets you index and search PDF pages using ColPali embeddings.
2. A live web application that lets you interact with the Vespa application.

After running this notebook, you will have set up a Vespa application, and indexed some PDF pages.
You can then test that you are able to query the Vespa application, and you will be ready to deploy the web application including the frontend.

Some of the features we want to highlight in this demo are:

- Visual retrieval of PDF pages using ColPali embeddings
- Explainability by displaying similarity maps over the patches in the PDF pages for each query token.
- Extracting queries and questions from the PDF pages using `gemini-1.5-8b` model.
- Type-ahead search suggestions based on the extracted queries and questions.
- Comparison of different retrieval and ranking strategies (BM25, ColPali MaxSim, and a combination of both).
- AI-generated responses to the query based on the top ranked PDF pages. Also using the `gemini-1.5-8b` model.

We also wanted to give a notion of which latency one can expect using Vespa for this use case.
Event though your users might not state this explicitly, we consider it important to provide a snappy user experience.

In this notebook, we will prepare the Vespa backend application for our visual retrieval demo.
We will use ColPali as the model to extract patch vectors from images of pdf pages.
At query time, we use MaxSim to retrieve and/or (based on the configuration) rank the page results.

The steps we will take in this notebook are:

1. Setup and configuration
2. Download PDFs
3. Convert PDFs to images
4. Generate queries and questions
5. Generate ColPali embeddings
6. Prepare the Vespa application package
7. Deploy the Vespa application to Vespa Cloud
8. Feed the data to the Vespa application
9. Test a query to the Vespa application

All the steps that are needed to provision the Vespa application, including feeding the data, can be done by running this notebook.
We have tried to make it easy for others to run this notebook, to create your own PDF Enterprise Search application using Vespa.

## 1. Setup and Configuration


In [1]:
!python --version

Python 3.12.9


Install dependencies:

Note that the python pdf2image package requires poppler-utils, see other installation options [here](https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler).

In [2]:
!pip install poppler-utils



Now install the required python packages:

In [3]:
!pip3 install colpali-engine==0.3.1 vidore_benchmark==4.0.0 pdf2image pypdf==5.0.1 pyvespa>=0.50.0 vespacli numpy pillow==10.4.0 google-generativeai==0.8.3

In [4]:
import os
import json
from typing import Tuple
import hashlib
import numpy as np

# Vespa
from vespa.package import (
    ApplicationPackage,
    Field,
    Schema,
    Document,
    HNSW,
    RankProfile,
    Function,
    FieldSet,
    SecondPhaseRanking,
    Summary,
    DocumentSummary,
)
from vespa.deployment import VespaCloud
from vespa.application import Vespa
from vespa.io import VespaResponse

# Google Generative AI for Google Gemini interaction
import google.generativeai as genai

# Torch and other ML libraries
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from pdf2image import convert_from_path
from pypdf import PdfReader

from pdf2image import convert_from_path
import fitz  # PyMuPDF
from tqdm import tqdm
import os

# ColPali model and processor
from colpali_engine.models import ColPali, ColPaliProcessor
from colpali_engine.utils.torch_utils import get_torch_device
from vidore_benchmark.utils.image_utils import scale_image, get_base64_image

# Load environment variables
from dotenv import load_dotenv

load_dotenv()

# Avoid warning from huggingface tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"

  from .autonotebook import tqdm as notebook_tqdm


### Create a free trial in Vespa Cloud

Create a tenant from [here](https://vespa.ai/free-trial/).
The trial includes $300 credit.
Take note of your tenant name, and input it below.


In [5]:
VESPA_TENANT_NAME = "treetrash2025"  # Replace with your tenant name

Here, set your desired application name. (Will be created in later steps)
Note that you can not have hyphen `-` or underscore `_` in the application name.


In [6]:
VESPA_APPLICATION_NAME = "colpalidemodev"
VESPA_SCHEMA_NAME = "pdf_page"

Next, you can to create a token. This is an optional authentication method (the default is mTLS), and will be used for feeding data, and querying the application.
For details, see [Authenticating to Vespa Cloud](https://pyvespa.readthedocs.io/en/latest/authenticating-to-vespa-cloud.html).
For now, we will use a single token with both read and write permissions.
For production, we recommend separate tokens for feeding and querying, (the former with write permission, and the latter with read permission).
The tokens can be created from the [Vespa Cloud console](https://console.vespa-cloud.com/) in the 'Account' -> 'Tokens' section. Please make sure to save the both the token id and it's value somwhere safe - you'll need it when you're going to connect to your app.  

In [7]:
# Replace this with the id of your token
VESPA_TOKEN_ID = "pyvespa_integration"  # This needs to match the token_id that you created in the Vespa Cloud Console

We also need to set the value of the write token to be able to feed data to the Vespa application (value of VESPA_TOKEN_ID_WRITE). Please run the cell below to set the variable.

In [8]:
VESPA_CLOUD_SECRET_TOKEN = os.getenv("VESPA_CLOUD_SECRET_TOKEN") or input(
    "Enter Vespa cloud secret token: "
)

We will use Google's Gemini API to create sample queries for our images.
Create a Gemini API key from [here](https://aistudio.google.com/app/apikey). Once you have the key, please run the cell below.
You can also use other VLM's to create these queries.

In [9]:
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY") or input(
    "Enter Google Generative AI API key: "
)
# Configure Google Generative AI
genai.configure(api_key=GOOGLE_API_KEY)

### Loading the ColPali model from huggingface 🤗

In [77]:
MODEL_NAME = "vidore/colpali-v1.2"

# Set device for Torch
device = get_torch_device("auto")
print(f"Using device: {device}")

# Load the ColPali model and processor
model = ColPali.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float32,
    device_map=device,
).eval()

processor = ColPaliProcessor.from_pretrained(MODEL_NAME)

Using device: cpu


Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.82s/it]


## 2. Download PDFs

We are going to use public reports from the Norwegian Government Pension Fund Global (also known as the Oil Fund).
The fund puts transparency at the forefront and publishes reports on its investments, holdings, and returns, as well as its strategy and governance.

As we can see, a lot of the information is in the form of tables, charts and numbers.
These are not easily extractable using pdf-readers or OCR tools.


In [11]:
import requests

pdfs = [
    {
        "url": "https://sustainable.stanford.edu/sites/g/files/sbiybj26701/files/media/file/2022-23-sustainability-year-review.pdf",
        "path": "pdfs/stanford-2023.pdf",
        "year": "2024",
    },
]

### Downloading the PDFs

We create a function to download the PDFs from the web to the provided directory.

In [12]:
PDFS_DIR = "pdfs"
os.makedirs(PDFS_DIR, exist_ok=True)


def download_pdf(url: str, path: str):
    r = requests.get(url, stream=True)
    with open(path, "wb") as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
    return path


# Download the pdfs
for pdf in pdfs:
    download_pdf(pdf["url"], pdf["path"])

## 3. Convert PDFs to Images


In [13]:
def get_pdf_images(pdf_path):
    # Extract text using PyMuPDF
    doc = fitz.open(pdf_path)
    page_texts = []
    for page in doc:
        text = page.get_text()
        page_texts.append(text)

    # Convert to PIL images using pdf2image
    images = convert_from_path(pdf_path)
    assert len(images) == len(page_texts)
    return images, page_texts


pdf_folder = "pdfs"
pdf_pages = []
for pdf in tqdm(pdfs):
    pdf_file = pdf["path"]
    title = os.path.splitext(os.path.basename(pdf_file))[0]
    images, texts = get_pdf_images(pdf_file)
    for page_no, (image, text) in enumerate(zip(images, texts)):
        pdf_pages.append(
            {
                "title": title,
                "year": pdf["year"],
                "url": pdf["url"],
                "path": pdf_file,
                "image": image,
                "text": text,
                "page_no": page_no,
            }
        )

100%|██████████| 1/1 [00:13<00:00, 13.08s/it]


In [14]:
len(pdf_pages)

23

In [15]:
MAX_PAGES = 10  # Set to None to use all pages
pdf_pages = pdf_pages[:MAX_PAGES] if MAX_PAGES else pdf_pages

Let us look at the extracted text from the pages displayed above.

## 4. Generate Queries

In this step, we want to generate queries for each page image.
These will be useful for 2 reasons:

1. We can use these queries as typeahead suggestions in the search bar.
2. We could potentially use the queries to generate an evaluation dataset. See [Improving Retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for a deeper dive into this topic. This will not be within the scope of this notebook though.

The prompt for generating queries is adapted from [this](https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html#an-update-retrieval-focused-prompt) wonderful blog post by Daniel van Strien.

We have modified the prompt to also generate keword based queries, in addition to the question based queries.

We will use the Gemini API to generate these queries, with `gemini-1.5-flash-8b` as the model.


In [91]:
from pydantic import BaseModel


class GeneratedQueries(BaseModel):
    broad_topical_question: str
    broad_topical_query: str
    specific_detail_question: str
    specific_detail_query: str
    visual_element_question: str
    visual_element_query: str


def get_retrieval_prompt() -> Tuple[str, GeneratedQueries]:
    prompt = (
        prompt
    ) = """You are an trash analyst. You will be presented an image of a document page from a report published by an institution. The report is an annual report on topics such as sustianability.
Your task is to generate retrieval queries and questions that you would use to retrieve this document (or ask based on this document) in a large corpus.
Please generate 3 different types of retrieval queries and questions.
A retrieval query is a keyword based query, made up of 2-5 words, that you would type into a search engine to find this document.
A question is a natural language question that you would ask, for which the document contains the answer.
The queries should be of the following types:
1. A broad topical query: This should cover the main subject of the document.
2. A specific detail query: This should cover a specific detail or aspect of the document.
3. A visual element query: This should cover a visual element of the document, such as a chart, graph, or image.

Important guidelines:
- Ensure the queries are relevant for retrieval tasks, not just describing the page content.
- Use a fact-based natural language style for the questions.
- Frame the queries as if someone is searching for this document in a large corpus.
- Make the queries diverse and representative of different search strategies.

Format your response as a JSON object with the structure of the following example:
{
    "broad_topical_question": "What was the Sustainability Policy in 2023?",
    "broad_topical_query": "sustainability goals",
    "specific_detail_question": "How many pounds of seafood were purchased?",
    "specific_detail_query": "pounds of seafood purchased",
    "visual_element_question": "What is the percent of vegetarian items on the menu?",
    "visual_element_query": "percentage of vegetarian items"
}

If there are no relevant visual elements, provide an empty string for the visual element question and query.
Here is the document image to analyze:
Generate the queries based on this image and provide the response in the specified JSON format.
Only return JSON. Don't return any extra explanation text. """

    return prompt, GeneratedQueries


prompt_text, pydantic_model = get_retrieval_prompt()

In [92]:
gemini_model = genai.GenerativeModel("gemini-1.5-flash-8b")


def generate_queries(image, prompt_text, pydantic_model):
    try:
        response = gemini_model.generate_content(
            [image, "\n\n", prompt_text],
            generation_config=genai.GenerationConfig(
                response_mime_type="application/json",
                response_schema=pydantic_model,
            ),
        )
        queries = json.loads(response.text)
    except Exception as _e:
        print(_e)
        queries = {
            "broad_topical_question": "",
            "broad_topical_query": "",
            "specific_detail_question": "",
            "specific_detail_query": "",
            "visual_element_question": "",
            "visual_element_query": "",
        }
    return queries

In [93]:
for pdf in tqdm(pdf_pages):
    image = pdf.get("image")
    pdf["queries"] = generate_queries(image, prompt_text, pydantic_model)

100%|██████████| 10/10 [00:35<00:00,  3.54s/it]


Let's take a look at the queries and questions generated for the page displayed above.

In [94]:
pdf_pages[8]["queries"]

{'broad_topical_query': 'stanford sustainability',
 'broad_topical_question': "What are Stanford's sustainability initiatives in 2023?",
 'specific_detail_query': 'energy savings forsythe hall',
 'specific_detail_question': 'What were the estimated energy savings in Forsythe Hall in 2023?',
 'visual_element_query': '',
 'visual_element_question': ''}

## 5. Generate embeddings

Now that we have the queries, we can use the ColPali model to generate embeddings for each page image.


In [20]:
def generate_embeddings(images, model, processor, batch_size=1) -> np.ndarray:
    """
    Generate embeddings for a list of images.
    Move to CPU only once per batch.

    Args:
        images (List[PIL.Image]): List of PIL images.
        model (nn.Module): The model to generate embeddings.
        processor: The processor to preprocess images.
        batch_size (int, optional): Batch size for processing. Defaults to 64.

    Returns:
        np.ndarray: Embeddings for the images, shape
                    (len(images), processor.max_patch_length (1030 for ColPali), model.config.hidden_size (Patch embedding dimension - 128 for ColPali)).
    """

    def collate_fn(batch):
        # Batch is a list of images
        return processor.process_images(batch)  # Should return a dict of tensors

    dataloader = DataLoader(
        images,
        shuffle=False,
        collate_fn=collate_fn,
    )

    embeddings_list = []
    for batch in tqdm(dataloader):
        with torch.no_grad():
            batch = {k: v.to(model.device) for k, v in batch.items()}
            embeddings_batch = model(**batch)
            # Convert tensor to numpy array and append to list
            embeddings_list.extend(
                [t.cpu().numpy() for t in torch.unbind(embeddings_batch)]
            )

    # Stack all embeddings into a single numpy array
    all_embeddings = np.stack(embeddings_list, axis=0)
    return all_embeddings

In [21]:
# Generate embeddings for all images
images = [pdf["image"] for pdf in pdf_pages]
embeddings = generate_embeddings(images, model, processor)

  0%|          | 0/10 [00:00<?, ?it/s]You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
 10%|█         | 1/10 [01:49<16:25, 109.46s/it]You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
 20%|██        | 2/10 [03:01<11:40, 87.53s/it] You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<ima

Now, we have one embedding vector of dimension 128 for each patch of each image (1024 patches + some special tokens).

In [22]:
embeddings.shape

(10, 1030, 128)

In [23]:
assert len(pdf_pages) == embeddings.shape[0]
assert embeddings.shape[1] == 1030  # Number of patches (including special tokens)
assert embeddings.shape[2] == 128  # Embedding dimension per patch

## 6. Prepare Data on Vespa Format

Now, that we have all the data we need, all that remains is to make sure it is in the right format for Vespa.

We now convert the embeddings to Vespa JSON format so we can store (and index) them in Vespa.
Details in [Vespa JSON feed format doc](https://docs.vespa.ai/en/reference/document-json-format.html).


We use binary quantization (BQ) of the page level ColPali vector embeddings to reduce their size by 32x.

Read more about binarization of multi-vector representations in the [colbert blog post](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/).

The binarization step maps 128 dimensional floats to 128 bits, or 16 bytes per vector. Reducing the size by 32x. On the [DocVQA benchmark](https://huggingface.co/datasets/vidore/docvqa_test_subsampled), binarization results in only a small drop in ranking accuracy.

In [24]:
def float_to_binary_embedding(float_query_embedding: dict) -> dict:
    """Utility function to convert float query embeddings to binary query embeddings."""
    binary_query_embeddings = {}
    for k, v in float_query_embedding.items():
        binary_vector = (
            np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist()
        )
        binary_query_embeddings[k] = binary_vector
    return binary_query_embeddings

Note that we also store a scaled down (blurred) version of the image in Vespa. The purpose of this is to return this fast on first results to the frontend, to provide a snappy user experience, and then load the full resolution image async in the background.

In [64]:
vespa_feed = []
for pdf, embedding in zip(pdf_pages, embeddings):
    url = pdf["url"]
    year = pdf["year"]
    title = pdf["title"]
    image = pdf["image"]
    text = pdf.get("text", "")
    page_no = pdf["page_no"]
    query_dict = pdf["queries"]
    questions = [v for k, v in query_dict.items() if "question" in k and v]
    queries = [v for k, v in query_dict.items() if "query" in k and v]
    base_64_image = get_base64_image(
        scale_image(image, 32), add_url_prefix=False
    )  # Scaled down image to return fast on search (~1kb)
    base_64_full_image = get_base64_image(image, add_url_prefix=False)
    embedding_dict = {k: v for k, v in enumerate(embedding)}
    binary_embedding = float_to_binary_embedding(embedding_dict)
    # id_hash should be md5 hash of url and page_number
    id_hash = hashlib.md5(f"{url}_{page_no}".encode()).hexdigest()
    page = {
        "id": id_hash,
        "fields": {
            "id": id_hash,
            "url": url,
            "title": title,
            "year": year,
            "page_number": page_no,
            "blur_image": base_64_image,
            "full_image": base_64_full_image,
            "text": text,
            "embedding": binary_embedding,
            "queries": queries,
            "questions": questions,
        },
    }
    vespa_feed.append(page)

### [Optional] Saving the feed file

If you have a large dataset, you can optionally save the file, and feed it using the Vespa CLI, which is more performant than the pyvespa client. See [Feeding to Vespa Cloud](https://pyvespa.readthedocs.io/en/latest/examples/feed_performance_cloud.html) for more details. Uncomment the cell below if you want to save the feed file.

In [65]:
os.makedirs("output", exist_ok=True)
with open("output/vespa_feed.jsonl", "w") as f:
    vespa_feed_to_save = []
    for page in vespa_feed:
        document_id = page["id"]
        put_id = f"id:{VESPA_APPLICATION_NAME}:{VESPA_SCHEMA_NAME}::{document_id}"
        vespa_feed_to_save.append({"put": put_id, "fields": page["fields"]})
    json.dump(vespa_feed_to_save, f)

## 7. Prepare Vespa Application


### Configuring the application package

[PyVespa](https://pyvespa.readthedocs.io/en/latest/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html).
A Vespa application package consists of configuration files, schemas, models, and code (plugins).

Here are some of the key components of this application package:

1. We store images (and a scaled down version of the image) as a `raw` field.
2. We store the binarized ColPali embeddings as a `tensor<int8>` field.
3. We store the queries and questions as a `array<str>` field.
4. We define 3 different ranking profiles:
    - `default` Uses BM25 for first phase ranking and MaxSim for second phase ranking.
    - `bm25` Uses `bm25(title) + bm25(text)` (first phase only) for ranking.
    - `retrieval-and-rerank` Uses `nearestneighbor` of the query embedding over the document embeddings for retrieval, `binary_max_sim` for first phase ranking, and `max_sim` of the query-embeddings as float for second phase ranking.
    Vespa's [phased ranking](https://docs.vespa.ai/en/phased-ranking.html) allows us to use different ranking strategies for retrieval and reranking, to choose attractive trade-offs between latency, cost, and accuracy.
5. We also calculate dot product between the query and each document, so that it can be returned with the results, to generate the similarity maps, which show which patches of the image is most similar to the query token embeddings.

First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type.

In [95]:
colpali_schema = Schema(
    name=VESPA_SCHEMA_NAME,
    document=Document(
        fields=[
            Field(
                name="id",
                type="string",
                indexing=["summary", "index"],
                match=["word"],
            ),
            Field(name="url", type="string", indexing=["summary", "index"]),
            Field(name="year", type="int", indexing=["summary", "attribute"]),
            Field(
                name="title",
                type="string",
                indexing=["summary", "index"],
                match=["text"],
                index="enable-bm25",
            ),
            Field(name="page_number", type="int", indexing=["summary", "attribute"]),
            Field(name="blur_image", type="raw", indexing=["summary"]),
            Field(name="full_image", type="raw", indexing=["summary"]),
            Field(
                name="text",
                type="string",
                indexing=["summary", "index"],
                match=["text"],
                index="enable-bm25",
            ),
            Field(
                name="embedding",
                type="tensor<int8>(patch{}, v[16])",
                indexing=[
                    "attribute",
                    "index",
                ],
                ann=HNSW(
                    distance_metric="hamming",
                    max_links_per_node=32,
                    neighbors_to_explore_at_insert=400,
                ),
            ),
            Field(
                name="questions",
                type="array<string>",
                indexing=["summary", "attribute"],
                summary=Summary(fields=["matched-elements-only"]),
            ),
            Field(
                name="queries",
                type="array<string>",
                indexing=["summary", "attribute"],
                summary=Summary(fields=["matched-elements-only"]),
            ),
        ]
    ),
    fieldsets=[
        FieldSet(
            name="default",
            fields=["title", "text"],
        ),
    ],
    document_summaries=[
        DocumentSummary(
            name="default",
            summary_fields=[
                Summary(
                    name="text",
                    fields=[("bolding", "on")],
                ),
                Summary(
                    name="snippet",
                    fields=[("source", "text"), "dynamic"],
                ),
            ],
            from_disk=True,
        ),
        DocumentSummary(
            name="suggestions",
            summary_fields=[
                Summary(name="questions"),
            ],
            from_disk=True,
        ),
    ],
)

# Define similarity functions used in all rank profiles
mapfunctions = [
    Function(
        name="similarities",  # computes similarity scores between each query token and image patch
        expression="""
                sum(
                    query(qt) * unpack_bits(attribute(embedding)), v
                )
            """,
    ),
    Function(
        name="normalized",  # normalizes the similarity scores to [-1, 1]
        expression="""
                (similarities - reduce(similarities, min)) / (reduce((similarities - reduce(similarities, min)), max)) * 2 - 1
            """,
    ),
    Function(
        name="quantized",  # quantizes the normalized similarity scores to signed 8-bit integers [-128, 127]
        expression="""
                cell_cast(normalized * 127.999, int8)
            """,
    ),
]

# Define the 'bm25' rank profile
bm25 = RankProfile(
    name="bm25",
    inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
    first_phase="bm25(title) + bm25(text)",
    functions=mapfunctions,
)


# A function to create an inherited rank profile which also returns quantized similarity scores
def with_quantized_similarity(rank_profile: RankProfile) -> RankProfile:
    return RankProfile(
        name=f"{rank_profile.name}_sim",
        first_phase=rank_profile.first_phase,
        inherits=rank_profile.name,
        summary_features=["quantized"],
    )


colpali_schema.add_rank_profile(bm25)
colpali_schema.add_rank_profile(with_quantized_similarity(bm25))


# Update the 'colpali' rank profile
input_query_tensors = []
MAX_QUERY_TERMS = 64
for i in range(MAX_QUERY_TERMS):
    input_query_tensors.append((f"query(rq{i})", "tensor<int8>(v[16])"))

input_query_tensors.extend(
    [
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
        ("query(qtb)", "tensor<int8>(querytoken{}, v[16])"),
    ]
)

colpali = RankProfile(
    name="colpali",
    inputs=input_query_tensors,
    first_phase="max_sim_binary",
    second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10),
    functions=mapfunctions
    + [
        Function(
            name="max_sim",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(embedding)), v
                        ),
                        max, patch
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim_binary",
            expression="""
                sum(
                    reduce(
                        1 / (1 + sum(
                            hamming(query(qtb), attribute(embedding)), v)
                        ),
                        max, patch
                    ),
                    querytoken
                )
            """,
        ),
    ],
)
colpali_schema.add_rank_profile(colpali)
colpali_schema.add_rank_profile(with_quantized_similarity(colpali))

# Update the 'hybrid' rank profile
hybrid = RankProfile(
    name="hybrid",
    inputs=input_query_tensors,
    first_phase="max_sim_binary",
    second_phase=SecondPhaseRanking(
        expression="max_sim + 2 * (bm25(text) + bm25(title))", rerank_count=10
    ),
    functions=mapfunctions
    + [
        Function(
            name="max_sim",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(embedding)), v
                        ),
                        max, patch
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim_binary",
            expression="""
                sum(
                    reduce(
                        1 / (1 + sum(
                            hamming(query(qtb), attribute(embedding)), v)
                        ),
                        max, patch
                    ),
                    querytoken
                )
            """,
        ),
    ],
)
colpali_schema.add_rank_profile(hybrid)
colpali_schema.add_rank_profile(with_quantized_similarity(hybrid))

### Configuring the `services.xml`

[services.xml](https://docs.vespa.ai/en/reference/services.html) is the primary configuration file for a Vespa application, with a plethora of options to configure the application.

Since `pyvespa` version `0.50.0`, these configuration options are also available in `pyvespa`. See [Pyvespa - Advanced configuration](https://pyvespa.readthedocs.io/en/latest/advanced-configuration.html) for more details. (Note that configurating this is optional, and pyvespa will use basic defaults for you if you opt out).

We will use the advanced configuration to configure up [dynamic snippets](https://docs.vespa.ai/en/document-summaries.html#dynamic-snippets). This allows us to highlight matched terms in the search results and generate a `snippet` to display, rather than the full text of the document.

In [96]:
from vespa.configuration.services import (
    services,
    container,
    search,
    document_api,
    document_processing,
    clients,
    client,
    config,
    content,
    redundancy,
    documents,
    node,
    certificate,
    token,
    document,
    nodes,
)
from vespa.configuration.vt import vt
from vespa.package import ServicesConfiguration

service_config = ServicesConfiguration(
    application_name=VESPA_APPLICATION_NAME,
    services_config=services(
        container(
            search(),
            document_api(),
            document_processing(),
            clients(
                client(
                    certificate(file="security/clients.pem"),
                    id="mtls",
                    permissions="read,write",
                ),
                client(
                    token(id=f"{VESPA_TOKEN_ID}"),
                    id="token_write",
                    permissions="read,write",
                ),
            ),
            config(
                vt("tag")(
                    vt("bold")(
                        vt("open", "<strong>"),
                        vt("close", "</strong>"),
                    ),
                    vt("separator", "..."),
                ),
                name="container.qr-searchers",
            ),
            id=f"{VESPA_APPLICATION_NAME}_container",
            version="1.0",
        ),
        content(
            redundancy("1"),
            documents(document(type="pdf_page", mode="index")),
            nodes(node(distribution_key="0", hostalias="node1")),
            config(
                vt("max_matches", "2", replace_underscores=False),
                vt("length", "1000"),
                vt("surround_max", "500", replace_underscores=False),
                vt("min_length", "300", replace_underscores=False),
                name="vespa.config.search.summary.juniperrc",
            ),
            id=f"{VESPA_APPLICATION_NAME}_content",
            version="1.0",
        ),
        version="1.0",
    ),
)

In [97]:
# Create the Vespa application package
vespa_application_package = ApplicationPackage(
    name=VESPA_APPLICATION_NAME,
    schema=[colpali_schema],
    services_config=service_config,
)

## 8. Deploy Vespa Application


In [98]:
# This is only needed for CI.
VESPA_TEAM_API_KEY = """-----BEGIN PRIVATE KEY-----
MIGHAgEAMBMGByqGSM49AgEGCCqGSM49AwEHBG0wawIBAQQgt8hzL1s6sxGxN9t+
F6CIto9Y5I3A355Qv4GoDKSNU+mhRANCAAQUt9TEg8zwy5TV11/YdP2htLrk4CGP
bjzuZZS8prmXl4Pf+0idp5asi0uyyV5KnoFFoMWvksFAduMc6vzn9/Nj
-----END PRIVATE KEY-----"""

In [99]:
vespa_cloud = VespaCloud(
    tenant=VESPA_TENANT_NAME,
    application=VESPA_APPLICATION_NAME,
    key_content=VESPA_TEAM_API_KEY,
    application_package=vespa_application_package,
)

# Deploy the application
vespa_cloud.deploy()

# Output the endpoint URL
endpoint_url = vespa_cloud.get_token_endpoint()
print(f"Application deployed. Token endpoint URL: {endpoint_url}")

Setting application...
Running: vespa config set application treetrash2025.colpalidemodev
Setting target cloud...
Running: vespa config set target cloud

Api-key found for control plane access. Using api-key.
Deployment started in run 10 of dev-aws-us-east-1c for treetrash2025.colpalidemodev. This may take a few minutes the first time.
INFO    [23:43:37]  Deploying platform version 8.478.26 and application dev build 9 for dev-aws-us-east-1c of default ...
INFO    [23:43:37]  Using CA signed certificate version 1
INFO    [23:43:37]  Using 1 nodes in container cluster 'colpalidemodev_container'
INFO    [23:43:40]  Session 13 for tenant 'treetrash2025' prepared and activated.
INFO    [23:43:41]  ######## Details for all nodes ########
INFO    [23:43:41]  h113661a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP
INFO    [23:43:41]  --- platform vespa/cloud-tenant-rhel8:8.478.26
INFO    [23:43:41]  --- container on port 4080 has config generation 12, wanted is 13
INFO    [23:43:41]  --

Make sure to take note of the token endpoint_url.
You need to put this in your `.env` file for your web application - `VESPA_APP_TOKEN_URL=https://abcd.vespa-app.cloud` - to access the Vespa application from your web application.


## 9. Feed Data to Vespa

We will need the `enpdoint_url` and `colpalidemo_write` token to feed the data to the Vespa application.

In [100]:
print(VESPA_CLOUD_SECRET_TOKEN)

vespa_cloud_rlKFdG9IlqAZWiC4mKrIugyrqKtsh4Jmdi9K4naJgZV


In [101]:
# Instantiate Vespa connection using token
app = Vespa(url=endpoint_url, vespa_cloud_secret_token=VESPA_CLOUD_SECRET_TOKEN)
app.get_application_status()

<Response [200]>

Now, let us feed the data to Vespa.
If you have a large dataset, you could also do this async, with `feed_async_iterable()`, see [Feeding Vespa cloud](https://pyvespa.readthedocs.io/en/latest/examples/feed_performance_cloud.html) for a detailed comparison.

In [102]:
def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(
            f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}"
        )


# Feed data into Vespa synchronously
app.feed_iterable(vespa_feed, schema=VESPA_SCHEMA_NAME, callback=callback)

## 10. Test a query to the Vespa application

For now, we will just run a query with the default rank profile.
We will need a utility function to generate embeddings for the query, and pass this to Vespa to use for calculating MaxSim.
In the web application, we also provide function to generate binary embeddings, allowing the user to choose different rank profiles at query time.

In [103]:
query = "Tell me about the sustainability goals of the organization"

In [104]:
def get_q_embs_vespa_format(query: str):
    inputs = processor.process_queries([query]).to(model.device)
    with torch.no_grad():
        embeddings_query = model(**inputs)
        q_embs = embeddings_query.to("cpu")[0]  # Extract the single embedding
    return {idx: emb.tolist() for idx, emb in enumerate(q_embs)}

In [105]:
q_emb = get_q_embs_vespa_format(query)

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


In [106]:
with app.syncio() as sess:
    response = sess.query(
        body={
            "yql": (
                f"select id, url, title, year, full_image, quantized  from {VESPA_SCHEMA_NAME} where userQuery();"
            ),
            # "yql": (
            #     f"select id, url, title, year, quantized  from {VESPA_SCHEMA_NAME} where userQuery();"
            # ),
            "ranking": "default",
            "query": query,
            "timeout": "10s",
            "hits": 3,
            "input.query(qt)": q_emb,
            "presentation.timing": True,
        }
    )

In [107]:
# from vespa.io import VespaQueryResponse
# import json

# response:VespaQueryResponse = app.query(
#     yql="select id,title,page,chunks from pdf where userQuery() or ({targetHits:10}nearestNeighbor(embedding,q))",
#     groupname="jo-bergum", 
#     ranking="hybrid",
#     query="why is colbert effective?",
#     body={
#         "presentation.format.tensors": "short-value",
#         "input.query(q)": "embed(e5, \"why is colbert effective?\")",
#     }
# )
# assert(response.is_successful())
# print(json.dumps(response.hits[0], indent=2))

In [116]:
print(response.json["root"]["children"][0]["fields"])

{'id': '8d780112802d1b42ede6b5a2d2da1e69', 'url': 'https://sustainable.stanford.edu/sites/g/files/sbiybj26701/files/media/file/2022-23-sustainability-year-review.pdf', 'year': 2024, 'title': 'stanford-2023', 'full_image': '/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAiYBqQDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0

In [109]:
print(len(response.json["root"]["children"]))
img = response.json["root"]["children"][0]["fields"]["full_image"]


3


In [118]:
print(img)

/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAiYBqQDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3+iiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAoo

In [119]:
import base64
from IPython.display import HTML

def display_jpeg_base64(base64_string, width=None, height=None):
    """Displays a JPEG image from a base64 encoded string in a Jupyter Notebook.

    Handles both cases:
        1. Raw base64 string (starting with /9j/...)
        2. Data URI (starting with data:image/jpeg;base64,...)

    Args:
        base64_string: The base64 encoded JPEG image string.
        width: Optional width for the displayed image (e.g., "300px").
        height: Optional height for the displayed image (e.g., "200px").
    """

    if base64_string.startswith("data:image/jpeg;base64,"):
        # Data URI format - use it directly
        img_tag = f'<img src="{base64_string}"'
    elif base64_string.startswith("/9j/"):  # Check for JPEG signature
        # Raw base64 - create the data URI
        data_uri = f"data:image/jpeg;base64,{base64_string}"
        img_tag = f'<img src="{data_uri}"'
    else:
        print("Warning: Base64 string does not appear to be a JPEG image.")
        return  # Don't try to display

    if width:
        img_tag += f' width="{width}"'
    if height:
        img_tag += f' height="{height}"'
    img_tag += '>'

    display(HTML(img_tag))

display_jpeg_base64(img)


In [122]:
import requests
import base64
import json

def ask_gemini(
        prompt, image_encoding, max_tokens=50, content_only=True
    ):
        """
        Call the API to get the response for given prompt and images
        """

        data_uri = f"data:image/jpeg;base64,{image_encoding}"

        message = prompt + data_uri

    
        responses = client.generate_content(
            message,
        )
        print(responses)
        return responses



# Example usage (REPLACE WITH YOUR ACTUAL VALUES):
api_key = os.getenv("GOOGLE_API_KEY")
gemini_model = "gemini-1.5-flash"
system_instruction = "You are a smart and helpful assistant. Interpret what is in the image and summarize it to answer this question: " + query
client = genai.GenerativeModel(model_name=gemini_model, system_instruction=system_instruction)

gemini_response = ask_gemini(prompt_text, img)

response:
GenerateContentResponse(
    done=True,
    iterator=None,
    result=protos.GenerateContentResponse({
      "candidates": [
        {
          "content": {
            "parts": [
              {
                "text": "```json\n{\n  \"broad_topical_question\": \"What are the organization's sustainability goals?\",\n  \"broad_topical_query\": \"sustainability goals annual report\",\n  \"specific_detail_question\": \"What specific targets has the organization set for reducing its environmental impact?\",\n  \"specific_detail_query\": \"environmental impact reduction targets\",\n  \"visual_element_question\": \"\",\n  \"visual_element_query\": \"\"\n}\n```\n"
              }
            ],
            "role": "model"
          },
          "finish_reason": "STOP",
          "avg_logprobs": -0.011001471176888178
        }
      ],
      "usage_metadata": {
        "prompt_token_count": 654666,
        "candidates_token_count": 103,
        "total_token_count": 654769
      }
 

In [123]:
print(gemini_response.text)

```json
{
  "broad_topical_question": "What are the organization's sustainability goals?",
  "broad_topical_query": "sustainability goals annual report",
  "specific_detail_question": "What specific targets has the organization set for reducing its environmental impact?",
  "specific_detail_query": "environmental impact reduction targets",
  "visual_element_question": "",
  "visual_element_query": ""
}
```



Great. You have now deployed the Vespa application and fed the data to it, and made sure you are able to query it using the vespa endpoint and a token.

### Saving the generated key/cert files

A key and cert file is generated for you as an alternative to using tokens for authentication.
We advise you to save these files in a secure location, in case you want to use them for authentication in the future.

In [112]:
from pathlib import Path

key_path = Path(
    f"~/.vespa/{VESPA_TENANT_NAME}.{VESPA_APPLICATION_NAME}.default/data-plane-private-key.pem"
).expanduser()
cert_path = Path(
    f"~/.vespa/{VESPA_TENANT_NAME}.{VESPA_APPLICATION_NAME}.default/data-plane-public-cert.pem"
).expanduser()

assert key_path.exists(), cert_path.exists()

## 11. Deploying your web app

To deploy a frontend to let users interact with the Vespa application. you can clone the sample app from [sample-apps repo](https://github.com/vespa-engine/sample-apps/blob/master/visual-retrieval-colpali/README.md).
It includes instructions for running and connecting your web application to your vespa app.

In [113]:
!git clone --depth 1 --filter=blob:none --sparse https://github.com/vespa-engine/sample-apps.git src && cd src && git sparse-checkout set visual-retrieval-colpali

fatal: destination path 'src' already exists and is not an empty directory.


Now, you have the code for the webapp in your `src/visual-retrieval-colpali`-directory

In [114]:
os.listdir("src/visual-retrieval-colpali")

['.env',
 '.gitignore',
 'application',
 'prepare_feed_deploy.py',
 'pyproject.toml',
 'README.md',
 'src',
 'uv.lock',
 'vespa_feed_to_hf_dataset.py']

### Setting environment variables for your web app

Now, you need to set the following variables in the `src/.env.example`-file:

```bash
VESPA_APP_TOKEN_URL=https://abcde.z.vespa-app.cloud # Your token endpoint url you got after deploying your Vespa app.
VESPA_CLOUD_SECRET_TOKEN=vespa_cloud_xxxxxxxx # The value of the token that your created in this notebook.
GEMINI_API_KEY=your_api_key # The same as GOOGLE_API_KEY in this notebook
HF_TOKEN=hf_xxxx # If you want to deploy your web app to huggingface spaces - https://huggingface.co/settings/tokens
```

After, that, rename your file to .env.

In [115]:
# rename src/visual-retrieval-colpali/.env.example
os.rename(
    "src/visual-retrieval-colpali/.env.example", dst="src/visual-retrieval-colpali/.env"
)

FileNotFoundError: [WinError 2] The system cannot find the file specified: 'src/visual-retrieval-colpali/.env.example' -> 'src/visual-retrieval-colpali/.env'

And you're ready to spin up your web app locally, and deploy to huggingface spaces if you want.
Navigate to `src/visual-retrieval-colpali/` directory and follow the instructions in the `README.md` to continue. 🚀

## Cleanup

As this notebook runs in CI, we will delete the Vespa application after running the notebook.
DO NOT run the cell below unless you are sure you want to delete the Vespa application.

In [None]:
if os.getenv("CI", "false") == "true":
    vespa_cloud.delete()