<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
  <img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-rock.svg" style="margin-bottom: 25px;">
</picture>


# Standalone ColBERT + Vespa for long-context ranking

This is a guide on how to use the [ColBERT](https://github.com/stanford-futuredata/ColBERT) package to produce token-level
vectors. This as an alternative for using the native Vespa [colbert embedder](https://docs.vespa.ai/en/embedding.html#colbert-embedder).

This guide illustrates how to feed multiple passages per document (long-context)

- Compress token vectors using binarization compatible with Vespa unpackbits 
- Use Vespa hex feed format for binary vectors with mixed vespa tensors
- How to query 

Read more about [Vespa Long-Context ColBERT](https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/).

In [None]:
!pip3 install -U pyvespa colbert-ai numpy torch

Load a checkpoint with colbert and obtain document and query embeddings

In [11]:
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

ckpt = Checkpoint(
    "colbert-ir/colbertv2.0", colbert_config=ColBERTConfig(root="experiments")
)



In [50]:
document_passages = [
    "Alan Turing  was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.",
    "Born in Maida Vale, London, Turing was raised in southern England. He graduated from King's College, Cambridge, with a degree in mathematics.",
    "After the war, Turing worked at the National Physical Laboratory, where he designed the Automatic Computing Engine, one of the first designs for a stored-program computer.",
    "Turing has an extensive legacy with statues of him and many things named after him, including an annual award for computer science innovations.",
]

In [51]:
document_token_vectors = ckpt.docFromText(document_passages)



In [52]:
document_token_vectors.shape

torch.Size([4, 35, 128])

In [53]:
query_vectors = ckpt.queryFromText(["Who was Alan Turing?"])[0]
query_vectors.shape

torch.Size([32, 128])

The query is always padded to 32 so in the above we have 32 query token vectors. 

Routines for binarization and output in Vespa tensor format that can be used in queries and in JSON feed.

In [67]:
import numpy as np
import torch
from binascii import hexlify
from typing import List, Dict


def binarize_token_vectors_hex(vectors: torch.Tensor) -> Dict[str, str]:
    # Notice axix=2 to pack the bits in the last dimension which is the token level vectors
    binarized_token_vectors = np.packbits(np.where(vectors > 0, 1, 0), axis=2).astype(
        np.int8
    )
    vespa_tensor = list()
    for chunk_index in range(0, len(binarized_token_vectors)):
        token_vectors = binarized_token_vectors[chunk_index]
        for token_index in range(0, len(token_vectors)):
            values = str(hexlify(token_vectors[token_index].tobytes()), "utf-8")
            if (
                values == "00000000000000000000000000000000"
            ):  # skip empty vectors due to padding of batch
                continue
            vespa_tensor_cell = {
                "address": {"context": chunk_index, "token": token_index},
                "values": values,
            }
            vespa_tensor.append(vespa_tensor_cell)

    return vespa_tensor


def float_query_token_vectors(vectors: torch.Tensor) -> Dict[str, List[float]]:
    vespa_token_feed = dict()
    for index in range(0, len(vectors)):
        vespa_token_feed[index] = vectors[index].tolist()
    return vespa_token_feed

In [None]:
import json

print(json.dumps(binarize_token_vectors_hex(document_token_vectors)))
print(json.dumps(float_query_token_vectors(query_vectors)))

## Definining the Vespa application
[PyVespa](https://pyvespa.readthedocs.io/en/latest/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html).
A Vespa application package consists of configuration files, schemas, models, and code (plugins).   

First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type.

We use HNSW with hamming distance for retrieval

In [60]:
from vespa.package import Schema, Document, Field

colbert_schema = Schema(
    name="doc",
    document=Document(
        fields=[
            Field(name="id", type="string", indexing=["summary"]),
            Field(
                name="passages",
                type="array<string>",
                indexing=["summary", "index"],
                index="enable-bm25",
            ),
            Field(
                name="colbert",
                type="tensor<int8>(context{}, token{}, v[16])",
                indexing=["attribute", "summary"],
            ),
        ]
    ),
)

In [61]:
from vespa.package import ApplicationPackage

vespa_app_name = "colbertlong"
vespa_application_package = ApplicationPackage(
    name=vespa_app_name, schema=[colbert_schema]
)

Note that we just use max sim in the first phase ranking over all the hits that are retrieved by the query

In [62]:
from vespa.package import RankProfile, Function, FirstPhaseRanking

colbert_profile = RankProfile(
    name="default",
    inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
    functions=[
        Function(
            name="max_sim_per_context",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert)) , v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim", expression="reduce(max_sim_per_context, max, context)"
        ),
    ],
    first_phase=FirstPhaseRanking(expression="max_sim"),
    match_features=["max_sim_per_context"],
)
colbert_schema.add_rank_profile(colbert_profile)

## Deploy the application to Vespa Cloud

With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/).
It is also possible to deploy the app using docker; see the [Hybrid Search - Quickstart](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html) guide for
an example of deploying it to a local docker container.

Install the Vespa CLI using [homebrew](https://brew.sh/) - or download a binary from GitHub as demonstrated below.

In [None]:
!brew install vespa-cli

Alternatively, if running in Colab, download the Vespa CLI:

In [None]:
import os
import requests

res = requests.get(
    url="https://api.github.com/repos/vespa-engine/vespa/releases/latest"
).json()
os.environ["VERSION"] = res["tag_name"].replace("v", "")
!curl -fsSL https://github.com/vespa-engine/vespa/releases/download/v${VERSION}/vespa-cli_${VERSION}_linux_amd64.tar.gz | tar -zxf -
!ln -sf /content/vespa-cli_${VERSION}_linux_amd64/bin/vespa /bin/vespa

To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud:

Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one).
This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial).
Make note of the tenant name, it is used in the next steps.

### Configure Vespa Cloud date-plane security

Create Vespa Cloud data-plane mTLS cert/key-pair. The mutual certificate pair is used to talk to your Vespa cloud endpoints. See [Vespa Cloud Security Guide](https://cloud.vespa.ai/en/security/guide) for details.

We save the paths to the credentials for later data-plane access without using pyvespa APIs.

In [None]:
import os

os.environ["TENANT_NAME"] = "vespa-team"  # Replace with your tenant name

vespa_cli_command = (
    f'vespa config set application {os.environ["TENANT_NAME"]}.{vespa_app_name}'
)

!vespa config set target cloud
!{vespa_cli_command}
!vespa auth cert -N

Validate that we have the expected data-plane credential files:

In [35]:
from os.path import exists
from pathlib import Path

cert_path = (
    Path.home()
    / ".vespa"
    / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-public-cert.pem"
)
key_path = (
    Path.home()
    / ".vespa"
    / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-private-key.pem"
)

if not exists(cert_path) or not exists(key_path):
    print(
        "ERROR: set the correct paths to security credentials. Correct paths above and rerun until you do not see this error"
    )

Note that the subsequent Vespa Cloud deploy call below will add `data-plane-public-cert.pem` to the application before deploying it to Vespa Cloud, so that
you have access to both the private key and the public certificate. At the same time, Vespa Cloud only knows the public certificate.

### Configure Vespa Cloud control-plane security

Authenticate to generate a tenant level control plane API key for deploying the applications to Vespa Cloud, and save the path to it.

The generated tenant api key must be added in the Vespa Console before attemting to deploy the application.

```
To use this key in Vespa Cloud click 'Add custom key' at
https://console.vespa-cloud.com/tenant/TENANT_NAME/account/keys
and paste the entire public key including the BEGIN and END lines.
```

In [None]:
!vespa auth api-key

from pathlib import Path

api_key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.api-key.pem"

### Deploy to Vespa Cloud

Now that we have data-plane and control-plane credentials ready, we can deploy our application to Vespa Cloud!

`PyVespa` supports deploying apps to the [development zone](https://cloud.vespa.ai/en/reference/environments#dev-and-perf).

>Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.

In [63]:
from vespa.deployment import VespaCloud


def read_secret():
    """Read the API key from the environment variable. This is
    only used for CI/CD purposes."""
    t = os.getenv("VESPA_TEAM_API_KEY")
    if t:
        return t.replace(r"\n", "\n")
    else:
        return t


vespa_cloud = VespaCloud(
    tenant=os.environ["TENANT_NAME"],
    application=vespa_app_name,
    key_content=read_secret() if read_secret() else None,
    key_location=api_key_path,
    application_package=vespa_application_package,
)

Now deploy the app to Vespa Cloud dev zone.

The first deployment typically takes 2 minutes until the endpoint is up.

In [64]:
from vespa.application import Vespa

app: Vespa = vespa_cloud.deploy()

Deployment started in run 3 of dev-aws-us-east-1c for samples.colbertlong. This may take a few minutes the first time.
INFO    [19:49:37]  Deploying platform version 8.324.16 and application dev build 3 for dev-aws-us-east-1c of default ...
INFO    [19:49:37]  Using CA signed certificate version 0
INFO    [19:49:46]  Using 1 nodes in container cluster 'colbertlong_container'
INFO    [19:49:51]  Session 2737 for tenant 'samples' prepared and activated.
INFO    [19:49:52]  ######## Details for all nodes ########
INFO    [19:49:52]  h88976a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO    [19:49:52]  --- platform vespa/cloud-tenant-rhel8:8.324.16
INFO    [19:49:52]  --- logserver-container on port 4080 has config generation 2737, wanted is 2737
INFO    [19:49:52]  --- metricsproxy-container on port 19092 has config generation 2737, wanted is 2737
INFO    [19:49:52]  h88976b.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO    [19:49:52]  -

Use Vespa tensor `blocks` format for mixed tensors (two mapped dimensions with one dense) [doc](https://docs.vespa.ai/en/reference/document-json-format.html#tensor).

In [65]:
from vespa.io import VespaResponse

vespa_feed_format = {
    "id": "1",
    "passages": document_passages,
    "colbert": {"blocks": binarize_token_vectors_hex(document_token_vectors)},
}
with app.syncio() as sync:
    response: VespaResponse = sync.feed_data_point(
        data_id=1, fields=vespa_feed_format, schema="doc"
    )

## Querying



This example uses brute-force "true" search without a retrieval step using nearestNeighbor or keywords. 

In [66]:
from vespa.io import VespaQueryResponse
import json

response: VespaQueryResponse = app.query(
    yql="select * from doc where true",
    ranking="default",
    body={
        "presentation.format.tensors": "short-value",
        "input.query(qt)": float_query_token_vectors(query_vectors),
    },
)
assert response.is_successful()
print(json.dumps(response.hits[0], indent=2))

{
  "id": "id:doc:doc::1",
  "relevance": 100.0651626586914,
  "source": "colbertlong_content",
  "fields": {
    "matchfeatures": {
      "max_sim_per_context": {
        "0": 100.0651626586914,
        "1": 62.7861328125,
        "2": 67.44772338867188,
        "3": 60.133323669433594
      }
    },
    "sddocname": "doc",
    "documentid": "id:doc:doc::1",
    "id": "1",
    "passages": [
      "Alan Turing  was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.",
      "Born in Maida Vale, London, Turing was raised in southern England. He graduated from King's College, Cambridge, with a degree in mathematics.",
      "After the war, Turing worked at the National Physical Laboratory, where he designed the Automatic Computing Engine, one of the first designs for a stored-program computer.",
      "Turing has an extensive legacy with statues of him and many things named after him, including an annual award for computer science 

As can be seen from the matchfeatures, the first context (index 0) scored the highest and this is the score that is used to score the entire document. 


In [None]:
vespa_cloud.delete()