<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
  <img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-rock.svg" style="margin-bottom: 25px;">
</picture>

# Evaluating OpenAI Matryoshka 🪆 embeddings with Vespa

This notebook demonstrates the effectiveness of using the [recently released](https://openai.com/blog/new-embedding-models-and-api-updates) OpenAI `text-embedding-3` embeddings in Vespa.

Specifically, we are interested in the effectiveness of the "[Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147)" technique used in training, which claims to let us "shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties". If this works well, it should indeed allow us to very easily trade off accuracy in exchange for smaller vector size, reducing both storage and computation costs while retaining quality.

We'll use a standard information retrieval benchmark to measure retrieval quality with different embedding dimensions, and this will give us a chance to show how Vespa could elegantly accommodate this new technique without having to add any new features, using our flexible [tensor data model](https://docs.vespa.ai/en/tensor-user-guide.html).

Let's get started! First, install a few dependencies:

In [None]:
!pip3 install -U pyvespa ir_datasets openai pytrec_eval

## Getting a sample dataset
Let's download a dataset so we have something to embed:

In [271]:
import ir_datasets
dataset = ir_datasets.load('beir/trec-covid')
print("Dataset has", dataset.docs_count(), "documents. Sample:")
dataset.docs_iter()[120]._asdict()

Dataset has 171332 documents. Sample:


{'doc_id': 'z2u5frvq',
 'text': 'The authors discuss humoral immune responses to HIV and approaches to designing vaccines that induce viral neutralizing and other potentially protective antibodies.',
 'title': 'Antibody-Based HIV-1 Vaccines: Recent Developments and Future Directions: A summary report from a Global HIV Vaccine Enterprise Working Group',
 'url': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2100141/',
 'pubmed_id': '18052607'}

### Queries
This dataset also comes with a set of queries, and query/document relevance judgements:

In [273]:
print(next(dataset.queries_iter()))
print(next(dataset.qrels_iter()))

BeirCovidQuery(query_id='1', text='what is the origin of COVID-19', query='coronavirus origin', narrative="seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans")
TrecQrel(query_id='1', doc_id='005b2j4b', relevance=2, iteration='0')


## Definining the Vespa application
[PyVespa](https://pyvespa.readthedocs.io/en/latest/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html).
A Vespa application package consists of configuration files, schemas, models, and possibly even custom code (plugins).

First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type.

In [192]:
from vespa.package import Schema, Document, Field, FieldSet
my_schema = Schema(
            name="my_schema",
            mode="index",
            document=Document(
                fields=[
                    Field(name="doc_id", type="string", indexing=["summary"]),
                    Field(name="text", type="string", indexing=["summary", "index"], index="enable-bm25"),
                    Field(name="title", type="string", indexing=["summary", "index"], index="enable-bm25"),
                    Field(name="url", type="string", indexing=["summary", "index"]),
                    Field(name="pubmed_id", type="string", indexing=["summary", "index"]),
                    
                    Field(name="embedding", type="tensor<float>(x[3072])",
                        indexing=["attribute"],
                        attribute=["paged", "distance-metric: angular"],
                    ),

                    Field(name="shortened_256", type="tensor<float>(x[256])",
                        indexing=["attribute", "index"],
                        attribute=["distance-metric: angular"]
                    )
                ],
            ),
            fieldsets=[
                FieldSet(name = "default", fields = ["title", "text"])
            ]
)

The last two fields of type `tensor<float>(x[3072/256])` are not from the dataset - they are tensor fields to hold the embedding data, with a size of 3072 dimensions (full) and 256 dimensions (shortened).

For the latter tensor field, `index` means we will build an [HNSW Approximate Nearest Neighbor index](https://docs.vespa.ai/en/approximate-nn-hnsw.html), which we can later use to greatly reduce the query latency.

To evaluate OpenAI embeddings, we must get the embeddings from their API, but Vespa can also embed the document for you with a user-specified model.

Now we must define an [application package](https://andreer.no) which uses this schema:

In [193]:
from vespa.package import ApplicationPackage

vespa_app_name = "matryoshka"
vespa_application_package = ApplicationPackage(
        name=vespa_app_name,
        schema=[my_schema]
)

In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema.

Vespa supports [phased ranking](https://docs.vespa.ai/en/phased-ranking.html) and has a rich set of built-in [rank-features](https://docs.vespa.ai/en/reference/rank-features.html), including many
text-matching features such as:

- [BM25](https://docs.vespa.ai/en/reference/bm25.html),
- [nativeRank](https://docs.vespa.ai/en/reference/nativerank.html)
and [many more](https://docs.vespa.ai/en/reference/rank-features.html).

Users can also define custom functions using [ranking expressions](https://docs.vespa.ai/en/reference/ranking-expressions.html).

The following defines three runtime selectable Vespa ranking profiles:
* `exact` uses the full-size embedding
* `shortened` uses only 256 dimensions (exact, or using the approximate nearest neighbor HNSW index)
* `shortened_rerank` uses the 256-dimension shortened embeddings (exact or ANN) in a first phase, and the full 3072-dimension embeddings in a second phase. By default the second phase is applied to the top 100 documents from the first phase.

In [194]:
from vespa.package import RankProfile, Function, FirstPhaseRanking, SecondPhaseRanking

my_schema.add_rank_profile(RankProfile(name="bm25", first_phase="bm25(title)+bm25(text)"))

exact = RankProfile(
    name="exact",
    inputs=[
        ("query(q3072)", "tensor<float>(x[3072])"),
        ("query(q256)", "tensor<float>(x[256])")
        ],
    functions=[
        Function(
            name="cos_sim_3072",
            expression="closeness(field, embedding)"
        )
    ],
    first_phase=FirstPhaseRanking(
        expression="cos_sim_3072"
    ),
    match_features=["cos_sim_3072"]
)
my_schema.add_rank_profile(exact)

shortened = RankProfile(
    name="shortened",
    inputs=[
        ("query(q3072)", "tensor<float>(x[3072])"),
        ("query(q256)", "tensor<float>(x[256])")
        ],
    functions=[
        Function(
            name="cos_sim_256",
            expression="closeness(field, shortened_256)"
        )
    ],
    first_phase=FirstPhaseRanking(
        expression="cos_sim_256"
    ),
    match_features=["cos_sim_256"]
)
my_schema.add_rank_profile(shortened)

shortened_rerank = RankProfile(
    name="shortened_rerank",
    inputs=[
        ("query(q3072)", "tensor<float>(x[3072])"),
        ("query(q256)", "tensor<float>(x[256])")
        ],
    functions=[
        Function(
            name="cos_sim_256",
            expression="closeness(field, shortened_256)"
        ),
        Function(
            name="cos_sim_3072",
            expression="cosine_similarity(attribute(embedding), query(q3072), x)"
        ),
    ],
    first_phase=FirstPhaseRanking(
        expression="cos_sim_256"
    ),
    second_phase=SecondPhaseRanking(
        expression="cos_sim_3072"
    ),
    match_features=["cos_sim_256", "cos_sim_3072"]
)
my_schema.add_rank_profile(shortened_rerank)

For an example of a `hybrid` rank-profile which combines semantic search with traditional text retrieval such as BM25, see the previous blog post: [Turbocharge RAG with LangChain and Vespa Streaming Mode for Sharded Data](https://blog.vespa.ai/turbocharge-rag-with-langchain-and-vespa-streaming-mode/)

## Deploy the application to Vespa Cloud

With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/).
It is also possible to deploy the app using docker; see the [Hybrid Search - Quickstart](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html) guide for
an example of deploying it to a local docker container.

Install the Vespa CLI using [homebrew](https://brew.sh/) - or download a binary from GitHub as demonstrated below.

In [8]:
!brew install vespa-cli

To reinstall 8.294.50, run:
  brew reinstall vespa-cli


Alternatively, if running in Colab, download the Vespa CLI:

In [40]:
import os
import requests
res = requests.get(url="https://api.github.com/repos/vespa-engine/vespa/releases/latest").json()
os.environ["VERSION"] = res["tag_name"].replace("v", "")
!curl -fsSL https://github.com/vespa-engine/vespa/releases/download/v${VERSION}/vespa-cli_${VERSION}_linux_amd64.tar.gz | tar -zxf -
!ln -sf /content/vespa-cli_${VERSION}_linux_amd64/bin/vespa /bin/vespa

To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud:

Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one).
This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial).
Make note of the tenant name, it is used in the next steps.

### Configure Vespa Cloud date-plane security

Create Vespa Cloud data-plane mTLS cert/key-pair. The mutual certificate pair is used to talk to your Vespa cloud endpoints. See [Vespa Cloud Security Guide](https://cloud.vespa.ai/en/security/guide) for details.

We save the paths to the credentials for later data-plane access without using pyvespa APIs.

In [195]:
import os

os.environ["TENANT_NAME"] = "vespa-team" # Replace with your tenant name

vespa_cli_command = f'vespa config set application {os.environ["TENANT_NAME"]}.{vespa_app_name}'

!vespa config set target cloud
!{vespa_cli_command}
!vespa auth cert -N

[32mSuccess:[0m Certificate written to [36m'/Users/andreer/.vespa/vespa-team.matryoshka.default/data-plane-public-cert.pem'[0m
[32mSuccess:[0m Private key written to [36m'/Users/andreer/.vespa/vespa-team.matryoshka.default/data-plane-private-key.pem'[0m


Validate that we have the expected data-plane credential files:

In [196]:
from os.path import exists
from pathlib import Path

cert_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-public-cert.pem"
key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-private-key.pem"

if not exists(cert_path) or not exists(key_path):
    print("ERROR: set the correct paths to security credentials. Correct paths above and rerun until you do not see this error")

Note that the subsequent Vespa Cloud deploy call below will add `data-plane-public-cert.pem` to the application before deploying it to Vespa Cloud, so that
you have access to both the private key and the public certificate. At the same time, Vespa Cloud only knows the public certificate.

### Configure Vespa Cloud control-plane security

Authenticate to generate a tenant level control plane API key for deploying the applications to Vespa Cloud, and save the path to it.

The generated tenant api key must be added in the Vespa Console before attemting to deploy the application.

```
To use this key in Vespa Cloud click 'Add custom key' at
https://console.vespa-cloud.com/tenant/TENANT_NAME/account/keys
and paste the entire public key including the BEGIN and END lines.
```

In [197]:
#!vespa auth api-key

from pathlib import Path
api_key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.api-key.pem"

### Deploy to Vespa Cloud

Now that we have data-plane and control-plane credentials ready, we can deploy our application to Vespa Cloud!

`PyVespa` supports deploying apps to the [development zone](https://cloud.vespa.ai/en/reference/environments#dev-and-perf).

>Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.

In [198]:
from vespa.deployment import VespaCloud

def read_secret():
    """Read the API key from the environment variable. This is
    only used for CI/CD purposes."""
    t = os.getenv("VESPA_TEAM_API_KEY")
    if t:
        return t.replace(r"\n", "\n")
    else:
        return t

vespa_cloud = VespaCloud(
    tenant=os.environ["TENANT_NAME"],
    application=vespa_app_name,
    key_content=read_secret() if read_secret() else None,
    key_location=api_key_path,
    application_package=vespa_application_package)

Now deploy the app to Vespa Cloud dev zone.

The first deployment typically takes 2 minutes until the endpoint is up.

In [201]:
from vespa.application import Vespa
app:Vespa = vespa_cloud.deploy()

Deployment started in run 21 of dev-aws-us-east-1c for vespa-team.matryoshka. This may take a few minutes the first time.
INFO    [20:10:43]  Deploying platform version 8.296.15 and application dev build 19 for dev-aws-us-east-1c of default ...
INFO    [20:10:43]  Using CA signed certificate version 0
INFO    [20:10:44]  Using 1 nodes in container cluster 'matryoshka_container'
INFO    [20:10:47]  Session 282296 for tenant 'vespa-team' prepared and activated.
INFO    [20:10:47]  ######## Details for all nodes ########
INFO    [20:10:47]  h88969c.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO    [20:10:47]  --- platform vespa/cloud-tenant-rhel8:8.296.15
INFO    [20:10:47]  --- container-clustercontroller on port 19050 has config generation 282296, wanted is 282296
INFO    [20:10:47]  --- metricsproxy-container on port 19092 has config generation 282295, wanted is 282296
INFO    [20:10:47]  h88978a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be

## Get OpenAI embeddings for documents in the dataset

When producing the embeddings, we concatenate the title and text into a single string.

This requires an OpenAI API key and thus costs money, approximately $5 for the full dataset.

You may also need to adjust max_workers to ensure you don't run into your rate limit.

In [202]:
from openai import OpenAI
import concurrent.futures

skipped = 0
docs = []
for doc in dataset.docs_iter()[0:100]:
    if(len(doc.title) + 1 + len(doc.text) <= 8192):
        docs.append(doc)
    else:
        skipped+=1

print("embedding", len(docs), "docs (skipped", skipped, "too big for context window)")

client = OpenAI()

def get_embedding(text, model="text-embedding-3-large"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

def embed_doc(doc):
  embedding = get_embedding(doc.title + " " + doc.text)  
  shortened = embedding[0:256]
  return {
      "doc_id": doc.doc_id,
      "text": doc.text,
      "title": doc.title,
      "url": doc.url,
      "pubmed_id": doc.pubmed_id,
      
      "shortened_256": {"type":"tensor<float>(x[256])","values":shortened},
      "embedding": {"type":"tensor<float>(x[3072])","values":embedding}
  }

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        my_docs_to_feed = list(executor.map(embed_doc, docs))

embedding 100 docs (skipped 0 too big for context window)


## Feeding the dataset and embeddings into Vespa

Now that we have parsed the dataset and created an object with the fields that we want to add to Vespa, we must format the
object into the format that PyVespa accepts. Notice the `fields`, `id` and `groupname` keys. The `groupname` is the
key that is used to shard and co-locate the data and is only relevant when using Vespa with streaming mode.

In [203]:
from typing import Iterable
def vespa_feed(user:str) -> Iterable[dict]:
    for doc in reversed(my_docs_to_feed):
        yield {
            "fields": doc,
            "id": doc["doc_id"],
            "groupname": user
        }

Now, we can feed to the Vespa instance (`app`), using the `feed_iterable` API, using the generator function above as input
with a custom `callback` function.

The most time-consuming part of this feeding is building the HNSW index.

In [205]:
from vespa.io import VespaResponse

def callback(response:VespaResponse, id:str):
    if not response.is_successful():
        print(f"Document {id} failed to feed with status code {response.status_code}, url={response.url} response={response.json}")

app.feed_iterable(schema="my_schema", iter=vespa_feed(""), callback=callback, max_queue_size=8000, max_workers=64, max_connections=128)

### Embedding the queries
We need to obtain embedding for the queries from OpenAI. If we were using an open-source model like e5, vespa could perform the embedding for you as part of the query execution.

In [98]:
queries = []
for q in dataset.queries_iter():
    queries.append({'text': q.text, 'embedding': get_embedding(q.text), 'id': q.query_id})

..................................................

### Querying data

Now, we can query our data. We'll do it in a few different ways, using the rank profiles we defined in the schema:

- BM25 text search, as a baseline
- Exhaustive (exact) nearest neighbor search with the full embeddings (3072 dimensions)
- Exhaustive (exact) nearest neighbor search with the shortened 256 dimensions
- Approximate nearest neighbor search using HNSW index with 256 dimensions
- Approximate nearest neighbor search using HNSW index with 256 dimensions, reranking top 100 hits with the full embeddings

The query request uses the Vespa Query API  and the `Vespa.query()` function
supports passing any of the Vespa query API parameters.

Read more about querying Vespa in:

- [Vespa Query API](https://docs.vespa.ai/en/query-api.html)
- [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html)
- [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html)

In [230]:
from vespa.io import VespaQueryResponse
import json

def query_bm25(q):
    yql="select * from my_schema where userQuery() limit 10"
    response:VespaQueryResponse = session.query(
        yql=yql,
        query=q['text'],
        ranking="bm25",
        timeout=10,
        body={
            "presentation.timing": "true"
        }
    )
    assert(response.is_successful())
    return response

def query(embedding, *, field, input, ranking, approximate):
    response:VespaQueryResponse = session.query(
        yql="select doc_id, title from my_schema where ({targetHits:100, approximate:"+str(approximate)+"}nearestNeighbor("+field+","+input+")) limit 10",
        ranking=ranking,
        timeout=10,
        body={
            "presentation.format.tensors": "short-value",
            "presentation.timing": "true",
            "input.query(q3072)": embedding,
            "input.query(q256)": embedding[0:256]
        }
    )
    assert(response.is_successful())
    return response

def query_exact(q):
    return query(q['embedding'], field="embedding", input="q3072", ranking="exact", approximate=False)

def query_256(q):
    return query(q['embedding'], field="shortened_256", input="q256", ranking="shortened", approximate=False)

def query_256_ann(q):
    return query(q['embedding'], field="shortened_256", input="q256", ranking="shortened", approximate=True)

def query_rerank(q):
    return query(q['embedding'], field="shortened_256", input="q256", ranking="shortened_rerank", approximate=True)

print("Query text:", queries[0]['text'])

with app.syncio() as session:
    print(json.dumps(query_rerank(queries[0]).hits[0], indent=2))

Query text: what is the origin of COVID-19


{
  "id": "index:matryoshka_content/0/16c7e8749fb82d3b5e37bedb",
  "relevance": 0.6591723960884718,
  "source": "matryoshka_content",
  "fields": {
    "matchfeatures": {
      "cos_sim_256": 0.5481410972571522,
      "cos_sim_3072": 0.6591723960884718
    },
    "doc_id": "beguhous",
    "title": "The proximal origin of SARS-CoV-2"
  }
}


Notice the `matchfeatures` that returns the configured match-feature from the rank-profile.

Now we can run all our queries and analyze the results:

In [226]:
import pandas

global qt

def run_queries(query_function):
    print("\nrun", query_function.__name__, )
    results = {}
    csv_results = []
    for q in queries:
        response = query_function(q)
        print(".", end="")
        results[q['id']] = {}
        for pos, hit in enumerate(response.hits, start=1):
            global qt 
            qt += float(response.get_json()['timing']['querytime'])
            results[q['id']][hit['fields']['doc_id']] = pos
            csv_results.append(
                {
                    "query_id": q['id'],
                    "iteration": "Q0",
                    "doc_id": hit['fields']['doc_id'],
                    "position": pos,
                    "score": hit['relevance'],
                    "runid": query_function.__name__
                }
            )
    df_result = pandas.DataFrame.from_records(csv_results)
    df_result.to_csv(query_function.__name__+".run", index=False, header=False, sep=' ')
    return results

query_functions = ( query_bm25, query_exact, query_256, query_256_ann, query_rerank )
runs = {}

with app.syncio() as session:
    for f in query_functions:
        qt=0
        runs[f.__name__] = run_queries(f)
        print(" avg query time {:.4f} s".format(qt/len(queries)))


run query_bm25


.................................................. avg query time 0.0694 s

run query_exact
.................................................. avg query time 2.2802 s

run query_256
.................................................. avg query time 0.2710 s

run query_256_ann
.................................................. avg query time 0.0266 s

run query_rerank
.................................................. avg query time 0.0286 s


This also produces a set of csv files which can be used to analyze the results with the trec_eval command line utility.

To run the analysis directly in the notebook, we need to get the query relevance judgements into the format supported by pytrec_eval:

In [129]:
qrels = {}

for q in dataset.queries_iter():
    qrels[q.query_id] = {}

for qrel in dataset.qrels_iter():
    qrels[qrel.query_id][qrel.doc_id] = qrel.relevance

Let's check the scoring for the first query:

In [233]:
for docid in runs['query_256_ann']['1']:
    score = qrels['1'].get(docid)
    print(docid, score or "-")

beguhous 2
k9lcpjyo 2
pl48ev5o 2
jwxt4ygt 2
dv9m19yk 1
ft4rbcxf 1
h8ahn8fw 2
6y1gwszn 2
3xusxrij -
2tyt8255 1


Looks promising! Now we can compute the results:

In [234]:
import pytrec_eval

def evaluate(run):
    evaluator = pytrec_eval.RelevanceEvaluator(
        qrels, {'ndcg_cut.10'})
    evaluation = evaluator.evaluate(run)
    
    sum = 0
    for ev in evaluation:
        sum+=evaluation[ev]['ndcg_cut_10']
    return sum/len(evaluation)

for run in runs:
  print(run, "\tndcg_cut_10: {:.4f}".format(evaluate(runs[run])))

query_bm25 	ndcg_cut_10: 0.6178
query_exact 	ndcg_cut_10: 0.7870
query_256 	ndcg_cut_10: 0.7564
query_256_ann 	ndcg_cut_10: 0.7542
query_rerank 	ndcg_cut_10: 0.7886


## Conclusions

Results indeed look very promising in this (tiny) test.

Querying with the first 256 dimensions still gives very good results, while requiring only **8.3%** of the memory. We also note that speeding up the query by using a HNSW ANN index does not seem to negatively impact the quality.

When adding a second phase to re-rank the top 100 hits using the full embeddings, the results are just as good from the exact search (in fact ever so slightly better), with much lower latency / query cost.

## Summary

For those interested in learning more about Vespa, join the [Vespa community on Slack](https://vespatalk.slack.com/) to exchange ideas,
seek assistance, or stay in the loop on the latest Vespa developments.


We can now delete the cloud instance:

In [235]:
vespa_cloud.delete()

Deactivated vespa-team.matryoshka in dev.aws-us-east-1c
Deleted instance vespa-team.matryoshka.default
