<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://assets.vespa.ai/logos/Vespa-logo-green-RGB.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg">
  <img alt="#Vespa" width="200" src="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg" style="margin-bottom: 25px;">
</picture>

# Evaluating a Vespa Application

We are often asked by users and customers what is the best retrieval and ranking strategy for a given use case. And even though we might sometimes have an intuition, we always recommend to set up experiments and do a proper quantitative evaluation.

> Models are temporary; Evals are forever.
> 
> -Eugene Yan


Without a proper evaluation setup, you run the risk of settling for `lgtm@10` (looks good to me @ 10).

Then, if you deploy your application to users, you can be sure that you will get feedback of queries that does not produce relevant results. If you then try to optimize for that without knowing whether your tweaks are actually improving the overall quality of your search, you might end up with a system that is worse than the one you started with. 


![search eval](_static/search-eval.jpg)


## So, what can you do? 

You can set up a proper evaluation pipeline, where you can test different ranking strategies, and see how they perform on a set of evaluation queries that act as a proxy for your real users' queries. This way, you can make informed decisions about what works best for your use case. If you collect real user interactions, it could be even better, but it is important to also keep the evaluation pipeline light enough so that you can run it both during development and in CI pipelines (possibly at different scales).

This guide will show how you easily can evaluate a Vespa application using pyvespa's `VespaMatchEvaluator` and `VespaEvaluator` class.

### Evaluate match-phase (retrieval) for recall

The match-phase (or retrieval phase) in Vespa is perform to retrieve candidate documents to rank. Here, what we care about is that all possibly relevant documents are retrieved fast, without matching too many documents. If we match too many documents, latency will suffer, as all retrieved docs will be exposed to ranking. For an introduction to phased retrieval in Vespa, see the [docs](https://docs.vespa.ai/en/phased-ranking.html)

For this tutorial, we will evaluate and compare `weakAnd`, `nearestNeighbor`, as well as the combination of the two (using `OR`-operator).

### Evaluate ranking

We will define and compare 4 different ranking strategies in this guide:

1. `bm25` - Keyword-based retrieval and ranking - The solid baseline.
2. `semantic` - Vector search using cosine similarity (using https://huggingface.co/intfloat/e5-small-v2 for embeddings)
3. `fusion`- Hybrid search (semantic+keyword). Combining BM25 and Semantic with [reciprocal rank fusion](https://docs.vespa.ai/en/phased-ranking.html#cross-hit-normalization-including-reciprocal-rank-fusion)
4. `atan_norm` - Hybrid search, combining BM25 and Semantic with [atan normalization](https://docs.vespa.ai/en/tutorials/hybrid-search.html#hybrid-ranking) as described in Aapo Tanskanen's [Guidebook to the State-of-the-Art Embeddings and Information Retrieval](https://www.linkedin.com/pulse/guidebook-state-of-the-art-embeddings-information-aapo-tanskanen-pc3mf/) (Originally proposed by [Seo et al. (2022)](https://www.mdpi.com/2227-7390/10/8/1335)) 

<div class="alert alert-info">
    Refer to <a href="https://vespa-engine.github.io/pyvespa/troubleshooting.html">troubleshooting</a>
    for any problem when running this guide.
</div>


**Pre-requisite**: Create a tenant at [cloud.vespa.ai](https://cloud.vespa.ai/), save the tenant name.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vespa-engine/pyvespa/blob/master/docs/sphinx/source/evaluating-vespa-application-cloud.ipynb)


## Install

Install [pyvespa](https://pyvespa.readthedocs.io/) >= 0.53.0
and the [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html).
The Vespa CLI is used for data and control plane key management ([Vespa Cloud Security Guide](https://cloud.vespa.ai/en/security/guide)).


## Configure application


In [1]:
# Replace with your tenant name from the Vespa Cloud Console
tenant_name = "vespa-team"
# Replace with your application name (does not need to exist yet)
application = "evaluation"
schema_name = "doc"

## Create an application package

The [application package](https://vespa-engine.github.io/pyvespa/api/vespa/package.html)
has all the Vespa configuration files -
create one from scratch:


Note that the name cannot have `-` or `_`.


## Deploy to Vespa Cloud

The app is now defined and ready to deploy to Vespa Cloud.

Deploy `package` to Vespa Cloud, by creating an instance of
[VespaCloud](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#VespaCloud):


For more details on different authentication options and methods, see [authenticating-to-vespa-cloud](https://vespa-engine.github.io/pyvespa/authenticating-to-vespa-cloud.html).

The following will upload the application package to Vespa Cloud Dev Zone (`aws-us-east-1c`), read more about [Vespa Zones](https://cloud.vespa.ai/en/reference/zones.html).
The Vespa Cloud Dev Zone is considered as a sandbox environment where resources are down-scaled and idle deployments are expired automatically.
For information about production deployments, see the following [method](https://vespa-engine.github.io/pyvespa/api/vespa/deployment#vespa.deployment.VespaCloud.deploy_to_prod).

> Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.


Now deploy the app to Vespa Cloud dev zone.

The first deployment typically takes 2 minutes until the endpoint is up. (Applications that for example refer to large onnx-models may take a bit longer.)


If the deployment failed, it is possible you forgot to add the key in the Vespa Cloud Console in the `vespa auth api-key` step above.

If you can authenticate, you should see lines like the following

```
 Deployment started in run 1 of dev-aws-us-east-1c for mytenant.hybridsearch.
```

The deployment takes a few minutes the first time while Vespa Cloud sets up the resources for your Vespa application

`app` now holds a reference to a [Vespa](https://vespa-engine.github.io/pyvespa/api/vespa/application.html#vespa.application.Vespa) instance. We can access the
mTLS protected endpoint name using the control-plane (vespa_cloud) instance. This endpoint we can query and feed to (data plane access) using the
mTLS certificate generated in previous steps.

See [Authenticating to Vespa Cloud](https://vespa-engine.github.io/pyvespa/authenticating-to-vespa-cloud.html) for details on using token authentication instead of mTLS.


## Getting your evaluation data

For evaluating information retrieval methods, in addition to the document corpus, we also need a set of queries and a mapping from queries to relevant documents.

For this guide, we will use the [NanoMSMARCO](https://huggingface.co/datasets/zeta-alpha-ai/NanoMSMARCO) dataset, made available on huggingface by [Zeta Alpha](https://zeta-alpha.com/).

This dataset is a subset of their üç∫[NanoBEIR](https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6)-collection, with 50 queries and up to 10K documents each. 

This is a great dataset for testing and evaluating information retrieval methods quickly, as it is small and easy to work with. 

Note that for almost any real-world use case, we would recommend you to create your own evaluation dataset. See [Vespa blog post](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) on how you can get help from an LLM for this. 

Note that creating 20-50 queries and annotating relevant documents for each query can be a good start and well worth the effort.


In [4]:
from vespa.application import Vespa

app = Vespa("http://localhost:8080")

In [5]:
app.get_application_status()

<Response [200]>

In [6]:
from datasets import load_dataset

dataset_id = "zeta-alpha-ai/NanoMSMARCO"

dataset = load_dataset(dataset_id, "corpus", split="train", streaming=True)
vespa_feed = dataset.map(
    lambda x: {
        "id": x["_id"],
        "fields": {"text": x["text"], "id": x["_id"]},
    }
)

Note that we are only _evaluating_ rank strategies here, we consider it OK to use the `train` split for evaluation. If we were to make changes to our ranking strategies, such as adding weighting terms, or training ml models for ranking, we would suggest to adopt a `train`, `validation`, `test` split approach to avoid overfitting.

In [7]:
query_ds = load_dataset(dataset_id, "queries", split="train")
qrels = load_dataset(dataset_id, "qrels", split="train")

In [8]:
ids_to_query = dict(zip(query_ds["_id"], query_ds["text"]))

Let us print the first 5 queries:

In [9]:
for idx, (qid, q) in enumerate(ids_to_query.items()):
    print(f"qid: {qid}, query: {q}")
    if idx == 5:
        break

qid: 994479, query: which health care system provides all citizens or residents with equal access to health care services
qid: 1009388, query: what's right in health care
qid: 1088332, query: weather in oran
qid: 265729, query: how long keep financial records
qid: 1099433, query: how do hoa fees work
qid: 200600, query: heels or heal


In [10]:
relevant_docs = dict(zip(qrels["query-id"], qrels["corpus-id"]))

Let us print the first 5 query ids and their relevant documents:

In [11]:
for idx, (qid, doc_id) in enumerate(relevant_docs.items()):
    print(f"qid: {qid}, doc_id: {doc_id}")
    if idx == 5:
        break

qid: 994479, doc_id: 7275120
qid: 1009388, doc_id: 7248824
qid: 1088332, doc_id: 7094398
qid: 265729, doc_id: 7369987
qid: 1099433, doc_id: 7255675
qid: 200600, doc_id: 7929603


We can see that this dataset only has one relevant document per query.
The `VespaEvaluator` class handles this just fine, but you could also provide a set of relevant documents per query if there are multiple relevant docs.

```python
# multiple relevant docs per query
qrels = {
    "q1": {"doc1", "doc2"},
    "q2": {"doc3", "doc4"},
    # etc. 
}
```

Now we can feed to Vespa using `feed_iterable` which accepts any `Iterable` and an optional callback function where we can
check the outcome of each operation. The application is configured to use [embedding](https://docs.vespa.ai/en/embedding.html)
functionality, that produce a vector embedding using a concatenation of the title and the body input fields. This step may be resource intensive, depending on the model size.

Read more about embedding inference in Vespa in the [Accelerating Transformer-based Embedding Retrieval with Vespa](https://blog.vespa.ai/accelerating-transformer-based-embedding-retrieval-with-vespa/)
blog post.

Default node resources in Vespa Cloud have 2 v-cpu for the Dev Zone.


In [12]:
from vespa.io import VespaResponse


def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(f"Error when feeding document {id}: {response.get_json()}")


app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback)

## Evaluate match-phase

There are two separate classes provided for doing evaluations:
1. `VespaMatchEvaluator`, which is intended to evaluate only the _retrieval_ (or match-phase), and should not do any ranking. This is useful to evaluate whether your relevant documents are retrieved (and thus exposed to ranking). It only computes recall, total matched documents per query as well as `searchtime`.
2. `VespaEvaluator` is intended to evaluate a complete ranking strategy, across several common IR metrics. 

Both API's are inspired by [SentenceTransformers](https://www.sbert.net/) [`InformationRetrievalEvaluator`](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator) class.

The difference it that `VespaMatchEvaluator` evaluates only the retrieval phase, while `VespaEvaluator` evaluates your whole retrieval and ranking _system_ (Vespa application) as opposed to a single model. Your application should be fed with the document corpus in advance, instead of taking in the document corpus. 

We now have created the app, the queries, and the relevant documents. The only thing missing before we can initialize the `VespaMatchEvaluator` is a set of functions that defines the Vespa queries. Each of them is passed as `vespa_query_fn`. 

We will use the `vespa.querybuilder` module to create the queries. See [reference doc](https://vespa-engine.github.io/pyvespa/api/vespa/querybuilder/builder/builder.html) and [example notebook](https://vespa-engine.github.io/pyvespa/query.html#Using-the-Querybuilder-DSL-API) for more details on usage. 

This module is a Python wrapper around the Vespa Query Language (YQL), which is an alternative to providing the YQL query as a string directly.

In [23]:
import vespa.querybuilder as qb

# 'yql=select * from doc where userQuery()'\
#  'input.query(qt)=embed(moderncolbert, @query)' \
#   'input.query(q)=embed(alibaba_gte_moderncolbert, @query)' \
#  'query=exchanging information by sound'
# def match_weakand_query_fn(query_text: str, top_k: int) -> dict:
#     return {
#         "yql": str(qb.select("*").from_(schema_name).where(qb.userQuery(query_text))),
#         "query": query_text,
#         "ranking": "match-only",
#         "input.query(q)": f"embed({query_text})",
#     }

embedding_field = "embedding_alibaba_gte_modernbert"


def match_hybrid_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": str(
            qb.select("*")
            .from_(schema_name)
            .where(
                qb.nearestNeighbor(
                    field=embedding_field,
                    query_vector="q",
                    annotations={"targetHits": 1000},
                )
                | qb.userQuery(
                    query_text,
                )
            )
        ),
        "query": query_text,
        # "ranking": "match-only",
        "input.query(q)": "embed(alibaba_gte_modernbert, @query)",
        "input.query(qt)": "embed(moderncolbert, @query)",
        "timeout": "20s",
    }


def match_semantic_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": str(
            qb.select("*")
            .from_(schema_name)
            .where(
                qb.nearestNeighbor(
                    field=embedding_field,
                    query_vector="q",
                    annotations={"targetHits": 1000},
                )
            )
        ),
        "query": query_text,
        # "ranking": "match-only",
        "input.query(q)": "embed(alibaba_gte_modernbert, @query)",
        "input.query(qt)": "embed(moderncolbert, @query)",
    }

Now, let us run the evaluator:

In [17]:
from vespa.evaluation import VespaMatchEvaluator


match_results = {}
for evaluator_name, query_fn in [
    ("semantic", match_semantic_query_fn),
    # ("weakand", match_weakand_query_fn),
    ("hybrid", match_hybrid_query_fn),
]:
    print(f"Evaluating {evaluator_name}...")

    match_evaluator = VespaMatchEvaluator(
        queries=ids_to_query,
        relevant_docs=relevant_docs,
        vespa_query_fn=query_fn,
        app=app,
        name="test-run",
        id_field="id",  # specify the id field used in the relevant_docs
        write_csv=True,
        write_verbose=True,  # optionally write verbose metrics to CSV
    )

    results = match_evaluator()
    match_results[evaluator_name] = results
    print(f"Results for {evaluator_name}:")
    print(results)

Evaluating semantic...
Results for semantic:
{'match_recall': 1.0, 'avg_recall_per_query': 1.0, 'total_relevant_docs': 50, 'total_matched_relevant': 50, 'avg_matched_per_query': 1000.0, 'total_queries': 50, 'searchtime_avg': 0.04658, 'searchtime_q50': 0.029, 'searchtime_q90': 0.10160000000000002, 'searchtime_q95': 0.12329999999999998}
Evaluating hybrid...
Results for hybrid:
{'match_recall': 1.0, 'avg_recall_per_query': 1.0, 'total_relevant_docs': 50, 'total_matched_relevant': 50, 'avg_matched_per_query': 1581.42, 'total_queries': 50, 'searchtime_avg': 0.036680000000000004, 'searchtime_q50': 0.0345, 'searchtime_q90': 0.05840000000000001, 'searchtime_q95': 0.06409999999999999}


By setting `write_csv=True` and `verbose=True`, we can save a CSV-file for each query to inspect further the queries that were not matched.
This is important to understand how you could improve recall if some relevant documents were not matched.  

In [None]:
results = pd.DataFrame(match_results)
results

Here, we can see that all retrieval strategies actually match all the relevant documents. 
To tune number of documents retrieved and latency, we could tune the `targetHits`-parameter for both the `nearestNeighbor`-operator and `weakAnd`-parameter (our `userInput` is converted to `weakAnd`, see [docs](https://docs.vespa.ai/en/reference/query-language-reference.html)), as well as several additional `weakAnd`-parameters. See Vespa [blog](https://blog.vespa.ai/tripling-the-query-performance-of-lexical-search/) for details. We will not go in detail of this in this notebook. 

## Evaluate ranking

Now, we will move on to demonstrate how to evaluate the ranking strategies. For that, we will use the `VespaEvaluator`-class. Its interface is very similar to `VespaMatchEvaluator`, with the difference that it has much more metrics available. Also note that number of `hits` will affect the number of documents considered for evaluation. 

## VespaEvaluator


Let us take a look at its API and documentation:

In [19]:
from vespa.evaluation import VespaEvaluator

?VespaEvaluator

[31mInit signature:[39m
VespaEvaluator(
    queries: [33m'Dict[str, str]'[39m,
    relevant_docs: [33m'Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]'[39m,
    vespa_query_fn: [33m'Callable[[str, int, Optional[str]], dict]'[39m,
    app: [33m'Vespa'[39m,
    name: [33m'str'[39m = [33m''[39m,
    id_field: [33m'str'[39m = [33m''[39m,
    accuracy_at_k: [33m'List[int]'[39m = [[32m1[39m, [32m3[39m, [32m5[39m, [32m10[39m],
    precision_recall_at_k: [33m'List[int]'[39m = [[32m1[39m, [32m3[39m, [32m5[39m, [32m10[39m],
    mrr_at_k: [33m'List[int]'[39m = [[32m10[39m],
    ndcg_at_k: [33m'List[int]'[39m = [[32m10[39m],
    map_at_k: [33m'List[int]'[39m = [[32m100[39m],
    write_csv: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    csv_dir: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
)
[31mDocstring:[39m     
Evaluate retrieval performance on a Vespa application.

This class:

- Iterates over queries and iss

In [None]:
def semantic_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": str(
            qb.select("*")
            .from_(schema_name)
            .where(
                qb.nearestNeighbor(
                    field="embedding",
                    query_vector="q",
                    annotations={"targetHits": 100},
                )
            )
        ),
        "query": query_text,
        "ranking": "semantic",
        "input.query(q)": f"embed({query_text})",
        "hits": top_k,
    }


def bm25_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": "select * from sources * where userQuery();",  # provide the yql directly as a string
        "query": query_text,
        "ranking": "bm25",
        "hits": top_k,
    }


def fusion_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": str(
            qb.select("*")
            .from_(schema_name)
            .where(
                qb.nearestNeighbor(
                    field="embedding",
                    query_vector="q",
                    annotations={"targetHits": 100},
                )
                | qb.userQuery(query_text)
            )
        ),
        "query": query_text,
        "ranking": "fusion",
        "input.query(q)": f"embed({query_text})",
        "hits": top_k,
    }


def atan_norm_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": str(
            qb.select("*")
            .from_(schema_name)
            .where(
                qb.nearestNeighbor(
                    field="embedding",
                    query_vector="q",
                    annotations={"targetHits": 100},
                )
                | qb.userQuery(query_text)
            )
        ),
        "query": query_text,
        "ranking": "atan_norm",
        "input.query(q)": f"embed({query_text})",
        "hits": top_k,
    }

In [24]:
all_results = {}
for evaluator_name, query_fn in [
    # ("semantic", semantic_query_fn),
    # ("bm25", bm25_query_fn),
    # ("fusion", fusion_query_fn),
    # ("atan_norm", atan_norm_query_fn),
    ("default", match_hybrid_query_fn),
]:
    print(f"Evaluating {evaluator_name}...")
    evaluator = VespaEvaluator(
        queries=ids_to_query,
        relevant_docs=relevant_docs,
        vespa_query_fn=query_fn,
        app=app,
        name=evaluator_name,
        write_csv=True,  # optionally write metrics to CSV
    )

    results = evaluator.run()
    all_results[evaluator_name] = results

Evaluating default...


RetryError: RetryError[<Future at 0x1188b11c0 state=finished raised ReadTimeout>]

### Looking at the results

In [None]:
results = pd.DataFrame(all_results)

In [None]:
# take out all rows with "searchtime" to a separate dataframe
searchtime = results[results.index.str.contains("searchtime")]
results = results[~results.index.str.contains("searchtime")]


# Highlight the maximum value in each row
def highlight_max(s):
    is_max = s == s.max()
    return ["background-color: lightgreen; color: black;" if v else "" for v in is_max]


# Style the DataFrame: Highlight max values and format numbers to 4 decimals
styled_df = results.style.apply(highlight_max, axis=1).format("{:.4f}")
styled_df

We can see that for this particular dataset, the hybrid strategy `atan_norm` is the best across all metrics.

In [None]:
results.plot(kind="bar", figsize=(12, 6))

### Looking at searchtimes

Ranking quality is not the only thing that matters. For many applications, search time is equally important. 



In [None]:
# plot search time, add (ms) to the y-axis
# convert to ms
searchtime = searchtime * 1000
searchtime.plot(kind="bar", figsize=(12, 6)).set(ylabel="time (ms)")

We can see that both hybrid strategies, `fusion` and `atan_norm` strategy is a bit slower on average than pure `bm25` or `semantic`, as expected. 

Depending on the latency budget of your application, this is likely still an attractive trade-off.

## Conclusion and next steps

We have shown how you can evaluate a Vespa application on two different levels. 
1. Evaluate retrieval (match-phase) using the `VespaMatchEvaluator` class. Here we checked recall, and 
2. Evaluate ranking strategies using `VespaEvaluator` class. We have defined and compared 4 different ranking strategies in terms of both ranking quality and searchtime latency.

We hope this can provide you with a good starting point for evaluating your own Vespa application. 

If you are ready to advance, you can try to optimize the ranking strategies further, by for example weighing each of the terms in the `atan_norm` strategy differently (`a * normalize_linear(normalized_bm25) + (1-a) * normalize_linear(cos_sim)`) , or by adding a [crossencoder](https://vespa-engine.github.io/pyvespa/examples/cross-encoders-for-global-reranking.html) for re-ranking the top-k results.


## Cleanup

In [None]:
vespa_cloud.delete()