<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
  <img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-rock.svg" style="margin-bottom: 25px;">
</picture>


# Arxiv AI-powered search

This notebook demonstrates how to load a ArxiV dataset hosted on [HF datasets](https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv) 
and feed it to a Vespa instance. The dataset comprises of English language arXiv papers from the Cornell/arXiv dataset, with two new columns added: title-embeddings and abstract-embeddings. Embeddings generated using the [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) embeddings model. 

In this notebook, we use Vespa's embedder functionality to include the  [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) embedding
model into Vespa for query serving. 

In [None]:
!pip3 install -U pyvespa 

## Definining the Vespa application

[PyVespa](https://pyvespa.readthedocs.io/en/latest/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). 
A Vespa application package consists of configuration files, schemas, models, and code (plugins).   

First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. This is a translation
of the dataset features:

In [93]:
from vespa.package import Schema, Document, Field, FieldSet, HNSW
paper_schema = Schema(
            name="paper",
            mode="index",
            document=Document(
                fields=[
                    Field(name="id", type="string", indexing=["summary", "index"], match=["word"]),
                    Field(name="submitter", type="string", indexing=["summary", "index"]),
                    Field(name="authors", type="string", indexing=["summary", "index"]),
                    Field(name="title", type="string", indexing=["summary", "index"], index = "enable-bm25"),
                    Field(name="abstract", type="string", indexing=["summary", "index"], index="enable-bm25"),
                    Field(name="journal_ref", type="string", indexing=["summary", "index"]),
                    Field(name="doi", type="string", indexing=["summary", "index"]),
                    Field(name="categories", type="array<string>", indexing=["summary", "index"], match=["word"]),
                    Field(name="title_embedding", type="tensor<bfloat16>(x[384])",
                        indexing=["attribute", "index"],
                        ann=HNSW(distance_metric="angular")
                    ),
                    Field(name="abstract_embedding", type="tensor<bfloat16>(x[384])",
                        indexing=["attribute", "index"],
                        ann=HNSW(distance_metric="angular")
                    ),
                ],
            ),
            fieldsets=[
                FieldSet(name = "default", fields = ["title", "abstract", "authors", "submitter"])
            ]
)

In [94]:
from vespa.package import ApplicationPackage, Component, Parameter

vespa_app_name = "arxivsearch"
vespa_application_package = ApplicationPackage(
        name=vespa_app_name,
        schema=[paper_schema],
        components=[Component(id="bge", type="hugging-face-embedder",
            parameters=[
                Parameter("transformer-model", {"url": "https://huggingface.co/Xenova/bge-small-en-v1.5/resolve/main/onnx/model.onnx"}),
                Parameter("tokenizer-model", {"url": "https://huggingface.co/Xenova/bge-small-en-v1.5/raw/main/tokenizer.json"}),
                Parameter("pooling-strategy", args=dict(), children="cls")
            ]
        )]
) 

In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. 

Vespa supports [phased ranking](https://docs.vespa.ai/en/phased-ranking.html) and has a rich set of built-in [rank-features](https://docs.vespa.ai/en/reference/rank-features.html), including many
text-matching features such as:

- [BM25](https://docs.vespa.ai/en/reference/bm25.html).
- [nativeRank](https://docs.vespa.ai/en/reference/nativerank.html) and many more. 

Users can also define custom functions using [ranking expressions](https://docs.vespa.ai/en/reference/ranking-expressions.html). 

The following defines a `hybrid` Vespa ranking profile and a plain `bm25` profile.

In [101]:
from vespa.package import RankProfile, FirstPhaseRanking, GlobalPhaseRanking

bm25 = RankProfile(
    name="bm25", 
    inputs=[("query(q)", "tensor<float>(x[384])")],
    
    first_phase=FirstPhaseRanking(
        expression="bm25(title) + bm25(abstract)",
    )
)

hybrid = RankProfile(
    name="hybrid", 
    inputs=[("query(q)", "tensor<float>(x[384])")],
    first_phase=FirstPhaseRanking(
        expression="closeness(field, title_embedding) + closeness(field, abstract_embedding)"
    ),
    global_phase=GlobalPhaseRanking(
        expression="reciprocal_rank_fusion(closeness(field,title_embedding), bm25(title), bm25(abstract), closeness(field,abstract_embedding))"
    ),
    match_features=["bm25(title)", "bm25(abstract)", "closeness(field, title_embedding)", "closeness(field, abstract_embedding)"]
)
paper_schema.add_rank_profile(bm25)
paper_schema.add_rank_profile(hybrid)

## Deploy the application to Vespa Cloud

With the configured application, we can deploy it to [Vespa Cloud](https://cloud.vespa.ai/en/). 
It is also possible to deploy the app using docker; see the [Hybrid Search - Quickstart](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html) guide for
an example of deploying it to a local docker container. 

Install the Vespa CLI using [homebrew](https://brew.sh/) - or download a binary from GitHub as demonstrated below. 

In [None]:
!brew install vespa-cli

Alternatively, if running in Colab, download the Vespa CLI:

In [None]:
import os
import requests
res = requests.get(url="https://api.github.com/repos/vespa-engine/vespa/releases/latest").json()
os.environ["VERSION"] = res["tag_name"].replace("v", "")
!curl -fsSL https://github.com/vespa-engine/vespa/releases/download/v${VERSION}/vespa-cli_${VERSION}_linux_amd64.tar.gz | tar -zxf -
!ln -sf /content/vespa-cli_${VERSION}_linux_amd64/bin/vespa /bin/vespa

To deploy the application to Vespa Cloud we need to create a tenant in the Vespa Cloud:

Create a tenant at [console.vespa-cloud.com](https://console.vespa-cloud.com/) (unless you already have one). 
This step requires a Google or GitHub account, and will start your [free trial](https://cloud.vespa.ai/en/free-trial). 
Make note of the tenant name, it is used in the next steps.

### Configure Vespa Cloud date-plane security

Create Vespa Cloud data-plane mTLS cert/key-pair. The mutual certificate pair is used to talk to your Vespa cloud endpoints. See [Vespa Cloud Security Guide](https://cloud.vespa.ai/en/security/guide) for details.

We save the paths to the credentials for later data-plane access without using pyvespa APIs. 

In [None]:
import os

os.environ["TENANT_NAME"] = "vespa-team" # Replace with your tenant name

vespa_cli_command = f'vespa config set application {os.environ["TENANT_NAME"]}.{vespa_app_name}'

!vespa config set target cloud
!{vespa_cli_command}
!vespa auth cert -N 

Validate that we have the expected data-plane credential files:

In [7]:
from os.path import exists
from pathlib import Path

cert_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-public-cert.pem"
key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.{vespa_app_name}.default/data-plane-private-key.pem"

if not exists(cert_path) or not exists(key_path):
    print("ERROR: set the correct paths to security credentials. Correct paths above and rerun until you do not see this error")

Note that the subsequent Vespa Cloud deploy call below will add `data-plane-public-cert.pem` to the application before deploying it to Vespa Cloud, so that
you have access to both the private key and the public certificate. At the same time, Vespa Cloud only knows the public certificate. 

### Configure Vespa Cloud control-plane security 

Authenticate to generate a tenant level control plane API key for deploying the applications to Vespa Cloud, and save the path to it. 

The generated tenant api key must be added in the Vespa Console before attemting to deploy the application. 

```
To use this key in Vespa Cloud click 'Add custom key' at
https://console.vespa-cloud.com/tenant/TENANT_NAME/account/keys
and paste the entire public key including the BEGIN and END lines.
```

In [None]:
!vespa auth api-key

from pathlib import Path
api_key_path = Path.home() / ".vespa" / f"{os.environ['TENANT_NAME']}.api-key.pem"

### Deploy to Vespa Cloud

Now that we have data-plane and control-plane credentials ready, we can deploy our application to Vespa Cloud! 

`PyVespa` supports deploying apps to the [development zone](https://cloud.vespa.ai/en/reference/environments#dev-and-perf).

>Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.

In [103]:
from vespa.deployment import VespaCloud

def read_secret():
    """Read the API key from the environment variable. This is 
    only used for CI/CD purposes."""
    t = os.getenv("VESPA_TEAM_API_KEY")
    if t:
        return t.replace(r"\n", "\n")
    else:
        return t

vespa_cloud = VespaCloud(
    tenant=os.environ["TENANT_NAME"],
    application=vespa_app_name,
    key_content=read_secret() if read_secret() else None,
    key_location=api_key_path,
    application_package=vespa_application_package)

Now deploy the app to Vespa Cloud dev zone. 

The first deployment typically takes 2 minutes until the endpoint is up. 

In [104]:
from vespa.application import Vespa
app:Vespa = vespa_cloud.deploy()

Deployment started in run 7 of dev-aws-us-east-1c for samples.arxivsearch. This may take a few minutes the first time.
INFO    [12:01:11]  Deploying platform version 8.284.4 and application dev build 7 for dev-aws-us-east-1c of default ...
INFO    [12:01:11]  Using CA signed certificate version 0
INFO    [12:01:12]  Using 1 nodes in container cluster 'arxivsearch_container'
INFO    [12:01:13]  Using 1 nodes in container cluster 'arxivsearch_container'
INFO    [12:01:15]  Deployment successful.
INFO    [12:01:15]  Session 247 for tenant 'samples' prepared and activated.
INFO    [12:01:15]  ######## Details for all nodes ########
INFO    [12:01:15]  h90001f.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO    [12:01:15]  --- platform vespa/cloud-tenant-rhel8:8.284.4
INFO    [12:01:15]  --- logserver-container on port 4080 has config generation 247, wanted is 247
INFO    [12:01:15]  --- metricsproxy-container on port 19092 has config generation 247, wanted is 247
IN

## Index the dataset

The following streams the hf dataset into the Vespa instance. Notice the mapping of the dataset fields to Vespa feed
format. 

In [105]:
# app:Vespa = vespa_cloud.deploy()

from datasets import load_dataset
dataset = load_dataset("somewheresystems/dataclysm-arxiv", split="train", streaming=True)
vespa_feed = dataset.map(lambda x: 
{
    "id": x["id"],
    "fields" : {
        "id": x["id"],
        "title": x["title"],
        "abstract": x["abstract"],
        "title_embedding": x["title_embedding"],
        "abstract_embedding": x["abstract_embedding"],
        "journal_ref": x.get("journal-ref",None),
        "doi": x.get("doi",None),
        "categories": x["categories"],
        "authors": x["authors"],
        "submitter": x["submitter"]
    }
})
from vespa.io import VespaResponse

def callback(response:VespaResponse, id:str):
    if not response.is_successful():
        print(f"Document {id} failed to feed with status code {response.status_code}, url={response.url} response={response.json}")

app.feed_iterable(schema="paper", iter=vespa_feed, callback=callback, max_connections=12, max_workers=64, max_queue_size=10000)


### Querying data

Now, we can start exploring querying the arxiv papers. 

The query request uses the Vespa Query API  and the `Vespa.query()` function 
supports passing any of the Vespa query API parameters. 

Read more about querying Vespa in:

- [Vespa Query API](https://docs.vespa.ai/en/query-api.html)
- [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html)
- [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html)


In [109]:
from vespa.io import VespaQueryResponse
import json

response:VespaQueryResponse = app.query(
    yql="select title, id from paper where ({targetHits:10}nearestNeighbor(title_embedding,q)) or ({targetHits:10}nearestNeighbor(abstract_embedding,q))",
    ranking="hybrid",
    query="dark matter field fluid model",
    body={
        "presentation.format.tensors": "short-value",
        "input.query(q)": "embed(bge, \"dark matter field fluid model\")",
    }
)
assert(response.is_successful())
print(json.dumps(response.hits[0:2], indent=2))

[
  {
    "id": "index:arxivsearch_content/0/cfdff72f28cffdb0b73f6026",
    "relevance": 0.06384129063829451,
    "source": "arxivsearch_content",
    "fields": {
      "matchfeatures": {
        "bm25(abstract)": 0.0,
        "bm25(title)": 0.0,
        "closeness(field,abstract_embedding)": 0.6178772298066597,
        "closeness(field,title_embedding)": 0.6288338602029975
      },
      "id": "0812.3122",
      "title": "Cosmological constraints on unifying Dark Fluid models"
    }
  },
  {
    "id": "index:arxivsearch_content/0/c77e9d766bd90c894a5d0481",
    "relevance": 0.06198484047241319,
    "source": "arxivsearch_content",
    "fields": {
      "matchfeatures": {
        "bm25(abstract)": 0.0,
        "bm25(title)": 0.0,
        "closeness(field,abstract_embedding)": 0.5754037589718138,
        "closeness(field,title_embedding)": 0.6644048114912198
      },
      "id": "0711.0466",
      "title": "A Model for Dark Matter Halos"
    }
  }
]


In [108]:


response:VespaQueryResponse = app.query(
    yql="select title, id from paper where userQuery()",
    ranking="bm25",
    query="dark matter field fluid model",
)
assert(response.is_successful())
print(json.dumps(response.hits[0:2], indent=2))

[
  {
    "id": "index:arxivsearch_content/0/cfdff72f28cffdb0b73f6026",
    "relevance": 31.398304828681407,
    "source": "arxivsearch_content",
    "fields": {
      "id": "0812.3122",
      "title": "Cosmological constraints on unifying Dark Fluid models"
    }
  },
  {
    "id": "index:arxivsearch_content/0/6033639d686a018894cdd4ec",
    "relevance": 30.574650705468287,
    "source": "arxivsearch_content",
    "fields": {
      "id": "0812.3611",
      "title": "Dark Energy vs. Dark Matter: Towards a Unifying Scalar Field?"
    }
  }
]


## Summary

This notebook demonstrates how to interact with HF datasets, including embedding models in Vespa and querying. 

In [None]:
vespa_cloud.delete()