In [1]:
# hide
%load_ext autoreload
%autoreload 2

# Build end-to-end Vespa apps with pyvespa

> Python API to create, modify, deploy and interact with Vespa applications

`pyvespa` provides a python API to [vespa.ai](vespa.ai). It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and ML experimentation. 

This tutorial will create a text search application from scratch based on the MS MARCO dataset, similar to our [text search tutorials](https://docs.vespa.ai/documentation/tutorials/text-search.html). We will first show how to define the app by creating an application package [REF]. Then we locally deploy the app in a Docker container. Once the app is up and running we show how to feed data to it. After the data is sent, we can make queries and inspect the results. We then show how to add a new rank profile to the application package and to redeploy the app with the latest changes. We proceed to show how to evaluate and compare two rank profiles with evaluation metrics such as Recall and Reciprocal Rank.

## Application package API

We first create a `Document` instance containing the `Field`s that we want to store in the app. In this case we will keep the application simple and only feed a unique `id`, `title` and `body` of the MS MARCO documents.

In [2]:
from vespa.package import Document, Field

document = Document(
    fields=[
        Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
        Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
        Field(name = "body", type = "string", indexing = ["index", "summary"], index = "enable-bm25")        
    ]
)

The complete `Schema` of our application will be named `msmarco` and contains the `Document` instance that we defined above, the default `FieldSet` indicates that queries will look for matches by searching both in the titles and bodies of the documents. The default `RankProfile` indicates that all the matched documents will be ranked by the `nativeRank` expression involving the title and the body of the matched documents.

In [3]:
from vespa.package import Schema, FieldSet, RankProfile

msmarco_schema = Schema(
    name = "msmarco", 
    document = document, 
    fieldsets = [FieldSet(name = "default", fields = ["title", "body"])],
    rank_profiles = [RankProfile(name = "default", first_phase = "nativeRank(title, body)")]
)

Once the `Schema` is defined, all we have to do is to create our msmarco `ApplicationPackage`:

In [4]:
from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name = "msmarco", schema=msmarco_schema)

At this point, `app_package` contains all the relevant information required to create our MS MARCO text search app. We now need to deploy it.

## Deploy to Vespa Cloud

This tutorial shows how to deploy the application package to [Vespa Cloud](https://cloud.vespa.ai/). For the following to work you need to sign-up to Vespa Cloud, register an application name there and generate your user API key on the Vespa Cloud console.

We first create a `VespaCloud` context named `cloud` that will handle the secure communication with Vespa Cloud servers. In order to do that, all we need is your Vespa Cloud tenant name, the application name that you registered and the user key you generated on the Vespa Cloud console:

**Note:** It takes around 15 min to call `cloud.deploy` for the first time, as Vespa Cloud will have the setup the environment. Subsequent calls will be much faster, usually taking less than 10 seconds.

In [None]:
from vespa.package import VespaCloud

with VespaCloud("vespa-team", "ms-marco", "/Users/tmartins/sample_application/tmartins.vespa-team.pem") as cloud:
    vespa = cloud.deploy('from-notebook', app_package)

In [5]:
from vespa.package import VespaCloud

vespa_cloud = VespaCloud(
    "vespa-team", 
    "ms-marco", 
    "/Users/tmartins/sample_application/tmartins.vespa-team.pem", 
    app_package
)
app = vespa_cloud.deploy('from-notebook', "/Users/tmartins/sample_application")

Deployment started in run 12 of dev-aws-us-east-1c for vespa-team.ms-marco.from-notebook. This may take about 15 minutes the first time.
INFO    [10:37:04]  Deploying platform version 7.278.21 and application version unknown ...
INFO    [10:37:05]  No services requiring restart.
INFO    [10:37:05]  Deployment successful.
INFO    [10:37:05]  Session 13751 for tenant 'vespa-team' prepared and activated.
INFO    [10:37:06]  ######## Details for all nodes ########
INFO    [10:37:06]  h711a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO    [10:37:06]  --- platform docker.ouroath.com:4443/vespa/centos-tenant:7.278.21
INFO    [10:37:06]  --- container on port 4080 has not started 
INFO    [10:37:06]  h712a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP
INFO    [10:37:06]  --- platform docker.ouroath.com:4443/vespa/centos-tenant:7.278.21
INFO    [10:37:06]  --- logserver-container on port 4080 has config generation 13751, wanted is 13751
INFO    [

The app variable above will hold a `Vespa` instance that will be used to connect and interact with our text search application. We can see the deployment message returned by the Vespa engine:

In [None]:
app.__class__

In [None]:
app.deployment_message

## Feed data to the app 

We now have our text search app up and running. We can start to feed data to it. We have pre-processed and sampled some MS MARCO data to use in this tutorial. We can load 996 documents that we want to feed and check the first two documents in this sample.

In [None]:
from pandas import read_csv

docs = read_csv("https://thigm85.github.io/data/msmarco/docs.tsv", sep = "\t")
docs.shape

In [None]:
docs.head(2)

To feed the data we need to specify the `schema` that we are sending data to. We name our schema `msmarco` in a previous section. Each data point needs to have a unique `data_id` associated with it, independent of having an id field or not. The `fields` should be a dict containing all the fields in the schema, which are `id`, `title` and `body` in our case. 

In [None]:
app.feed_data_point(
        schema = "msmarco", 
        data_id = "test", 
        fields = {
            "id": "test", 
            "title": "this is a test title", 
            "body": "this is test body"
        }
    )

In [None]:
for idx, row in docs.iterrows():
    print(idx)
    response = app.feed_data_point(
        schema = "msmarco", 
        data_id = str(row["id"]), 
        fields = {
            "id": str(row["id"]), 
            "title": str(row["title"]), 
            "body": str(row["body"])
        }
    )

Each call to the method `feed_data_point` sends a POST request to the appropriate Vespa endpoint and we can check the response of the requests if needed, such as the status code and the message returned.

In [None]:
response.status_code

In [None]:
response.json()

## Make a simple query

Once our application is fed we can start to use it by sending queries to it. The MS MARCO app expectes to receive questions as queries and the goal of the application is to return documents that are relevant to the questions made.

In the example below, we will send a question via the `query` parameter. In addition, we need to specify how we want the documents to be matched and ranked. We do this by specifying a `Query` model. The query model below will have the `OR` operator in the match phase, indicating that the application will match all the documents which have at least one query term within the title or the body (due to the default `FieldSet` we defined earlier) of the document. And we will rank all the matched documents by the default `RankProfile` that we defined earlier.

In [None]:
from vespa.query import Query, OR, RankProfile as Ranking

results = app.query(
    query="Where is my text?", 
    query_model = Query(
        match_phase=OR(), 
        rank_profile=Ranking(name="default")
    ),
    hits = 2
)

In [None]:
results.hits

In addition to the `query` and `query_model` parameters, we can specify a multitude of relevant Vespa parameters such as the number of `hits` that we want Vespa to return. We chose `hits=2` for simplicity in this tutorial.

In [None]:
len(results.hits)

## Change the application package and redeploy

We can also make specific changes to our application by changing the application package and redeploying. Lets add a new rank profile based on BM25 to our `Schema`.

In [None]:
app_package.schema.add_rank_profile(
    RankProfile(name = "bm25", inherits = "default", first_phase = "bm25(title) + bm25(body)")
)

After that we can redeploy our application, similar to what we did earlier:

In [None]:
app = vespa_cloud.deploy('from-notebook', "/Users/tmartins/sample_application")

We can then use the newly created `bm25` rank profile to make queries:

In [None]:
results = app.query(
    query="Where is my text?", 
    query_model = Query(
        match_phase=OR(), 
        rank_profile=Ranking(name="bm25")
    ),
    hits = 2
)
len(results.hits)

## Compare query models

When we are building a search application, we often want to experiment and compare different query models. In this section we want to show how easy it is to compare different query models in Vespa.

Lets load some labelled data where each data point contains a `query_id`, a `query` and a list of `relevant_docs` associated with the query. In this case, we have only one relevant document for each query.

In [None]:
import requests, json

labelled_data = json.loads(
    requests.get("https://thigm85.github.io/data/msmarco/query-labels.json").text
)

Following we can see two examples of the labelled data:

In [None]:
labelled_data[0:2]

Lets define two `Query` models to be compared. We are going to use the same `OR` operator in the match phase and compare the `default` and `bm25` rank profiles.

In [None]:
default_ranking = Query(
    match_phase=OR(), 
    rank_profile=Ranking(name="default")
)

In [None]:
bm25_ranking = Query(
    match_phase=OR(), 
    rank_profile=Ranking(name="bm25")
)

Now we will chose which evaluation metrics we want to look at. In this case we will chose the `MatchRatio` to check how many documents have been matched by the query, the `Recall` at 10 and the `ReciprocalRank` at 10.

In [None]:
from vespa.evaluation import MatchRatio, Recall, ReciprocalRank

eval_metrics = [MatchRatio(), Recall(at = 10), ReciprocalRank(at = 10)]

We now can run the `evaluation` method for each `Query` model. This will make queries to the application and process the results to compute the pre-defined `eval_metrics` defined above.

In [None]:
default_evaluation = app.evaluate(
    labelled_data=labelled_data, 
    eval_metrics=eval_metrics, 
    query_model=default_ranking, 
    id_field="id",
    timeout=5,
    hits=10
)

In [None]:
bm25_evaluation = app.evaluate(
    labelled_data=labelled_data, 
    eval_metrics=eval_metrics, 
    query_model=bm25_ranking, 
    id_field="id",
    timeout=5,
    hits=10
)

We can then merge the DataFrames returned by the `evaluation` method and start to analyse the results.

In [None]:
from pandas import merge

eval_comparison = merge(
    left=default_evaluation, 
    right=bm25_evaluation, 
    on="query_id", 
    suffixes=('_default', '_bm25')
)
eval_comparison.head()

Notice that we expect to observe the same match ratio for both query models since they use the same `OR` operator.

In [None]:
eval_comparison[["match_ratio_value_default", "match_ratio_value_bm25"]].describe().loc[["mean", "std"]]

The `bm25` rank profile obtained a significantly higher recall than the `default`.

In [None]:
eval_comparison[["recall_10_value_default", "recall_10_value_bm25"]].describe().loc[["mean", "std"]]

Similarly, `bm25` also get a significantly higher reciprocal rank value when compared to the `default` rank profile.

In [None]:
eval_comparison[["reciprocal_rank_10_value_default", "reciprocal_rank_10_value_bm25"]].describe().loc[["mean", "std"]]