# Getting started with pyvespa

![Vespa logo](https://vespa.ai/assets/vespa-logo-color.png)

This notebook starts Vespa, configures the application and tests the document and query APIs.
Install [jupyter notebook](https://jupyter.org/install#jupyter-notebook)
and start the notebook by selecting `getting-started-pyvespa.ipynb`:

    $ git clone --depth 1 https://github.com/vespa-engine/pyvespa.git
    $ jupyter notebook --notebook-dir pyvespa/docs/sphinx/source

Docker is used to run Vespa, alternatively, use [Vespa Cloud](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html).
Start Docker and validate minimum 4G available:

In [1]:
!docker info | grep "Total Memory"

 Total Memory: 11.7GiB


## Install pyvespa

In [None]:
!python3 -m pip install pyvespa

## Create the application package

Create an [application package](https://pyvespa.readthedocs.io/en/latest/use_cases/text_search/text-search-quick-start.html):

In [1]:
from typing import List

from vespa.package import (
    Document,
    Field,
    Schema,
    FieldSet,
    RankProfile,
    HNSW,
    ApplicationPackage,
    QueryProfile,
    QueryProfileType,
    QueryTypeField,
)

from vespa.query import QueryModel, AND, RankProfile as Ranking

class QuestionAnswering(ApplicationPackage):
    def __init__(self, name: str = "qa"):
        context_document = Document(
            fields=[
                Field(
                    name="questions",
                    type="array<int>",
                    indexing=["summary", "attribute"],
                ),
                Field(name="dataset", type="string", indexing=["summary", "attribute"]),
                Field(name="context_id", type="int", indexing=["summary", "attribute"]),
                Field(
                    name="text",
                    type="string",
                    indexing=["summary", "index"],
                    index="enable-bm25",
                ),
            ]
        )
        context_schema = Schema(
            name="context",
            document=context_document,
            fieldsets=[FieldSet(name="default", fields=["text"])],
            rank_profiles=[
                RankProfile(name="bm25", inherits="default", first_phase="bm25(text)"),
                RankProfile(
                    name="nativeRank",
                    inherits="default",
                    first_phase="nativeRank(text)",
                ),
            ],
        )
        sentence_document = Document(
            inherits="context",
            fields=[
                Field(
                    name="sentence_embedding",
                    type="tensor<float>(x[512])",
                    indexing=["attribute", "index"],
                    ann=HNSW(
                        distance_metric="euclidean",
                        max_links_per_node=16,
                        neighbors_to_explore_at_insert=500,
                    ),
                )
            ],
        )
        sentence_schema = Schema(
            name="sentence",
            document=sentence_document,
            fieldsets=[FieldSet(name="default", fields=["text"])],
            rank_profiles=[
                RankProfile(
                    name="semantic-similarity",
                    inherits="default",
                    first_phase="closeness(sentence_embedding)",
                ),
                RankProfile(name="bm25", inherits="default", first_phase="bm25(text)"),
                RankProfile(
                    name="bm25-semantic-similarity",
                    inherits="default",
                    first_phase="bm25(text) + closeness(sentence_embedding)",
                ),
            ],
        )
        super().__init__(
            name=name,
            schema=[context_schema, sentence_schema],
            query_profile=QueryProfile(),
            query_profile_type=QueryProfileType(
                fields=[
                    QueryTypeField(
                        name="ranking.features.query(query_embedding)",
                        type="tensor<float>(x[512])",
                    )
                ]
            ),
        )

app_package = QuestionAnswering()

## Deploy the application using Docker

Deploy the `app_package`, wait for _Finished deployment_:

In [2]:
import os
from vespa.deployment import VespaDocker

disk_folder = os.path.join(os.getenv("WORK_DIR"), "sample_application")
vespa_docker = VespaDocker(port=8081, disk_folder=disk_folder)
app = vespa_docker.deploy(application_package=app_package)

Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Waiting for application status.
Finished deployment.


As part of deploying, pyvespa will export the configuration to an
[application package](https://docs.vespa.ai/en/reference/application-packages-reference.html) on disk.
This set of files can be deployed using [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html),
and can be useful to check into the source code repository.
As the application package was named "qa" in the code above, look for files in that directory:

In [3]:
!find qa -type f

qa/application/hosts.xml
qa/application/services.xml
qa/application/schemas/context.sd
qa/application/schemas/sentence.sd
qa/application/search/query-profiles/types/root.xml
qa/application/search/query-profiles/default.xml


Use [disk_folder](reference-api.rst#vespadocker) to configure the working directory.
Use [export_application_package](reference-api.rst#vespa.deployment.VespaDocker.export_application_package)
to export the application package from code to files.

Remember to [clean up](#Cleanup) after deploying to a local Docker container.

## Download, prepare and feed sample data

In [5]:
import json, requests

sentence_data = json.loads(
    requests.get("https://data.vespa.oath.cloud/blog/qa/sample_sentence_data_100.json").text
)
list(sentence_data[0].keys())

['text', 'dataset', 'questions', 'context_id', 'sentence_embedding']

Prepare the data as a list of dicts having the `id` key holding a unique id of the data point and the `fields` key holding a dict with the data fields required by the application:

In [6]:
batch_feed = [
    {
        "id": idx, 
        "fields": sentence
    }
    for idx, sentence in enumerate(sentence_data)
]

Feed the batch using the `sentence` schema:

In [7]:
response = app.feed_batch(schema="sentence", batch=batch_feed)

## Run a query

Query the application using the [Vespa Query Language](https://docs.vespa.ai/en/query-language.html):

In [8]:
result = app.query(body={
  'yql': 'select text from sources sentence  where userQuery();',
  'query': 'What is in front of the Notre Dame Main Building?',
  'type': 'any',
  'hits': 5,
  'ranking.profile': 'bm25'
})

In [9]:
result.hits[0]

{'id': 'index:qa_content/0/a87ff679ab8603b42a4ffde2',
 'relevance': 11.194862200830393,
 'source': 'qa_content',
 'fields': {'text': 'Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes".'}}

## Get documents

Get the sentences with ids = 0, 1 and 2. Inspect the response in `json`:

In [10]:
batch = [{"id": 0}, {"id": 1}, {"id": 2}]
response = app.get_batch(schema="sentence", batch=batch)

In [11]:
response

[<vespa.io.VespaResponse at 0x1126c7e20>,
 <vespa.io.VespaResponse at 0x1126af280>,
 <vespa.io.VespaResponse at 0x1126eefa0>]

In [12]:
response[0].json

{'pathId': '/document/v1/sentence/sentence/docid/0',
 'id': 'id:sentence:sentence::0',
 'fields': {'text': "Atop the Main Building's gold dome is a golden statue of the Virgin Mary.",
  'dataset': 'squad',
  'sentence_embedding': {'cells': [{'address': {'x': '0'},
     'value': -0.005731593817472458},
    {'address': {'x': '1'}, 'value': 0.007575507741421461},
    {'address': {'x': '2'}, 'value': -0.06413306295871735},
    {'address': {'x': '3'}, 'value': -0.007967847399413586},
    {'address': {'x': '4'}, 'value': -0.06464996933937073},
    {'address': {'x': '5'}, 'value': -0.07429644465446472},
    {'address': {'x': '6'}, 'value': 0.005069912411272526},
    {'address': {'x': '7'}, 'value': -0.019518841058015823},
    {'address': {'x': '8'}, 'value': -0.021434271708130836},
    {'address': {'x': '9'}, 'value': -0.06423905491828918},
    {'address': {'x': '10'}, 'value': 0.0652240440249443},
    {'address': {'x': '11'}, 'value': -0.06434165686368942},
    {'address': {'x': '12'}, 'valu

## Update a document

Update a data point by `id`. Optionally, `create` the data point if it does not exist:

In [13]:
batch_update = [
    {
        "id": 0,                               # data_id
        "fields": {"text": "this is a test"},  # fields to be updated
        "create": False                        # Optional. Create data point if not exist, default to False.
        
    }
]

In [14]:
response = app.update_batch(schema="sentence", batch=batch_update)

## Delete documents

Delete the sentences with ids = 0, 1 and 2:

In [15]:
batch = [{"id": 0}, {"id": 1}, {"id": 2}]
response = app.delete_batch(schema="sentence", batch=batch)

## Cleanup

In [16]:
from shutil import rmtree

vespa_docker.container.stop()
vespa_docker.container.remove()
rmtree(disk_folder, ignore_errors=True)