![Vespa Cloud logo](https://cloud.vespa.ai/assets/logos/vespa-cloud-logo-full-black.png)

# Text Search on Vespa Cloud - quickstart

This is the same guide as [getting-started-pyvespa](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html), deploying to Vespa Cloud.

<div class="alert alert-info">
    Refer to <a href="https://pyvespa.readthedocs.io/en/latest/troubleshooting.html">troubleshooting</a>
    for any problem when running this guide.
</div>

Pre-requisite: Create a tenant at [cloud.vespa.ai](https://cloud.vespa.ai/), save the tenant name.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vespa-engine/pyvespa/blob/master/docs/sphinx/source/getting-started-pyvespa-cloud.ipynb)

## Install

Install [pyvespa](https://pyvespa.readthedocs.io/) >= 0.35
and the [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html).
The Vespa CLI is used for key management:

In [1]:
!pip3 install pyvespa

Install the Vespa CLI using homebrew:

In [1]:
!brew install vespa-cli

Alternatively, if running in Colab, download the Vespa CLI:

In [1]:
import os
import requests
res = requests.get(url="https://api.github.com/repos/vespa-engine/vespa/releases/latest").json()
os.environ["VERSION"] = res["tag_name"].replace("v", "")
!curl -fsSL https://github.com/vespa-engine/vespa/releases/download/v${VERSION}/vespa-cli_${VERSION}_linux_amd64.tar.gz | tar -zxf -
!ln -sf /content/vespa-cli_${VERSION}_linux_amd64/bin/vespa /usr/local/bin/vespa

## Configure application and keys
Create Vespa Cloud data-plane cert/key-pair.
We save the paths to the credentials, for later dataplane access without using pyvespa APIs - see example at the end of this notebook.

In [1]:
import os
tenant_name = "mytenant" # Your tenant name here
os.environ["TENANT_NAME"] = tenant_name

!vespa config set target cloud
!vespa config set application ${TENANT_NAME}.textsearch
!vespa auth cert -N

In [1]:
from os.path import exists

cert_path = "/Users/me/.vespa/" + tenant_name + ".textsearch.default/data-plane-public-cert.pem"
key_path  = "/Users/me/.vespa/" + tenant_name + ".textsearch.default/data-plane-private-key.pem"

if not exists(cert_path) or not exists(key_path):
    print("ERROR: set the correct paths to security credentials. Correct paths above and rerun until you do not see this error")

Note that the subsequent deploy-call below will add `data-plane-public-cert.pem` to the application before deploying it.

Authenticate to generate an API key for deployment, and save path for it:

In [1]:
!vespa auth api-key

from pathlib import Path
api_key_path = str(Path.home()) + "/.vespa/" + os.getenv("TENANT_NAME") + ".api-key.pem"

Follow the instrauctions from the output above and add the key in the console.

## Create an application package

The [application package](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.ApplicationPackage)
has all the Vespa configuration files -
create one from scratch:

In [1]:
from vespa.package import ApplicationPackage

app_name = "textsearch"
app_package = ApplicationPackage(name=app_name)

Note that the name cannot have `-` or `_`.

The above will create an empty schema with the same name as the application package.

## Add fields to the schema

Add [fields](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.Field)
to the [schema](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.Schema):

In [1]:
from vespa.package import Field

app_package.schema.add_fields(
    Field(name = "id",    type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "body",  type = "string", indexing = ["index", "summary"], index = "enable-bm25")
)

* `id` holds the document ids, while `title` and `body` are the text fields of the documents.

* Setting `"index"` in `indexing` means that a searchable index for `title` and `body` is created.
  Read more about [indexing options](https://docs.vespa.ai/en/reference/schema-reference.html#indexing). 

* Setting `index = "enable-bm25"` will pre-compute quantities to make it fast to compute the BM25 score.

## Search multiple fields

A [FieldSet](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.FieldSet)
groups fields together for searching -
it configures queries to look for matches both in the `title` and `body` fields of the documents:

In [1]:
from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "body"])
)

## Define ranking

Specify how to rank the matched documents by defining a
[RankProfile](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.RankProfile).
Below are different rank profiles that can be selected in the query:

In [1]:
from vespa.package import RankProfile

app_package.schema.add_rank_profile(
    RankProfile(name = "bm25", first_phase = "bm25(title) + bm25(body)")
)
app_package.schema.add_rank_profile(
    RankProfile(name = "native_rank", first_phase = "nativeRank(title, body)")
)

## Deploy

The text search app with fields, a fieldset to group fields together, and rank profiles
is now defined and ready to deploy.
Deploy `app_package` to Vespa Cloud, by creating an instance of
[VespaCloud](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.deployment.VespaCloud):

In [1]:
from vespa.deployment import VespaCloud

vespa_cloud = VespaCloud(
    tenant=tenant_name,
    application=app_name,
    key_location=api_key_path,
    application_package=app_package)

In [1]:
app = vespa_cloud.deploy(instance="default")

If the deployment failed, it is possible you forgot to add the key in the Vespa Cloud Console in the `vespa auth api-key` step above. instructions successful deployment, you should see lines like:

> Deployment started in run 1 of dev-aws-us-east-1c for mytenant.textsearch.

> ...

> INFO    [06:04:21]  Found endpoints:
> dev.aws-us-east-1c
> https://textsearch-container.textsearch.mytenant.aws-us-east-1c.dev.z.vespa-app.cloud/ (cluster 'textsearch_container')

The deployment takes a few minutes the first time while Vespa Cloud sets up the resources for your Vespa application

`app` now holds a reference to a [Vespa](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa) instance.
Store the endpoint for later usage - set `endpoint` from the above output:

In [1]:
endpoint = "https://textsearch-container.textsearch.mytenant.aws-us-east-1c.dev.z.vespa-app.cloud/"

## Feed

Download approx 10K documents:

In [1]:
from pandas import read_csv

docs = read_csv(filepath_or_buffer="https://data.vespa.oath.cloud/blog/msmarco/sample_docs.csv").fillna('')
docs.head()

[Feed](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa.feed_df) the documents to the application:

In [1]:
feed_res = app.feed_df(docs, asynchronous=True, batch_size=100)

## Query

Query the text search app using the [Vespa Query language](https://docs.vespa.ai/en/query-language.html)
by sending the parameters to the body argument of
[Vespa.query](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa.query) -
here using the `bm25` rank profile:

In [1]:
query = {
    'yql': 'select * from sources * where userQuery()',
    'query': 'what keeps planes in the air',
    'ranking': 'bm25',
    'type': 'all',
    'hits': 10
}
res = app.query(body=query)
res.hits[0]

## Next steps

This is just an intro into the capabilities of Vespa and pyvespa.
Browse the site to learn more about schemas, feeding and queries - 
find more complex applications in
[examples](https://pyvespa.readthedocs.io/en/latest/examples.html).

## Example: Document operations using cert/key pair

Above, we deployed to Vespa Cloud, and as part of that, generated a cert/key pair.
This pair can be used to access the dataplane for reads/writes to documents and running queries.

Find the ID of the first document in the feed:

In [1]:
from vespa.application import df_to_vespafeed
import json

feed = json.loads(df_to_vespafeed(docs, app_name, "id", namespace=app_name))
doc_json = feed[0]
docid = doc_json["fields"]["id"]
doc_json

Set up a dataplane connection using the cert/key pair:

In [1]:
import requests

session = requests.Session()
session.cert = (cert_path, key_path)

Get a document from the endpoint returned when we deployed to Vespa Cloud above:

In [1]:
url = "{0}/document/v1/{1}/{2}/docid/{3}".format(endpoint, app_name, app_name, docid)
doc = session.get(url).json()
doc

Update the title and post the new version:

In [1]:
doc["fields"]["title"] = "Can you eat lobster?"
response = session.post(url, json=doc).json()
response

Get the doc again to see the updated title:

In [1]:
doc = session.get(url).json()
doc

## Example: Reconnect pyvespa using cert/key pair

Above, we stored the dataplane credentials for later use. Deployment of an application usually happens when the schema changes, whereas accessing the dataplane is for document updates and user queries.

One only needs to know the endpoint and the cert/key pair to enable a connection to a Vespa Cloud application:

In [1]:
# cert_path = "/Users/me/.vespa/mytenant.textsearch.default/data-plane-public-cert.pem"
# key_path  = "/Users/me/.vespa/mytenant.textsearch.default/data-plane-private-key.pem"

from vespa.application import Vespa

the_app = Vespa(endpoint, cert=cert_path, key=key_path)

res = the_app.query(body={
    'yql': 'select * from sources * where true',
    'hits': 1
})
res.hits[0]

A common problem is a cert mismatch - the cert/key pair used when deployed is different than the pair used when making queries. Make sure it is the same pair / re-create with `vespa auth cert -f` and redeploy as needed.