# Build a basic text search application from python with Vespa

> Introducing pyvespa simplified API. Build Vespa application from python with few lines of code.

- toc: true 
- badges: false
- comments: true
- categories: [vespa, pyvespa, cord19]

This post will introduce you to the simplified `pyvespa` API that allows us to build a basic text search application from scratch with just a few code lines from python. A follow-up post will add a layer of complexity by deploying a BERT model to the search application built here,  again with just a few lines of code.

`pyvespa` exposes a subset of [Vespa](https://vespa.ai/) API in python. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications. I have written about how we can use it to [connect and interact with running Vespa applications](https://towardsdatascience.com/how-to-connect-and-interact-with-search-applications-from-python-520118139f69) and [evaluate Vespa ranking functions from python](https://towardsdatascience.com/how-to-evaluate-vespa-ranking-functions-from-python-7749650f6e1a). This time, we focus on building and deploying applications from scratch.

## Install

The pyvespa simplified API introduced here was released on version `0.2.0`

`pip3 install pyvespa==0.2.0`

## Define the application

As an example, we are going to build an application to search through [CORD19 sample data](https://ir.nist.gov/covidSubmit/data.html).

### Create an application package

The first step is to create a Vespa [ApplicationPackage](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.ApplicationPackage):

In [4]:
from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="cord19")

### Add fields to the Schema

We can then add [fields](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.Field) to the application's [Schema](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.Schema) created by default in `app_package`.

In [7]:
from vespa.package import Field

app_package.schema.add_fields(
    Field(name = "cord_uid", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "abstract", type = "string", indexing = ["index", "summary"], index = "enable-bm25")
)

* `cord_uid` will store the cord19 document ids, while `title` and `abstract` are self explanatory. 

* All the fields in this case are of type `string`. 

* Including `"index"` in the `indexing` list means that Vespa will create a searchable index for `title` and `abstract`. You can read more about which options is available for `indexing` in the [Vespa documentation](https://docs.vespa.ai/documentation/reference/schema-reference.html#indexing). 

* Setting `index = "enable-bm25"` makes Vespa pre-compute quantities to make it very fast to compute the bm25 score that we will use to rank the documents retrieved.

### Search multiple fields when querying

In [8]:
from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "abstract"])
)

### Define how to rank the documents matched

In [9]:
from vespa.package import RankProfile

app_package.schema.add_rank_profile(
    RankProfile(name = "bm25", first_phase = "bm25(title) + bm25(abstract)")
)

## Deploy your application

In [None]:
import os
from vespa.package import VespaDocker

vespa_docker = VespaDocker(port=8080)

os.environ["WORK_DIR"] = "/Users/tmartins"
disk_folder = os.path.join(os.getenv("WORK_DIR"), "sample_application")

app = vespa_docker.deploy(
    application_package = app_package,
    disk_folder=disk_folder
)

## Feed some data

In [None]:
from pandas import read_csv

parsed_feed = read_csv("/Users/tmartins/projects/sw/blog/_notebooks/data/2021-01-18-cord19-deploy-bert-from-pyvespa/parsed_feed.csv")
parsed_feed = parsed_feed.head(100)

In [None]:
parsed_feed

In [None]:
for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
    }
    fields.update(
        bert_config.doc_fields(text = str(row["title"]))
    )
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

In [None]:
response.json()

## Query your application

In [None]:
from vespa.query import QueryModel, RankProfile as Ranking, OR, QueryModelFeature

result = app.query(
    query="this is a test", 
    query_model=QueryModel(
        query_properties=[
            QueryModelFeature(bert_config)
        ],
        match_phase = OR(),
        rank_profile = Ranking(name="pretrained_bert_tiny")
    )
)

In [None]:
result.json

In [None]:
result.number_documents_retrieved