# Text search

![Vespa logo](https://vespa.ai/assets/vespa-logo-color.png)

Quick get started with the gallery module.
[Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker, validate minimum 4G available:

In [16]:
!docker info | grep "Total Memory"

 Total Memory: 11.7GiB


## Create the application package

The first step is to create an [application package](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.ApplicationPackage):

In [1]:
from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="textsearch")

## Add fields to the schema

We can then add [fields](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.Field) to the application's [Schema](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.Schema) created by default in `app_package`.

In [2]:
from vespa.package import Field

app_package.schema.add_fields(
    Field(name = "id",    type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "body",  type = "string", indexing = ["index", "summary"], index = "enable-bm25")
)

* `id` will store the document ids, while `title` and `body` are self explanatory. 

* All the fields in this case are of type `string`. 

* Including `"index"` in the `indexing` list means that Vespa will create a searchable index for `title` and `body`.
  You can read more about which options is available for `indexing` in the
  [Vespa documentation](https://docs.vespa.ai/en/reference/schema-reference.html#indexing). 

* Setting `index = "enable-bm25"` makes Vespa pre-compute quantities to make it fast to compute the BM25 score.
  We will use BM25 to rank the documents retrieved.

## Search multiple fields when querying

A [Fieldset](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.FieldSet)
groups fields together for searching.
E.g., the `default` fieldset defined below groups `title` and `abstract` together.

In [3]:
from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "body"])
)

## Define how to rank the documents matched

We can specify how to rank the matched documents by defining a
[RankProfile](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.RankProfile).
In this case, we defined the `bm25` rank profile that combines that BM25 scores
computed over the `title` and `body` fields. 

In [4]:
from vespa.package import RankProfile

app_package.schema.add_rank_profile(
    RankProfile(name = "default", first_phase = "bm25(title) + bm25(body)")
)
app_package.schema.add_rank_profile(
    RankProfile(name = "bm25", first_phase = "bm25(title) + bm25(body)")
)
app_package.schema.add_rank_profile(
    RankProfile(name = "native_rank", first_phase = "nativeRank(title,body)")
)

## QueryModel

Set the query parameters:

In [5]:
from vespa.query import QueryModel, AND, RankProfile as Ranking

bm25_query_model = QueryModel(
    name="and_bm25",
    match_phase = AND(),
    rank_profile = Ranking(name="bm25")
)

## Deploy

In [6]:
import os
from vespa.deployment import VespaDocker

disk_folder = os.path.join(os.getcwd(), "sample_application")
vespa_docker = VespaDocker(port=8089, disk_folder=disk_folder)
app = vespa_docker.deploy(application_package=app_package)

Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Waiting for application status.
Finished deployment.


## Feed

Download approx 10K documents:

In [10]:
from pandas import read_csv

docs = read_csv(
    filepath_or_buffer="https://data.vespa.oath.cloud/blog/msmarco/sample_docs.csv"
)
docs.head()

Unnamed: 0,id,title,body
0,D1712962,Can you eat crab or imitation krab when you ha...,Answers com Wiki Answers Categories Health...
1,D1817294,How long is a tax refund check good,Answers com Wiki Answers Categories Busine...
2,D1761039,The Suffolk Resolves 1774,The Suffolk Resolves 1774 Across New England ...
3,D2899268,The eagle has flown,Download citation Share Download full text PDF...
4,D3278481,22b Cotton and African American Life,22b Cotton and African American Life Two thi...


In [12]:
responses = app.feed_df(docs)

## Query

In [13]:
response = app.query(query="what keeps planes in the air", query_model=bm25_query_model)

In [14]:
for hit in response.hits:
    print({
        "id": hit["fields"]["id"], 
        "title": hit["fields"]["title"], 
        "relevance": hit["relevance"]
    })

{'id': 'D1871659', 'title': 'What keeps airplanes in the air ', 'relevance': 25.629646778721742}
{'id': 'D684487', 'title': 'MP02  Motion Diagrams', 'relevance': 10.530621046528191}
{'id': 'D254873', 'title': ' ', 'relevance': 9.360245799821818}
{'id': 'D1765332', 'title': 'Fast   Furious 6', 'relevance': 8.074330525519938}
{'id': 'D1631044', 'title': 'First Cell', 'relevance': 7.633614797739616}
{'id': 'D3434656', 'title': 'A Global Guide to Pet Relocation Costs', 'relevance': 6.924114436169092}
{'id': 'D3157544', 'title': 'Thomas Cook to axe 2 600 jobs', 'relevance': 6.62022009705605}
{'id': 'D209851', 'title': 'The USS Scorpion Buried at Sea', 'relevance': 6.406117915168793}
{'id': 'D958861', 'title': 'George Harrison  close to death ', 'relevance': 6.008681916270936}
{'id': 'D1529248', 'title': 'The airport shopping challenge  Left buying Christmas presents til  the last minute like I did  You CAN do it all in two hours   and I m living proof', 'relevance': 5.991353085441334}


## Cleanup

In [15]:
from shutil import rmtree

rmtree(os.path.join(os.getcwd(), "sample_application"), ignore_errors=True)
vespa_docker.container.stop()
vespa_docker.container.remove()