# Text search

![Vespa logo](https://vespa.ai/assets/vespa-logo-color.png)

This self-contained tutorial will create a basic text search application based on the MS MARCO dataset,
similar to Vespa's [text search tutorials](https://docs.vespa.ai/en/tutorials/text-search.html).

[Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker, validate minimum 4G available:

In [15]:
!docker info | grep "Total Memory"

 Total Memory: 11.7GiB


## Create an application package

Create an [application package](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.ApplicationPackage) - do not use a `-` in the name:

In [16]:
from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name="textsearch")

## Add fields to the schema

Add [fields](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.Field)
to the application's [schema](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.Schema):

In [17]:
from vespa.package import Field

app_package.schema.add_fields(
    Field(name = "id",    type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "body",  type = "string", indexing = ["index", "summary"], index = "enable-bm25")
)

* `id` holds the document ids, while `title` and `body` are the text fields of the documents.

* Setting `"index"` in `indexing` means that a searchable index for `title` and `body` is created.
  Read more about [indexing options](https://docs.vespa.ai/en/reference/schema-reference.html#indexing). 

* Setting `index = "enable-bm25"` will pre-compute quantities to make it fast to compute the BM25 score.

## Search multiple fields when querying

A [FieldSet](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.FieldSet)
groups fields together for searching -
it configures queries to look for matches both in the titles and bodies of the documents:

In [18]:
from vespa.package import FieldSet

app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "body"])
)

## Define how to rank the documents matched

Specify how to rank the matched documents by defining a
[RankProfile](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.RankProfile).
Here, the `bm25` rank profile combines BM25 scores from `title` and `body`:

In [19]:
from vespa.package import RankProfile

app_package.schema.add_rank_profile(
    RankProfile(name = "default", first_phase = "bm25(title) + bm25(body)")
)
app_package.schema.add_rank_profile(
    RankProfile(name = "bm25", first_phase = "bm25(title) + bm25(body)")
)
app_package.schema.add_rank_profile(
    RankProfile(name = "native_rank", first_phase = "nativeRank(title, body)")
)

## Deploy

The text search app with fields, a fieldset to group fields together, and a rank profile to rank matched documents is now defined and ready to deploy.
Deploy `app_package` on the local machine using Docker,
without leaving the notebook, by creating an instance of
[VespaDocker](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.deployment.VespaDocker):

In [20]:
import os
from vespa.deployment import VespaDocker

disk_folder = os.path.join(os.getenv("WORK_DIR"), "sample_application")
vespa_docker = VespaDocker(port=8089, disk_folder=disk_folder)
app = vespa_docker.deploy(application_package=app_package)

Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Waiting for application status.
Finished deployment.


`app` now holds a [Vespa](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa) instance,
to be used to interact with the application.
`pyvespa` provides an API to define Vespa application packages from python.
`vespa_docker.deploy` exports Vespa configuration files to `disk_folder` -
going through these is a good way to learning about Vespa configuration.

## Feed

Download approx 10K documents:

In [21]:
from pandas import read_csv

docs = read_csv(
    filepath_or_buffer="https://data.vespa.oath.cloud/blog/msmarco/sample_docs.csv"
)
docs.head()

Unnamed: 0,id,title,body
0,D1712962,Can you eat crab or imitation krab when you ha...,Answers com Wiki Answers Categories Health...
1,D1817294,How long is a tax refund check good,Answers com Wiki Answers Categories Busine...
2,D1761039,The Suffolk Resolves 1774,The Suffolk Resolves 1774 Across New England ...
3,D2899268,The eagle has flown,Download citation Share Download full text PDF...
4,D3278481,22b Cotton and African American Life,22b Cotton and African American Life Two thi...


Feed the DataFrame to the application:

In [22]:
feed_res = app.feed_df(docs)

## Query

Query the text search app using the [Vespa Query language](https://docs.vespa.ai/en/query-language.html)
by sending the parameters to the body argument of
[app.query](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa.query):

In [23]:
query = {
    'yql': 'select * from sources * where userQuery();',
    'query': 'what keeps planes in the air',
    'ranking': 'bm25',
    'type': 'all',
    'hits': 10
}

In [24]:
res = app.query(body=query)
res.hits[0]

{'id': 'id:textsearch:textsearch::D1871659',
 'relevance': 25.629646778721725,
 'source': 'textsearch_content',
 'fields': {'sddocname': 'textsearch',
  'documentid': 'id:textsearch:textsearch::D1871659',
  'id': 'D1871659',
  'title': 'What keeps airplanes in the air ',
  'body': 'Answers com   Wiki Answers   Categories Cars   Vehicles Airplanes and Aircraft What keeps airplanes in the air  Flag What keeps airplanes in the air  Answer by Karin L  Confidence votes 95 0KThere s more to raising cattle than throwing them out to pasture  Know your soil and plants to earn profit above ground and wealth below  It is the combined forces of lift  thrust and weight that keeps an airplane in the air  Lift happens to be the largest force in this equation  and is dependent on the speed of the wing  or how fast an airplane is going   vertical velocity of air and air density  Well the elevator the rudder will help and something else I forgot what it was but don t judge me for that               And 

## Query with QueryModel

Using the Vespa Query Language as above gives full query power and flexibility from Vespa.
In contrast, the QueryModel abstraction focuses on specific use cases
and can be more useful for ML experiments.
Here, match using the `AND`operator and rank using the `bm25` ranking profile:

In [25]:
from vespa.query import QueryModel, AND, RankProfile as Ranking

bm25_query_model = QueryModel(
    name="and_bm25",
    match_phase = AND(),
    rank_profile = Ranking(name="bm25")
)

In [26]:
response = app.query(query="what keeps planes in the air", query_model=bm25_query_model)

In [27]:
for hit in response.hits:
    print({
        "id": hit["fields"]["id"], 
        "title": hit["fields"]["title"], 
        "relevance": hit["relevance"]
    })

{'id': 'D1871659', 'title': 'What keeps airplanes in the air ', 'relevance': 25.629646778721725}
{'id': 'D684487', 'title': 'MP02  Motion Diagrams', 'relevance': 10.530621046528196}
{'id': 'D254873', 'title': ' ', 'relevance': 9.360245799821822}
{'id': 'D1765332', 'title': 'Fast   Furious 6', 'relevance': 8.074330525519944}
{'id': 'D1631044', 'title': 'First Cell', 'relevance': 7.633614797739622}
{'id': 'D3434656', 'title': 'A Global Guide to Pet Relocation Costs', 'relevance': 6.924114436169097}
{'id': 'D3157544', 'title': 'Thomas Cook to axe 2 600 jobs', 'relevance': 6.620220097056055}
{'id': 'D209851', 'title': 'The USS Scorpion Buried at Sea', 'relevance': 6.406117915168797}
{'id': 'D958861', 'title': 'George Harrison  close to death ', 'relevance': 6.00868191627094}
{'id': 'D1529248', 'title': 'The airport shopping challenge  Left buying Christmas presents til  the last minute like I did  You CAN do it all in two hours   and I m living proof', 'relevance': 5.991353085441339}


## Cleanup

In [28]:
from shutil import rmtree

vespa_docker.container.stop()
vespa_docker.container.remove()
rmtree(disk_folder, ignore_errors=True)