# Similarity search with BERT embeddings

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vespa-engine/pyvespa/blob/master/docs/sphinx/source/use_cases/msmarco/msmarco_independent_embeddings.ipynb)

This tutorial covers the use of independent vector embeddings for query and document. The more relevant case of paired document and query vectors will be covered elsewhere.

## Install

The library is available at PyPI and therefore can be installed with `pip`.

In [None]:
!pip install pyvespa==0.1.7.dev1 --force-reinstall

## Application package API

**You need to add a tensor field for the document embedding**.

In [8]:
from vespa.package import Document, Field

document = Document(
    fields=[
        Field(name = "id", type = "string", indexing = ["attribute", "summary"]),
        Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
        Field(name = "title_bert", type = "tensor<float>(x[768])", indexing = ["attribute"])
    ]
)

**You can add the dot-product of the document tensor with the query tensor**.

In [9]:
from vespa.package import Schema, FieldSet, RankProfile

msmarco_schema = Schema(
    name = "msmarco", 
    document = document, 
    fieldsets = [FieldSet(name = "default", fields = ["title"])],
    rank_profiles = [
        RankProfile(name = "default", first_phase = "nativeRank(title)"), 
        RankProfile(
            name = "bert_title", 
            inherits="default", 
            first_phase = "sum(query(tensor_bert)*attribute(title_bert))"
        )
    ]
)

In [10]:
from vespa.package import ApplicationPackage

app_package = ApplicationPackage(name = "msmarco", schema=msmarco_schema)

**You need to add a tensor field for the query embedding**.

In [11]:
app_package.add_query_profile_type_field(
    name="ranking.features.query(tensor_bert)",
    type="tensor<float>(x[768])"
)

## Deploy

In [12]:
from vespa.package import VespaDocker

vespa_docker = VespaDocker(port=8089)
app = vespa_docker.deploy(
    application_package=app_package, 
    disk_folder="/Users/tmartins/projects/sample_application"
)

Waiting for application status.
Waiting for application status.
Waiting for application status.


## Funcion to generate embeddings

Here is an example of a function that takes text as input and generates a list of float (embedding) as output.

In [32]:
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("bert-base-nli-mean-tokens")

def create_embedding(text, normalize=True):
    vector = model.encode([text])[0].tolist()
    if normalize:
        norm = np.linalg.norm(vector)
        if norm > 0.0:
            vector = vector / norm
    return vector.tolist()

## Feed data to the app 

In [13]:
from pandas import read_csv

docs = read_csv("https://thigm85.github.io/data/msmarco/docs.tsv", sep = "\t")
docs.shape

(996, 3)

In [14]:
docs.head(2)

Unnamed: 0,id,title,body
0,D2185715,What Is an Appropriate Gift for a Bris,Hub Pages Religion and Philosophy Judaism...
1,D2819479,lunge,1lungenoun ˈlənj Popularity Bottom 40 of...


In [17]:
len(create_embedding(text="this is a test title"))

768

**Note that you need to add "values" when sending tensors.**

In [22]:
response = app.feed_data_point(
    schema = "msmarco", 
    data_id = "test_id", 
    fields = {
        "id": "test_id", 
        "title": "this is a test title", 
        "title_bert": {
            "values": create_embedding(text="this is a test title")
        }
    }
)
response.status_code

200

## Query

### With ANN operator

In [36]:
from vespa.query import Query, ANN, RankProfile as Ranking

results = app.query(
    query="Where is my text?", 
    query_model = Query(
        match_phase=ANN(
            doc_vector="title_bert", 
            query_vector="tensor_bert", 
            embedding_model=create_embedding, 
            hits=10, 
            label="ann"
        ), 
        rank_profile=Ranking(name="bert_title")
    ),
)

In [37]:
results.json

{'root': {'id': 'toplevel',
  'relevance': 1.0,
  'fields': {'totalCount': 1},
  'coverage': {'coverage': 100,
   'documents': 1,
   'full': True,
   'nodes': 1,
   'results': 1,
   'resultsFull': 1},
  'children': [{'id': 'id:msmarco:msmarco::test_id',
    'relevance': 0.5120726823806763,
    'source': 'msmarco_content',
    'fields': {'sddocname': 'msmarco',
     'documentid': 'id:msmarco:msmarco::test_id',
     'id': 'test_id',
     'title': 'this is a test title'}}]}}

### Without the ANN operator

When not using the ANN operator, you need to create the `ranking.features.query(tensor_bert)` yourself and send it with the query as shown below. There will be a better way to doo this in the future.

In [39]:
other_args = {
    "ranking.features.query(tensor_bert)": create_embedding(text="this is a test query")
}

results = app.query(
    query="Where is my text?", 
    query_model = Query(
        match_phase=OR(), 
        rank_profile=Ranking(name="default")
    ),
    hits = 2,
    **other_args
)

In [40]:
results.json

{'root': {'id': 'toplevel',
  'relevance': 1.0,
  'fields': {'totalCount': 1},
  'coverage': {'coverage': 100,
   'documents': 1,
   'full': True,
   'nodes': 1,
   'results': 1,
   'resultsFull': 1},
  'children': [{'id': 'id:msmarco:msmarco::test_id',
    'relevance': 0.02334839861715184,
    'source': 'msmarco_content',
    'fields': {'sddocname': 'msmarco',
     'documentid': 'id:msmarco:msmarco::test_id',
     'id': 'test_id',
     'title': 'this is a test title'}}]}}