# Text search app with BERT models from python

> Introducing pyvespa ML API. Cutting-edge Vespa search with few lines of code.

- toc: true 
- badges: false
- comments: true
- categories: [vespa, pyvespa, BERT, transformers]

## Define the application

Start with a basic text search app

In [1]:
from vespa.package import ApplicationPackage, Field, FieldSet, RankProfile

app_package = ApplicationPackage(name="cord19")
app_package.schema.add_fields(
    Field(name = "cord_uid", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25")
)
app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title"])
)
app_package.schema.add_rank_profile(
    RankProfile(name = "bm25", first_phase = "bm25(title)")
)

Download BERT tokenizer and model from the `transformers` library

In [None]:
model = BertForSequenceClassification.from_pretrained(
    "google/bert_uncased_L-2_H-128_A-2")  # This could be any pytorch BERT model

In [None]:
from transformers import BertForSequenceClassification, BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("google/bert_uncased_L-2_H-128_A-2")
model = BertForSequenceClassification.from_pretrained(
    "google/bert_uncased_L-2_H-128_A-2")  # This could be any pytorch BERT model

Define your Vespa model configuration

In [2]:
from vespa.ml import BertModelConfig

bert_config = BertModelConfig(
    model_id="pretrained_bert_tiny",
    tokenizer="google/bert_uncased_L-2_H-128_A-2",
    model="google/bert_uncased_L-2_H-128_A-2",    
    query_input_size=32,
    doc_input_size=96
)

Some weights of the model checkpoint at google/bert_uncased_L-2_H-128_A-2 were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification w

Create your model based rank profile

In [3]:
from vespa.package import SecondPhaseRanking

app_package.add_model_ranking(
    model_config=bert_config,
    inherits="default",
    first_phase="bm25(title)",
    second_phase=SecondPhaseRanking(
        rerank_count=10, expression="logit1"
    ),
)

Using framework PyTorch: 1.7.1
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']


  position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
  assert all(


## Deploy your application

In [None]:
import os
from vespa.package import VespaDocker

vespa_docker = VespaDocker(port=8080)

os.environ["WORK_DIR"] = "/Users/tmartins"
disk_folder = os.path.join(os.getenv("WORK_DIR"), "sample_application")

app = vespa_docker.deploy(
    application_package = app_package,
    disk_folder=disk_folder
)

## Feed some data

In [None]:
from pandas import read_csv

parsed_feed = read_csv("/Users/tmartins/projects/sw/blog/_notebooks/data/2021-01-18-cord19-deploy-bert-from-pyvespa/parsed_feed.csv")
parsed_feed = parsed_feed.head(100)

In [None]:
for idx, row in parsed_feed.iterrows():
    fields = {
        "cord_uid": str(row["cord_uid"]),
        "title": str(row["title"]),
    }
    fields.update(
        bert_config.doc_fields(text = str(row["title"]))
    )
    response = app.feed_data_point(
        schema = "cord19",
        data_id = str(row["cord_uid"]),
        fields = fields,
    )

In [None]:
response.json()

## Query your application

In [None]:
from vespa.query import QueryModel, RankProfile as Ranking, OR, QueryModelFeature

result = app.query(
    query="this is a test", 
    query_model=QueryModel(
        query_properties=[
            QueryModelFeature(bert_config)
        ],
        match_phase = OR(),
        rank_profile = Ranking(name="pretrained_bert_tiny")
    )
)

In [None]:
result.json

In [None]:
result.number_documents_retrieved