# Build a Question Answering System Engine
In this example we will be going over the code used to build a question answering system. This example uses a modified BERT model to extract features from questions and Milvus to search for similar questions and answers. 

## Prepare

### Install dependencies

In [15]:
# ! pip install -r requirements.txt

### Start Milvus Server

Uncomment following cell if you haven't start milvus server yet.

In [16]:
# ! wget https://raw.githubusercontent.com/milvus-io/milvus/master/deployments/docker/standalone/docker-compose.yml -O docker-compose.yml
# ! docker-compose up -d

### Check running servers

In [17]:
! docker-compose ps

      Name                 Command             State               Ports        
--------------------------------------------------------------------------------
milvus-etcd         etcd -advertise-        Up             2379/tcp, 2380/tcp   
                    client-url ...                                              
milvus-minio        /usr/bin/docker-        Up (healthy)   9000/tcp             
                    entrypoint ...                                              
milvus-standalone   /tini -- milvus run     Up             0.0.0.0:19530->19530/
                    standalone                             tcp                  


## Core Code

### Connect to Serves

In [18]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

connections.connect(host='localhost', port='19530')

### Create Milvus Collection with index

A collection in Milvus is similar to a table in a relational database, where users store the vectors.

Each collection has its owb Schema, which in this case includes two fields `id` and `embedding`:
- `id`: The id of the inserted data
- `embedding`: The embedding of the text

In this case we assign an `IVF_FLAT` index to the collection before inserting data. The indexes will be generated once the data is inserted.

In [19]:
def create_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        collection = Collection(name=collection_name)
        collection.drop()

    field1 = FieldSchema(name="id", dtype=DataType.INT64, descrition="ids", is_primary=True, auto_id=False)
    field2 = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, descrition="float vector",dim=dim, is_primary=False)
    schema = CollectionSchema(fields=[field1, field2], description="collection description")
    collection = Collection(name=collection_name, schema=schema)

    index_params = {
        "index_type": "IVF_FLAT",
        "metric_type": 'IP',
        "params": {"nlist": 200}
    }
    collection.create_index(field_name="embedding", index_params=index_params)

    return collection

In [20]:
collection = create_collection('question_answering', 768)

### Generate embedding and insert into collection

We use [Towhee](https://towhee.io/), a machine learning framework to create ata processing pipelines within a few lines of code. Apart from pre-trained deep learning models and data processing operators, [Towhee](https://towhee.io/) also provides insert and query operators in Milvus.

In [21]:
import towhee

In [22]:
dc = (
	towhee.read_csv('qa.csv')
		.runas_op['id', 'id'](func=lambda x: int(x))
		.text_embedding.dpr['question', 'qvec'](model_name="facebook/dpr-ctx_encoder-single-nq-base")
		.runas_op['qvec', 'qvec'](func=lambda x: x.squeeze(0))
		.tensor_normalize['qvec', 'qvec']()
		.to_milvus['id', 'qvec'](collection=collection)
)

In [23]:
dc.show()
collection.num_entities

id,question,answer,qvec
0,Is Disability Insurance Requi...,Not generally. There are five s...,"[0.06599887, 0.011836695, 0.02264631, ...] shape=(768,)"
1,Can Creditors Take Life Insu...,If the person who passed away w...,"[0.007935313, -0.012593831, 0.020739755, ...] shape=(768,)"
2,Does Travelers Insurance Have...,One of the insurance carriers I...,"[0.06452476, 0.028823059, 0.027821207, ...] shape=(768,)"
3,Can I Drive A New Car Home...,Most auto dealers will not let ...,"[0.030032441, 0.033972304, 0.012397564, ...] shape=(768,)"
4,Is The Cash Surrender Value ...,Cash surrender value comes only...,"[0.015002071, -0.0010270709, 0.001804623, ...] shape=(768,)"


99

### Search

When searching for answers for a query, we first turn the query into an embedding followiing the same procedure. Then search the similar embeddings from collection and find the corresponding answer.

In [24]:
id_answer = {}
for i in dc:
	id_answer[i.id] = i.answer 

In [25]:
from towhee import Entity
queries = ['What is AAA?']
search_params = {"metric_type": 'IP', "params": {"nprobe": 16}}

dc = (
	towhee.DataFrame([Entity(query=query) for query in queries])
		.text_embedding.dpr['query', 'qvec'](model_name="facebook/dpr-ctx_encoder-single-nq-base")
		.runas_op['qvec', 'qvec'](func=lambda x: x.squeeze(0))
		.tensor_normalize['qvec', 'qvec']()
		.milvus_search['qvec', 'results'](collection=collection, anns_field="embedding", param=search_params, limit=5)
		.runas_op['results', 'answers'](func=lambda res: [id_answer[x.id] for x in res])
		# .runas_op['results', 'answers'](func = lambda x: [{'answer': data[i.id], 'scores': i.score} for i in x])
		.select['query', 'answers']()
)

In [26]:
dc.show()

query,answers
What is AAA?,"[ AAA Home insurance, like all ot..., It is important to talk to your..., Ultrasound exams are usually a ..., A renter's insurance policy cov...,...] len=5"


In [27]:
dc[0].answers

[' AAA Home insurance, like all other major carriers, covers a wide variety of claims, including fire, theft, vandalism, and many other items. However, there are numerous types of policies offered, so it is best to determine the type of policy you have to accurately understand all of the benefits. An experienced broker can help.',
 ' It is important to talk to your insurance professional about the specific terms and conditions of your policy, but generally speaking, theft is a named covered peril in most Home Insurance policies. This would include theft of your personal property from within your home of course, but also includes theft of your property outside of your home. Personal property outside of your home is usually covered up to 10% of the total personal property amount listed in your policy. If covered, the loss settlement would be subject to your deductible. The claims specialist will also be looking for a copy or a police report.',
 ' Ultrasound exams are usually a few hundre

### Gradio

Here is a gradio interface you can play with, type your questions and get the answer.

In [28]:
import gradio

def chat(message, history):
    history = history or []
    with towhee.api() as api:
           qa_function = (
                api.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base")
                    .runas_op(func=lambda x: x.squeeze(0))
                    .tensor_normalize()
                    .milvus_search(collection='question_answering', limit=3)
                    .runas_op(func=lambda res: [id_answer[x.id]+'\n' for x in res])
                    .as_function()
            )
    response = qa_function(message)[0]
    history.append((message, response))
    return history, history

chatbot = gradio.Chatbot(color_map=("green", "gray"))
interface = gradio.Interface(
    chat,
    ["text", "state"],
    [chatbot, "state"],
    allow_screenshot=False,
    allow_flagging="never",
)
interface.launch(inline=True, share=True)

Running on local URL:  http://127.0.0.1:7861/
Running on public URL: https://53059.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


(<gradio.routes.App at 0x7ff4684631c0>,
 'http://127.0.0.1:7861/',
 'https://53059.gradio.app')