<a href="https://colab.research.google.com/github/ruslanmv/Watsonx-Assistant-with-Milvus-as-Vector-Database/blob/master/notebooks/1_build_question_answering_engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Question Answering Engine in Minutes

This notebook illustrates how to build a question answering engine from scratch using [Milvus](https://milvus.io/) and [Towhee](https://towhee.io/). Milvus is the most advanced open-source vector database built for AI applications and supports nearest neighbor embedding search across tens of millions of entries, and Towhee is a framework that provides ETL for unstructured data using SoTA machine learning models.

We will go through question answering procedures and evaluate performance. Moreover, we managed to make the core functionality as simple as almost 10 lines of code with Towhee, so that you can start hacking your own question answering engine.

## Preparations

### Install Dependencies

First we need to install dependencies such as towhee, towhee.models and gradio.

In [1]:
! python -m pip install -q towhee towhee.models gradio

### Prepare the Data

There is a subset of the  [InsuranceQA Corpus](https://github.com/shuzi/insuranceQA)  (1000 pairs of questions and answers) used in this demo, everyone can download on [Github](https://github.com/towhee-io/examples/releases/download/data/question_answer.csv).

In [2]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/question_answer.csv -O

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  595k  100  595k    0     0   601k      0 --:--:-- --:--:-- --:--:-- 1048k


**question_answer.csv**: a file containing question and the answer.

Let's take a quick look:

In [3]:
import pandas as pd

df = pd.read_csv('question_answer.csv')
df.head()

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,id,question,answer
0,0,Is Disability Insurance Required By Law?,Not generally. There are five states that requ...
1,1,Can Creditors Take Life Insurance After ...,If the person who passed away was the one with...
2,2,Does Travelers Insurance Have Renters Ins...,One of the insurance carriers I represent is T...
3,3,Can I Drive A New Car Home Without Ins...,Most auto dealers will not let you drive the c...
4,4,Is The Cash Surrender Value Of Life Ins...,Cash surrender value comes only with Whole Lif...


To use the dataset to get answers, let's first define the dictionary:

- `id_answer`: a dictionary of id and corresponding answer

In [4]:
id_answer = df.set_index('id')['answer'].to_dict()

### Create Milvus Collection

Before getting started, please make sure that you have started a [Milvus service](https://milvus.io/docs/install_standalone-docker.md). This notebook uses [milvus 2.2.10](https://milvus.io/docs/v2.2.x/install_standalone-docker.md) and [pymilvus 2.2.11](https://milvus.io/docs/release_notes.md#2210).

In [5]:
! python -m pip install -q pymilvus==2.2.11

Next to define the function `create_milvus_collection` to create collection in Milvus that uses the [L2 distance metric](https://milvus.io/docs/metric.md#Euclidean-distance-L2) and an [IVF_FLAT index](https://milvus.io/docs/index.md#IVF_FLAT).

### Setup Remote Server
Here we should define the variable `REMOTE_SERVER` just created [here](https://github.com/ruslanmv/Watsonx-Assistant-with-Milvus-as-Vector-Database/blob/master/README.md)

In [6]:
REMOTE_SERVER='50.17.92.90'
LOCAL_SERVER='127.0.0.1'

In [7]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
connections.connect(host=REMOTE_SERVER, port='19530')

In [8]:
def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    fields = [
    FieldSchema(name='id', dtype=DataType.VARCHAR, descrition='ids', max_length=500, is_primary=True, auto_id=False),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='reverse image search')
    collection = Collection(name=collection_name, schema=schema)

    # create IVF_FLAT index for collection.
    index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

collection = create_milvus_collection('question_answer', 768)

## Question Answering Engine

In this section, we will show how to build our question answering engine using Milvus and Towhee. The basic idea behind question answering is to use Towhee to generate embedding from the question dataset and compare the input question with the embedding stored in Milvus.

[Towhee](https://towhee.io/) is a machine learning framework that allows the creation of data processing pipelines, and it also provides predefined operators for implementing insert and query operations in Milvus.

<img src="./workflow.png" width = "60%" height = "60%" align=center />

### Load question embedding into Milvus

We first generate embedding from question text with [dpr](https://towhee.io/text-embedding/dpr) operator and insert the embedding into Milvus. Towhee provides a [method-chaining style API](https://towhee.readthedocs.io/en/main/index.html) so that users can assemble a data processing pipeline with operators.

In [9]:
from IPython.display import clear_output

In [10]:
%%time
from towhee import pipe, ops
import numpy as np
from towhee.datacollection import DataCollection
from IPython.display import clear_output
insert_pipe = (
    pipe.input('id', 'question', 'answer')
        .map('question', 'vec', ops.text_embedding.dpr(model_name='facebook/dpr-ctx_encoder-single-nq-base'))
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
        .map(('id', 'vec'), 'insert_status', ops.ann_insert.milvus_client(host=REMOTE_SERVER, port='19530', collection_name='question_answer'))
        .output()
)

import csv
with open('question_answer.csv', encoding='utf-8') as f:
    reader = csv.reader(f)
    next(reader)
    for row in reader:
        insert_pipe(*row)
clear_output()

CPU times: user 26.9 s, sys: 2.03 s, total: 28.9 s
Wall time: 3min 29s


In [11]:
print('Total number of inserted data is {}.'.format(collection.num_entities))

Total number of inserted data is 0.


#### Explanation of Data Processing Pipeline

Here is detailed explanation for each line of the code:

`pipe.input('id', 'question', 'answer')`: Get three inputs, namely question's id, quesion's text and question's answer;

`map('question', 'vec', ops.text_embedding.dpr(model_name='facebook/dpr-ctx_encoder-single-nq-base'))`: Use the `acebook/dpr-ctx_encoder-single-nq-base` model to generate the question embedding vector with the [dpr operator](https://towhee.io/text-embedding/dpr) in towhee hub;

`map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))`: normalize the embedding vector;

`map(('id', 'vec'), 'insert_status', ops.ann_insert.milvus_client(host='127.0.0.1', port='19530', collection_name='question_answer'))`: insert question embedding vector into Milvus;

### Ask Question with Milvus and Towhee

Now that embedding for question dataset have been inserted into Milvus, we can ask question with Milvus and Towhee. Again, we use Towhee to load the input question, compute a embedding, and use it as a query in Milvus. Because Milvus only outputs IDs and distance values, we provide the `id_answers` dictionary to get the answers based on IDs and display.

In [12]:
%%time
collection.load()
ans_pipe = (
    pipe.input('question')
        .map('question', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
        .map('vec', 'res', ops.ann_search.milvus_client(host=REMOTE_SERVER, port='19530', collection_name='question_answer', limit=1))
        .map('res', 'answer', lambda x: [id_answer[int(i[0])] for i in x])
        .output('question', 'answer')
)
ans = ans_pipe('Is  Disability  Insurance  Required  By  Law?')
ans = DataCollection(ans)
clear_output()

CPU times: user 125 ms, sys: 641 ms, total: 766 ms
Wall time: 23.9 s


In [13]:
ans.show()

question,answer
Is Disability Insurance Required By Law?,Not generally. There are five states that require most all employers carry short term disability insurance on their employees. T...


Then we can get the answer about 'Is  Disability  Insurance  Required  By  Law?'.

In [14]:
ans[0]['answer']

['Not generally. There are five states that require most all employers carry short term disability insurance on their employees. These states are: California, Hawaii, New Jersey, New York, and Rhode Island. Besides this mandatory short term disability law, there is no other legislative imperative for someone to purchase or be covered by disability insurance.']

## Release a Showcase

We've done an excellent job on the core functionality of our question answering engine. Now it's time to build a showcase with interface. [Gradio](https://gradio.app/) is a great tool for building demos. With Gradio, we simply need to wrap the data processing pipeline via a `chat` function:

In [15]:
import towhee
def chat(message, history):
    history = history or []
    ans_pipe = (
        pipe.input('question')
            .map('question', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
            .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
            .map('vec', 'res', ops.ann_search.milvus_client(host=REMOTE_SERVER, port='19530', collection_name='question_answer', limit=1))
            .map('res', 'answer', lambda x: [id_answer[int(i[0])] for i in x])
            .output('question', 'answer')
    )
    response = ans_pipe(message).get()[1][0]
    history.append((message, response))
    return history, history

In [25]:
import gradio

collection.load()
chatbot = gradio.Chatbot()
interface = gradio.Interface(
    chat,
    ["text", "state"],
    [chatbot, "state"],
    allow_flagging="never",
)
clear_output()
interface.launch(inline=True, share=True)

2024-02-15 17:49:34,865 - 140590948419136 - _trace.py-_trace:45 - DEBUG: connect_tcp.started host='api.gradio.app' port=443 local_address=None timeout=3 socket_options=None
2024-02-15 17:49:34,870 - 140590375175744 - _trace.py-_trace:45 - DEBUG: connect_tcp.started host='api.gradio.app' port=443 local_address=None timeout=5 socket_options=None
2024-02-15 17:49:34,875 - 140590336312896 - selector_events.py-selector_events:54 - DEBUG: Using selector: EpollSelector
2024-02-15 17:49:34,878 - 140601223811520 - _config.py-_config:78 - DEBUG: load_ssl_context verify=True cert=None trust_env=True http2=False
2024-02-15 17:49:34,880 - 140601223811520 - _config.py-_config:144 - DEBUG: load_verify_locations cafile='/mnt/c/Blog/Watsonx-Assistant-with-Milvus-as-Vector-Database/.venv/lib/python3.10/site-packages/certifi/cacert.pem'


2024-02-15 17:49:34,937 - 140601223811520 - _trace.py-_trace:45 - DEBUG: connect_tcp.started host='127.0.0.1' port=7860 local_address=None timeout=5.0 socket_options=None
2024-02-15 17:49:34,938 - 140601223811520 - _trace.py-_trace:45 - DEBUG: connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fddbd8d9a50>
2024-02-15 17:49:34,939 - 140601223811520 - _trace.py-_trace:45 - DEBUG: send_request_headers.started request=<Request [b'GET']>
2024-02-15 17:49:34,940 - 140601223811520 - _trace.py-_trace:45 - DEBUG: send_request_headers.complete
2024-02-15 17:49:34,942 - 140601223811520 - _trace.py-_trace:45 - DEBUG: send_request_body.started request=<Request [b'GET']>
2024-02-15 17:49:34,942 - 140601223811520 - _trace.py-_trace:45 - DEBUG: send_request_body.complete
2024-02-15 17:49:34,943 - 140601223811520 - _trace.py-_trace:45 - DEBUG: receive_response_headers.started request=<Request [b'GET']>
2024-02-15 17:49:34,944 - 140601223811520 - _trace.py-_trace:45 - DEB

Running on local URL:  http://127.0.0.1:7860


2024-02-15 17:49:35,223 - 140601223811520 - _trace.py-_trace:45 - DEBUG: connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fddbd8da050>
2024-02-15 17:49:35,224 - 140601223811520 - _trace.py-_trace:45 - DEBUG: start_tls.started ssl_context=<ssl.SSLContext object at 0x7fddbf67e340> server_hostname='api.gradio.app' timeout=30
2024-02-15 17:49:35,464 - 140590948419136 - _trace.py-_trace:45 - DEBUG: start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fddbd8ab490>
2024-02-15 17:49:35,465 - 140590948419136 - _trace.py-_trace:45 - DEBUG: send_request_headers.started request=<Request [b'GET']>
2024-02-15 17:49:35,467 - 140590948419136 - _trace.py-_trace:45 - DEBUG: send_request_headers.complete
2024-02-15 17:49:35,467 - 140590948419136 - _trace.py-_trace:45 - DEBUG: send_request_body.started request=<Request [b'GET']>
2024-02-15 17:49:35,468 - 140590948419136 - _trace.py-_trace:45 - DEBUG: send_request_body.complete
2024-02-15 17:49:


Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.


2024-02-15 17:49:45,303 - 140590375175744 - _config.py-_config:144 - DEBUG: load_verify_locations cafile='/mnt/c/Blog/Watsonx-Assistant-with-Milvus-as-Vector-Database/.venv/lib/python3.10/site-packages/certifi/cacert.pem'
2024-02-15 17:49:45,305 - 140590948419136 - _config.py-_config:78 - DEBUG: load_ssl_context verify=True cert=None trust_env=True http2=False




2024-02-15 17:49:45,308 - 140590948419136 - _config.py-_config:144 - DEBUG: load_verify_locations cafile='/mnt/c/Blog/Watsonx-Assistant-with-Milvus-as-Vector-Database/.venv/lib/python3.10/site-packages/certifi/cacert.pem'


2024-02-15 17:49:45,391 - 140590375175744 - _trace.py-_trace:45 - DEBUG: connect_tcp.started host='api.gradio.app' port=443 local_address=None timeout=5 socket_options=None
2024-02-15 17:49:45,395 - 140590948419136 - _trace.py-_trace:45 - DEBUG: connect_tcp.started host='api.gradio.app' port=443 local_address=None timeout=5 socket_options=None
2024-02-15 17:49:45,596 - 140590375175744 - _trace.py-_trace:45 - DEBUG: connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fddbd37b4c0>
2024-02-15 17:49:45,598 - 140590375175744 - _trace.py-_trace:45 - DEBUG: start_tls.started ssl_context=<ssl.SSLContext object at 0x7fddbf67e440> server_hostname='api.gradio.app' timeout=5
2024-02-15 17:49:45,612 - 140590948419136 - _trace.py-_trace:45 - DEBUG: connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fddbd37b550>
2024-02-15 17:49:45,613 - 140590948419136 - _trace.py-_trace:45 - DEBUG: start_tls.started ssl_context=<ssl.SSLContext object at