# WatsonX BackEnd Pipeline

## Preparations

First we need to install dependencies such as towhee, towhee.models and gradio.

In [1]:
!python3 -m pip install -q towhee towhee.models gradio datasets ipywidgets

### Prepare the Data

In [2]:
from datasets import load_dataset
from IPython.display import clear_output
dataset = load_dataset("ruslanmv/ai-medical-chatbot")
clear_output()

**dataset**: a file containing question and the answer.

Let's take a quick look:

In [3]:
column_names = dataset["train"].column_names
print(column_names)

['Description', 'Patient', 'Doctor']


In [4]:
train_data = dataset["train"]
for i in range(1):
    print(train_data[i])

{'Description': 'Q. What does abutment of the nerve root mean?', 'Patient': 'Hi doctor,I am just wondering what is abutting and abutment of the nerve root means in a back issue. Please explain. What treatment is required for\xa0annular bulging and tear?', 'Doctor': 'Hi. I have gone through your query with diligence and would like you to know that I am here to help you. For further information consult a neurologist online -->'}


For this demo let us choose the first 1000 dialogues

In [5]:
import pandas as pd
df = pd.DataFrame(train_data[:5])

In [6]:
df.head()

Unnamed: 0,Description,Patient,Doctor
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...
2,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...
3,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...
4,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...


For the development of the model, let just consider the patient and doctor

In [7]:
#df = df[["Patient", "Doctor"]].rename(columns={"Patient": "question", "Doctor": "answer"})
df = df[["Description", "Doctor"]].rename(columns={"Description": "question", "Doctor": "answer"})

In [8]:
# Add the 'ID' column as the first column
df.insert(0, 'id', df.index)
# Reset the index and drop the previous index column
df = df.reset_index(drop=True)

In [9]:
import re
# Clean the 'question' and 'answer' columns
df['question'] = df['question'].apply(lambda x: re.sub(r'\s+', ' ', x.strip()))
df['answer'] = df['answer'].apply(lambda x: re.sub(r'\s+', ' ', x.strip()))
df['question'] = df['question'].str.replace('^Q.', '', regex=True)
# Assuming your DataFrame is named df
max_length = 500  # Due to our enbeeding model does not allow long strings
df['question'] = df['question'].str.slice(0, max_length)

In [10]:
df.head()

Unnamed: 0,id,question,answer
0,0,What does abutment of the nerve root mean?,Hi. I have gone through your query with dilige...
1,1,What should I do to reduce my weight gained d...,Hi. You have really done well with the hypothy...
2,2,I have started to get lots of acne on my face...,Hi there Acne has multifactorial etiology. Onl...
3,3,Why do I have uncomfortable feeling between t...,Hello. The popping and discomfort what you fel...
4,4,My symptoms after intercourse threatns me eve...,Hello. The HIV test uses a finger prick blood ...


To use the dataset to get answers, let's first define the dictionary:

- `id_answer`: a dictionary of id and corresponding answer

In [11]:
id_answer = df.set_index('id')['answer'].to_dict()


### Create Milvus Collection

Before getting started, please make sure that you have started a [Milvus service](https://milvus.io/docs/install_standalone-docker.md). This notebook uses [milvus 2.2.10](https://milvus.io/docs/v2.2.x/install_standalone-docker.md) and [pymilvus 2.2.11](https://milvus.io/docs/release_notes.md#2210).

In [12]:
!python3 -m pip install -q pymilvus==2.2.11 python-dotenv

Next to define the function `create_milvus_collection` to create collection in Milvus that uses the [L2 distance metric](https://milvus.io/docs/metric.md#Euclidean-distance-L2) and an [IVF_FLAT index](https://milvus.io/docs/index.md#IVF_FLAT).

### Setup Remote Server
Here we should define the variable `REMOTE_SERVER` just created [here](https://github.com/ruslanmv/Watsonx-Assistant-with-Milvus-as-Vector-Database/blob/master/README.md)

In [13]:
LOCAL_SERVER='127.0.0.1'
from dotenv import load_dotenv
import os
load_dotenv()
host_milvus = os.environ.get("REMOTE_SERVER", LOCAL_SERVER)

In [14]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
connections.connect(host=host_milvus, port='19530')

In [15]:
collection_name='qa_medical_big'
def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    fields = [
    FieldSchema(name='id', dtype=DataType.INT64, descrition='ids', max_length=500, is_primary=True, auto_id=False),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='reverse image search')
    collection = Collection(name=collection_name, schema=schema)

    # create IVF_FLAT index for collection.
    index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection
collection = create_milvus_collection(collection_name, 768)

### Load question embedding into Milvus

We first generate embedding from question text with [dpr](https://towhee.io/text-embedding/dpr) operator and insert the embedding into Milvus. Towhee provides a [method-chaining style API](https://towhee.readthedocs.io/en/main/index.html) so that users can assemble a data processing pipeline with operators.

In [16]:
from IPython.display import clear_output

In [17]:
%%time
from towhee import pipe, ops
import numpy as np
from towhee.datacollection import DataCollection
from IPython.display import clear_output
max_input_length = 500  # Maximum length allowed by the model
insert_pipe = (
    pipe.input('id', 'question', 'answer')
        .map('question', 'vec', lambda x: x[:max_input_length])  # Truncate the question if longer than 500 tokens
        .map('vec', 'vec', ops.text_embedding.dpr(model_name='facebook/dpr-ctx_encoder-single-nq-base'))
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
        .map(('id', 'vec'), 'insert_status', ops.ann_insert.milvus_client(host=host_milvus, port='19530', collection_name=collection_name))
        .output()
)
clear_output()  

CPU times: user 828 ms, sys: 1.52 s, total: 2.34 s
Wall time: 11 s


In [18]:
%%time
# Assuming you have a DataFrame named df
# Iterate over each row in the DataFrame
for index, row in df.iterrows():
    question = row['question']
    # Truncate the question string if it exceeds the expected size for the model
    if len(question) > 500:
        row['question'] = question[:500]  # Truncate the question string if it exceeds 500 characters
    insert_pipe(*row)
# Clear the output
clear_output()    

CPU times: user 688 ms, sys: 46.9 ms, total: 734 ms
Wall time: 5.36 s


In [19]:
# Iterate over each row in the DataFrame
#for index, row in df.iterrows():
#    insert_pipe(*row)

In [20]:
print('Total number of inserted data is {}.'.format(collection.num_entities))

Total number of inserted data is 0.


### Ask Question with Milvus and Towhee

Now that embedding for question dataset have been inserted into Milvus, we can ask question with Milvus and Towhee. Again, we use Towhee to load the input question, compute a embedding, and use it as a query in Milvus. Because Milvus only outputs IDs and distance values, we provide the `id_answers` dictionary to get the answers based on IDs and display.

In [21]:
from towhee import pipe, ops
import numpy as np
from towhee.datacollection import DataCollection
from IPython.display import clear_output
# Define the maximum input length for the question
max_input_length = 512

In [22]:
%%time
# Load the collection
collection.load()
# Create the combined pipe for question encoding and answer retrieval
combined_pipe = (
    pipe.input('question')
        .map('question', 'vec', lambda x: x[:max_input_length])  # Truncate the question if longer than 512 tokens
        .map('vec', 'vec', ops.text_embedding.dpr(model_name='facebook/dpr-ctx_encoder-single-nq-base'))
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
        .map('vec', 'res', ops.ann_search.milvus_client(host=host_milvus, port='19530', collection_name=collection_name, limit=1))
        .map('res', 'answer', lambda x: [id_answer[int(i[0])] for i in x])
        .output('question', 'answer')
)
clear_output()  

2024-02-19 12:25:07,507 - 140040949338560 - connectionpool.py-connectionpool:549 - DEBUG: https://huggingface.co:443 "HEAD /facebook/dpr-ctx_encoder-single-nq-base/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
2024-02-19 12:25:07,699 - 140040949338560 - connectionpool.py-connectionpool:549 - DEBUG: https://huggingface.co:443 "HEAD /facebook/dpr-ctx_encoder-single-nq-base/resolve/main/config.json HTTP/1.1" 200 0
2024-02-19 12:25:07,835 - 140040949338560 - connectionpool.py-connectionpool:549 - DEBUG: https://huggingface.co:443 "HEAD /facebook/dpr-ctx_encoder-single-nq-base/resolve/main/config.json HTTP/1.1" 200 0


CPU times: user 46.9 ms, sys: 484 ms, total: 531 ms
Wall time: 5.07 s


In [23]:
%%time
# Perform the encoding and retrieval for a specific question
ans = combined_pipe('What does abutment of the nerve root mean?')
ans = DataCollection(ans)

2024-02-19 12:25:09,757 - 140031972804160 - node.py-node:167 - INFO: Begin to run Node-_input
2024-02-19 12:25:09,758 - 140034343503424 - node.py-node:167 - INFO: Begin to run Node-lambda-0
2024-02-19 12:25:09,759 - 140031964350016 - node.py-node:167 - INFO: Begin to run Node-text-embedding/dpr-1
2024-02-19 12:25:09,759 - 140031955895872 - node.py-node:167 - INFO: Begin to run Node-lambda-2
2024-02-19 12:25:09,760 - 140031972804160 - node.py-node:167 - INFO: Begin to run Node-ann-search/milvus-client-3
2024-02-19 12:25:09,761 - 140031947441728 - node.py-node:167 - INFO: Begin to run Node-lambda-4
2024-02-19 12:25:09,761 - 140034343503424 - node.py-node:167 - INFO: Begin to run Node-_output


CPU times: user 0 ns, sys: 15.6 ms, total: 15.6 ms
Wall time: 434 ms


In [24]:
ans.show()

question,answer
What does abutment of the nerve root mean?,Hi. I have gone through your query with diligence and would like you to know that I am here to help you. For further information...


Then we can get the answer about 'What does abutment of the nerve root mean?'.

In [25]:
ans[0]['answer']

['Hi. I have gone through your query with diligence and would like you to know that I am here to help you. For further information consult a neurologist online -->']