# Medical Query Prototype 

## Description:
As per the document: "A prototype tool that allows users to query about medical drugs, search online medical news or research study sources via an API or any search strategies, index the relevant results, and use a Large Language Model (LLM) to summarize the information to answer the user's query."

## Summary: 
Suppose that we take medical papers and either (1) web-scrap relevant data (**Integration for News Search**) (2) similarly scrap from a pdf. First, we need API's that link the data or scrap our files in such a format that is required by many LLMS. The data collection is then followed by some pre-processing. It would be preferable that our pipeline be a Question Answering system that can not only answer within a good accuracy but tell us when it cannot answer the query. For instance, if a common question is that of "What are the main competitor drugs of [name] and what are their latest news now?" If no such token allows for an answer we would want it to tell us so, essentially reading the paper for us. If an answer exists we want to know how "good" of an answer it is. As such a relevancy assessment would be required (**Relevance Filtering**). The best tech for this task would be Retrieval Augmented Generation (RAG). (NOTE: Fine-tuning would also work, but RAGs are more dynamic and better for this task). RAG allow for our data to be structured as an embedding (**Information Indexing**), and combined with our query the LLM can provide an answer specific to our data (**Summarization**)! In my opinion,  Llama-Index provides the best library to tackle RAG. Finally, a user interface is suggested with the use of gradio (**User Interface**)

# Design Flow Structure 

Below is a presented flow design structure for the overall proposal. 

![title](img/Design.png)

# API Integration for News Search/API/PDF

Llama-index provides Data Loaders from LlamaHub which are pre-made APIs to communicate with certain data! As such this allows alot of functionality.

## PDF Loader

In [None]:
## Define our documents, LlamaIndex makes this particularly easy with it's Data Loaders on LlamaHub object and open source use.

from llama_index.readers.file import PDFReader #We have a Data Loader for our PDF papers

#PDF Loader
pdf_loader = PDFReader()

#Document Objects 
pdf_documents = pdf_loader.load_data("path/to/pdf/paper.pdf")

## Web Loader (News/Web)

In [None]:
## Define our documents, LlamaIndex makes this particularly easy with it's Data Loaders on LlamaHub object and open source use.
## This sort of web loader pattern can be extend towards News Sources instead of having a specific API for it. 

from llama_index.readers.web import BeautifulSoupWebReader #We have a Data Loader for a website (Note: This can potentially be difficult to do initially
#and would require some other custom changes for dynamic sites, as well as sites that require logins and so forth, but this is a simple example.)

#Web Loader
web_loader = BeautifulSoupWebReader()

#Document Objects 
web_documents = web_loader.load_data(urls=[URL])

## Web Scraper (Example on website)

This would be an example of how to create a simple webscraper for a more specialized website, one in which we must use the CSS selectors or XML paths to get specific data. This web scraper employs the use of playwright. The data extracted from these scrapers must then be preproccesed into a Document object for Llama index. 

In [None]:
## Example of a more specialized web scraper.
from playwright.async_api import async_playwright
import asyncio

async def process_locator(locator):
    count = await locator.count()
    if count > 1: #We have many elements and must resolve each inner text
        texts = ""
        for i in range(count):
            element = locator.nth(i)
            if await element.is_visible():
                inner_text = await element.inner_text()
                texts = texts + "," + inner_text
                return texts
    else:
        if await locator.is_visible():
            return await locator.inner_text()
        else:
            return "NA"
    

async def main():
   async with async_playwright() as pw:
       browser = await pw.chromium.launch(
           ##We'll employ the use of chromium for this webscraper
           ##Using a proxy creates HTTP errors.
          headless=False
      )

       #Beginning page: 
       page = await browser.new_page()
       await page.goto('PAPER_URL')
       await page.wait_for_timeout(5000)
       result = []
       nextPage_urls = []
       page_list = await page.query_selector_all('.list_paper_a') #Assuming we have a website of lists of papers
       for page in page_list:
           nextPage_urls.append(await page.get_attribute('href'))
           
       for page_url in nextPage_urls:
            paper_info = {}
            await page.goto(page_url)
           ##NOTE: These CSS selectors are highly dependent in the overall structure of the page, and the paper. However, making an assumption 
           #of some website that has drug information in a somewhat consistent manner we can extract relevant information as follows. 
            #Title: 
            title = page.locator(".title-1")
            paper_info['title'] = await process_locator(title)
            #Common Name:
            drug_name = page.locator("#field_generic_name_value")
            paper_info['drug_name'] = await process_locator(common_name)
            #Quantity:
            quantity = page.locator("#field_quantity_value")
            paper_info['quantity'] = await process_locator(quantity)
            #Packaging: 
            packaging = page.locator("#field_packaging_value")
            paper_info['packaging'] = await process_locator(packaging)
            #Brands:
            brand = page.locator("#field_brands_value")
            paper_info['brand'] = await process_locator(brand)
            #Categories:
            categories = page.locator("#field_categories_value")
            paper_info['categories'] = await process_locator(categories)
            #Certifications:
            certifications = page.locator("#field_labels_value")
            paper_info['certifications'] = await process_locator(certifications)
            #Origin:
            origin = page.locator("#field_origin_value")
            paper_info['origin'] = await process_locator(origin)
            #origin of ingredients:
            origin_of_ingredients = page.locator("#field_origins_value")
            paper_info['origin_of_ingredients'] = await process_locator(origin_of_ingredients)
            #Places of manufacturing:
            places_manufactured = page.locator("#field_manufacturing_places_value")
            paper_info['places_manufactured'] = await process_locator(places_manufactured)
            #Stores:
            stores = page.locator("#field_stores_value")
            paper_info['stores'] = await process_locator(stores)
        
            result.append(food_info)
            


       
       

       
           
           
           
       await browser.close()
       return result
if __name__ == '__main__':
   result = await main()
#CITATIONs: 
#Code cited from OxyLabs: https://github.com/oxylabs/playwright-web-scraping?tab=readme-ov-file
#,https://playwright.dev/python/docs/locat

In [None]:
## Concatenated the documents

documents = pdf_documents + web_documents

# Information Indexing & Relevance Filtering 

For this task, we employ the use of Retrieval Augmented Generation (RAG). Note that I am choosing this over fine-tuning at this moment due to the time constraint. Fine-tuning a specific open source model on Hugging Face, for example, could provide accurate results for a QA system, but is generally more expensive of a procedure. Additionally, fine-tuning provides a more static model, and requires a "re fine-tuning" with every new dataset. whereas RAG is more dynamic.

RAG's essentially work by creating a data structure (embeddings) of our documents, and allowing our LLM to mathematically provide an answer to our query based on the vector embeddings of both. Note: At this point we assume to have the pdfs and relevant websites.

![title](img/basic_rag.png)


High-level RAG image cited from Llama Index: https://docs.llamaindex.ai/en/stable/getting_started/concepts/#indexing-stage

## Indexing

We then take our documents and return vector emebeddings, allowing for a data structure that allows our LLMs to query the data. For this procedure we can employ the use of LLM's to embedded our data, OpenSource (HuggingFace) and OpenAI llms provide this functionality. Notice, in the Installation Instructions I provided a choice for both, this is the case as an API key is required for OpenAI use. 

In [None]:
## OpenAI usage: 
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

# global
Settings.embed_model = OpenAIEmbedding()

# per-index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

In [None]:
##  Hugging Face usage (bge-small-en-v1.5):
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5" #The model that we use from Hugging Face provides alot of diversity of choice for one that is best suited for task
)

# per-index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

## Retriever 

Retrievers objects allow us to get the most relevant answer given a query. As such, this is a possible method to **Relevance Filtering**. For this, we use the VectorIndexRetriever. (Note: There are many selections and preferences for retriever) The retriever allows for fetching the maximum relevant context for our query. 

In [None]:
# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

# configure response synthesizer (We use the default response synthesizer)
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
)

## Query by LLM

Note that this is the simplest implementation, Llama Index provides an extensive ability of control, and as such we can have more "low-level" code here. For instance, we can have a streaming response (similar to ChatGPT) that reduces the perceived latency of the response, a user experience choice. 

In [None]:
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query  
response = query_engine.query("What are the main competitor drugs of [drug name] and what are their latest news now?")

## BONUS: Agents

We can extend our query engine to act more "intelligently" on our data by further calling external services or our own defined tools on the data given. Agent use could be a great addition to the task, as for instance, we can create a better summarization process through the use of specifically created agent tools! 

https://docs.llamaindex.ai/en/stable/module_guides/deploying/agents/

## Evaluation (Summarization)

Llama provides an extensive library for evaluation of our pipeline. There are a few methods that would be the most beneficial for monitoring the performance of our pipeline. We would like to have our model tested on question by our documents before relying on the answer given for our actual wanted query. To do have this functionality we use llama-index Question Generation. We then use a RelevancyEvaluator, that allows use to evaluate the "relevancy" of the answers generated by our model. Below we have an example with gpt-4, again this can be extend towards open source models. 

In [None]:
## Suppose we have our document object prepared
from llama_index.core.evaluation import DatasetGenerator, RelevancyEvaluator
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Response
from llama_index.llms.openai import OpenAI
import pandas as pd

data_generator = DatasetGenerator.from_documents(documents)
eval_questions = data_generator.generate_questions_from_nodes() ## Here we generate questions!

In [None]:
# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")

evaluator_gpt4 = RelevancyEvaluator(llm=gpt4) ## Here we create the Relevancy Object

# create vector index
vector_index = VectorStoreIndex.from_documents(documents)

In [None]:
# define jupyter display function, that produces a visual our query and response
def display_eval_df(query: str, response: Response, eval_result: str) -> None:
    eval_df = pd.DataFrame(
        {
            "Query": query,
            "Response": str(response),
            "Source": (
                response.source_nodes[0].node.get_content()[:1000] + "..."
            ),
            "Evaluation Result": eval_result,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

The cell below produces a dataframe to show the relevancy of our answers. 

CITED: https://docs.llamaindex.ai/en/stable/examples/evaluation/QuestionGeneration/

In [None]:
query_engine = vector_index.as_query_engine()
response_vector = query_engine.query(eval_questions[1])
eval_result = evaluator_gpt4.evaluate_response(
    query=eval_questions[1], response=response_vector
)

NOTICE: The use of the relevancy evaluator would be incredibly useful for an open-source model because if we decided to use fine-tuning with RAG we can then produce distribution of relevancy based on the each fine tune, or each model!

In [None]:
## We can then extend this to a open-source model, and finally evaluate on our query before accepting the answer. 
# create llm
llm = OpenAI(model="HUGGING_FACE_MODEL", temperature=0.0)

# build index
...

# define evaluator
evaluator = RelevancyEvaluator(llm=llm)

# query index
query_engine = vector_index.as_query_engine()
query = "Could you identify the primary competitve drugs to [NAME] and clarify whether they incorporated real-world evidence in their submissions to the FDA or EMA?"
response = query_engine.query(query)
response_str = response.response
for source_node in response.source_nodes:
    eval_result = evaluator.evaluate(
        query=query,
        response=response_str,
        contexts=[source_node.get_content()],
    )
    print(str(eval_result.passing))

Llama provides a great image for this usage pattern:

![title](img/eval_query_response_context.png)

# User Interface Design

I would like to pitch the use of Gradio for a simple user-interface. This library gives machine learning models/pipelines a great web interface. This open source python library creates quick web applications for ML models, such that they can be easily shared among the required users. This UI can then be hosted and used within the company easily. 

CITE: https://www.gradio.app/

In [1]:
import gradio as gr

def process_query(dataset, param1, param2, param3, query):
    # Placeholder for processing logic.
    # The pipeline that we have created will be called here: 
    #...
    #...
    answer = "Processed answer based on the dataset and parameters would appear here."
    return answer

# Placeholder for the save documentation function
def save_documentation(dataset, query, answer):
    # Placeholder for saving documentation logic.
    # This function should be expanded to actually save the documentation in a persistent store.
    print(f"Documentation saved for dataset: {dataset}, query: {query}, answer: {answer}")
    return "Documentation has been saved."

# Create the Gradio interface
with gr.Blocks() as demo:
    gr.Markdown("<h1>Paper Query</h1>")  # Title for the UI
    
    with gr.Row():
        with gr.Column():
            dataset = gr.Dropdown(["Paper Documentation 1", "Paper Documentation 2", "Paper Documentation 3"], label="Dataset")
            param1 = gr.Checkbox(label="Parameter 1")
            param2 = gr.Checkbox(label="Parameter 2")
            param3 = gr.Checkbox(label="Parameter 3")
            submit_button = gr.Button("Submit")
            save_button = gr.Button("Save this documentation")  # Button for saving documentation
        with gr.Column():
            answer = gr.Textbox(label="Answer", lines=10, placeholder="Your answer will appear here...")
            feedback_good = gr.Button("Good Answer")
            feedback_bad = gr.Button("Bad Answer")
            feedback_response = gr.Label()  # To display a response after feedback is submitted

    query = gr.Textbox(label="Query", placeholder="Type your query here...")
    
    # # Function bindings
    # submit_button.click(
    #     fn=process_query,
    #     inputs=[dataset, param1, param2, param3, query],
    #     outputs=[answer]
    # )
    
    # save_button.click(
    #     fn=save_documentation,
    #     inputs=[dataset, query, answer],
    #     outputs=[]
    # )

    # feedback_good.click(
    #     fn=lambda is_good, answer: "Thank you for your feedback!",
    #     inputs=[True, answer],
    #     outputs=[feedback_response]
    # )
    
    # feedback_bad.click(
    #     fn=lambda is_good, answer: "Thank you for your feedback!",
    #     inputs=[False, answer],
    #     outputs=[feedback_response]
    # )

# Launch the app
demo.launch()



Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




## Gradio

Here is an image of the gradio design. Below we have a submit button to follow the text box for our query. Additionally, we have a feedback (similar to OpenAI gpt) where we can say if the answer is good or bad. Further, we can save our document and fix some parameter if they are present! The Label area would present visuals such as a distribution of relevancy etc. 

![title](img/gradio_design.png)

## Installation Instructions. (My machine is Ubuntu (Debian))
1. <code> pip install llama-index <code>
2. <code> pip install llama-index-readers-file <code>
3. <code> pip install llama-index-readers-web <code>
4. <code> pip install llama-index-embeddings-openai <code> OR <code> pip install llama-index-embeddings-huggingface <code>
5. <code> pip install llama-index-llms-openai <code>

**UI interface requirements**: 

<code> pip install gradio <code>

**Working Prototype requirements**:

<code> pip install git+https://github.com/huggingface/transformers <code>
<code> pip install bitsandbytes <code> 
<code> pip install accelerate <code>

**Notice! CUDA must also be installed for the specific system**

## Simple Working Prototype

Below is a working prototype of the design, run on a pdf of a paper I found that is potentially similar to a paper Evidinno may encounter. (Response Stream the answer below) (PDF USED: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10475455/#:~:text=In%202022%2C%20the%20FDA%20approved,(NBE)%20%5B1%5D.)

In [1]:
import llama_index.core



In [5]:
##Imports
from llama_index.readers.file.docs.base import PDFReader #We have a Data Loader for our PDF papers
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings
from llama_index.llms.huggingface import HuggingFaceLLM

In [6]:
#CREATE DOCUMENTS
#PDF Loader
pdf_loader = PDFReader()


#Document Objects 
pdf_documents = pdf_loader.load_data("s43556-023-00138-y.pdf")

In [7]:
pdf_documents

[Document(id_='4830b885-bf7a-4ad4-b1de-2187a6ea2b92', embedding=None, metadata={'page_label': '1', 'file_name': 's43556-023-00138-y.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Zhang\xa0 et\xa0al. Molecular Biomedicine            (2023) 4:26  \nhttps://doi.org/10.1186/s43556-023-00138-y\nREVIEW Open Access\n© The Author(s) 2023. Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which \npermits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the \noriginal author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or \nother third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line \nto the material. If material is not included in the article’s Creative Commons licence and yo

In [15]:
# setup prompts - specific to StableLM
from llama_index.core import PromptTemplate
#Quantize the model to save memory
from transformers import BitsAndBytesConfig

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

# quantize to save memory
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

In [28]:
# setup prompts - specific to StableLM
from llama_index.core import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM will answer the query as best as possible for the user. 
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

# quantize to save memory (This is a choice for Open Source LLMs, if we have an API this is not needed.)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


In [31]:
import torch
print(torch.cuda.is_available())

False


In [None]:
import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096, "quantization_config": quantization_config},
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16}
)

Settings.llm = llm
Settings.chunk_size = 1024

In [None]:
##Embedding our documents
index = VectorStoreIndex.from_documents(pdf_documents)

In [None]:
query_engine = index.as_query_engine(streaming=True)

Suppose we see this in the pdf: 

![title](img/question.png)

Thus, we create the question: "When was Vonoprazan first launched and where?" as an initial test.

In [None]:
##Here we get our response
response_stream = query_engine.query("When was Vonoprazan first lanched and where?")

In [None]:
#Response Stream reduces latency
response = response_stream.get_response()

In [33]:
print(response)

Vonoprazan was first launched on February 2015 in Japan.<|endoftext|>
