# Medical Query Prototype 

## Description:
As per the document: "A prototype tool that allows users to query about medical drugs, search online medical news or research study sources via an API or any search strategies, index the relevant results, and use a Large Language Model (LLM) to summarize the information to answer the user's query."

## Discussion: 
Suppose that we take medical papers and either (1) web-scrap relevant data (2) similarly scrap from a pdf, it would be preferable that our pipeline be a Question Answering system that can not only answer within a good accuracy but tell us when it cannot answer the query. For instance, if a common question is that of "What are the main competitor drugs of [name] and what are their latest news now?" If no such token allows for an answer we would want it to tell us so, essentially reading the paper for us. 

## Installation Instructions. (My machine is Ubuntu (Debian))
1. <code> pip install llama-index <code>
2. <code> pip install llama-index-readers-file <code>
3. <code> pip install llama-index-readers-web <code>
4. <code> pip install llama-index-embeddings-openai <code> OR <code> pip install llama-index-embeddings-huggingface <code>

# API Integration for News Search/API/PDF

## PDF Loader

In [None]:
## Define our documents, LlamaIndex makes this particularly easy with it's Data Loaders on LlamaHub object and open source use.

from llama_index.readers.file import PDFReader #We have a Data Loader for our PDF papers

#PDF Loader
pdf_loader = PDFReader()

#Document Objects 
pdf_documents = pdf_loader.load_data("path/to/pdf/paper.pdf")

## Web Loader (News/Web)

In [None]:
## Define our documents, LlamaIndex makes this particularly easy with it's Data Loaders on LlamaHub object and open source use.
## This sort of web loader pattern can be extend towards News Sources instead of having a specific API for it. 

from llama_index.readers.web import BeautifulSoupWebReader #We have a Data Loader for a website (Note: This can potentially be difficult to do initially
#and would require some other custom changes for dynamic sites, as well as sites that require logins and so forth, but this is a simple example.)

#Web Loader
web_loader = BeautifulSoupWebReader()

#Document Objects 
web_documents = web_loader.load_data(urls=[URL])

## Web Scraper (Example on website)

This would be an example of how to create a simple webscraper for a more specialized website, one in which we must use the CSS selectors or XML paths to get specific data. This web scraper employs the use of playwright. (EDIT FINISH)

In [None]:
## WEB SCRAPING CELL: 
from playwright.async_api import async_playwright
import asyncio

async def process_locator(locator):
    count = await locator.count()
    if count > 1: #We have many elements and must resolve each inner text
        texts = ""
        for i in range(count):
            element = locator.nth(i)
            if await element.is_visible():
                inner_text = await element.inner_text()
                texts = texts + "," + inner_text
                return texts
    else:
        if await locator.is_visible():
            return await locator.inner_text()
        else:
            return "NA"
    

async def main():
   async with async_playwright() as pw:
       browser = await pw.chromium.launch(
           ##We'll employ the use of chromium for this webscraper
           ##Using a proxy creates HTTP errors.
          headless=False
      )

       #Beginning page: 
       page = await browser.new_page()
       await page.goto('https://world.openfoodfacts.org/')
       await page.wait_for_timeout(5000)
       result = []
       food_urls = []
       food_list = await page.query_selector_all('.list_product_a')
       for food in food_list:
           food_urls.append(await food.get_attribute('href'))
           
       for food_url in food_urls:
            food_info = {}
            await page.goto(food_url)
            #Title: 
            title = page.locator(".title-1")
            food_info['title'] = await process_locator(title)
            #Common Name:
            common_name = page.locator("#field_generic_name_value")
            food_info['common_name'] = await process_locator(common_name)
            #Quantity:
            quantity = page.locator("#field_quantity_value")
            food_info['quantity'] = await process_locator(quantity)
            #Packaging: 
            packaging = page.locator("#field_packaging_value")
            food_info['packaging'] = await process_locator(packaging)
            #Brands:
            brand = page.locator("#field_brands_value")
            food_info['brand'] = await process_locator(brand)
            #Categories:
            categories = page.locator("#field_categories_value")
            food_info['categories'] = await process_locator(categories)
            #Certifications:
            certifications = page.locator("#field_labels_value")
            food_info['certifications'] = await process_locator(certifications)
            #Origin:
            origin = page.locator("#field_origin_value")
            food_info['origin'] = await process_locator(origin)
            #origin of ingredients:
            origin_of_ingredients = page.locator("#field_origins_value")
            food_info['origin_of_ingredients'] = await process_locator(origin_of_ingredients)
            #Places of manufacturing:
            places_manufactured = page.locator("#field_manufacturing_places_value")
            food_info['places_manufactured'] = await process_locator(places_manufactured)
            #Stores:
            stores = page.locator("#field_stores_value")
            food_info['stores'] = await process_locator(stores)
            #Countries where Sold:
            countries_sold = page.locator("#field_countries_value")
            food_info['countries_sold'] = await process_locator(countries_sold)
           
            #HEALTH SECTION
            #Notice, because of the increasing complexity of the DOM elements in this area the CSS selectors don't follow a similarly nice pattern
            #Ingredients: 
            ingredients = page.locator("#panel_ingredients_content .panel_text")
            food_info['ingredients'] = await process_locator(ingredients)
            #NOVA score:
            nova_score = page.locator("ul#panel_nova li.accordion-navigation h4")
            food_info['nova_score'] = await process_locator(nova_score)
            # #Palm Status:
            # palm_status = page.locator(".accordion-navigation active .content panel_content active .panel_text")
            # food_info['palm_status'] = await process_locator(palm_status)
            # #Vegan Status:
            # vegan_status = page.locator("#panel_ingredients_analysis_en-vegan_content .panel_text")
            # food_info['vegan_status'] = await process_locator(vegan_status)
            # #Vegetarian Status:
            # vegetarian_status = page.locator("#panel_ingredients_analysis_en-vegetarian_content .panel_text")
            # food_info['vegetarian_status'] = await process_locator(vegetarian_status)
            #Nutrition grade:
            nutrition_grade = page.locator(".accordion-navigation .grade_a_title")
            food_info['nutrition_grade'] = await process_locator(nutrition_grade)

            # #NUTITRION FACTS
            # #
            # table_rows = await page.query_selector_all("#panel_nutrition_facts_table_content")
            # nutrition_facts = {}
            # for row in table_rows:
            #     columns = await row.query_selector_all('td')
            #     name = await process_locator(columns[0])
            #     value_per_100g = await process_locator(columns[1])
            #     nutrition_facts[name] = {
            #         "100g/100ml": value_per_100g
            #     }
                    
                    
            # food_info['nutrition_table'] = nutrition_facts
            result.append(food_info)
            


       
       

       
           
           
           
       await browser.close()
       return result
if __name__ == '__main__':
   result = await main()

#Problems & Changes:
#

#CITATIONs: 
#Code cited from OxyLabs: https://github.com/oxylabs/playwright-web-scraping?tab=readme-ov-file
#,https://playwright.dev/python/docs/locat

In [None]:
## Concatenated the documents

documents = pdf_documents + web_documents

# Information Indexing & Relevance Filtering 

For this task, we employ the use of Retrieval Augmented Generation (RAG). Note that I am choosing this over fine-tuning at this moment due to the time constraint. Fine-tuning a specific open source model on Hugging Face, for example, could provide accurate results for a QA system, but is generally more expensive of a procedure. Additionally, fine-tuning provides a more static model, and requires a "re fine-tuning" with every new dataset. whereas RAG is more dynamic.

RAG's essentially work by creating a data structure (embeddings) of our documents, and allowing our LLM to more or less mathematically provide an answer to our query based on the vector embeddings of both. A particularly good library for RAG is LlamaIndex. Note: At this point we assume to have the pdfs and relevant websites.

## Indexing

We then take our documents and return vector emebeddings, allowing for a data structure that allows our LLMs to query the data. For this procedure we can employ the use of LLM's to embedded our data, OpenSource (HuggingFace) and OpenAI llms provide this functionality. Notice, in the Installation Instructions I provided a choice for both, this is the case as an API key is required for OpenAI use. 

In [None]:
## OpenAI usage: 
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

# global
Settings.embed_model = OpenAIEmbedding()

# per-index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

In [None]:
##  Hugging Face usage (bge-small-en-v1.5):
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

# per-index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Retriever 

Retrievers objects allow us to get the most relevant answer given a query. As such, this is a possible method to **Relevance Filtering**. For this, we use the VectorIndexRetriever. (Note: There are many selections and preferences for retriever) The retriever allows for fetching the maximum relevant context for our query. 

In [None]:
# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

# configure response synthesizer (We use the default response synthesizer)
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
)

## Query by LLM

In [None]:
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("What are the main competitor drugs of [drug name] and what are their latest news now?")

![title](img/basic_rag.png)


High-level RAG image cited from Llama Index: https://docs.llamaindex.ai/en/stable/getting_started/concepts/#indexing-stage