<font size=6>**Advanced RAG**</font>

<font size=5>**Multi-Modal RAG for Healthcare**</font>

## Setup

This notebook assumes the following:
- Python 3.10+
- An OpenAI API key set as an environment variable:
  ```bash
  export OPENAI_API_KEY="your-key"


**Solution Approach**

The core of the solution involves the creation of an AI-based Advanced Retrieval-Augmented Generation (RAG) System that encompasses the following key components:

1. **Development of a Text and Image-Based Knowledge Base:** This will compile comprehensive information about vitamins and nutrients, supplemented with relevant images to enrich user understanding.
2. **Implementation of Reranking Methodology:** This process will refine the AI system’s ability to recommend the most relevant responses to user queries, thereby enhancing accuracy and user satisfaction.
3. **Query Expansion Techniques:** These will improve the AI’s comprehension of user queries, making it more efficient in providing precise information tailored to individual needs.

## **Setup**

In [None]:
# @title Run this cell => Restart the session => Start executing the below cells **(DO NOT EXECUTE THIS CELL AGAIN)**

!pip install -q langchain==0.3.21 \
                huggingface_hub==0.29.3 \
                openai==1.68.2 \
                chromadb==0.6.3 \
                langchain-community==0.3.20 \
                langchain_openai==0.3.10 \
                lark==1.2.2\
                rank_bm25==0.2.2\
                numpy==2.2.4 \
                scipy==1.15.2 \
                scikit-learn==1.6.1 \
                transformers==4.50.0 \
                pypdf==5.4.0 \
                markdown-pdf==1.7 \
                sentence_transformers==4.0.0 \
                torch==2.6.0+cu124


%pip install --upgrade chromadb
%pip install pillow
%pip install open-clip-torch
%pip install matplotlib

In [None]:
import zipfile
import os

zip_path = "path"
extract_to = "./"   # current notebook directory

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to)

print("Extraction complete!")


In [6]:
# @title Loading the `config.json` file
import json
import os

# Load the JSON file and extract values
file_name = 'config.json'
with open(file_name, 'r') as file:
    config = json.load(file)
    os.environ['OPENAI_API_KEY'] = config.get("API_KEY") # Loading the API Key
    os.environ["OPENAI_BASE_URL"] = config.get("OPENAI_API_BASE") # Loading the API Base Url

In [7]:
# @title Defining the Embedding Model - Using `text-embedding-ada-002` Model
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

In [8]:
# @title Defining the LLM Model - Using `gpt-4o-mini` Model
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

## **Step 1 : Loading the PDF and storing it in the vectorDB**


In [10]:
from langchain.document_loaders import PyPDFLoader

# Reading the NOFO Document
pdf_file = "path/to/your/document.pdf"
pdf_loader = PyPDFLoader(pdf_file);
text_data = pdf_loader.load()

In [None]:
from langchain.vectorstores import Chroma

text_vectorstore = Chroma(
    collection_name="vitamin_and_minerals",
    embedding_function=embeddings)

text_vectorstore.add_documents(documents=text_data)

In [12]:
vanilla_retriever = text_vectorstore.as_retriever(search_kwargs={"k": 10})

## **Step 2 : Loading the Images along with metadata and storing them in the vectorDB**

In [13]:
import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
from matplotlib import pyplot as plt
import os

In [None]:
# Create database file at folder "my_vectordb" or load into client if exists
chroma_client = chromadb.PersistentClient(path="my_vectordb")

# Instantiate image loader helper
image_loader = ImageLoader()

# Instantiate multimodal embedding function
image_embedding_function = OpenCLIPEmbeddingFunction()

# Create the collection, aka vector database. Or, if database already exist, then use it. Specify the model that we want to use to do the embedding
multimodal_db = chroma_client.get_or_create_collection(name="multimodal_db", embedding_function=image_embedding_function, data_loader=image_loader)

In [None]:
multimodal_db.count()

In [None]:
import os
print(os.listdir("path/to/your/image/folder"))



In [None]:
from PIL import Image
import numpy as np

import os

folder = "path/to/your/image/folder"

image_paths = [f"{folder}/{file}" for file in os.listdir(folder)]
image_paths


In [None]:
# Defining Metadata
Metadata = []
for file in os.listdir("path/to/your/image/folder"):
  Metadata.append({'Vitamin':file.split("-")[0][-1],
                   'info': f'The images shows the sources of Vitamin {file.split("-")[0][-1]}'})

Metadata

In [33]:
# Use .add() to add a new record or .update() to update existing record
multimodal_db.add(
    ids=[str(x) for x in range(len(os.listdir("path/to/your/image/folder")))],
    uris = image_paths,
    metadatas=Metadata
)

In [None]:
multimodal_db.count()

In [35]:
# Simple function to print the results of a query.
# The 'results' is a dict {ids, distances, data, ...}
# Each item in the dict is a 2d list.
def print_query_results(query_list: list, query_results: dict)->None:
    result_count = len(query_results['ids'][0])

    for i in range(len(query_list)):
        for j in range(result_count):
            id       = query_results["ids"][i][j]
            distance = query_results['distances'][i][j]
            data     = query_results['data'][i][j]
            document = query_results['documents'][i][j]
            metadata = query_results['metadatas'][i][j]
            uri      = query_results['uris'][i][j]

            print(f'id: {id}, distance: {distance}, metadata: {metadata}, document: {document}')

            # Display image, the physical file must exist at URI.
            # (ImageLoader loads the image from file)
            print(f'data: {uri}')
            plt.imshow(data)
            plt.axis("off")
            plt.show()

In [36]:
def get_image(query_texts):
# Query vector db
    query_results = multimodal_db.query(
        query_texts = query_texts,
        n_results=2,
        include=['documents', 'distances', 'metadatas', 'data', 'uris'],
    )

    print_query_results(query_texts, query_results)

In [None]:
get_image(['Citrus fruits have which common vitamins'])

## **Reranking**

In [67]:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
crossencoder = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")

In [68]:
from langchain.retrievers.document_compressors import CrossEncoderReranker
reranker = CrossEncoderReranker(model=crossencoder, top_n=5)

In [69]:
from langchain.retrievers import ContextualCompressionRetriever
reranker_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=vanilla_retriever
)

## **User Query Expansion**

In [70]:
# Query Expansion for User_query

def query_expansion(user_query):
    query_enhancement = f"""
    You are an expert in information retrieval systems, particularly skilled in enhancing queries for document search efficiency.
    Perform query expansion on the received question by considering alternative phrasings or synonyms commonly used in document retrieval contexts.
    If there are multiple ways to phrase the user's question or common synonyms for key terms, provide several reworded versions.

    If there are acronyms or words you are not familiar with, do not try to rephrase them.

    Return at least 3 versions of the question as a list.
    Generate only a list of questions. Do not mention anything before or after the list.

    Question:
    {user_query}
    """


    new_question = llm.invoke(query_enhancement)
    return (new_question.content)

## **Step 3 : Retriever which can fetch the data from (PDF + Images) and return this to the user**

In [None]:
# Query Expansion
multipule_queries = query_expansion(user_query)
print(multipule_queries)

In [None]:
# Calling the vanilla retriever
vanilla_responses = vanilla_retriever.get_relevant_documents(multipule_queries)

# We can see the results are not sorted
context_query_pairs_for_scoring = [[multipule_queries, doc_text.page_content] for doc_text in vanilla_responses]
crossencoder.score(context_query_pairs_for_scoring)

**In the above cell, we can observe that we fetched 10 chunks and they are not sorted. Now, we will be following the re-ranking approach to fetch the 5 most relevant chunks from these 10.**

In [99]:
# Calling the reranking retriever
reranked_responses = reranker_retriever.get_relevant_documents(multipule_queries)

In [None]:
# We can see the results are sorted
context_query_pairs_for_scoring = [[multipule_queries, doc_text.page_content] for doc_text in reranked_responses]
crossencoder.score(context_query_pairs_for_scoring)

In [101]:
prompt = f"""
You are an expert AI assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.

### Context:
{reranked_responses}

### Question:
{multipule_queries}
"""

In [102]:
llm_response = llm.invoke(prompt)

In [None]:
print("Answer: \n", llm_response.content)
print("="*50)
print("Relevant Images: \n")
get_image([llm_response.content])