#### Multi model RAG with Captioning

This code implements one of the multiple ways of multi-model RAG. It extracts and processes text and images from PDFs, utilizing a multi-modal Retrieval-Augmented Generation (RAG) system for summarizing and retrieving content for question answering.

In [15]:
import pymupdf  # PyMuPDF
from PIL import Image
import io
import os
from dotenv import load_dotenv

import google.generativeai as genai
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma


load_dotenv()

True

In [16]:
file_path="data/Attention.pdf"

##### Data Extraction

In [17]:
text_data=[]
image_data=[]

with pymupdf.open(file_path) as pdf_file:
    #Create directory to store the images
    if not os.path.exists("extracted_images"):
        os.makedirs("extracted_images")
    
    #loop through every page in pdf
    for page_number in range(len(pdf_file)):
        page = pdf_file[page_number]

        #get the text on page
        text = page.get_text().strip()
        text_data.append({'response':text,"name":page_number+1})

        #Get the list of images on teh page
        images = page.get_images(full=True)

        #loop through all images on teh page
        for image_index,img in enumerate(images):
            xref=img[0]
            base_image = pdf_file.extract_image(xref) #get base image
            image_bytes = base_image["image"]  #get images bytes
            image_ext = base_image["ext"] #get image extension

            #Load the image using PIL and save it
            image = Image.open(io.BytesIO(image_bytes))
            image.save(f"extracted_images/image_{page_number+1}_{image_index+1}.{image_ext}")
            



In [21]:
genai.configure(api_key='AIzaSyBKkv7DTUdkj9oGgm6vwvdgS7Hqb0BN7Qs')
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

#### Image captioning

In [22]:
for img in os.listdir("extracted_images"):
    image = Image.open(f"extracted_images/{img}")
    response = model.generate_content([image, "You are an assistant tasked with summarizing tables, images and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements \
    Give a concise summary of the table or text that is well optimized for retrieval. Table or text or image:"])
    image_data.append({"response": response.text, "name": img})

In [24]:
image_data

[{'response': 'This image depicts the architecture of a Transformer decoder.  It shows the flow of information from inputs through embedding, positional encoding, masked and unmasked multi-head attention layers, feed forward networks, add & norm layers, and finally to output probabilities via a linear and softmax layer. The decoder processes sequences sequentially, indicated by the "shifted right" outputs.  Key components include multi-head attention for context understanding and feed forward networks for transformation.\n',
  'name': 'image_3_1.png'},
 {'response': "This image shows a diagram of a scaled dot-product attention mechanism.  The process starts with three inputs (V, K, Q), which are each passed through linear layers. The outputs of these layers are fed into a scaled dot-product attention module. The attention module's output is concatenated and then passed through another linear layer to produce the final output (h).\n",
  'name': 'image_4_1.png'},
 {'response': 'This imag

#### Vectorstore

In [26]:
#Embeddings 
from langchain_huggingface import HuggingFaceEmbeddings
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

#Load the document
docs_list = [Document(page_content=text['response'],metadata={'name':text['name']}) for text in text_data]
img_list = [Document(page_content=img['response'],metadata={'name':img['name']}) for img in image_data]

#Split the documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50
)

docs_split = text_splitter.split_documents(docs_list)
img_split = text_splitter.split_documents(img_list)

In [27]:
#Add to vectorstore
vectorstore = Chroma.from_documents(
    documents=docs_split + img_split,
    collection_name="multi_model_rag",
    embedding=embeddings
)

retriver = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={'k':1} 
)

In [28]:
## Query
query = "What is the BLEU score of the Transformer (base model)?"
docs = retriver.invoke(query)

In [29]:
docs

[Document(metadata={'name': 8}, page_content='Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the\nEnglish-to-German and English-to-French newstest2014 tests at a fraction of the training cost.\nModel\nBLEU\nTraining Cost (FLOPs)\nEN-DE\nEN-FR\nEN-DE\nEN-FR\nByteNet [15]\n23.75\nDeep-Att + PosUnk [32]\n39.2\n1.0 · 1020\nGNMT + RL [31]\n24.6\n39.92\n2.3 · 1019\n1.4 · 1020\nConvS2S [8]\n25.16\n40.46\n9.6 · 1018')]

In [30]:
from langchain_core.output_parsers import StrOutputParser
from langchain_groq import ChatGroq

# Prompt
system = """You are an assistant for question-answering tasks. Answer the question based upon your knowledge. 
Use three-to-five sentences maximum and keep the answer concise."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "Retrieved documents: \n\n <docs>{documents}</docs> \n\n User question: <question>{question}</question>"),
    ]
)

# LLM
groq_api_key=os.getenv("GROQ_API_KEY")
llm=ChatGroq(groq_api_key=groq_api_key,model_name="llama-3.1-8b-instant")

# Chain
rag_chain = prompt | llm | StrOutputParser()

# Run
generation = rag_chain.invoke({"documents":docs[0].page_content, "question": query})
print(generation)

Unfortunately, the BLEU score for the Transformer model is not explicitly mentioned in the given table. However, it is stated that the Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests.
