## Multimodal RAG with GPT4V and LangChain

#### When do we need Multimodal RAG
Standard RAG is easy with text-only files, but what if we want to use RAG with pdfs or slides that have text, images, and tables? Then we use Multimodal RAG.

#### Multimodal RAG explained
* Summarize text with the LLM model.
* Summarize table with the LLM model.
* Summarize images with the new Multimodal LLM model (GPT4V).
* Convert summaries into numbers (embeddings) and store the embeddings in a multivector retriever (vector database).
* Store the raw documents (the text and the summary of the images) in a DocumentStore.
* When a question is asked, do similarity search to retrieve the most relevant docs and send the response to the LLM Model to format it properly using natural language.

In [None]:
#!pip install python-dotenv

In [4]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

#### Install LangChain

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install langchain

## Connect with an LLM and start a conversation with it

If you are using the pre-loaded poetry shell, you do not need to install the following packages because they are already pre-loaded for you:

In [None]:
#!pip install langchain-openai

In [None]:
#!pip install langchain-community

* For this project, we will use OpenAI's gpt-3.5-turbo and gpt4o. **The model gpt4-vision-preview has been deprecated by OpenAI and it is replaced now by the model gpt-4o**.

In [5]:
from langchain_openai import ChatOpenAI

chain_gpt_35 = ChatOpenAI(model="gpt-3.5-turbo", max_tokens=1024)
chain_gpt_4_vision = ChatOpenAI(model="gpt-4o", max_tokens=1024)

## Multimodal RAG App

If you are using the pre-loaded poetry shell, you do not need to install the following packages because they are already pre-loaded for you:

In [1]:
#!pip install openai

In [2]:
#!pip install pydantic lxml tiktoken

In [None]:
#!pip install langchain-chroma

In [3]:
#!pip install "unstructured[all-docs]"

* The unstructured module is the key here. We will use it to extract all the relevant parts of the document (text, tables and images).
* Chromadb will be our vector store.

#### In order to use the unstructured module, we will need to install two other modules: tesseract and poppler
* In MacOS with Homebrew:
    * brew install tesseract
    * brew install poppler
* For other systems (Windows, etc):
    * [info on how to install tesseract](https://tesseract-ocr.github.io/tessdoc/Installation.html)
    * [info on how to install poppler](https://pdf2image.readthedocs.io/en/latest/installation.html)


video: https://youtu.be/HNCypVfeTdw?si=o545WJ0FfMcM5IcC

https://youtu.be/IDu46GjahDs?si=8WLl6Wfu8yqGT1CQ

#### We will use a fake startupai-financial-report-v2.pdf file with text, tables and images

#### What we will do next:
* Import partition_pdf from the unstructured package
* We set the tesseract_cmd to the path where we store our tesseract.exe file
* We set input_path and output_path
* We then create the raw_pdf_elements and run the partition_pdf function from the unstructured package:
    * we set the filename and path
    * we instruct unstructured to extract all the relevant parts of the file (text, tables and images)
    * we set chunking strategy
    * we set the output path
* The following cell can take a few seconds to run

In [3]:
from typing import Any
import os
from unstructured.partition.pdf import partition_pdf
import pytesseract
import os

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

input_path = os.getcwd()
output_path = os.path.join(os.getcwd(), "figures")

# Get elements
raw_pdf_elements = partition_pdf(
    filename=os.path.join(input_path, "startupai-financial-report-v2.pdf"),
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=output_path,
)

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## See what we have in the raw_pdf_elements variable
* The classes with CompositeElements are text
* The classes with Table are tables

In [2]:
raw_pdf_elements

[<unstructured.documents.elements.CompositeElement at 0x31a909650>,
 <unstructured.documents.elements.Table at 0x31fa35250>,
 <unstructured.documents.elements.CompositeElement at 0x31fac7950>]

## Now we want to extract the relevant information
* We want to store the text, table and image elements in 3 lists.
* We cannot send the images as they are, we need to convert them into binary format with base64.
* For the text and table elements we will loop to add them in their list.

In [3]:
import base64

text_elements = []
table_elements = []
image_elements = []

# Function to encode images
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# We create the text and table elements in 2 steps
# Step 1: append the entire class in the list
for element in raw_pdf_elements:
    #Text elements have CompositeElement in the string of their type name
    if 'CompositeElement' in str(type(element)):
        text_elements.append(element)
    #Table element have Table in the string of their type name
    elif 'Table' in str(type(element)):
        table_elements.append(element)

# Step 2: extract just the text, we don't want to store the raw classes
table_elements = [i.text for i in table_elements]
text_elements = [i.text for i in text_elements]

# Tables
print("number of table elements in the pdf file: ", len(table_elements))

# Text
print("number of text elements in the pdf file: ", len(text_elements))

number of table elements in the pdf file:  1
number of text elements in the pdf file:  2


#### Images
* They are currently stored in the "figures" folder.
* We will loop through that folder:
    * check if the image file ends with png, jpg, jpeg
    * then provide the full page to the encode_image function to encode it in a base64 format
    * and then enter the encoding result in the image list
* The following cell may take a few seconds to run. 

In [4]:
for image_file in os.listdir(output_path):
    if image_file.endswith(('.png', '.jpg', '.jpeg')):
        image_path = os.path.join(output_path, image_file)
        encoded_image = encode_image(image_path)
        image_elements.append(encoded_image)
print("number of image elements in the pdf file: ",len(image_elements))

number of image elements in the pdf file:  8


## Now we can create 3 functions to summarize the texts, table and images
* for the text and table the functions are very similar
* for the images we use GPT4V
* pay attention on how we set the url

In [6]:
from langchain.schema.messages import HumanMessage, AIMessage

# Function for text summaries
def summarize_text(text_element):
    prompt = f"Summarize the following text:\n\n{text_element}\n\nSummary:"
    response = chain_gpt_35.invoke([HumanMessage(content=prompt)])
    return response.content

# Function for table summaries
def summarize_table(table_element):
    prompt = f"Summarize the following table:\n\n{table_element}\n\nSummary:"
    response = chain_gpt_35.invoke([HumanMessage(content=prompt)])
    return response.content

# Function for image summaries
def summarize_image(encoded_image):
    prompt = [
        AIMessage(content="You are a bot that is good at analyzing images."),
        HumanMessage(content=[
            {
                "type": "text", 
                "text": "Describe the contents of this image."},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encoded_image}"
                },
            },
        ])
    ]
    response = chain_gpt_4_vision.invoke(prompt)
    return response.content

## Now we will create a summary for each text, table and image element
* The following cells will take some time to run.
* Careful: GPT4V is significantly more expensive than the regular GPT models.
* Note: If you try to summarize all them, the Jupyter Kernel may crash occassionally. It that happens, you will have to run it again. 

In [7]:
# Processing text elements, stopping at the 2nd
text_summaries = []
for i, te in enumerate(text_elements[0:2]):
    summary = summarize_text(te)
    text_summaries.append(summary)
    print(f"{i + 1}th element of texts processed.")

1th element of texts processed.
2th element of texts processed.


In [8]:
# Processing table elements, stopping at the 1st
table_summaries = []
for i, te in enumerate(table_elements[0:1]):
    summary = summarize_table(te)
    table_summaries.append(summary)
    print(f"{i + 1}th element of tables processed.")

1th element of tables processed.


In [10]:
# Processing image elements, stopping at the 8th
image_summaries = []
for i, ie in enumerate(image_elements[0:8]):
    summary = summarize_image(ie)
    image_summaries.append(summary)
    print(f"{i + 1}th element of images processed.")

1th element of images processed.
2th element of images processed.
3th element of images processed.
4th element of images processed.
5th element of images processed.
6th element of images processed.
7th element of images processed.
8th element of images processed.


## After creating the summaries, we can now proceed with the RAG technique
* We will use LangChain's [Multi-Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store:
    * all our documents,
    * the summaries,
    * and also the embeddings
    * in a vector database
* We will use a Chroma vector database
* We will use a docstore to store the raw documents (the original documents). For that we will use InMemoryStore from LangChain.
* We will provide later the id_key, now it is just a string
* Then we create the function to add documents to the retriever
    * create some uuids for our documents using the uuid4() function.
        * The uuid4 function in Python is part of the uuid module, which generates unique identifiers according to the UUID (Universally Unique Identifier) standard.
        * The uuid4 function specifically generates a random UUID based on the version 4 specification. This means that each time you call uuid4, it generates a completely random UUID that is highly unlikely to be duplicated anywhere else, now or in the future.
        * A UUID generated by uuid4 looks something like this: 12345678-1234-5678-1234-567812345678, where each digit is a hexadecimal character (0-9, a-f), representing a 128-bit value.
        * The version 4 UUIDs are useful for situations where you need to ensure uniqueness across different systems without the need for a central coordinating mechanism.
    * then we will create a list of documents using the Document class
        * for each page_content, we include the summary of the element
        * for each metadata, we enter the uuid
    * Then we add the documents to the vector database
    * We also store our raw documents in the docstore. Each raw document has the corresponding uuid.
    * As you can see, the connection between the vector database and the docstore is in the uuids.

In [11]:
import uuid

from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma

# Initialize the Chroma vector database and docstore
vectorstorev2 = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
storev2 = InMemoryStore()
id_key = "doc_id"

# Initialize the multi-vector retriever
retrieverv2 = MultiVectorRetriever(vectorstore=vectorstorev2, docstore=storev2, id_key=id_key)

# Function to add documents to the multi-vector retriever
def add_documents_to_retriever(summaries, original_contents):
    doc_ids = [str(uuid.uuid4()) for _ in summaries]
    summary_docs = [
        Document(page_content=s, metadata={id_key: doc_ids[i]})
        for i, s in enumerate(summaries)
    ]
    retrieverv2.vectorstore.add_documents(summary_docs)
    retrieverv2.docstore.mset(list(zip(doc_ids, original_contents)))

## Now we can add everything to the Multivector Retriever
* for text and tables:
    * summaries are stored in the vector database.
    * raw documents are stored in the docstore.
* for the images:
    * summaries are stored in the vector database.
    * summaries (not the raw images) are also stored in the docstore.   

In [12]:
# Add text summaries
add_documents_to_retriever(text_summaries, text_elements)

# Add table summaries
add_documents_to_retriever(table_summaries, table_elements)

# Add image summaries
add_documents_to_retriever(image_summaries, image_summaries)

## After adding that, we can now retrieve the information

In [15]:
retrieverv2.invoke(
    "What do you see in the images?"
)

['The image contains the word "STATEMENT" in capital letters, set against a solid blue background. The text is white, making it stand out prominently against the blue. The font is bold and sans-serif, which gives it a strong and clear appearance. There are no other graphics or text in the image.',
 "The image shows a modern graphics card, which is a piece of computer hardware responsible for rendering images, videos, and animations for the display. This card has a dual-fan cooling solution, with two large fans on the top to dissipate heat. You can see the heatsink fins beneath the fans, which help to increase the surface area for better thermal performance.\n\nOn the side facing us, there's a bracket with various ports that indicate this card's output options, which typically include DisplayPort, HDMI, and sometimes DVI or USB-C. The presence of multiple outputs suggests it can support multi-monitor setups.\n\nThe card is mounted on a black printed circuit board (PCB) with a PCIe (Peri

#### Now, if we use the multi-vector retriever as the context:

In [17]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

template = """Answer the question based only on the following context, which can include text, images and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

chain = (
    {"context": retrieverv2, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

#### Then we can make questions about text, images or tables in the document

In [18]:
chain.invoke(
     "What do you see on the images in the database?"
)

'Based on the context provided, the images in the database contain the following elements:\n1. A statement displayed in bold white text against a blue background.\n2. A financial figure of $22,000,000 in sales in bold blue or purple text on a white background.\n3. The text "FINANCIAL STATEMENT" in large orange letters on a navy blue background.\n4. A donut chart with two segments - one orange segment covering around 67% and one purple segment covering around 33%. In the center of the chart, there is text that reads "33% ROI" indicating the Return on Investment percentage.'

In [19]:
chain.invoke(
     "What is the name of the company?"
)

'The name of the company is StartupAI.'

In [20]:
chain.invoke(
     "What is the product displayed in the image?"
)

'The product displayed in the image is a modern graphics card.'

In [21]:
chain.invoke(
     "How much are the total expenses of the company?"
)

'The total expenses of the company are $2,000,000.'

In [22]:
chain.invoke(
     "What is the ROI?"
)

'The ROI is 33%.'

In [23]:
chain.invoke(
     "How much did the company sell in 2023?"
)

'The company sold $22 million in 2023.'

* Note: see that the previous answer can be seen as a mistake if we look at the bar chart, but we have to admit that the pdf is a bit confusing about it since it highlights de 22M sales in 2 different places.

In [24]:
chain.invoke(
     "And in 2022?"
)

'In 2022, the approximate value represented on the bar chart is 15 units.'

* Note: see that now GPT4 is taking the right sales data from the bar chart. Impressive!

## How to execute the code from Visual Studio Code
* In Visual Studio Code, see the file 001-multimodal.py
* In terminal, make sure you are in the directory of the file and run:
    * python 001-multimodal.py