## Multimodal RAG using Langchain Expression Language And GPT4-Vision
https://medium.aiplanet.com/multimodal-rag-using-langchain-expression-language-and-gpt4-vision-8a94c8b02d21

### Install required dependencies

In [3]:
# lock to 0.10.19 due to a persistent bug in more recent versions
! pip install --quiet -U pdf2image pytesseract unstructured[all-docs] pillow pydantic lxml pillow matplotlib tiktoken open_clip_torch torch
! pip install --quiet -U langchain openai chromadb langchain-experimental # (newest versions required for multi-modal)
#! apt install poppler-utils
#! apt install tesseract-ocr

### Data Loading

### Use partition_pdf method below from Unstructured to extract text and images

In [4]:
path = "LLaVA/"
file_name = "LLaVA.pdf"

In [2]:
# Extract images, tables, and chunk text
import os
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename=os.path.join(path, file_name),
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    extract_image_block_output_dir=path,
)

tables = []
texts = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        tables.append(str(element))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        texts.append(str(element))

print(f"Tables: {len(tables)}")
print(f"Texts: {len(texts)}")

This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name
2024-03-01 08:22:02.668453: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-01 08:22:02.703804: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-01 08:22:02.703827: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-01 08:22:02.705389: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register 

Tables: 4
Texts: 31


### Multi-modal embeddings and Chroma storage

In [5]:
from langchain_community.vectorstores.chroma import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings


# Create chroma
vectorstore = Chroma(
    collection_name=str(path.replace("/", "_")) + "vectorstore",
    embedding_function=OpenCLIPEmbeddings(),
)

# Get image URIs with .jpg extension only
image_uris = sorted([os.path.join(path, image_name) for image_name in os.listdir(path) if image_name.endswith(".jpg")])

# Add images
vectorstore.add_images(uris=image_uris)

# Add documents
vectorstore.add_texts(texts=texts)

# Make retriever
retriever = vectorstore.as_retriever(
    # search_kwargs={"k": 3},
)

### Image processing

In [6]:
import base64
import io

from PIL import Image


def resize_base64_image(base64_string, size=(128, 128)):
    """
    Resize an image encoded as a Base64 string.

    Args:
    base64_string (str): Base64 string of the original image.
    size (tuple): Desired size of the image as (width, height).

    Returns:
    str: Base64 string of the resized image.
    """
    # Decode the Base64 string
    img_data = base64.b64decode(base64_string)
    img = Image.open(io.BytesIO(img_data))

    # Resize the image
    resized_img = img.resize(size, Image.LANCZOS)

    # Save the resized image to a bytes buffer
    buffered = io.BytesIO()
    resized_img.save(buffered, format=img.format)

    # Encode the resized image to Base64
    return base64.b64encode(buffered.getvalue()).decode("utf-8")


def is_base64(s):
    """Check if a string is Base64 encoded"""
    try:
        return base64.b64encode(base64.b64decode(s)) == s.encode()
    except Exception:
        return False


def split_image_text_types(docs):
    """Split numpy array images and texts"""
    images = []
    text = []
    for doc in docs:
        doc = doc.page_content  # Extract Document contents
        if is_base64(doc):
            # Resize image to avoid OAI server error
            images.append(resize_base64_image(doc, size=(250, 250)))  # base64 encoded str
        else:
            text.append(doc)
    return {"images": images, "texts": text}

### Retrieval Augmented Generation chain

In [7]:
from operator import itemgetter

from langchain_openai.chat_models import ChatOpenAI
from langchain_core.messages import HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import (
    RunnableLambda,
    RunnablePassthrough,
    RunnableParallel,
)


def prompt_func(data_dict):
    # Joining the context texts into a single string
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = []

    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        image_message = {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{data_dict['context']['images'][0]}"},
        }
        messages.append(image_message)

    # Adding the text message for analysis
    text_message = {
        "type": "text",
        "text": (
            "As a secretary, your task is to extract and interpret both textual and visual information from the document, leveraging the rich context provided. The content has been sourced based on specific keywords input by the user."
            "**If the document does not contain direct references or clear data relevant to the user's query, you must clearly state 'No sufficient reference available to provide an answer' and refrain from answering further.**"
            f"Keywords provided by the user: {data_dict['question']}\n\n"
            "Extracted content:\n"
            f"{formatted_texts}"
        ),
    }
    messages.append(text_message)

    return [HumanMessage(content=messages)]

In [19]:
model = ChatOpenAI(temperature=0, model="gpt-4-vision-preview", max_tokens=4096)

# RAG pipeline
chain = {
    "context": retriever | RunnableLambda(split_image_text_types),
    "question": RunnablePassthrough(),
} | RunnableParallel(
    {
        "response": prompt_func | model | StrOutputParser(),
        "context": itemgetter("context"),
    }
)

### Helper function for displaying information

In [20]:
from IPython.display import HTML


def plt_img_base64(img_base64):
    # Create an HTML img tag with the base64 string as the source
    image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'

    # Display the image by rendering the HTML
    display(HTML(image_html))

In [21]:
from langchain_community.callbacks.manager import get_openai_callback


def displayRAG(question):
    print("###QUESTION###")
    print(question, end="\n\n")

    with get_openai_callback() as callback:
        response = chain.invoke(question)
        print(callback, end="\n\n")

    print("###ANSWER###")
    print(response["response"], end="\n\n")

    for i, image in enumerate(response["context"]["images"]):
        print(f"###IMAGE{i+1}###")
        plt_img_base64(image)

    for i, text in enumerate(response["context"]["texts"]):
        print(f"###TEXT{i+1}###")
        print(text, end="\n\n")

In [22]:
vectorstore.similarity_search("related work")

[Document(page_content='25'),
 Document(page_content='GPT-4 (text-only) judge: Since there is no context or image to refer to, I cannot independently analyze the rocking chair in question. However, I can try to analyze the reasoning given by the assistants. Assistant 1, who claims to have seen the image, mentions that the legs of the rocking chair are made of wood, and the back and seat are made of silk, implying that it’s a combination of two materials. However, this response seems less plausible, as silk is not a common material used for the construction of rocking chairs due to issues with stability and durability. Typically, one would expect the frame, legs, and seat or back to be made of wood, with cushions and padding made of fabrics, such as cotton or polyester. Assistant 2 argues that rocking chairs are typically made of sturdy materials, and wood is a common material used for making rocking chairs due to its strength, durability, and natural appearance. This reasoning stands t

### Q&A

In [23]:
displayRAG("What is the performance of LLaVA across multiple image domains / subjects?")

###QUESTION###
What is the performance of LLaVA across multiple image domains / subjects?

Tokens Used: 2265
	Prompt Tokens: 2256
	Completion Tokens: 9
Successful Requests: 1
Total Cost (USD): $0.022829999999999996

###ANSWER###
No sufficient reference available to provide an answer.

###TEXT1###
5 Experiments

We assess the performance of LLaVA in instruction-following and visual reasoning capabilities with two primary experimental settings: multimodal chatbot and the ScienceQA dataset, respectively. We train all models with 8× A100s, following Vicuna’s hyperparameters [9]. We pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and fine-tune on the proposed LLaVA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32. See Appendix for more training details.

5.1 Multimodal Chatbot

We developed a chatbot demo to show the image understanding and conversation abilities of LLaVA, and to stud

In [24]:
displayRAG("What are the related work?")

###QUESTION###
What are the related work?

Tokens Used: 2439
	Prompt Tokens: 2430
	Completion Tokens: 9
Successful Requests: 1
Total Cost (USD): $0.02457

###ANSWER###
No sufficient reference available to provide an answer.

###TEXT1###
What are the meals that | can cook with these? With the variety of food items stored in the refrigerator, you can prepare several meals. For example, you can create a fruit salad using the strawberries, blueberries, and carrots. Additionally, you can make a delicious smoothie using the fruits and vegetables, which would be a healthy and nutritious snack or breakfast option. The milk and eggs can be used to create various dishes, such as a milk-based smoothie, a LLaVA | baked casserole, or scrambled eggs. The juice can also be used to make a fresh fruit and juice mix or as a refreshing homemade popsicle. The possibilities are vast, and the meals you can create will depend on your preferences and dietary needs. | would like to do the fruit salad. Show me 

In [None]:
displayRAG("Moses and the Messengers from Canaan")