## Semi-structured and Multi-modal RAG

Many documents contain a mixture of content types, including text, tables, and images.

Semi-structured data can be challenging for conventional RAG for at least two reasons:

* Text splitting may break up tables, corrupting the data in retrieval
* Embedding tables may pose challenges for semantic similarity search

And the information captured in images is typically lost.

With the emergence of multimodal LLMs, like [GPT4-V](https://openai.com/research/gpt-4v-system-card), it is worth considering how to utilize images in RAG:

`Option 1:`

* Use multimodal embeddings (such as [CLIP](https://openai.com/research/clip)) to embed images and text
* Retrieve both using similarity search
* Pass raw images and text chunks to a multimodal LLM for answer synthesis

`Option 2:`

* Use a multimodal LLM (such as [GPT4-V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images
* Embed and retrieve text
* Pass text chunks to an LLM for answer synthesis

`Option 3:`

* Use a multimodal LLM (such as [GPT4-V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images
* Embed and retrieve image summaries with a reference to the raw image
* Pass raw images and text chunks to a multimodal LLM for answer synthesis   

This cookbook show how we might tackle this :

* We will use [Unstructured](https://unstructured.io/) to parse images, text, and tables from documents (PDFs).
* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text, (optionally) images along with their summaries for retrieval.
* We will demonstrate `Option 2`, and will follow-up on the other approaches in future cookbooks.


## Packages

In [None]:
! pip install langchain unstructured[all-docs] pydantic lxml langchainhub langchain_openai chromadb langchain_community


Collecting langchain
  Downloading langchain-0.2.1-py3-none-any.whl (973 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unstructured[all-docs]
  Downloading unstructured-0.14.2-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
Collecting langchainhub
  Downloading langchainhub-0.1.16-py3-none-any.whl (4.8 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.1.7-py3-none-any.whl (34 kB)
Collecting chromadb
  Downloading chromadb-0.5.0-py3-none-any.whl (526 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_core-0.2.1-py3-none-any.whl (308 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.5/308.5 kB[0m [31m18.2

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')


## Data Loading

### Partition PDF tables, text, and images
  
* `LLaVA` Paper: https://arxiv.org/pdf/2304.08485.pdf
* Use [Unstructured](https://unstructured-io.github.io/unstructured/) to partition elements

In [None]:
# !sudo apt install poppler-utils
# !sudo apt install tesseract-ocr
!apt-get install tesseract-ocr
!apt-get install libpoppler-cpp-dev
!apt-get install poppler-utils


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 1s (4,814 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 121918 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-

In [None]:
# path = "/Users/rlm/Desktop/Papers/LLaVA/"
path = "/content/"

In [None]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

# Get elements
raw_pdf_elements = partition_pdf(
    filename=path + "llava.pdf",
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# !pip install PyPDF2
import PyPDF2

def get_num_pages(pdf_path):
    with open(pdf_path, "rb") as file:
        pdf = PyPDF2.PdfReader(file)
        return len(pdf.pages)

pdf_path = "/content/llava.pdf"
print(get_num_pages(pdf_path))

25


In [None]:
#  import os
#  os.getcwd()

'/content'

In [None]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 31,
 "<class 'unstructured.documents.elements.Table'>": 4}

In [None]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

4
31


## Multi-vector retriever

Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).

Summaries are used to retrieve raw tables and / or raw chunks of text.

### Text and Table summaries

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate


In [None]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


Collecting unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-8_11zubv/unsloth_52e1a6e5924e4194aae796acd54c766d
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-8_11zubv/unsloth_52e1a6e5924e4194aae796acd54c766d
  Resolved https://github.com/unslothai/unsloth.git to commit b0781339f035c72b3028d846eb2261e8115cd375
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tyro (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading tyro-0.8.4-py3-none-any.whl (102 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/102.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting datasets>=2.16.0 (from unsloth[colab-ne

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import concurrent.futures

# Load the LLaMA2 model and tokenizer
model_name = "meta-llama/Llama-2-7b"  # Replace with the actual model name if different
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Define the summarization pipeline
summarize_chain = pipeline("summarization", model=model, tokenizer=tokenizer)

texts = [i.text for i in text_elements]
# Apply summarization
# Function to summarize a single text# Function to summarize a single text
def summarize_text(text):
    if not text or len(text.strip()) == 0:
        return "Empty or invalid text provided."

    try:
        summary = summarize_chain(text, max_length=50, min_length=25, do_sample=False)
        return summary[0]['summary_text']
    except IndexError as e:
        return f"Error summarizing text: {e}"
    except Exception as e:
        return f"Unexpected error: {e}"


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-2-7b.
401 Client Error. (Request ID: Root=1-6653a1af-3c1c52166962082a774bfe08;63358356-979a-4e1d-8e75-dbe555b11971)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b/resolve/main/config.json.
Access to model meta-llama/Llama-2-7b is restricted. You must be authenticated to access it.

In [None]:

# Apply summarization in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    text_summaries = list(executor.map(summarize_text, texts))

# Print the summaries
for i, summary in enumerate(text_summaries):
    print(f"Summary {i + 1}: {summary}")

Your max_length is set to 50, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)


Summary 1: Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks. We present the first attempt to use language-only GPT-4 to
Summary 2: One of the core aspirations in artificial intelligence is to develop a general-purpose assistant that can effectively follow multi-modal vision-and-language instructions. In this line of work, each task is solved independently by one single large vision
Summary 3: In computer vision, existing works that build instruction- following agents can be broadly categorized into two classes: (i) End-to-end trained models, which are separately explored for each specific research topic. (ii) A system that
Summary 4: We use COCO images and generate three types of instruction-following data. One example per type is shown in the bottom block of Table 14. For each type of data, we manually design the annotations we have used as seed
Summary 5: We design a conver

In [None]:
# # Apply to text
# texts = [i.text for i in text_elements]
# text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

In [None]:
summarize_chain = pipeline("summarization", model="meta-llama/Llama-2-7b")
tables = [i.text for i in table_elements]
# Apply summarization
# Function to summarize a single text# Function to summarize a single text
def summarize_table(text):
    if not text or len(text.strip()) == 0:
        return "Empty or invalid text provided."

    try:
        summary = summarize_chain(text, max_length=50, min_length=25, do_sample=False)
        return summary[0]['summary_text']
    except IndexError as e:
        return f"Error summarizing text: {e}"
    except Exception as e:
        return f"Unexpected error: {e}"




OSError: meta-llama/Llama-2-7b does not appear to have a file named config.json. Checkout 'https://huggingface.co/meta-llama/Llama-2-7b/tree/main' for available files.

In [None]:
from huggingface_hub import login
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:

# Apply summarization in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    table_summaries = list(executor.map(summarize_table, tables))

# Print the summaries
for i, summary in enumerate(table_summaries):
    print(f"Summary {i + 1}: {summary}")


Summary 1: Conversation Detail description Complex reasoning All Full data Detail + Complex Conv + 5% Detail + 10% Complex Conversation No Instruction Tuning 83.1 81.5 (-1.6) 81.0 (-2.1) 76
Summary 2: OpenFlamingo [5] BLIP-2 [28] LLaVA LLa VA† 19.3 ± 0.5 54.6 ± 1.4 57.3± 1.9 58.8 ± 0
Summary 3: G1-6 G7-12 Average Representative & SoTA methods with numbers reported in the literature 90.23 Human [34] 74.64 GPT-3.5 w/ CoT [34) 75.44 G
Summary 4: Predict answer first Training from scratch 7B model size 89.96 (-0.96) 89.77 (-1.15) - -


### Images

We will implement `Option 2` discussed above:

* Use a multimodal LLM ([LLaVA](https://llava.hliu.cc/)) to produce text summaries from images
* Embed and retrieve text
* Pass text chunks to an LLM for answer synthesis

#### Image summaries

We will use [LLaVA](https://github.com/haotian-liu/LLaVA/), an open source multimodal model.

We will use [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436) to run LLaVA locally (e.g., on a Mac laptop):

* Clone [llama.cpp](https://github.com/ggerganov/llama.cpp)
* Download the LLaVA model: `mmproj-model-f16.gguf` and one of `ggml-model-[f16|q5_k|q4_k].gguf` from [LLaVA 7b repo](https://huggingface.co/mys/ggml_llava-v1.5-7b/tree/main)
* Build
```
mkdir build && cd build && cmake ..
cmake --build .
```
* Run inference across images:
```
/Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p "Describe the image in detail. Be specific about graphs, such as bar plots." --image "$img" > "$output_file"
```

In [None]:
%%bash

# Define the directory containing the images, adjust the path as necessary
# IMG_DIR="/content/drive/My Drive/path_to_your_images/"
IMG_DIR="figures/"

# Check if the directory contains any .jpg files and then print their names
if ls "${IMG_DIR}"*.jpg 1> /dev/null 2>&1; then
    # Loop through each image in the directory
    for img in "${IMG_DIR}"*.jpg; do
        # Extract the base name of the image without extension
        base_name=$(basename "$img" .jpg)
        echo $base_name
    done
else
    echo "No JPG files found in the specified directory."
fi

figure-15-6
figure-16-7
figure-17-8
figure-17-9
figure-18-10
figure-18-11
figure-19-12
figure-19-13
figure-21-14
figure-21-15
figure-23-16
figure-3-1
figure-4-2
figure-6-3
figure-8-4
figure-8-5


In [None]:
# !git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp
# !git checkout llava
# !mkdir build && cd build
# !cmake .. && cmake --build . --config Release
# !mkdir -p ~/.ai/bin/llava
# !cp bin/llava bin/ggml-metal.metal ~/.ai/bin/llava

fatal: not a git repository (or any parent up to mount point /home)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).


In [None]:
# %%bash

# # Define the directory containing the images
# IMG_DIR="/test/figures/"

# # Loop through each image in the directory
# for img in "${IMG_DIR}"*.jpg; do
#     # Extract the base name of the image without extension
#     base_name=$(basename "$img" .jpg)
#     echo $base_name
#     # Define the output file name based on the image name
#     output_file="${IMG_DIR}${base_name}.txt"

#     # Execute the command and save the output to the defined output file
#     /test/llama.cpp/bin/llava -m ../model/ggml-model-q5_k.gguf --mmproj ../model/mmproj-model-f16.gguf --temp 0.1 -p "Describe the image in detail. Be specific about graphs, such as bar plots." --image "$img" > "$output_file"

# done


Inference_with_LLaVa_for_multimodal_generation.ipynb LLaVA Llava_demo_4bit.ipynb Semi_structured_and_multi_modal_RAG_(2).ipynb Semi_structured_and_multi_modal_RAG_ori.ipynb Untitled.ipynb figures figures.zip llama.cpp llava.pdf


In [None]:
%%bash

Define the directory containing the images
IMG_DIR="/test/figures/"

# Loop through each image in the directory
for img in "${IMG_DIR}"*.jpg; do
    # Extract the base name of the image without extension
    base_name=$(basename "$img" .jpg)
    echo $base_name
    # Define the output file name based on the image name
    output_file="${IMG_DIR}${base_name}.txt"

    # Execute the command and save the output to the defined output file
    /test/llama.cpp/bin/llava -m ../model/ggml-model-q5_k.gguf --mmproj ../model/mmproj-model-f16.gguf --temp 0.1 -p "Describe the image in detail. Be specific about graphs, such as bar plots." --image "$img" > "$output_file"

done


Inference_with_LLaVa_for_multimodal_generation.ipynb LLaVA Llava_demo_4bit.ipynb Semi_structured_and_multi_modal_RAG_(2).ipynb Semi_structured_and_multi_modal_RAG_ori.ipynb Untitled.ipynb figures figures.zip llama.cpp llava.pdf


In [None]:
from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

llava_model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

2024-05-15 02:57:19.288500: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-15 02:57:22.115648: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
import re, os
from PIL import Image
import requests
prompt = "USER: <image>\nWhat's the content of the image? Give more description and detail. ASSISTANT:"
img_dir = "figures/"

# Ensure the directory exists
if not os.path.exists(img_dir):
    os.makedirs(img_dir)

# Loop through each image in the directory
for img in os.listdir(img_dir):
    if img.endswith(".jpg"):
        path = os.path.join(img_dir, img)
        print(path)
        image = Image.open( path )
        #print(image)

        inputs = processor(text=prompt, images=image, return_tensors="pt")

        # Generate
        generate_ids = llava_model.generate(**inputs, max_new_tokens=30)

        generated_text = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        #print(generated_text)
        match = re.search(r"ASSISTANT:(.*)", generated_text, re.DOTALL)

        if match:
            text = match.group(1).strip()
            print(text)
            # Write result to txt
            base_name = os.path.splitext(img)[0]
            output_file = os.path.join(img_dir, f"{base_name}.txt")
            with open(output_file, "w") as f:
                f.write(text)

        #break

figures/figure-17-9.jpg
The image is a screenshot of a conversation between two people, likely discussing a trip to a scenic location. They are talking about the weather,
figures/figure-3-1.jpg
The image features a garage with a truck parked inside. The truck is surrounded by several suitcases, with some placed near the tr
figures/figure-16-7.jpg
The image features a split screen with two different web pages displayed. The first web page is a joke website, while the second one is a math
figures/figure-18-10.jpg
The image is a black and white photograph of a man standing in front of a beautiful sunset. The man is positioned in the center of the
figures/figure-21-15.jpg
The image displays a graph showing the number of orders received over time. The graph is divided into two sections, one for the number of orders and the
figures/figure-17-8.jpg
The image is a screenshot of a recipe for a fruit salad, likely taken from a cookbook or a digital recipe book. The reci
figures/figure-15-6.jpg
T

Note:

To run LLaVA with python bindings, we need a Python API to run the CLIP model.

CLIP support is likely to be added to `llama.cpp` in the future.

After running the above, we  fetch and clean image summaries.

In [None]:
import glob
import os
img_dir = "figures/"
# Get all .txt file summaries
file_paths = glob.glob(os.path.expanduser(os.path.join(img_dir, "*.txt")))

# Read each file and store its content in a list
img_summaries = []
for file_path in file_paths:
    with open(file_path, "r") as file:
        img_summaries.append(file.read())

# # Remove any logging prior to summary
# logging_header = "clip_model_load: total allocated memory: 201.27 MB\n\n"
# cleaned_img_summary = [s.split(logging_header, 1)[1].strip() for s in img_summaries]

### Add to vectorstore

Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries.

In [None]:
pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-2.7.0


In [None]:
import uuid
from langchain.retrievers.multi_vector import MultiVectorRetriever
from sentence_transformers import SentenceTransformer
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings


# Load the sentence transformer model for embedding generation
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to generate embeddings
# Custom embedding function class
class CustomEmbeddingFunction:
    def __init__(self, model):
        self.model = model

    def embed_documents(self, texts):
        return self.model.encode(texts).tolist()

# Initialize the custom embedding function
embedding_function = CustomEmbeddingFunction(embedding_model)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=embedding_function)



# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]

In [None]:
for i, s in enumerate(text_summaries):
  print (i, " = ", s)

0  =  Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks. We present the first attempt to use language-only GPT-4 to
1  =  One of the core aspirations in artificial intelligence is to develop a general-purpose assistant that can effectively follow multi-modal vision-and-language instructions. In this line of work, each task is solved independently by one single large vision
2  =  In computer vision, existing works that build instruction- following agents can be broadly categorized into two classes: (i) End-to-end trained models, which are separately explored for each specific research topic. (ii) A system that
3  =  We use COCO images and generate three types of instruction-following data. One example per type is shown in the bottom block of Table 14. For each type of data, we manually design the annotations we have used as seed
4  =  We design a conversation between the assist

In [None]:

retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

For `option 2` (above):

* Store the image summary in the `docstore`, which we return to the LLM for answer generation.

In [None]:
# Add image summaries
img_ids = [str(uuid.uuid4()) for _ in img_summaries]
summary_img = [
    Document(page_content=s, metadata={id_key: img_ids[i]})
    for i, s in enumerate(img_summaries)
]
retriever.vectorstore.add_documents(summary_img)
retriever.docstore.mset(list(zip(img_ids, img_summaries)))

For `option 3` (above):

* Store the images in the `docstore`.
* Using the image in answer synthesis will require a multimodal LLM with Python API integration.
* GPT4-V is expected soon, and - as mentioned above - CLIP support is likely to be added to `llama.cpp` in the future.

In [None]:
# Add images
img_ids = [str(uuid.uuid4()) for _ in img_summaries]
summary_img = [
    Document(page_content=s, metadata={id_key: img_ids[i]})
    for i, s in enumerate(img_summaries)
]
retriever.vectorstore.add_documents(summary_img)
### Fetch images
retriever.docstore.mset(
    list(
        zip(
            img_ids,
        )
    )
)

### Sanity Check retrieval

The most complex table in the paper:

In [None]:
tables[2]

'Method NAT Subject SOC LAN Context Modality IMG TXT NO Grade G1-6 G7-12 Average Representative & SoTA methods with numbers reported in the literature 90.23 Human [34] 74.64 GPT-3.5 [34] 75.44 GPT-3.5 w/ CoT [34] 84.37 LLaMA-Adapter [59] 87.52 MM-CoTBase [61] MM-CoTLarge [61] 95.91 Results with our own experiment runs GPT-4† LLaVA LLaVA+GPT-4† (complement) LLaVA+GPT-4† (judge) 84.97 69.74 70.87 88.30 77.17 82.00 89.60 74.44 74.68 83.72 87.88 95.26 87.48 76.00 78.09 84.36 85.82 90.82 73.45 95.95 95.50 96.74 84.06 90.36 90.36 91.56 87.36 88.00 88.55 91.09 81.87 89.49 89.05 90.62 87.50 67.28 67.43 80.32 82.90 88.80 70.75 88.00 87.80 88.99 88.10 77.42 79.93 86.90 86.83 92.89 90.73 90.66 91.08 93.52 91.59 76.80 78.23 85.83 84.65 92.44 84.69 90.93 92.22 92.73 82.42 68.89 69.68 84.05 85.37 90.31 79.10 90.90 88.73 92.16 88.40 73.97 75.17 85.19 84.91 91.68 82.69 90.92 90.97 92.53'

Here is the summary, which is embedded:

In [None]:
table_summaries[2]

'G1-6 G7-12 Average Representative & SoTA methods with numbers reported in the literature 90.23 Human [34] 74.64 GPT-3.5 w/ CoT [34) 75.44 G'

Here is our retrieval of that table from the natural language query:

In [None]:
# We can retrieve this table
retriever.invoke("What is percentage of Visual features of Best variant and before")[1]

AttributeError: 'CustomEmbeddingFunction' object has no attribute 'embed_query'

Image:

![image.png](attachment:image.png)

We can retrieve this image summary:

In [None]:
retriever.invoke("Images / figures with playful and creative examples")[1]

'F Prompts\n\nThe prompt used to generate image-based conversation from ChatGPT/GPT-4 is shown in Table 13.\n\n21\n\nmessages = [ {"role":"system", "content": f"""You are an AI visual assistant, and you are seeing a single image. What you see are provided with five sentences, describing the same image you are looking at. Answer all questions as you are seeing the image.\n\nDesign a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers.\n\nInclude questions asking about the visual content of the image, including the object types, counting the objects, object actions, object locations, relative positions between objects, etc. Only include questions that have definite answers: (1) one can see the content in the image that the question asks about and can answer confidently; (2) one can determine confidently from the image t

In [None]:
retriever.invoke("Images / figures with playful and creative examples")[1]

## RAG

Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval).

For `option 1` (above):

* Simply pass retrieved text chunks to LLM, as usual.

For `option 2a` (above):

* We would pass retrieved image and images to the multi-modal LLM.
* This should be possible soon, once [llama-cpp-python add multi-modal support](https://github.com/abetlen/llama-cpp-python/issues/813).
* And, of course, this will be enabled by GPT4-V API.

In [None]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Option 1: LLM
model = ChatOpenAI(temperature=0, model="gpt-4")
# Option 2: Multi-modal LLM
# model = GPT4-V or LLaVA

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [None]:
chain.invoke(
    "What is the performance of LLaVa across across multiple image domains / subjects?"
)

"The LLaVa model performs well across multiple image domains and subjects. In the LLaVa-Bench (COCO) benchmark, which uses 30 images from COCO-Val-2014 and generates three types of questions for each image, LLaVa's performance improves significantly with instruction tuning, detailed description, and complex reasoning questions. It achieves the best performance at 85.1% when all three types of data are used. In the LLaVa-Bench (In-the-Wild) benchmark, which uses a diverse set of 24 images with 60 questions in total, LLaVa outperforms other models, achieving significantly better performance compared to BLIP-2 (+29%) and OpenFlamingo (+48%). It also achieves an impressive 81.7% performance on complex reasoning questions, with an overall score of 67.3%. However, the model does have limitations and can sometimes fail to grasp the complex semantics within an image."

We can check the [trace](https://smith.langchain.com/public/85a7180e-0dd1-44d9-996f-6cb9c6f53205/r) to see retrieval of tables and text.

In [None]:
chain.invoke("Explain images / figures with playful and creative examples.")

'The text provides two examples of explaining images or figures in a playful and creative way. \n\n1. The first example is a meme featuring chicken nuggets. The meme starts with a phrase "Sometimes I just look at pictures of the Earth from space and I marvel at how beautiful it all is..." and then shows an image of chicken nuggets arranged to resemble the continents and islands on a world map. The punchline of the meme is "I mean, it’s not the real Earth, but how beautiful it is all is." This meme humorously suggests that the chicken nuggets represent the Earth, and the various locations depicted in the photo are actually chicken nugget versions of different places.\n\n2. The second example is a mock-up of a joke website. The website has a button that, when clicked, reveals a punchline to a joke. The joke is "Why was the math book sad? Because it had too many problems." This is a playful and creative way to engage website visitors and make them laugh.'