# Agentic Embedding

Retrieval-Augmented Generation (RAG) systems are gaining significant popularity among legacy industries such as law, life sciences, and finance, where there are massive amounts of unstructured text that are multimodal. Gathering insights from these extensive piles of documentation previously involved manual searches and insight generation from graphs and diagrams, which are extremely time-consuming and laborious, even for highly intelligent individuals. Companies like Harvey and Hebbia that have recently bagged huge funds exemplify how RAG systems can expedite this process by not only finding relevant documents but also providing a GPT-like interface that directly answers user questions.

However, RAG systems often hallucinate, especially when they fail to find relevant answers from the pool of embedded documents. Achieving performance from 80% to 100% is extremely challenging but crucial especially for vertical use cases where mistakes can be costly and unforgiving.

While foundational models are often blamed and guardrails built with hallucination models (e.g., [Lynx](https://www.patronus.ai/blog/lynx-state-of-the-art-open-source-hallucination-detection-model)) are gaining popularity, the importance of embedding strategies and the limitations of multimodal embedding are less frequently discussed.

**Agentic Embedding** is a new AI engineering term that I coined, which implies a method of utilizing different prompts or methods to embed various types of modalities (e.g., text, tables, graphs, diagrams, photos, etc.). While the code serves as a simple demonstration of the concept, it also explores the current limitations of traditional OCR methods in processing unstructured multimodal documents.


First, we're using the Mistral 7B research paper from arXiv: [https://arxiv.org/abs/2310.06825](https://arxiv.org/abs/2310.06825). This paper has been selected for demonstration because it's truly multimodal! (including graphs, bar charts, and diagrams).


#### Example Page:

![Alt text](./files/sample.png)


In [None]:
!pip install llama-index
!pip install llama-index-core
!pip install llama-index-llms-anthropic llama-index-multi-modal-llms-anthropic
!pip install llama-index-embeddings-huggingface
!pip install llama-parse

In [None]:
%pip install llama-index-embeddings-openai

[LlamaParse](https://github.com/run-llama/llama_parse) is a genAi native document parsing tool. 

First, we parse the pdf into json that includes a markdown version of the text (useful for tabular data extraction) and also image which are present in the json in the form of ImageNode.

In [2]:
import nest_asyncio
from llama_parse import LlamaParse 
import os
import json


nest_asyncio.apply()


# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

In [None]:
parser = LlamaParse(verbose=True)
file_path = "json_objs.json"

try:
    # Try to load the JSON file
    with open(file_path, "r") as json_file:
        json_objs = json.load(json_file)
    print("Loaded json_objs from file.")
except FileNotFoundError:
    # If the file does not exist, run the parser and save the result
    json_objs = parser.get_json_result("./files/Mistral_7B.pdf")
    
    # Save the json_objs to a file for future use
    with open(file_path, "w") as json_file:
        json.dump(json_objs, json_file)
    print("File not found. Parsed the PDF and saved the result.")

json_list = json_objs[0]["pages"]
json_list

#### Limitations of OCR / current document parsing
The cell below shows that llamaParse failed to extract the image from page 6. And I found that this is where most document parsers fail. Charts that are embedded as Vector Graphics are not stored as one cohesive unit within the pdf structure making it hard to parse. The limitations are significant as the format of images, charts, and tables vastly differ within and across documents. An ideal extraction would be vision-based image localization approach where the figure and it's legends are both extracted as a single image.

In [None]:
# Separately save images into the extracted_images directory
images = parser.get_images(json_objs, download_path='extracted_image')

In [31]:
# Perform embedding augmentation with images using claude agents
from IPython.display import Image
from anthropic import Anthropic
import base64
import pprint
from llama_index.core.schema import TextNode


pp = pprint.PrettyPrinter(indent=4)


client = Anthropic(api_key="sk-...")
MODEL_NAME = "claude-3-opus-20240229"

#### Defining embedding tools
Here, I have defined two tools: one for diagrams and one for graphs. In this example, it's an extremely simple agent that performs classification and data extraction depending on the type of image provided. However, a more complex method could potentially be implemented inside `process_tool_call`, such as adding metadata for pre-filtering based on the user query, depending on the specific use case.

In [32]:
# Define tools
image_label_tool_generic = {
    "name": "print_research_image_info",
    "description": "Extracts useful image information from a research paper.",
    "input_schema": {
        "type": "object",
        "properties": {
            "title": {"type": "string", "description": "Title of the figure."},
            "legend": {"type": "string", "description": "Legend of the figure."},
            "description": {"type": "string", "description": "Description of the image."},
            "keywords": {"type": "string", "description": "Several specific keywords that describ the image."},
            "x-axis" : {"type": "string", "description": "X axis of the graph"},
            "y-axis" : {"type": "string", "description": "Y axis of the graph"}
        },
        "required": ["title", "legend", "description", "keywords"]
    }
}

# Define specific tools
image_label_tool_diagram = {
    "name": "get_research_diagram_info",
    "description": "Interprets diagram information from a research paper.",
    "input_schema": {
        "type": "object",
        "properties": {
            "title": {"type": "string", "description": "Title of the figure."},
            "legend": {"type": "string", "description": "Legend of the figure."},
            "description": {"type": "string", "description": "Description of the diagram."},
            "keywords": {"type": "string", "description": "Several specific keywords that describ the diagram."}
        },
        "required": ["titke", "legend", "description", "keywords"]
    }
}

image_label_tool_graph = {
    "name": "get_research_graph_info",
    "description": "Interprets graph information from a research paper.",
    "input_schema": {
        "type": "object",
        "properties": {
            "title": {"type": "string", "description": "Title of the figure."},
            "legend": {"type": "string", "description": "Legend of the figure."},
            "description": {"type": "string", "description": "Description of the image"},
            "keywords": {"type": "string", "description": "Several specific keywords that describes the diagram."},
            "trend": {"type": "string", "description": "Overall trend of the graph."},
            "x-axis" : {"type": "string", "description": "X axis of the graph"},
            "y-axis" : {"type": "string", "description": "Y axis of the graph"}
        },
        "required": ["title", "legend", "description", "keywords",'trend', "x-axis", "y-axis"]
    }
}

tools = [image_label_tool_diagram, image_label_tool_graph]


def process_tool_call(tool_name, tool_input, image_path): 
    if(tool_name == 'get_research_diagram_info'):
        # Relevant embedding function here with metadata
        return TextNode(text=str(tool_input), metadata={"path": image_path, "type": 'diagram'})

    elif(tool_name == "get_research_graph_info"):
        return TextNode(text=str(tool_input), metadata={"path": image_path, "type": "graph"})


#...

In [33]:
# Emcode image to pass it to the LLM
def get_base64_encoded_image(image_path):
    with open(image_path, "rb") as image_file:
        binary_data = image_file.read()
        base_64_encoded_data = base64.b64encode(binary_data)
        base64_string = base_64_encoded_data.decode('utf-8')
        return base64_string
    

#### Context Augmentation

Providing context to the image before embedding, massively helps Vlm's capability to interpret charts. 

In [None]:
# Store everything into text_nodes and image_text_nodes
# For every image in the llama parse, provide context
def agentic_embedding(context, image_path):

    query = f'Print the description of the image provided from a research paper. Provided is the context of the image in markdown format: {context}.'
    message_list = [
        {
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(image_path)}},
                {"type": "text", "text": query}
            ]
        }
    ]

    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=4096,
        messages=message_list,
        tools=tools
    )

    if response.stop_reason == "tool_use":
        last_content_block = response.content[-1]
        if last_content_block.type == 'tool_use':
            tool_name = last_content_block.name
            tool_inputs = last_content_block.input
            print(f"\nTool Used: {tool_name}")
            pp.pprint(f"Tool Input: {tool_inputs}")

            return process_tool_call(tool_name, tool_inputs, image_path)

image_text_nodes = []
for image in images:
    image_path = image['path']
    page_number = image['page_number']
    context = json_list[page_number - 1]['md']
    
    image_text_node = agentic_embedding(context, image_path)
    image_text_nodes.append(image_text_node)
    print("image text node:", image_text_node)

In [38]:
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-3-opus-20240229", temperature=0.0, api_key="sk...")

In [39]:
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding


Settings.llm = llm
# Settings.embed_model = "local:BAAI/bge-small-en-v1.5"

Settings.embed_model = OpenAIEmbedding(api_key="sk...")


In [47]:
filtered_image_text_node = [x for x in image_text_nodes if x is not None]
filtered_image_text_node


[TextNode(id_='0eab90d1-8ce2-4ad4-8a4a-a74057b400ca', embedding=None, metadata={'path': 'extracted_images/12810641-4017-4eb7-9ac3-4129c2421dbc-img_p1_1.png', 'type': 'diagram'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="{'title': 'Sliding Window Attention', 'legend': 'The matrices visualize attention patterns across transformer layers. Rows represent layers and columns represent token positions. Yellow indicates where attention is applied within the sliding window for each layer.', 'description': 'The diagram compares vanilla attention to sliding window attention across transformer layers. With vanilla attention, each token attends to all previous tokens. Sliding window attention restricts the attention to a fixed window that shifts with each layer. This allows tokens in higher layers to indirectly attend to information beyond the initial window, increasing the effective context length captured by the model as you move up the layer stack.',

In [48]:
# Embed these
from llama_index.core import VectorStoreIndex

# Here, we're just embedding images for the purpose of the project
# Also for better embedding and to have more control over, you can use pipelining from Llama index
# https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index/
# index = VectorStoreIndex(text_nodes + image_text_nodes)
index = VectorStoreIndex(filtered_image_text_node)


In [49]:
query_engine = index.as_query_engine()


In [50]:
response = query_engine.query(
    "How would mistral 7B perform against other Llama models in the MMLU benchmark?"
)
print(str(response))

Based on the performance comparison graphs, the Mistral 7B model would likely outperform the LLaMA 2 models of various sizes on the MMLU benchmark. The first graph shows Mistral 7B achieving a higher accuracy score than the LLaMA 2 models, including the larger 13B parameter version, on the MMLU task.

The second set of graphs also indicates that Mistral 7B has superior performance to LLaMA 2 13B on the MMLU benchmark despite having fewer parameters. This suggests that the Mistral 7B architecture and training allow it to be more efficient and effective at the MMLU task compared to the LLaMA 2 models.

So in summary, the Mistral 7B model would be expected to outperform LLaMA 2 models, even those with more parameters like the 13B version, when evaluated on the MMLU benchmark based on the performance data shown.


### What's Next

There are two major limitations that hinder the current OCR/Document Parser + VLM embedding:

1. **Speed of Embedding**: The process of embedding can be time-consuming, affecting the overall efficiency.
2. **Dirty Output Format**: Current data extraction tools often produce unclean outputs, particularly for automatic schema inference tasks with multimodal documentation.

The ideal solution would involve extracting images as a whole, including their legends, and also being able to extract vector graphic images from PDFs, which current parsing tools fail to do. It is surprising how much the formats of these multimodal documents vary, making it nearly impossible to enforce consistent rules on these parsers to make them work.

This makes the new [Colpali](https://huggingface.co/blog/manu/colpali) model an interesting option to explore. Colpali takes a complete vision approach to embedding unstructured data by solely using the image representation of document pages.
