In [34]:
# bring in our LLAMA_CLOUD_API_KEY
from dotenv import load_dotenv
load_dotenv()
import os


In [35]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

In [63]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-4o-mini")

Settings.llm = llm
Settings.embed_model = embed_model

# Using brand new ```LlamaParse``` PDF reader for PDF Parsing
we also compare two different retrieval/query engine strategies:

1. Using raw Markdown text as nodes for building index and apply simple query engine for generating the results;
2. Using ```MarkdownElementNodeParser``` for parsing the ```LlamaParse``` output Markdown results and building recursive retriever query engine for generation.

In [43]:
from llama_parse import LlamaParse

documents = LlamaParse(result_type="markdown").load_data("./data/paper.pdf")

Started parsing the file under job_id 7ac8d9cc-6305-416f-96d4-6dcb31c9ca54


In [44]:
len(documents)

41

# Get page nodes

In [45]:
from copy import deepcopy
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex


def get_page_nodes(docs, separator="\n---\n"):
    """Split each document into page node, by separator."""
    nodes = []
    for doc in docs:
        doc_chunks = doc.text.split(separator)
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes

In [46]:
page_nodes = get_page_nodes(documents)


In [47]:
page_nodes

[TextNode(id_='e71f7ef6-1e4d-4727-a2e1-884f0bd11541', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training\n\nBrandon McKinzie◦, Zhe Gan◦, Jean-Philippe Fauconnier⋆, Sam Dodge⋆, Bowen Zhang⋆, Philipp Dufter⋆, Dhruti Shah⋆, Xianzhi Du⋆, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch⋆, Alexander Toshev†, and Yinfei Yang†\n\nApple\n\n◦First authors; ⋆Core authors; †Senior authors\n\nbmckinzie@apple.com, zhe.gan@apple.com\n\n# Abstract\n\nIn this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and com

In [62]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4o-mini"), num_workers=8
)

In [49]:
nodes = node_parser.get_nodes_from_documents(documents)

0it [00:00, ?it/s]
0it [00:00, ?it/s]
2it [00:00, ?it/s]
0it [00:00, ?it/s]
2it [00:00, ?it/s]
1it [00:00, 996.75it/s]
0it [00:00, ?it/s]
1it [00:00, ?it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
3it [00:00, 1506.03it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, ?it/s]
1it [00:00, ?it/s]
2it [00:00, 1998.24it/s]
1it [00:00, ?it/s]
1it [00:00, ?it/s]
1it [00:00, ?it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]


In [50]:
len(nodes)

111

In [51]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

In [52]:
len(base_nodes)

69

In [53]:
len(objects)

21

In [54]:
objects[0].get_content()

'The table compares the estimated total price of beers on a table according to different models. MM1-30B-Chat estimates the price as 12, Emu-Chat-37B states the price as 15.99, and LLaVA-NeXT-34B provides a detailed calculation estimating the price as 44 based on assumed quantities and types of beer.,\nwith the following columns:\n- MM1-30B-Chat (Ours): None\n- Emu-Chat-37B: None\n- LLaVA-NeXT-34B: None\n'

In [55]:
base_nodes[0].get_content()

'MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training\n\nBrandon McKinzie◦, Zhe Gan◦, Jean-Philippe Fauconnier⋆, Sam Dodge⋆, Bowen Zhang⋆, Philipp Dufter⋆, Dhruti Shah⋆, Xianzhi Du⋆, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch⋆, Alexander Toshev†, and Yinfei Yang†\n\nApple\n\n◦First authors; ⋆Core authors; †Senior authors\n\nbmckinzie@apple.com, zhe.gan@apple.com\n\n Abstract\n\nIn this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, w

In [56]:
# dump both indexed tables and page text into the vector index
recursive_index = VectorStoreIndex(nodes=base_nodes + objects + page_nodes)

In [58]:
print(page_nodes[7].get_content())

# 3.3 Pre-training Data Ablation

Large-scale and task-appropriate data is of paramount importance in training performant models. Typically, models are trained in two stages, pre-training and instruction tuning. In the former stage web-scale data is used while in the latter stage task-specific curated data is utilized. In the following, we focus on the pre-training stage and elaborate our data choices (see Figure 3, right).

|Data Type|Sources|Size|
|---|---|---|
|Captioned Images|CC3M [100], CC12M [13], HQIPT-204M [94], COYO [11], Web Image-Text-1B (Internal)|2B image-text pairs|
|Captioned Images (Synthetic)|VeCap [57]|300M image-text pairs|
|Interleaved Image-Text|OBELICS [58], Web Interleaved (Internal)|600M documents|
|Text-only|Webpages, Code, Social media, Books, Encyclopedic, Math|2T tokens|

Table 2: List of datasets for pre-training multimodal large language models.

Two types of data are commonly used to train MLLMs: captioning data consisting of images with paired text desc

In [59]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker



reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)

# Setup Baseline
For comparison, we setup a naive RAG pipeline with default parsing and standard chunking, indexing, retrieval.

In [60]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_files=["./data/paper.pdf"])
base_docs = reader.load_data()
raw_index = VectorStoreIndex.from_documents(base_docs)
raw_query_engine = raw_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker]
)

## Using ```new LlamaParse``` as pdf data parsing methods and retrieve tables with two different methods
we compare base query engine vs recursive query engine with tables

### Table Query Task: Queries for Table Question Answering

In [29]:
query = "Tell me over all summary of the document"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
The document discusses the methods, analysis, and insights related to the pre-training of a multimodal large language model (LLM). It includes detailed sections on dataset construction, training processes, evaluation metrics, and qualitative examples of the model's capabilities. The dataset comprises interleaved image-text documents and text-only data, with specific filtering and de-duplication processes outlined. Training details cover both pre-training and supervised fine-tuning, while evaluation details include various benchmarks and qualitative assessments of the model's performance in tasks such as counting objects in images and extracting scene text. The document concludes with acknowledgments of contributions from various individuals involved in the research and development process.
[1;3;38;2;11;159;203mRetrieval entering d71a02c9-3a9b-4abc-98d7-d539c6cc582e: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query 

In [61]:
query = "Tell me about 4-shot result numbers  across all models of MM1 ablation across different image encoders"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
The 4-shot results for MM1 ablation across different image encoders show varying performance metrics. For the AIM models, the results are 56.6 for AIM 600M, 59.5 for AIM 1B, and 60.9 for AIM 3B. For the CLIP models, the results are 58.7 for CLIP DFN+VeCap ViT-L, 57.0 for CLIP DFN ViT-H, 60.0 for CLIP DFN+VeCap ViT-H, 62.4 for CLIP DFN+VeCap ViT-L with 336 resolution, 62.6 for CLIP DFN+VeCap ViT-H, 62.2 for CLIP OpenAI ViT-L, and 62.5 for CLIP DFN ViT-H.

***********New LlamaParse+ Recursive Retriever Query Engine***********
The 4-shot results for the MM1 ablation across different image encoders are as follows:

- AIM600M: 56.6
- AIM1B: 59.5
- AIM3B: 60.9
- CLIPDFN+VeCap (ViT-L): 58.7
- CLIPDFN (ViT-H, 224): 57.0
- CLIPDFN+VeCap (ViT-H): 60.0
- CLIPDFN+VeCap (ViT-L): 62.6
- CLIPDFN+VeCap (ViT-H, 336): 62.4
- CLIPOpenAI (ViT-L): 62.2
- CLIPDFN (ViT-H, 378): 62.5

These results indicate the performance of different models when evaluated with four 

In [64]:
query = "what is the token size for test only data type  in pre training data ablation?"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
The token size for text-only data in the pre-training data ablation is 2 trillion tokens.

***********New LlamaParse+ Recursive Retriever Query Engine***********
The token size for the text-only data type in the pre-training data ablation is 2 trillion tokens.


In [33]:
response_1

Response(response='The 4-shot result numbers across all models in the MM1 ablation with different image encoders are as follows:\n\n- Recon.AIM 600M ViT/600M: 56.6\n- AIM 1B ViT/1B: 59.5\n- AIM 3B ViT/3B: 60.9\n- ContrastiveCLIP DFN+VeCap ViT-L: 58.7\n- CLIP DFN ViT-H DFN-5B: 57.0\n- CLIP DFN+VeCap ViT-H DFN-5B +VeCap: 60.0\n- CLIP DFN+VeCap ViT-L DFN-5B +VeCap: 62.4\n- CLIP DFN+VeCap ViT-H: 62.6\n- CLIP OpenAI ViT-L ImageText-400M: 62.2\n- CLIP DFN ViT-H DFN-5B: 62.5\n\nThese results indicate the performance of various models when utilizing 4-shot learning across different image encoders.', source_nodes=[NodeWithScore(node=TextNode(id_='69b5761b-7b3e-4158-9539-2d1ff0444773', embedding=None, metadata={'page_label': '6', 'file_name': 'paper.pdf', 'file_path': 'data\\paper.pdf', 'file_type': 'application/pdf', 'file_size': 18782815, 'creation_date': '2024-10-30', 'last_modified_date': '2024-10-29'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'la

In [29]:
query = "Tell me about japanese documenty pipeline"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
The Japanese document digitization pipeline utilizes LayoutParser to generate high-quality structured data from historical Japanese firm financial tables, which often feature complicated layouts. This pipeline employs two layout detection models to identify different levels of document structures and utilizes two customized OCR engines to enhance character recognition accuracy.

The documents typically contain vertically arranged text columns, a common format in Japanese writing. Due to issues like scanning noise and variations in printing technology, these columns can be skewed or have inconsistent widths, making them challenging to identify with traditional rule-based methods. The pipeline addresses these challenges by implementing a document reorganization algorithm that rearranges detected tokens based on their bounding boxes, improving character recognition recall.

Additionally, the pipeline is designed to handle unique fonts and glyphs u

In [30]:
response_1

Response(response='The Japanese document digitization pipeline utilizes LayoutParser to generate high-quality structured data from historical Japanese firm financial tables, which often feature complicated layouts. This pipeline employs two layout detection models to identify different levels of document structures and utilizes two customized OCR engines to enhance character recognition accuracy.\n\nThe documents typically contain vertically arranged text columns, a common format in Japanese writing. Due to issues like scanning noise and variations in printing technology, these columns can be skewed or have inconsistent widths, making them challenging to identify with traditional rule-based methods. The pipeline addresses these challenges by implementing a document reorganization algorithm that rearranges detected tokens based on their bounding boxes, improving character recognition recall.\n\nAdditionally, the pipeline is designed to handle unique fonts and glyphs used in historical d

In [31]:
response_2

Response(response='A comprehensive pipeline was developed to digitize historical Japanese firm financial tables, which often feature complicated layouts. This pipeline utilizes two layout models to identify various levels of document structures and incorporates two customized OCR engines to enhance character recognition accuracy. The documents typically contain vertically written text, a common format in Japanese, which can present challenges due to scanning noise and the variability in column widths. The pipeline effectively addresses these complexities to generate high-quality structured data from the historical documents.', source_nodes=[NodeWithScore(node=TextNode(id_='12361d36-1cb9-4f44-a5a0-89ccff52aff0', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# LayoutParser: A Unified Toolkit for DL-Based DIA\n\nfocuses on precision, efficiency, and robustness. The target documents may have complicated structures, and 