In [1]:
# bring in our LLAMA_CLOUD_API_KEY
from dotenv import load_dotenv
load_dotenv()
import os


'wget' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

In [4]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-4o-mini")

Settings.llm = llm
Settings.embed_model = embed_model

# Using brand new ```LlamaParse``` PDF reader for PDF Parsing
we also compare two different retrieval/query engine strategies:

1. Using raw Markdown text as nodes for building index and apply simple query engine for generating the results;
2. Using ```MarkdownElementNodeParser``` for parsing the ```LlamaParse``` output Markdown results and building recursive retriever query engine for generation.

In [5]:
from llama_parse import LlamaParse

documents = LlamaParse(result_type="markdown").load_data("../data/layout-parser-paper.pdf")

Started parsing the file under job_id 633f83c0-cdaa-4c14-933b-2d8c74b0a276


In [6]:
len(documents)

16

# Get page nodes

In [7]:
from copy import deepcopy
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex


def get_page_nodes(docs, separator="\n---\n"):
    """Split each document into page node, by separator."""
    nodes = []
    for doc in docs:
        doc_chunks = doc.text.split(separator)
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes

In [8]:
page_nodes = get_page_nodes(documents)


In [9]:
page_nodes

[TextNode(id_='820a84c4-8bf5-4e71-8a2a-637f12d3b6d3', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis\n\nZejiang Shen1, Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5\n\n1 Allen Institute for AI, shannons@allenai.org\n\n2 Brown University, ruochen_zhang@brown.edu\n\n3 Harvard University, {melissadell,jacob_carlson}@fas.harvard.edu\n\n4 University of Washington, bcgl@cs.washington.edu\n\n5 University of Waterloo, w422li@uwaterloo.ca\n\n# Abstract\n\nRecent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im

In [11]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4o-mini"), num_workers=8
)

In [12]:
nodes = node_parser.get_nodes_from_documents(documents)

0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]


In [13]:
nodes

[TextNode(id_='fd2ef33b-b3d2-444c-985c-3e467aa7a251', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='84563c44-3d0f-4029-b29b-4fcd023948c9', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='77070be0f63285857de531e720b90547ee8442dd7e864ef89c121805ae12d047')}, text='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis\n\nZejiang Shen1, Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5\n\n1 Allen Institute for AI, shannons@allenai.org\n\n2 Brown University, ruochen_zhang@brown.edu\n\n3 Harvard University, {melissadell,jacob_carlson}@fas.harvard.edu\n\n4 University of Washington, bcgl@cs.washington.edu\n\n5 University of Waterloo, w422li@uwaterloo.ca\n\n Abstract\n\nRecent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research 

In [14]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)


In [15]:
objects[0].get_content()


'This table summarizes various datasets used for layout analysis in different types of documents, indicating the model compatibility and specific notes for each dataset.,\nwith the following table title:\nDataset,\nwith the following columns:\n- Dataset: Names of the datasets used for layout analysis.\n- Base Model1: Compatibility of the datasets with Base Model1.\n- Large Model: Compatibility of the datasets with Large Model.\n- Notes: Additional notes regarding the datasets.\n'

In [18]:
base_nodes[0].get_content()

'LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis\n\nZejiang Shen1, Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5\n\n1 Allen Institute for AI, shannons@allenai.org\n\n2 Brown University, ruochen_zhang@brown.edu\n\n3 Harvard University, {melissadell,jacob_carlson}@fas.harvard.edu\n\n4 University of Washington, bcgl@cs.washington.edu\n\n5 University of Waterloo, w422li@uwaterloo.ca\n\n Abstract\n\nRecent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been ongoing efforts to improve reusability and simplify deep learning (DL) model development in disciplines like n

In [19]:
# dump both indexed tables and page text into the vector index
recursive_index = VectorStoreIndex(nodes=base_nodes + objects + page_nodes)

In [23]:
print(page_nodes[7].get_content())


# Table 2: All operations supported by the layout elements. The same APIs are supported across different layout element classes including Coordinate types, TextBlock and Layout.

|Operation Name|Description|
|---|---|
|block.pad(top, bottom, right, left)|Enlarge the current block according to the input|
|block.scale(fx, fy)|Scale the current block given the ratio in x and y direction|
|block.shift(dx, dy)|Move the current block with the shift distances in x and y direction|
|block1.is in(block2)|Whether block1 is inside of block2|
|block1.intersect(block2)|Return the intersection region of block1 and block2.|
|block1.union(block2)|Return the union region of block1 and block2.|
|block1.relative to(block2)|Convert the absolute coordinates of block1 to relative coordinates to block2|
|block1.condition on(block2)|Calculate the absolute coordinates of block1 given the canvas block2’s absolute coordinates|
|block.crop image(image)|Obtain the image segments in the block region|

# 3.4 Storage

In [25]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


# Setup Baseline
For comparison, we setup a naive RAG pipeline with default parsing and standard chunking, indexing, retrieval.

In [26]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_files=["../data/layout-parser-paper.pdf"])
base_docs = reader.load_data()
raw_index = VectorStoreIndex.from_documents(base_docs)
raw_query_engine = raw_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker]
)

## Using ```new LlamaParse``` as pdf data parsing methods and retrieve tables with two different methods
we compare base query engine vs recursive query engine with tables

### Table Query Task: Queries for Table Question Answering

In [27]:
query = "Details of block.shift(dx, dy)"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
The operation `block.shift(dx, dy)` is used to move the current block by specified distances in the x and y directions. The parameters `dx` and `dy` represent the shift distances along the horizontal and vertical axes, respectively.
[1;3;38;2;11;159;203mRetrieval entering 9a78aaff-b4e8-4470-a5fa-a54ea88ac00c: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Details of block.shift(dx, dy)
[0m
***********New LlamaParse+ Recursive Retriever Query Engine***********
The operation `block.shift(dx, dy)` is used to move the current block by specified distances in the x and y directions.


In [28]:
query = "Give me details of All operations supported by the layout elements."

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
The operations supported by the layout elements include:

1. **block.pad(top, bottom, right, left)**: Enlarges the current block according to the specified padding values.
2. **block.scale(fx, fy)**: Scales the current block based on the given ratios in the x and y directions.
3. **block.shift(dx, dy)**: Moves the current block by the specified distances in the x and y directions.
4. **block1.is_in(block2)**: Checks whether block1 is inside block2.
5. **block1.intersect(block2)**: Returns the intersection region of block1 and block2, with the coordinate type determined by the inputs.
6. **block1.union(block2)**: Returns the union region of block1 and block2, with the coordinate type determined by the inputs.
7. **block1.relative_to(block2)**: Converts the absolute coordinates of block1 to relative coordinates with respect to block2.
8. **block1.condition_on(block2)**: Calculates the absolute coordinates of block1 based on the absolute coordinat

In [29]:
query = "Tell me about japanese documenty pipeline"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
The Japanese document digitization pipeline utilizes LayoutParser to generate high-quality structured data from historical Japanese firm financial tables, which often feature complicated layouts. This pipeline employs two layout detection models to identify different levels of document structures and utilizes two customized OCR engines to enhance character recognition accuracy.

The documents typically contain vertically arranged text columns, a common format in Japanese writing. Due to issues like scanning noise and variations in printing technology, these columns can be skewed or have inconsistent widths, making them challenging to identify with traditional rule-based methods. The pipeline addresses these challenges by implementing a document reorganization algorithm that rearranges detected tokens based on their bounding boxes, improving character recognition recall.

Additionally, the pipeline is designed to handle unique fonts and glyphs u

In [30]:
response_1

Response(response='The Japanese document digitization pipeline utilizes LayoutParser to generate high-quality structured data from historical Japanese firm financial tables, which often feature complicated layouts. This pipeline employs two layout detection models to identify different levels of document structures and utilizes two customized OCR engines to enhance character recognition accuracy.\n\nThe documents typically contain vertically arranged text columns, a common format in Japanese writing. Due to issues like scanning noise and variations in printing technology, these columns can be skewed or have inconsistent widths, making them challenging to identify with traditional rule-based methods. The pipeline addresses these challenges by implementing a document reorganization algorithm that rearranges detected tokens based on their bounding boxes, improving character recognition recall.\n\nAdditionally, the pipeline is designed to handle unique fonts and glyphs used in historical d

In [31]:
response_2

Response(response='A comprehensive pipeline was developed to digitize historical Japanese firm financial tables, which often feature complicated layouts. This pipeline utilizes two layout models to identify various levels of document structures and incorporates two customized OCR engines to enhance character recognition accuracy. The documents typically contain vertically written text, a common format in Japanese, which can present challenges due to scanning noise and the variability in column widths. The pipeline effectively addresses these complexities to generate high-quality structured data from the historical documents.', source_nodes=[NodeWithScore(node=TextNode(id_='12361d36-1cb9-4f44-a5a0-89ccff52aff0', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# LayoutParser: A Unified Toolkit for DL-Based DIA\n\nfocuses on precision, efficiency, and robustness. The target documents may have complicated structures, and 