## Building Advanced RAG With LlamaParse

In this notebook we will demonstrate the following:

1. Using LlamaParse.
2. Using Recursive Retrieval with LlamaParse to query tables/ text within a document hierarchically.

[LlamaParse Documentation](https://github.com/run-llama/llama_parse/)

#### Installation

In [None]:
!pip install llama-index
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-parse

Collecting llama-index
  Downloading llama_index-0.11.5-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_agent_openai-0.3.0-py3-none-any.whl.metadata (728 bytes)
Collecting llama-index-cli<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_cli-0.3.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.12.0,>=0.11.5 (from llama-index)
  Downloading llama_index_core-0.11.5-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-embeddings-openai<0.3.0,>=0.2.4 (from llama-index)
  Downloading llama_index_embeddings_openai-0.2.4-py3-none-any.whl.metadata (635 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.3.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.3.0-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48.post3-py3-none-any.whl.metadata (8.5 kB)
Collecting 

#### Download Data

In [None]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10q/uber_10q_march_2022.pdf' -O './uber_10q_march_2022.pdf'

--2024-09-05 07:01:47--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10q/uber_10q_march_2022.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1260185 (1.2M) [application/octet-stream]
Saving to: ‘./uber_10q_march_2022.pdf’


2024-09-05 07:01:48 (77.6 MB/s) - ‘./uber_10q_march_2022.pdf’ saved [1260185/1260185]



#### Setting API Keys

In [None]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

import os

# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-..."

#### Setting LLM and Embedding Model

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-3.5-turbo-0125")

Settings.llm = llm
Settings.embed_model = embed_model

#### LlamaParse PDF reader for PDF Parsing

We compare two different retrieval/ queryengine strategies.

1. Using raw Markdown text as nodes for building index and applying a simple query engine for generating results.
2. Using MarkdownElementNodeParser for parsing the LlamaParse output Markdown results and building a recursive retriever query engine for generation.

In [None]:
# LlamaParse PDF reader for PDF Parsing
from llama_parse import LlamaParse

documents = LlamaParse(result_type="markdown").load_data(
    "./uber_10q_march_2022.pdf"
)
# Started parsing the file under job_id b76a572b-d2bb-42ae-bad9-b9810049f1af

Started parsing the file under job_id 0ef2f65b-9cab-4ca8-b221-d20f1f6d1336


In [None]:
print(documents[0].text[:1000] + "...")

# UNITED STATES SECURITIES AND EXCHANGE COMMISSION

# Washington, D.C. 20549

# FORM 10-Q

(Mark One)

☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934

For the quarterly period ended March 31, 2022

OR

☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934

For the transition period from_____ to _____

Commission File Number: 001-38902

# UBER TECHNOLOGIES, INC.

(Exact name of registrant as specified in its charter)

Not Applicable

(Former name, former address and former fiscal year, if changed since last report)

|Delaware|45-2647441|
|---|---|
|(State or other jurisdiction of incorporation or organization)|(I.R.S. Employer Identification No.)|
|1515 3rd Street|San Francisco, California 94158|
|(Address of principal executive offices, including zip code)|(415) 612-8582|
|(Registrant’s telephone number, including area code)| |

# Securities registered pursuant to Section 12(b) of the Act:

|Title of each c

In [None]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
)

nodes = node_parser.get_nodes_from_documents(documents)

3it [00:00, 41803.69it/s]
1it [00:00, 22310.13it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 20867.18it/s]
1it [00:00, 22429.43it/s]
1it [00:00, 21399.51it/s]
1it [00:00, 20460.02it/s]
1it [00:00, 19508.39it/s]
1it [00:00, 19508.39it/s]
5it [00:00, 85598.04it/s]
0it [00:00, ?it/s]
2it [00:00, 41527.76it/s]
2it [00:00, 46091.25it/s]
2it [00:00, 40524.68it/s]
2it [00:00, 38836.15it/s]
2it [00:00, 42366.71it/s]
2it [00:00, 41943.04it/s]
1it [00:00, 23967.45it/s]
1it [00:00, 24818.37it/s]
1it [00:00, 25890.77it/s]
4it [00:00, 72628.64it/s]
2it [00:00, 38836.15it/s]
3it [00:00, 41943.04it/s]
0it [00:00, ?it/s]
3it [00:00, 58254.22it/s]
3it [00:00, 53773.13it/s]
1it [00:00, 25575.02it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 26051.58it/s]
1it [00:00, 21509.25it/s]
0it [00:00, ?it/s]
1it [00:00, 16008.79it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
2it [00:00, 42153.81it/s]
4it [00:00, 76260.07it/s]
5it [00:00, 75166.74it/s]
2it [00:00, 39383.14it/s]
2it 

In [None]:
text_nodes, index_nodes = node_parser.get_nodes_and_objects(nodes)

In [None]:
text_nodes[0]

TextNode(id_='c6ffea61-1221-40e3-b0e0-5b24cfbd02d5', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='33b7b29c-8eba-458b-a25f-bb8f88951e92', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='3d4ec5b02a042598b0ea47cdac56453869c17b531a10f60343e9598e05a9390e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='de618b65-c78a-4390-8536-4e9e295c0e49', node_type=<ObjectType.INDEX: '3'>, metadata={'col_schema': 'Column: Delaware\nType: string\nSummary: State or other jurisdiction of incorporation or organization\n\nColumn: 45-2647441\nType: string\nSummary: I.R.S. Employer Identification No.'}, hash='c008153189b8dd031a3e5e694239a50ebd21f42602676f072d9746241fcef858')}, text='UNITED STATES SECURITIES AND EXCHANGE COMMISSION\n\n Washington, D.C. 20549\n\n FORM 10-Q\n\n(Mark One)\n\n☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\

In [None]:
index_nodes[0]

IndexNode(id_='de618b65-c78a-4390-8536-4e9e295c0e49', embedding=None, metadata={'col_schema': 'Column: Delaware\nType: string\nSummary: State or other jurisdiction of incorporation or organization\n\nColumn: 45-2647441\nType: string\nSummary: I.R.S. Employer Identification No.'}, excluded_embed_metadata_keys=['col_schema'], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='33b7b29c-8eba-458b-a25f-bb8f88951e92', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='3d4ec5b02a042598b0ea47cdac56453869c17b531a10f60343e9598e05a9390e'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='c6ffea61-1221-40e3-b0e0-5b24cfbd02d5', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='0cafbb2bbffe3085738e748c9ed19c5b88f6b300d876820fc3caa7afa8f0627f'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='c57f8dab-7b69-4850-8885-6a9cf0f531f9', node_type=<ObjectType.TEXT: '1'>, metadata={'table_df': "{'Delaware': {0: '(State or other jur

#### Build Index

In [None]:
recursive_index = VectorStoreIndex(nodes=text_nodes + index_nodes)
raw_index = VectorStoreIndex.from_documents(documents)

#### Create Query Engines

In [None]:
from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

In [None]:
recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=15, node_postprocessors=[reranker], verbose=True
)

In [None]:
raw_query_engine = raw_index.as_query_engine(
    similarity_top_k=15, node_postprocessors=[reranker]
)

#### Querying with two different query engines

we compare base query engine vs recursive query engine with tables

##### Table Query Task: Queries for Table Question Answering

In [None]:
query = "What is the change of free cash flow and what is the rate from the financial and operational highlights?"

response_1 = raw_query_engine.query(query)
print("\n************New LlamaParse+ Basic Query Engine************")
print(response_1)

response_2 = recursive_query_engine.query(query)
print(
    "\n************New LlamaParse+ Recursive Retriever Query Engine************"
)
print(response_2)


************New LlamaParse+ Basic Query Engine************
The change in free cash flow from the financial and operational highlights is an increase of $826 million, from a net cash used in operating activities of $611 million in 2021 to net cash provided by operating activities of $215 million in 2022. The rate of this change is a positive improvement.
[1;3;38;2;11;159;203mRetrieval entering 015f9778-1f7c-44cd-9e26-90f2c9e21550: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the change of free cash flow and what is the rate from the financial and operational highlights?
[0m[1;3;38;2;11;159;203mRetrieval entering 5e8febd0-0c43-4552-9499-9465674b8877: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the change of free cash flow and what is the rate from the financial and operational highlights?
[0m[1;3;38;2;11;159;203mRetrieval entering d3c8d59b-9d7e-4088-94e9-3e58aba09f10: TextNode
[0m[1;3;38;2;237;90;2