# Llama Parser <> LlamaIndex

This notebook is a complete walkthrough for using `LlamaParser` for RAG applications with `LlamaIndex`.

In [None]:
!pip install llama-index llama-parser sentence-trasformers

In [None]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10q/uber_10q_march_2022.pdf' -O './uber_10q_march_2022.pdf'

## Using PDFReader from LLamaHub as baseline 

In [42]:
from llama_hub.file.pdf.base import PDFReader
from pathlib import Path
from llama_index import Document


loader = PDFReader()
docs0 = loader.load_data(file=Path('./uber_10q_march_2022.pdf'))
doc_text = "\n\n".join([d.get_content() for d in docs0])
baseline_docs = [Document(text=doc_text)]

### Build Vector Index for the nodes parsed from PdfReader and run basic query engine as baseline approach

In [43]:
from llama_index.node_parser import SentenceSplitter
from llama_index.schema import IndexNode
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.query_engine import RetrieverQueryEngine


node_parser = SentenceSplitter(chunk_size=512)
baseline_base_nodes = node_parser.get_nodes_from_documents(baseline_docs)
# set node ids to be a constant
for idx, node in enumerate(baseline_base_nodes):
    node.id_ = f"node-{idx}"


embed_model=OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-4")
service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)

baseline_index = VectorStoreIndex(baseline_base_nodes, service_context=service_context)
baseline_retriever = baseline_index.as_retriever(similarity_top_k=15)
baseline_pdf_query_engine = RetrieverQueryEngine.from_args(
    baseline_retriever, service_context=service_context
)

## Using `LlamaParser` PDF reader for Pdf Parsing
we also compare two other retrieval strategies:
1. Using raw Markdown text for building index and apply simple query engine for synthesizing the results
2. Using `MarkdownElementNodeParser` for parsing the Markdown results and building recursive retriever query engine.

In [14]:
# llama-parser is async-first, running the sync code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

import os
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-"
os.environ["OPENAI_API_KEY"] = "sk-"

In [15]:
from llama_parser import LlamaParser
from llama_index.schema import Document
documents = LlamaParser(result_type="markdown").load_data('./uber_10q_march_2022.pdf')

In [16]:
print(documents[0].text[:1000] + '...')

# SEC Form 10-Q

## UNITED STATES SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549

### FORM 10-Q

(Mark One)

☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934

For the quarterly period ended March 31, 2022

Commission File Number: 001-38902

### UBER TECHNOLOGIES, INC.

(Exact name of registrant as specified in its charter)

Delaware 45-2647441

(State or other jurisdiction of incorporation or organization) (I.R.S. Employer Identification No.)

1515 3rd Street

San Francisco, California 94158

(Address of principal executive offices, including zip code)

(415) 612-8582

(Registrant’s telephone number, including area code)

### Securities registered pursuant to Section 12(b) of the Act:

|Title of each class|Trading Symbol(s)|Name of each exchange on which registered|
|---|---|---|
|Common Stock, par value $0.00001 per share|UBER|New York Stock Exchange|

Indicate by check mark whether the registrant (1) has filed all reports required 

In [17]:
from llama_index.node_parser import MarkdownElementNodeParser
from llama_index.llms import OpenAI

node_parser = MarkdownElementNodeParser(llm=OpenAI(model="gpt-4"))

In [18]:
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes, node_mapping = node_parser.get_base_nodes_and_mappings(nodes)

Embeddings have been explicitly disabled. Using MockEmbedding.


56it [14:23, 15.42s/it]


In [19]:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.embeddings import OpenAIEmbedding

ctx = ServiceContext.from_defaults(llm=OpenAI(model="gpt-4"), embed_model=OpenAIEmbedding(model="text-embedding-3-small"))

index = VectorStoreIndex(nodes=base_nodes, service_context=ctx)
base_index = VectorStoreIndex.from_documents(documents, service_context=ctx)

In [20]:
from llama_index.retrievers import RecursiveRetriever

retriever = RecursiveRetriever(
    "vector", 
    retriever_dict={
        "vector": index.as_retriever(similarity_top_k=15)
    },
    node_dict=node_mapping,
)

In [28]:
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(top_n=5, model="BAAI/bge-reranker-large")

recursive_query_engine = RetrieverQueryEngine.from_args(retriever, node_postprocessors=[reranker], service_context=ctx)

base_query_engine = base_index.as_query_engine(similarity_top_k=15, node_postprocessors=[reranker], service_context=ctx)

### Table Query

In [44]:
response = baseline_pdf_query_engine.query("What was the change in monthly active platform consumers?")
print(str(response))

The Monthly Active Platform Consumers (MAPCs) increased by 17% compared to the same period in 2021. The MAPCs were 115 million in the first quarter of 2022, declining 3 million, or 3%, quarter-over-quarter.


In [29]:
response = base_query_engine.query("What was the change in monthly active platform consumers?")
print(str(response))

The Monthly Active Platform Consumers (MAPCs) for Uber decreased from 118 million in Q4 2021 to 115 million in Q1 2022. This represents a decrease of 3 million MAPCs, or a 3% decline quarter-over-quarter. However, compared to the same period in 2021, the MAPCs grew by 17%.


In [30]:
response = recursive_query_engine.query("What was the change in monthly active platform consumers?")
print(str(response))

The number of Monthly Active Platform Consumers (MAPCs) increased by 17% from 98 million in the first quarter of 2021 to 115 million in the first quarter of 2022. However, there was a quarter-over-quarter decline of 3 million, or 3%, from the fourth quarter of 2021 to the first quarter of 2022.


In [45]:
response = baseline_pdf_query_engine.query("Which was the primary driver of revenue in the past 3 months for both region and offerings?")
print(str(response))

The primary driver of revenue in the past three months was the Mobility segment, which saw an increase in revenue from $853 million to $2,518 million. This was primarily due to an increase in Mobility Gross Bookings as the business recovers from the impacts of COVID-19 and business model changes in the UK. In terms of geographical region, the United States and Canada saw the highest revenue increase from $1,849 million to $4,562 million.


In [31]:
response = base_query_engine.query("Which was the primary driver of revenue in the past 3 months for both region and offerings?")
print(str(response))

The primary driver of revenue in the past three months was an increase in Gross Bookings, primarily driven by increases in Mobility Trip volumes as the business recovers from the impacts of COVID-19. Additionally, there was a significant increase in Freight revenue resulting primarily from the acquisition of Transplace in the fourth quarter of 2021. In terms of regions, the United States was the largest contributor to revenue, followed by all other countries combined.


In [32]:
response = recursive_query_engine.query("Which was the primary driver of revenue in the past 3 months for both region and offerings?")
print(str(response))

The primary driver of revenue in the past three months was the increase in Mobility Trip volumes as the business recovers from the impacts of COVID-19, and a significant increase in Freight revenue resulting primarily from the acquisition of Transplace in the fourth quarter of 2021. In terms of regions, the United States and Canada (US&CAN) generated the highest revenue, with a significant increase from $1,849 million to $4,562 million between the first quarters of 2021 and 2022.


### General Query

In [46]:
response = baseline_pdf_query_engine.query("What is the impact of the COVID-19 pandemic on business?")
print(str(response))

The COVID-19 pandemic has significantly impacted businesses globally. It has led to reduced demand for mobility offerings due to travel restrictions, business restrictions, school closures, and limitations on social or public gatherings implemented by various governments. Even as restrictions have been lifted in many regions, end-user behavior and demand may not recover to pre-pandemic levels. The pandemic has also led to driver supply constraints, with consumer demand for mobility recovering faster than driver availability. To support social distancing, shared rides offerings have been temporarily suspended in many regions. The pandemic has adversely affected near-term financial results and may continue to impact long-term financial results, necessitating significant response actions such as workforce reductions and changes to pricing models. The pandemic's impact on business partners and third-party vendors is also unpredictable and could have adverse effects. The extent of the pande

In [33]:
response = base_query_engine.query("What is the impact of the COVID-19 pandemic on business?")
print(str(response))

The COVID-19 pandemic has significantly affected business operations. Various governments have implemented measures such as travel restrictions, business restrictions, school closures, and limitations on social gatherings to limit the spread of the virus. These measures have reduced the demand for mobility offerings globally and affected travel behavior. Even as restrictions have been lifted, end-user behavior and demand may not recover to pre-pandemic levels. There have been driver supply constraints, with consumer demand for mobility recovering faster than driver availability. To support social distancing, shared rides offerings have been temporarily suspended in many regions. The pandemic has also adversely affected financial results and required significant actions in response, such as workforce reductions and changes to pricing models. The future impact of the pandemic on business operations, liquidity, financial condition, and results of operations is uncertain and depends on fut

In [34]:
response = recursive_query_engine.query("What is the impact of the COVID-19 pandemic on business?")
print(str(response))

The COVID-19 pandemic has had a significant impact on business operations. It has led to uncertainties and volatility in global financial markets and economies. The pandemic has also affected the assumptions and inputs supporting certain estimates, assumptions, and judgments, particularly those related to the impairment assessment related to the determination of the fair values of certain investments and equity method investments, as well as goodwill and the recoverability of long-lived assets. 

Government-imposed restrictions, such as those on business activities and travel, have adversely impacted business operations by reducing global demand for certain services, while accelerating the growth of others. The pandemic has also created uncertainty around the world, making it difficult to predict its cumulative and ultimate impact on future business operations, financial position, liquidity, and cash flows. 

The extent of the pandemic's impact on business and financial results largely