<a href="https://colab.research.google.com/github/ranzhang/Documentation/blob/master/LlamaParse_A_Tool_for_Parsing_Complex_Documents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LlamaParse - Parsing Complex Documents

With the release of [LlamaParse](https://github.com/run-llama/llama_parse) and [LlamaCloud](https://cloud.llamaindex.ai), LlamaIndex is demonstrating the next step of evolution for its offerings!

From the repository:

> LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.

What LlamaIndex has done is created an API Endpoint that we can access (currently for free up to 10,000 pages of PDFs a day) that will parse out PDF files into either plain-text or markdown. That second one means we have a way to retain structural data that can be leveraged for more structural queries!

They've also [recently released](https://www.llamaindex.ai/blog/llamaindex-v0-10-838e735948f8) their v0.10 which, similar to LangChain's v0.1.0, provides some stability and methodological changes to move LlamaIndex into the production-ready space. (seeyah later `ServiceContext`!)

Let's dive in and see what we can do with this new tool!

## Load and Parse PDFs

We'll start, as always, by grabbing some dependencies.

In [None]:
!pip install -qU llama-index llama-parse

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.0/108.0 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m72.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

We'll need to provide a LlamaCloud API key to continue.

You can find this by following these steps:

1. Sign in with one of their many SSO options.
  - ![image](https://i.imgur.com/WFH6CPK.png)
2. Navigate to the References in the bottom left hand corner of the screen and select `API Key`.
  - ![image](https://i.imgur.com/nlw1mo2.png)
3. Generate a new key, name it, and keep it in a safe place!
  - ![image](https://i.imgur.com/Rxshpeq.png)


Now that we have our API Key - let's provide it as an environment variable below.

You can also pass the key directly into the `LlamaParse` object we'll create later.

In [None]:
import os
import getpass

os.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass("LLamaParse API Key:")

LLamaParse API Key:··········


Since we'll be using OpenAI as our LLM today - we'll need to pass that API key as well.

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


Let's make sure we can run async in our Colab instance.

In [None]:
import nest_asyncio

nest_asyncio.apply()

### LlamaParse Initialization

Here we can initialize our `LlamaParse` object.

Notice that there's a few parameters worth paying attention to:

- `result_type` - at time of writing this notebook the options are limited to `"text"` and `"markdown"`. Markdown will be our choice as it will retain structured information quite nicely.
- `num_workers` - this will let us set how many workers we'll need. Generally we'll want to set this to the number of files we're going to need to parse. (the maximum is `10`)

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    verbose=True,
    language="en",
    num_workers=2,
)

### Uploading Files

We'll next need to upload some files to test our the parser!

Let's use [NVIDIA's 10-K](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/1cbe8fe7-e08a-46e3-8dcc-b429fc06c1a4.pdf) and the [Office of Educational Technology's AI and the Future of Learning report](https://www2.ed.gov/documents/ai-report/ai-report.pdf).

You can upload them below - be careful to make sure the file matches with what you've uploaded!

In [None]:
from google.colab import files

nvidia_earnings_report = files.upload()

Saving nvidia-earnings.pdf to nvidia-earnings (1).pdf


In [None]:
ai_report = files.upload()

Saving ai-report.pdf to ai-report (1).pdf


### Parsing Our Files

Now that we've uploaded our files and set-up our `LlamaParser` we're ready to parse some files!

Running this cell seems very inconsistent - with some files taking ~6min., and others taking ~4s. It seems there is some level of caching, but you can medium -> long wait times for this next cell.

> NOTE: As of time of writing, only `.pdf` files are accepted.

In [None]:
documents = parser.load_data(["./nvidia-earnings.pdf", "./ai-report.pdf"])

Started parsing the file under job_id 559673ec-f538-4afc-bd2b-9ca13d8627f8
Started parsing the file under job_id 0348c448-f308-44a0-aaa3-cc7055f9014e


Let's look at our 10-K example!

In [None]:
print(documents[0].text[:1000])

|Content|Page Number|
|---|---|
|Table of Contents| |
|UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549| |
|FORM 10-K| |
|ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ended January 28, 2024| |
|Commission file number: 0-23985| |
|NVIDIA NVIDIA CORPORATION (Exact name of registrant as specified in its charter)| |
|Delaware 94-3177549| |
|2788 San Tomas Expressway, Santa Clara, California 95051 (Address of principal executive offices)| |
|Registrant’s telephone number, including area code: (408) 486-2000| |
|Securities registered pursuant to Section 12(b) of the Act:| |
|Title of each class|Trading Symbol(s)|Name of each exchange on which registered|
|Common Stock, $0.001 par value per share|NVDA|The Nasdaq Global Select Market|
|Securities registered pursuant to Section 12(g) of the Act:|None|
|Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securiti

Right away we can see that some kind of structure is being retained!

In [None]:
print(documents[1].text[:1000])

# OFFICE OF Artificial Intelligence Educational Technology and the Future of Teaching and Learning Insights and Recommendations May 2023
---
## Artificial Intelligence and the Future of Teaching and Learning

Miguel A. Cardona, Ed.D.
Secretary, U.S. Department of Education

Roberto J. Rodríguez
Assistant Secretary, Office of Planning, Evaluation, and Policy Development

Kristina Ishmael
Deputy Director, Office of Educational Technology

May 2023

Examples Are Not Endorsements

This document contains examples and resource materials that are provided for the user’s convenience. The inclusion of any material is not intended to reflect its importance nor is it intended to endorse any views expressed or products or services offered. These materials may contain the views and recommendations of various subject matter experts as well as hypertext links, contact addresses, and websites to information created and maintained by other public and private organizations. The opinions expressed in any

The same is true of our AI Education report!

## LlamaIndex Recursive Query Engine

Now that we have some parsed objects - let's see how well we can leverage them using one of the [example query engines](https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb).

### Setting our...Settings

That's right! `ServiceContext` is dead, long live `Settings`.

Let's point our generic LLM to `gpt-3.5-turbo` and our generic embedding model as `text-embedding-3-small`.

> NOTE: You'll notice we're pulling `Settings` our of `llama_index.core` which is a major part of their `v0.10` update!

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

We're going to use a `MarkdownElementNodeParser` to help make sense of our Markdown objects so we can leverage the potentially structured information in the parsed documents.

- Check out the [docs](https://docs.llamaindex.ai/en/stable/api/llama_index.core.node_parser.MarkdownElementNodeParser.html)

In [None]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(llm=OpenAI(model="gpt-3.5-turbo"), num_workers=8)

Let's parse!

> NOTE: There appears to be inconsistent errors - but the parser is largely able to extract and understand structured data within the document provided by the parser

In [None]:
nodes = node_parser.get_nodes_from_documents(documents=[documents[0]])

Embeddings have been explicitly disabled. Using MockEmbedding.


108it [00:00, 5435.12it/s]
columns
  field required (type=value_error.missing)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 439, in _agive_response_single
    structured_response = await program.acall(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 76, in acall
    answer = await self._llm.astructured_predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/llms/llm.py", line 235, in astructured_predict
    return await program.acall(**prompt_args)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.py", line 220, in acall
    return _parse_tool_calls(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.py", line 64, in _parse_tool_calls
    output = output_cls.parse_raw(function_call.arguments)
  File "/usr/local/lib/python3.10/dist-packages/pydantic/v1/main.py", l

Now we can extract our `base_nodes` and `objects` to create our `VectorStoreIndex`.

In [None]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

Let's build the index!

In [None]:
from llama_index.core import VectorStoreIndex

recursive_index = VectorStoreIndex(nodes=base_nodes+objects)

### Recursive Query Engine

Now we can build our Recursive Query Engine with reranking!

We'll need to do a few steps:

1. Initalize our reranker using `FlagEmbeddingReranker` powered by the `BAAI/bge-reranker-large`.
2. Set up our recursive query engine!

First, let's install some requirements.

In [None]:
!pip install -qU llama-index-postprocessor-flag-embedding-reranker git+https://github.com/FlagOpen/FlagEmbedding.git

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.3/156.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for FlagEmbedding (setup.py) ... [?25l[?25hdone


First up, we'll initialize our reranker - we'll be leveraging [this](https://github.com/FlagOpen/FlagEmbedding) repo to leverage our [`BAAI/bge-reranker-large`](https://huggingface.co/BAAI/bge-reranker-large).

Once that's done - we can follow a fairly standard flow of creating our query engine!

In [None]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=15,
    node_postprocessors=[reranker],
    verbose=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## NVIDIA 10-K Test

Now we can test this on our documents! Let's start with our 10-K document.

In [None]:
query = "Who is the E-VP, Operations - and how old are they?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_95ebb8c0-3296-49ef-af82-192aa1916fc1_42_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Who is the E-VP, Operations - and how old are they?
[0m[1;3;38;2;11;159;203mRetrieval entering id_95ebb8c0-3296-49ef-af82-192aa1916fc1_44_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Who is the E-VP, Operations - and how old are they?
[0m[1;3;38;2;11;159;203mRetrieval entering id_95ebb8c0-3296-49ef-af82-192aa1916fc1_270_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Who is the E-VP, Operations - and how old are they?
[0m[1;3;38;2;11;159;203mRetrieval entering id_95ebb8c0-3296-49ef-af82-192aa1916fc1_190_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Who is the E-VP, Operations - and how old are they?
[0m[1;3;38;2;11;159;203mRetrieval entering id_95ebb8c0-3296-49ef-af82-192aa1916fc1_14_table: TextNod

In [None]:
print(response)

Debora Shoquist is the Executive Vice President of Operations, and she is 69 years old.


![image](https://i.imgur.com/OZcPlJw.png)

As you can see - this information was retrieved extremely well!

> NOTE: The actual response time was in the 2-3min. timeframe for the full query which is likely due to running this instance on CPU - meaning the reranking process was a bottleneck. You may find better performance running this notebook in a GPU enabled instance.

In [None]:
query = "What is the gross carrying amount of Total Amortizable Intangible Assets for Jan 29, 2023?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_95ebb8c0-3296-49ef-af82-192aa1916fc1_198_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the gross carrying amount of Total Amortizable Intangible Assets for Jan 29, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering id_95ebb8c0-3296-49ef-af82-192aa1916fc1_214_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the gross carrying amount of Total Amortizable Intangible Assets for Jan 29, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering id_95ebb8c0-3296-49ef-af82-192aa1916fc1_256_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the gross carrying amount of Total Amortizable Intangible Assets for Jan 29, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering id_95ebb8c0-3296-49ef-af82-192aa1916fc1_228_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the gross carrying amount o

In [None]:
print(response)

The gross carrying amount of Total Amortizable Intangible Assets for Jan 29, 2023 is $3,539 million.


![image](https://i.imgur.com/9jwFpWk.png)

Another big win for LlamaParse!

## Testing it on the AI Education Report

The results for the 10-K were incredible - but will the AI Education Report hold up?

In [None]:
ai_report_nodes = node_parser.get_nodes_from_documents(documents=[documents[1]])

Embeddings have been explicitly disabled. Using MockEmbedding.


19it [00:00, 6814.76it/s]
100%|██████████| 19/19 [00:13<00:00,  1.42it/s]


In [None]:
ai_base_nodes, ai_objects = node_parser.get_nodes_and_objects(ai_report_nodes)

In [None]:
ai_recursive_index = VectorStoreIndex(nodes=ai_base_nodes+ai_objects)

In [None]:
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

ai_recursive_query_engine = ai_recursive_index.as_query_engine(
    similarity_top_k=15,
    node_postprocessors=[reranker],
    verbose=True
)

In [None]:
query = "How many AI publications on pattern recognition was there in 2020?"
response = ai_recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_b7e493fe-7605-4671-ad3c-c676cc460d63_12_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How many AI publications on pattern recognition was there in 2020?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b7e493fe-7605-4671-ad3c-c676cc460d63_78_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How many AI publications on pattern recognition was there in 2020?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b7e493fe-7605-4671-ad3c-c676cc460d63_4_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How many AI publications on pattern recognition was there in 2020?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b7e493fe-7605-4671-ad3c-c676cc460d63_28_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How many AI publications on pattern recognition was there in 2020?
[0m[1;3;38;2;11;159;203mRetrieval entering 

In [None]:
print(response)

There were 30.07 AI publications on pattern recognition in 2020.


![image](https://i.imgur.com/tbGtUX2.png)

While the query engine *did* retrieve context that was literally on the figure - it was not the correct information, in any way.

In [None]:
query = "Can you describe what Figure 14 is related to?"
response = ai_recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_b7e493fe-7605-4671-ad3c-c676cc460d63_52_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Can you describe what Figure 14 is related to?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b7e493fe-7605-4671-ad3c-c676cc460d63_28_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Can you describe what Figure 14 is related to?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b7e493fe-7605-4671-ad3c-c676cc460d63_100_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Can you describe what Figure 14 is related to?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b7e493fe-7605-4671-ad3c-c676cc460d63_56_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Can you describe what Figure 14 is related to?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b7e493fe-7605-4671-ad3c-c676cc460d63_38_table: TextNode
[0m[1;3;38;2;237;

In [None]:
print(response)

Figure 14 is related to the long tail of learner variability in the context of AI in education. It illustrates how learners vary in their strengths and needs, emphasizing the importance of addressing a wider spectrum of strengths and needs beyond just the most typical cases. The figure highlights the potential of AI to cater to a diverse range of learners by focusing on the long tail of learner variability rather than solely targeting the most common learning profiles.


![image](https://i.imgur.com/T7nVQj8.png)

As you can see - the query engine did not successfully retrieve context related to the correct Figure. If you read the report, you'll notice that it found information related to Fig. 13.