<a href="https://colab.research.google.com/github/terenceou/ai-makerspace/blob/main/Unstructured_io_%2B_LlamaIndex_Llama_Pack_for_Complex_PDF_Retrieval_Llama_Pack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Unstructured + LlamaIndex to Do Complex PDF Retrieval

Let's get this out of the way straight away: There is no simple out of the box solution to this problem!

We'll be leveraging a file-conversion today to enable the *least* friction version of the implementation.

>NOTE: You could use an image processing, or OCR processing as another solution to this complex problem - though results were found to be better with this solution.

In [9]:
!pip install llama-index llama-hub unstructured==0.10.18 lxml beautifulsoup4 typing-extensions cohere -qU

## Boilerplate

Stop potential async bugs in Colab, and input the ole OpenAI API key.

In [2]:
import nest_asyncio

nest_asyncio.apply()

In [3]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key: ")

OpenAI API Key: ··········


## Downloading Our Llama Pack

We're going to be using the `EmbeddedTablesUnstructuredRetrieverPack` provided by the llama-hub today!

This is a tool powered by LlamaIndex, and Unstructured, and it effectively does what it says!

In [10]:
from llama_index.llama_pack import download_llama_pack

EmbeddedTablesUnstructuredRetrieverPack = download_llama_pack(
    "EmbeddedTablesUnstructuredRetrieverPack",
    "./embedded_tables_unstructured_pack",
)

## Data Preprocessing

Due to the fact that the out-of-the-box tools do not natively support PDFs, we'll need to apply a conversion from PDF to HTML.

Again, we'll leverage the `pdf2htmlEX` tool to accomplish this task.

You can find more information about this project [here](https://pdf2htmlex.github.io/pdf2htmlEX/)! The performance is best-in-class and it can be run locally!

Check out their [GitHub](https://github.com/pdf2htmlEX/pdf2htmlEX) as well!

In [11]:
!wget https://github.com/pdf2htmlEX/pdf2htmlEX/releases/download/v0.18.8.rc1/pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb

--2024-01-13 11:52:47--  https://github.com/pdf2htmlEX/pdf2htmlEX/releases/download/v0.18.8.rc1/pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/118475451/4ae63000-bae8-11ea-9475-e35496c41b6e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240113%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240113T115247Z&X-Amz-Expires=300&X-Amz-Signature=39b40a666f7c9e0df8e97ad4f0c001cae93417e20cec4a9055252cae4a07810a&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=118475451&response-content-disposition=attachment%3B%20filename%3Dpdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb&response-content-type=application%2Foctet-stream [following]
--2024-01-13 11:52:47--  https://objects.githubuser

In [22]:
!sudo apt install "./pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb" -y

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'pdf2htmlex' instead of './pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb'
pdf2htmlex is already the newest version (0.0.18.8.rc1.master.bionic.20200630-0).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


Let's download the PDF we'll be using for this example, as well!

In [80]:
!wget https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/7df4dbdc-eb62-4d53-bc27-d334bfcb2335.pdf -O quarterly-nvidia.pdf

--2024-01-13 13:29:54--  https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/7df4dbdc-eb62-4d53-bc27-d334bfcb2335.pdf
Resolving d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)... 13.35.37.166, 13.35.37.47, 13.35.37.63, ...
Connecting to d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)|13.35.37.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 473863 (463K) [application/pdf]
Saving to: ‘quarterly-nvidia.pdf’


2024-01-13 13:29:54 (24.8 MB/s) - ‘quarterly-nvidia.pdf’ saved [473863/473863]



Now we can use our tool to convert the PDF to the desired format.

In [81]:
import subprocess

def convert_pdf_to_html(pdf_path, html_path):
    #command = f"pdf2htmlEX --embed cfijo --dest-dir {html_path} {pdf_path}"
    #command = f"pdf2htmlEX --zoom 1.3 {pdf_path}"
    command = f"pdf2htmlEX"
    subprocess.call(["pdf2htmlEX", f"{pdf_path}"])

input_pdf = "quarterly-nvidia.pdf"
output_pdf = "quarterly-nvidia"

convert_pdf_to_html(input_pdf, output_pdf)

## Initializing the Index and Query Engine

Now that we have our PDF in HTML format - we can load the EmbeddedTablesUnstructuredRetrieverPack and away we go!

In [82]:
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
    "quarterly-nvidia.html",
    nodes_save_path="nvidia-quarterly.pkl"
)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Embeddings have been explicitly disabled. Using MockEmbedding.


0it [00:00, ?it/s]


That's it, yes, that's completely it. There's no more steps - we can now use the tool to query our documents - including tables within those documents!

Let's see what this is built out of!

In [83]:
modules = embedded_tables_unstructured_pack.get_modules()
display(modules)

{'node_parser': UnstructuredElementNodeParser(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x7f16ba23ec50>, id_func=<function default_id_func at 0x7f16bbdc6560>, llm=None, summary_query_str='What is this table about? Give a very concise summary (imagine you are adding a caption), and also output whether or not the table should be kept.'),
 'recursive_retriever': <llama_index.retrievers.recursive_retriever.RecursiveRetriever at 0x7f16bb234dc0>,
 'query_engine': <llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine at 0x7f16aac5d8a0>}

## Testing!

Now we can test our pipeline - and see some truly remarkable results, despite the limitations.

In [84]:
response = embedded_tables_unstructured_pack.run("Revenue?")

[1;3;34mRetrieving with query id None: Revenue?
[0m[1;3;38;5;200mRetrieving text node: Gaming 

revenue

for

the

third

quarter

of

fiscal

year

2024

was

$2.86

billion,

up

81%

from

year

ago

and

up

15%

from

the

previous

quarter

We

launched 

DLSS

3.5

Ray

Reconstruction;

released

ensorRT

LLM

for

Windows;

added

56

DLSS

games

and

over

15

Reflex

games;

and

surpassed 1,700 games on GeForce NOW

28

Professional 

Visualization

revenue

for

the

third

quarter

of

fiscal

year

2024

was

$416

million,

up

108%

from

year

ago

and

up

10%

from

the

previous quarter. We announced 

new

line

of desktop

workstations

with

NVIDIA

RTX

6000 Ada Generation

GPUs

and

NVIDIA

ConnectX

smart

interface cards.

Automotive 

revenue

for

the

third

quarter

of

fiscal

year

2024

was

$261

million,

up

4%

from

year

ago

and

up

3%

from

the

previous

quarter

We

furthered our collaboration with Foxconn to develop next-generation el

In [85]:
response.response

'The revenue for the third quarter of fiscal year 2024 was $2.86 billion for Gaming, $416 million for Professional Visualization, and $261 million for Automotive.'

In [86]:
response = embedded_tables_unstructured_pack.run("Revenue from sales outside of USA?")

[1;3;34mRetrieving with query id None: Revenue from sales outside of USA?
[0m[1;3;38;5;200mRetrieving text node: We 

sell

our

products

internationally

and

we

also

have

operations

and

conduct

business

internationally

Our

semiconductor

wafers

are

manufactured, 

assembled,

tested

and

packaged

by

third

parties

located

outside

of

the

United

States,

and

we

generated

65%

and

62%

of

our

revenue 

during

the

third

quarter

and

first

nine

months

of

fiscal

year

2024

from

sales

outside

of

the

United

States,

respectively

Due

to

recent

USG 

licensing

requirements,

we

expect

that

our

sales

to

China

and

other

affected

destinations

will

decline

significantly

in

the

fourth

quarter

of

fiscal 

year

2024.

The

global

nature

of

our

business

subjects

us

to

number

of

risks

and

uncertainties,

which

have

had

in

the

past

and

could

in

the 

future

have

material

adverse

effect

on

our

business,

fi

In [87]:
response.response

'The company generated 65% and 62% of its revenue during the third quarter and first nine months of fiscal year 2024, respectively, from sales outside of the United States.'

In [None]:
response = embedded_tables_unstructured_pack.run("Any policy changes?")

[1;3;34mRetrieving with query id None: Any policy changes?
[0m[1;3;38;5;200mRetrieving text node: Changes to the laws, rules and regulations to which weare  subject,  or  changes  to  their  interpretation  and  enforcement,  could  lead  to  materially  greater  compliance  and  other  costs  and/or  furtherrestrictions on our ability to manufacture and supply our products and operate our business. For example, we may face increased compliancecosts  as  a  result  of  changes  or  increases  in  antitrust  legislation,  regulation,  administrative  rule  making,  increased  focus  from  regulators  oncybersecurity  vulnerabilities  and  risks.  Our  position  in  markets  relating  to  AI  has  led  to  increased  interest  in  our  business  from  regulatorsworldwide, including  the European  Union, the United  States, and China.  For example,  the French Competition Authority  collected informationfrom  us  regarding  our  business  and  competition  in  the  graphics  card  and 

In [None]:
response.response

"There may be policy changes that could impact the company's operations. Changes to laws, regulations, and their interpretation and enforcement could lead to increased compliance costs and further restrictions on manufacturing and supplying products. Additionally, revisions to laws or regulations could result in increased taxation, trade sanctions, import duties or tariffs, and other retaliatory actions. Government actions, including trade protection and national security policies, could affect the company's ability to ship products and provide services. The increasing focus on the risks and strategic importance of AI technologies has also resulted in regulatory restrictions that may impact some or all of the company's product and service offerings."

In [None]:
response = embedded_tables_unstructured_pack.run("Unallocated cost of revenue and operating expenses?")

In [None]:
response.response

'The unallocated cost of revenue and operating expenses is not provided in the given context information.'

## Modifying our LlamaPack!

We can edit the actual application logic to modify our pack if desired.

Let's look at an example!

```python
def __init__(
        self,
        html_path: str,
        nodes_save_path: Optional[str] = None,
        **kwargs: Any,
    ) -> None:
        """Init params."""
        self.reader = FlatReader()

        docs = self.reader.load_data(Path(html_path))

        self.node_parser = UnstructuredElementNodeParser()
        if nodes_save_path is None or not os.path.exists(nodes_save_path):
            raw_nodes = self.node_parser.get_nodes_from_documents(docs)
            pickle.dump(raw_nodes, open(nodes_save_path, "wb"))
        else:
            raw_nodes = pickle.load(open(nodes_save_path, "rb"))

        base_nodes, node_mappings = self.node_parser.get_base_nodes_and_mappings(
            raw_nodes
        )
        # construct top-level vector index + query engine
        vector_index = VectorStoreIndex(base_nodes)
        vector_retriever = vector_index.as_retriever(similarity_top_k=1)
        self.recursive_retriever = RecursiveRetriever(
            "vector",
            retriever_dict={"vector": vector_retriever},
            node_dict=node_mappings,
            verbose=True,
        )
        self.query_engine = RetrieverQueryEngine.from_args(self.recursive_retriever)
```

Now we can add a different LLM to our QueryEngine - as follows:

```python
from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(
          llm=OpenAI(model="gpt-4-1106-preview")
)
self.query_engine = RetrieverQueryEngine.from_args(
  self.recursive_retriever,
  service_context=service_context
)
```

In [88]:
from embedded_tables_unstructured_pack.base import EmbeddedTablesUnstructuredRetrieverPack

In [89]:
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
    "quarterly-nvidia.html",
    nodes_save_path="nvidia-quarterly.pkl"
)

In [90]:
response = embedded_tables_unstructured_pack.run("Any policy changes?")

[1;3;34mRetrieving with query id None: Any policy changes?
[0m[1;3;38;5;200mRetrieving text node: Changes to the laws, rules and regulations to which weare  subject,  or  changes  to  their  interpretation  and  enforcement,  could  lead  to  materially  greater  compliance  and  other  costs  and/or  furtherrestrictions on our ability to manufacture and supply our products and operate our business. For example, we may face increased compliancecosts  as  a  result  of  changes  or  increases  in  antitrust  legislation,  regulation,  administrative  rule  making,  increased  focus  from  regulators  oncybersecurity  vulnerabilities  and  risks.  Our  position  in  markets  relating  to  AI  has  led  to  increased  interest  in  our  business  from  regulatorsworldwide, including  the European  Union, the United  States, and China.  For example,  the French Competition Authority  collected informationfrom  us  regarding  our  business  and  competition  in  the  graphics  card  and 

In [None]:
response.response

'Yes, there have been changes to laws, rules, and regulations that have affected compliance costs and imposed further restrictions on the ability to manufacture and supply products and operate the business. These changes include increased antitrust legislation, regulation, and a greater focus on cybersecurity. Additionally, there have been revisions to laws and regulations that have led to increased taxation, trade sanctions, and import/export restrictions. Trade protection and national security policies have also been enacted, affecting the ability to ship products and provide services. Moreover, in response to the war in Ukraine, economic sanctions and export control measures have been implemented, which have halted the passage of products, services, and support into certain regions and resulted in the cessation of direct sales and business operations in Russia. There is also an increasing regulatory focus on AI technologies, leading to restrictions on products and services capable o

In [None]:
response = embedded_tables_unstructured_pack.run("What are the research and development expenses, and what percentage of the net revenue do they represent?")

In [None]:
response.response

'The research and development expenses for the three months ended October 29, 2023, were $2,294 million, representing 12.7% of net revenue. For the nine months ended on the same date, the expenses were $6,210 million, which was 16.0% of net revenue.'

As can be seen, this significantly improves the outputs of the model - though it is using a more expensive model to achieve this!