<a href="https://colab.research.google.com/github/sdossou/DSA_Agentic_RAG/blob/main/DSA_LlamaParse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LlamaParse

Advanced retrieval augmented generation for the EU Digital Services Act (DSA) legislation: using LlamaParse functionality from LlamaIndex with the DSA Delegated Act and Annex pdf documents with OpenAI gpt3.5 turbo.


## Load and Parse PDFs

Installing dependencies.

In [None]:
!pip install -qU llama-index llama-parse

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.9/309.9 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m136.1/136.1 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

Providing a LlamaCloud API key

In [None]:
import os
import getpass

os.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass("LLamaParse API Key:")

LLamaParse API Key:··········


Providing an OpenAI API key

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········




Running async in the Colab instance.

In [None]:
import nest_asyncio

nest_asyncio.apply()

### LlamaParse Initialisation

Initialising the `LlamaParse` object.

Note the following key parameters:

- `result_type` - the options are currently `"text"` and `"markdown"`. Markdown is our choice as it retains structured information.
- `num_workers` - sets how many workers are needed. Generally set to the number of files for parsing. (the maximum is `10`)

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    verbose=True,
    language="en",
    num_workers=2,
)

### Uploading Files

Uploading the 2 DSA pdf files to parse:
1. The Delegated Act
2. The Annex

In [None]:
from google.colab import files

delegated_report = files.upload()

Saving C_2023_6807_1_EN_ACT_part1_v2_6zoLEmP6RODW1Ujvl3uOmUHWPI_99436.pdf to C_2023_6807_1_EN_ACT_part1_v2_6zoLEmP6RODW1Ujvl3uOmUHWPI_99436.pdf


In [None]:
Annex1_report = files.upload()

Saving C_2023_6807_1_EN_annexe_acte_autonome_part1_v5_BKaPiHWc5nVWA6yjgxUcnGNb7U_99435.pdf to C_2023_6807_1_EN_annexe_acte_autonome_part1_v5_BKaPiHWc5nVWA6yjgxUcnGNb7U_99435.pdf


### Parsing the Files

Note that the time this cell takes to run is inconsistent. Some files take ~6min while others take ~4s.

> NOTE: At time of writing, only `.pdf` files are accepted.

In [None]:
documents = parser.load_data(["./C_2023_6807_1_EN_ACT_part1_v2_6zoLEmP6RODW1Ujvl3uOmUHWPI_99436.pdf", "./C_2023_6807_1_EN_annexe_acte_autonome_part1_v5_BKaPiHWc5nVWA6yjgxUcnGNb7U_99435.pdf"])

Parsing files: 100%|██████████| 2/2 [00:26<00:00, 13.31s/it]


Looking at the Delegated Act document

In [None]:
print(documents[0].text[:1000])

# EUROPEAN COMMISSION

Brussels, 20.10.2023

C(2023) 6807 final

COMMISSION DELEGATED REGULATION (EU) …/... of 20.10.2023 supplementing Regulation (EU) 2022/2065 of the European Parliament and of the Council, by laying down rules on the performance of audits for very large online platforms and very large online search engines
---
## EXPLANATORY MEMORANDUM

1. CONTEXT OF THE DELEGATED ACT

On 16 November 2022, Regulation (EU) 2022/2065 of the European Parliament and of the Council (the Digital Services Act) entered into force. That regulation provides a harmonised legal framework applicable to all online intermediary services provided in the Union and seeks to create a safer digital space, in which the fundamental rights of all users of digital services are protected.

Regulation (EU) 2022/2065 includes a special set of obligations for providers of very large online platforms and of very large online search engines, proportionate to their particular role and societal impact in the Union

We can see that some structure is being retained.

In [None]:
print(documents[1].text[:1000])

# EUROPEAN COMMISSION

Brussels, 20.10.2023 C(2023) 6807 final ANNEXES 1 to 2

## ANNEXES

to COMMISSION DELEGATED REGULATION (EU) …/... supplementing Regulation (EU) 2022/2065 of the European Parliament and of the Council, by laying down rules on the performance of audits for very large online platforms and very large online search engines

EN
---
|Content|Page Number|
|---|---|
|SECTION A: General Information| |
|1. Audited service:| |
|2. Audited provider:| |
|3. Address of the audited provider:| |
|4. Point of contact of the audited provider:| |
|5. Scope of the audit:| |
|a. Does the audit report include an assessment of compliance with all the obligations and commitments referred to in Article 37(1) of Regulation (EU) 2022/2065 applicable to the audited provider?| |
|Yes/No i. Compliance with Regulation (EU) 2022/2065| |
|Obligations set out in Chapter III of Regulation (EU) 2022/2065:| |
|Audited obligation|Period covered|
|Indicate the precise obligation audited|(DD/MM/YYYY) to

The same is true for the Annex document.

## LlamaIndex Recursive Query Engine

Now that we have some parsed objects - let's see how well we can leverage them using one of the [example query engines](https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb).

### Setting the Settings

Setting the generic LLM to `gpt-3.5-turbo` and the generic embedding model as `text-embedding-3-small`.

Importing `Settings` from `llama_index.core` which is part of LlamaIndex `v0.10` update.

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

Using the `MarkdownElementNodeParser` to make sense of the Markdown objects to access the potentially structured information in the parsed documents.

In [None]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(llm=OpenAI(model="gpt-3.5-turbo"), num_workers=8)

Parsing the documents

Note that there are some errors but the parser can still get information from the document.

In [None]:
nodes = node_parser.get_nodes_from_documents(documents=[documents[0]])

5it [00:00, 8789.40it/s]
columns
  field required (type=value_error.missing)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 482, in _agive_response_single
    structured_response = await program.acall(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 92, in acall
    answer = await self._llm.astructured_predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/instrumentation/dispatcher.py", line 307, in async_wrapper
    result = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/llms/llm.py", line 391, in astructured_predict
    result = await program.acall(**prompt_args)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.py", line 220, in acall
    return _parse_tool_calls(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.

Extracting the `base_nodes` and `objects` to create the `VectorStoreIndex`.

In [None]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

Building the index

In [None]:
from llama_index.core import VectorStoreIndex

recursive_index = VectorStoreIndex(nodes=base_nodes+objects)

### Recursive Query Engine

Building the Recursive Query Engine with reranking.

Applying the following 2 steps:

1. Initalise the reranker using `FlagEmbeddingReranker` powered by the `BAAI/bge-reranker-large`
2. Set up the recursive query engine

Installing some requirements

In [None]:
!pip install -qU llama-index-postprocessor-flag-embedding-reranker git+https://github.com/FlagOpen/FlagEmbedding.git

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for FlagEmbedding (setup.py) ... [?25l[?25hdone


Initialising the reranker [`BAAI/bge-reranker-large`](https://huggingface.co/BAAI/bge-reranker-large).

Creating the recursive query engine.

In [None]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=15,
    node_postprocessors=[reranker],
    verbose=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

## Delegated Act Test

Testing with the Delegated Act document.

In [None]:
query = "What is in Section 2 Article 4?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_6b0dcd4b-c505-4208-bd55-4292c9859016_176_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is in Section 2 Article 4?
[0m

In [None]:
print(response)

Section II Article 4 details the criteria for choosing the auditing organization. It mandates that prior to selecting an auditing organization for the audit, the audited provider must confirm that the organization satisfies the criteria specified in Article 37(3) of Regulation (EU) 2022/2065. If the auditing organization comprises multiple legal entities or intends to engage subcontractors, the audited provider must validate that each legal entity or subcontractor meets particular requirements individually and collectively fulfills another requirement outlined in Article 37(3) of the regulation.


This information was retrieved as expected.

In [None]:
query = "When is this Regulation entering into force ?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_6b0dcd4b-c505-4208-bd55-4292c9859016_148_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query When is this Regulation entering into force ?
[0m

In [None]:
print(response)

This Regulation will enter into force on the twentieth day following its publication in the Official Journal of the European Union.


This was also an accurate response.

## Testing on the Annex Document

In [None]:
annex1_report_nodes = node_parser.get_nodes_from_documents(documents=[documents[1]])

5it [00:00, 6771.56it/s]
columns
  field required (type=value_error.missing)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 482, in _agive_response_single
    structured_response = await program.acall(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 92, in acall
    answer = await self._llm.astructured_predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/instrumentation/dispatcher.py", line 307, in async_wrapper
    result = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/llms/llm.py", line 391, in astructured_predict
    result = await program.acall(**prompt_args)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.py", line 220, in acall
    return _parse_tool_calls(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.

In [None]:
annex1_base_nodes, annex1_objects = node_parser.get_nodes_and_objects(annex1_report_nodes)

In [None]:
annex1_recursive_index = VectorStoreIndex(nodes=annex1_base_nodes+annex1_objects)

In [None]:
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

annex1_recursive_query_engine = annex1_recursive_index.as_query_engine(
    similarity_top_k=15,
    node_postprocessors=[reranker],
    verbose=True
)

In [None]:
query = "List the audit conclusion options?"
response = annex1_recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_56_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List the audit conclusion options?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_36_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List the audit conclusion options?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_44_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List the audit conclusion options?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_64_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List the audit conclusion options?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_4_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query 

In [None]:
print(response)

The audit conclusion options are:
- Positive
- Positive with comments
- Negative


The query engine successfully retrieved this information as expected.

In [None]:
query = "What information needs to be filled in section F.1?"
response = annex1_recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_64_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What information needs to be filled in section F.1?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_4_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What information needs to be filled in section F.1?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_44_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What information needs to be filled in section F.1?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_56_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What information needs to be filled in section F.1?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_36_table: TextNode


In [None]:
print(response)

Name of third party consulted, representative and contact information of consulted third party, date(s) of consultation, input provided by third-party.


The query engine succesfully retrieved this information from the table as expected.

In [None]:
query = "In annex 2, what reasons must be filled in Reasons for not implementing the recommendation ?"
response = annex1_recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_56_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query In annex 2, what reasons must be filled in Reasons for not implementing the recommendation ?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_44_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query In annex 2, what reasons must be filled in Reasons for not implementing the recommendation ?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_4_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query In annex 2, what reasons must be filled in Reasons for not implementing the recommendation ?
[0m[1;3;38;2;11;159;203mRetrieval entering id_287ff845-616a-4777-a941-b6604acf1c9c_64_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query In annex 2, what reasons must be f

In [None]:
print(response)

Justification for not implementing the recommendation and alternative measure(s) taken to achieve compliance must be filled in the Reasons for not implementing the recommendation section in Annex 2.


The query engine succesfully retrieved this information from the table as expected.