<a href="https://colab.research.google.com/github/sdossou/CSRD_Legislation_RAG/blob/main/CSRD_LlamaParse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LlamaParse

Advanced retrieval augmented generation for the EU ESG CSRD legislation: using LlamaParse functionality from LlamaIndex with the CSRD Delegated Act and Annex 1 pdf documents with OpenAI gpt3.5 turbo.

## Load and Parse PDFs

Installing dependencies.

In [None]:
!pip install -qU llama-index llama-parse

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.9/309.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m136.1/136.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

Providing a LlamaCloud API key

In [None]:
import os
import getpass

os.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass("LLamaParse API Key:")

LLamaParse API Key:··········


Providing an OpenAI API key

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


Running async in the Colab instance.

In [None]:
import nest_asyncio

nest_asyncio.apply()

### LlamaParse Initialisation

Initialising the `LlamaParse` object.

Note the following key parameters:

- `result_type` - the options are currently `"text"` and `"markdown"`. Markdown is our choice as it retains structured information.
- `num_workers` - sets how many workers are needed. Generally set to the number of files for parsing. (the maximum is `10`)

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    verbose=True,
    language="en",
    num_workers=2,
)

### Uploading Files

Uploading the 2 CSRD pdf files to parse:
1. The Delegated Act
2. The Annex 1


In [None]:
from google.colab import files

delegated_report = files.upload()

Saving Delegated regulation - C(2023)5303.pdf to Delegated regulation - C(2023)5303 (1).pdf


In [None]:
Annex1_report = files.upload()

Saving Annex - C(2023)5303.pdf to Annex - C(2023)5303 (1).pdf


### Parsing the Files

Note that the time this cell takes to run is inconsistent. Some files take ~6min while others take ~4s.

> NOTE: At time of writing, only `.pdf` files are accepted.

In [None]:
documents = parser.load_data(["./Delegated regulation - C(2023)5303 (1).pdf", "./Annex - C(2023)5303 (1).pdf"])

Parsing files: 100%|██████████| 2/2 [01:27<00:00, 43.92s/it]


Looking at the Delegated Act document

In [None]:
print(documents[0].text[:1000])

# EUROPEAN COMMISSION

Brussels, 31.7.2023

C(2023) 5303 final

## COMMISSION DELEGATED REGULATION (EU) …/...

of 31.7.2023

supplementing Directive 2013/34/EU of the European Parliament and of the Council as regards sustainability reporting standards

(Text with EEA relevance)

EN
---
# EXPLANATORY MEMORANDUM

The Accounting Directive (2013/34/EU) as amended by the Corporate Sustainability Reporting Directive (CSRD - 2022/2464) requires large companies and listed small and medium-sized companies (SMEs), as well as parent companies of large groups, to include in a dedicated section of their management report the information necessary to understand the company’s impacts on sustainability matters, and the information necessary to understand how sustainability matters affect the company’s development, performance, and position. This information must be reported in accordance with European Sustainability Reporting Standards (ESRS), to be adopted by the Commission by means of delegated acts

We can see that some structure is being retained.

In [None]:
print(documents[1].text[:1000])

# EUROPEAN COMMISSION

# EUROPEAN COMMISSION

Brussels, 31.7.2023

C(2023) 5303 final

## ANNEX 1

ANNEX to the Commission Delegated Regulation (EU) .../... supplementing Directive 2013/34/EU of the European
Parliament and of the Council as regards sustainability reporting standards

EN
---
## ANNEX I

EUROPEAN SUSTAINABILITY REPORTING STANDARDS (ESRS)

|ESRS 1|General requirements|
|---|---|
|ESRS 2|General disclosures|
|ESRS E1|Climate change|
|ESRS E2|Pollution|
|ESRS E3|Water and marine resources|
|ESRS E4|Biodiversity and ecosystems|
|ESRS E5|Resource use and circular economy|
|ESRS S1|Own workforce|
|ESRS S2|Workers in the value chain|
|ESRS S3|Affected communities|
|ESRS S4|Consumers and end-users|
|ESRS G1|Business conduct|

### ESRS 1 - GENERAL REQUIREMENTS

Table of contents
Objective
1. Categories of ESRS Standards, reporting areas and drafting conventions
1.1 Categories of ESRS Standards
1.2 Reporting areas and minimum content disclosure requirements on policies, actions, t

The same is true for the Annex 1 document.

## LlamaIndex Recursive Query Engine


### Setting the Settings

Setting the generic LLM to `gpt-3.5-turbo` and the generic embedding model as `text-embedding-3-small`.

Importing `Settings` from `llama_index.core` which is part of LlamaIndex `v0.10` update.

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

Using the `MarkdownElementNodeParser` to make sense of the Markdown objects to access the potentially structured information in the parsed documents.

In [None]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(llm=OpenAI(model="gpt-3.5-turbo"), num_workers=8)

Parsing the documents

Note that there are some errors but the parser can still get information from the document.

In [None]:
nodes = node_parser.get_nodes_from_documents(documents=[documents[0]])

3it [00:00, 14429.94it/s]
columns
  field required (type=value_error.missing)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 482, in _agive_response_single
    structured_response = await program.acall(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 92, in acall
    answer = await self._llm.astructured_predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/instrumentation/dispatcher.py", line 307, in async_wrapper
    result = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/llms/llm.py", line 391, in astructured_predict
    result = await program.acall(**prompt_args)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.py", line 220, in acall
    return _parse_tool_calls(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base

Extracting the `base_nodes` and `objects` to create the `VectorStoreIndex`.

In [None]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

Building the index

In [None]:
from llama_index.core import VectorStoreIndex

recursive_index = VectorStoreIndex(nodes=base_nodes+objects)

### Recursive Query Engine

Building the Recursive Query Engine with reranking.

Applying the following 2 steps:

1. Initalise the reranker using `FlagEmbeddingReranker` powered by the `BAAI/bge-reranker-large`
2. Set up the recursive query engine

Installing some requirements

In [None]:
!pip install -qU llama-index-postprocessor-flag-embedding-reranker git+https://github.com/FlagOpen/FlagEmbedding.git

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for FlagEmbedding (setup.py) ... [?25l[?25hdone


Initialising the reranker [`BAAI/bge-reranker-large`](https://huggingface.co/BAAI/bge-reranker-large).

Creating the recursive query engine.

In [None]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=15,
    node_postprocessors=[reranker],
    verbose=True
)

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

## Delegated Act Test

Testing with the Delegated Act document.

In [None]:
query = "What are the Cross-cutting standards?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_acf4946b-c8f8-4454-bba8-4b06b3555253_16_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the Cross-cutting standards?
[0m[1;3;38;2;11;159;203mRetrieval entering id_acf4946b-c8f8-4454-bba8-4b06b3555253_8_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the Cross-cutting standards?
[0m[1;3;38;2;11;159;203mRetrieval entering id_acf4946b-c8f8-4454-bba8-4b06b3555253_18_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the Cross-cutting standards?
[0m

In [None]:
print(response)

The Cross-cutting standards are ESRS 1 General requirements and ESRS 2 General disclosures.


This information was retrieved as expected.


In [None]:
query = "In which articles of the directive are the sustainability reporting requirements for large undertakings and listed SMEs are set out? ?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_acf4946b-c8f8-4454-bba8-4b06b3555253_8_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query In which articles of the directive are the sustainability reporting requirements for large undertakings and listed SMEs are set out? ?
[0m[1;3;38;2;11;159;203mRetrieval entering id_acf4946b-c8f8-4454-bba8-4b06b3555253_16_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query In which articles of the directive are the sustainability reporting requirements for large undertakings and listed SMEs are set out? ?
[0m[1;3;38;2;11;159;203mRetrieval entering id_acf4946b-c8f8-4454-bba8-4b06b3555253_18_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query In which articles of the directive are the sustainability reporting requirements for large undertakings and listed SMEs are set out? ?
[0m

In [None]:
print(response)

The sustainability reporting requirements for large undertakings and listed SMEs are set out in Articles 19a and 29a of the Accounting Directive.


This was also an accurate response.

## Testing on the Annex 1 Document


In [None]:
annex1_report_nodes = node_parser.get_nodes_from_documents(documents=[documents[1]])

108it [00:00, 16609.89it/s]
columns
  field required (type=value_error.missing)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 482, in _agive_response_single
    structured_response = await program.acall(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 92, in acall
    answer = await self._llm.astructured_predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/instrumentation/dispatcher.py", line 307, in async_wrapper
    result = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/llms/llm.py", line 391, in astructured_predict
    result = await program.acall(**prompt_args)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.py", line 220, in acall
    return _parse_tool_calls(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/ba

In [None]:
annex1_base_nodes, annex1_objects = node_parser.get_nodes_and_objects(annex1_report_nodes)

In [None]:
annex1_recursive_index = VectorStoreIndex(nodes=annex1_base_nodes+annex1_objects)

In [None]:
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

annex1_recursive_query_engine = annex1_recursive_index.as_query_engine(
    similarity_top_k=15,
    node_postprocessors=[reranker],
    verbose=True
)

In [None]:
query = "What is the double materiality principlet?"
response = annex1_recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_184_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the double materiality principlet?
[0m[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_1030_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What is the double materiality principlet?
[0m

In [None]:
print(response)

The double materiality principle involves assessing impact materiality and financial materiality. Impact materiality considers the effects on people or the environment, while financial materiality evaluates information crucial for financial decision-making.


The query engine successfully retrieved this information as expected.

In [None]:
query = "What are the subtopics of ESRS E2 pollution?"
response = annex1_recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_152_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the subtopics of ESRS E2 pollution?
[0m[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_772_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the subtopics of ESRS E2 pollution?
[0m[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_1066_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the subtopics of ESRS E2 pollution?
[0m[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_8_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the subtopics of ESRS E2 pollution?
[0m

In [None]:
print(response)

The subtopics of ESRS E2 pollution are:
- Pollution of air
- Pollution of water
- Pollution of soil
- Pollution of living organisms and food resources
- Substances of concern
- Substances of very high concern
- Microplastics


The query engine succesfully retrieved this information from the table as expected.



In [None]:
query = "What are the Sub-sub-topics of ESRS S4 Consumers and end-users?"
response = annex1_recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_174_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the Sub-sub-topics of ESRS S4 Consumers and end-users?
[0m[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_1774_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the Sub-sub-topics of ESRS S4 Consumers and end-users?
[0m[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_152_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the Sub-sub-topics of ESRS S4 Consumers and end-users?
[0m[1;3;38;2;11;159;203mRetrieval entering id_bcfe7dfb-72f8-4b80-9ed0-1e0a1de9e274_172_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the Sub-sub-topics of ESRS S4 Consumers and end-users?
[0m

In [None]:
print(response)

The sub-sub-topics of ESRS S4 Consumers and end-users are as follows:
1. Policies related to consumers and end-users
2. Processes for engaging with consumers and end-users about impacts
3. Processes to remediate negative impacts and channels for consumers and end-users to raise concerns
4. Taking action on material impacts on consumers and end-users, and approaches to managing material risks and pursuing material opportunities related to consumers and end-users, and effectiveness of those actions
5. Targets related to managing material negative impacts, advancing positive impacts, and managing material risks and opportunities


The query engine did not successfully retrieve the information from the table in the document but has seemed to come up with its own answers which are not in the document.