<a href="https://colab.research.google.com/github/sdossou/CSRD_Legislation_RAG/blob/main/CSRD_Legislation_RAG_Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CSRD Legislation RAG with LLaMA 2

This notebook analyses (using Retrieval Augmented Generation/RAG) the EU Corporate Sustainability Reporting Directive (CSRD) and the delegated act using LLaMA 2 and LangChain.

This is the first notebook in a series which will seek to improve the performance of this model through various reinforcement learning techniques.


Install all relevant dependencies


In [None]:
!pip install -U -q "langchain" "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.6/485.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

### Data Preparation

Collecting and parsing the CSRD directive and the delegated act from the EU website pages.

In [None]:
!wget https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32022L2464 -O "direct.htm"

--2024-03-26 15:15:47--  https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32022L2464
Resolving eur-lex.europa.eu (eur-lex.europa.eu)... 3.163.165.88, 3.163.165.37, 3.163.165.94, ...
Connecting to eur-lex.europa.eu (eur-lex.europa.eu)|3.163.165.88|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘direct.htm’

direct.htm              [   <=>              ] 612.67K  1.01MB/s    in 0.6s    

2024-03-26 15:15:48 (1.01 MB/s) - ‘direct.htm’ saved [627371]



In [None]:
!wget https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202302772 -O "deleg.htm"

--2024-03-26 15:15:58--  https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202302772
Resolving eur-lex.europa.eu (eur-lex.europa.eu)... 3.163.165.88, 3.163.165.37, 3.163.165.94, ...
Connecting to eur-lex.europa.eu (eur-lex.europa.eu)|3.163.165.88|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘deleg.htm’

deleg.htm               [     <=>            ]   5.14M  5.44MB/s    in 0.9s    

2024-03-26 15:16:01 (5.44 MB/s) - ‘deleg.htm’ saved [5392043]



#### Data Parsing


Parsing the HTM files with the BSHTMLLoader.

In [None]:
!pip install beautifulsoup4 -q

In [None]:
from langchain_community.document_loaders import BSHTMLLoader

direct_bshtml_loader = BSHTMLLoader("direct.htm")

direct_data = direct_bshtml_loader.load()



In [None]:
len(direct_data)

1

In [None]:
from langchain_community.document_loaders import BSHTMLLoader

deleg_bshtml_loader = BSHTMLLoader("deleg.htm")

deleg_data = deleg_bshtml_loader.load()

In [None]:
len(deleg_data)

1

Splitting the text in to chunks using the `RecursiveCharacterTextSplitter`.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000, # the character length of the chunk
    chunk_overlap = 100, # the character length of the overlap between chunks
    length_function = len, # the length function - in this case, character length (aka the python len() fn.)
)

In [None]:
direct_documents = text_splitter.transform_documents(direct_data)

In [None]:
len(direct_documents)

410

In [None]:
deleg_documents = text_splitter.transform_documents(deleg_data)

In [None]:
len(deleg_documents)

1158

In [None]:
combined_documents = direct_documents + deleg_documents

The 2 documents are transformed into manageable sizes.

### Index

The two structured documents are parsed into a useful format for querying, retrieving, as well as to be used in the LLM application stack.

#### Installing Dependencies and FAISS

Installing all relevant dependencies including Facebook AI Similarity Search or FAISS.

In [None]:
!pip install -q -U faiss-cpu tiktoken sentence-transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m77.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m91.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[?25h



Setting up the embeddings using HuggingFaceEmbeddings and the VectorStore using FAISS.

In [None]:
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)

vector_store = FAISS.from_documents(combined_documents, embedder)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Testing that the VectorStore is working by retrieving information from the two embedded documents.

In [None]:
query = "What are the existing information gaps in sustainability?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

(15)


The Commission report on the review clauses and its accompanying fitness check also identified a significant increase in requests to undertakings for information about sustainability matters aimed at addressing the existing information gap between users’ information needs and the available corporate sustainability information. In addition, ongoing expectations on undertakings to use a variety of different frameworks and standards are likely to continue and may even intensify as the value placed on sustainability information continues to grow. In the absence of policy action to build consensus on the information that undertakings should report, there will be significant increases in terms of cost and burden for reporting undertakings and for users of such information.












(16)
7

Preparation and presentation of sustainability information











7.1

Presenting comparative information











7.2

Sources of estimation and outcome uncertainty











7.3

Updatin

Checking how much time the `CacheBackedEmbeddings` pattern saves:

In [None]:
%%timeit -n 1 -r 1
query = "What date will the commission adopt delegated acts in accordance with article 49?"
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

12.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [None]:
%%timeit
query = "What date will the commission adopt delegated acts in accordance with article 49"
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

6.87 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The cached query is significantly faster than the first instance of the query.

### Building the Retrieval Chain

The following Retrieval Chain allows us to ask semantic questions on the data.





#### A Basic RetrievalQA Chain

The `return_source_documents=True` will ensure that we have the proper locations for the article the model is getting the answer from - should the end user want to verify the article themselves.


#### LLM

This notebook uses Meta's LLaMA 2.

Spefically "meta-llama/Llama-2-13b-chat-hf"

This 13B parameter model will run on less than 15GB of GPU RAM.

More information on this model can be found [here](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)

In [None]:
!pip install huggingface-hub -q

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Using Tim Dettmer's `bitsandbytes`, `accelerate` and `transformers` from Hugging Face to make the model as small as possible.

In [None]:
import torch
import transformers

model_id = "meta-llama/Llama-2-13b-chat-hf"

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_config = transformers.AutoConfig.from_pretrained(
    model_id
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto'
)

model.eval()

config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Lla

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Packing it into a `pipeline` for compatibility with `langchain`.

In [None]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=True,
    temperature=0.01,
    max_new_tokens=256
)

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

Setting up the chain.

In [None]:
retriever = vector_store.as_retriever()

In [None]:
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    callbacks=[handler],
    return_source_documents=True
)

Testing the chain.

In [None]:
qa_with_sources_chain({"query" : "What are the existing information gaps in sustainability??"})

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'What are the existing information gaps in sustainability??',
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n(15)\n\n\nThe Commission report on the review clauses and its accompanying fitness check also identified a significant increase in requests to undertakings for information about sustainability matters aimed at addressing the existing information gap between users’ information needs and the available corporate sustainability information. In addition, ongoing expectations on undertakings to use a variety of different frameworks and standards are likely to continue and may even intensify as the value placed on sustainability information continues to grow. In the absence of policy action to build consensus on the information that undertakings should report, there will be significant increases in terms of cost and burden for reporting undertakin

In [None]:
qa_with_sources_chain({"query" : "What date will the commission adopt delegated acts in accordance with article 49?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'What date will the commission adopt delegated acts in accordance with article 49?',
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n(17)\n\n\nArticle\xa049 is amended as follows:\n\n\n\n\n\n\n(a)\n\n\nparagraphs 2 and\xa03 are replaced by the following:\n\n‘2.\xa0\xa0\xa0The power to adopt delegated acts referred to in Article\xa01(2), Article\xa03(13), Articles 29b, 29c and\xa040b, and Article\xa046(2) shall be conferred on the Commission for a period of 5 years from 5\xa0January 2023. The Commission shall draw up a report in respect of the delegation of power not later than nine months before the end of the 5-year period. The delegation of power shall be tacitly extended for periods of an identical duration, unless the European Parliament or the Council opposes such extension not later than three months before the end of each period.\n\n(29)\n\n

In [None]:
qa_with_sources_chain({"query" : "which domains of ESG do the directive and the delegated acts apply to?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'which domains of ESG do the directive and the delegated acts apply to?',
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n3.\xa0\xa0\xa0The delegation of power referred to in Article\xa01(2), Article\xa03(13), Articles 29b, 29c and\xa040b, and Article\xa046(2) may be revoked at any time by the European Parliament or by the Council. A decision to revoke shall put an end to the delegation of the power specified in that decision. It shall take effect the day following the publication of that decision in the Official Journal of the European Union or at a later date specified therein. It shall not affect the validity of any delegated acts already in force.’;\n\n\n\n\n\n\n\n\n\n\n\n(b)\n\n\nthe following paragraph is inserted:\n\n‘3b.\xa0\xa0\xa0When adopting delegated acts pursuant to Articles 29b and\xa029c, the Commission shall take into consideration

This notebook is adapted from the notebook developed by AI Makerspace, which was originally using Barbie and Oppenheimer movie reviews.