<a href="https://colab.research.google.com/github/sdossou/Ragas/blob/main/CSRD_Direct_RAG_Evaluation_LangChain_%26_Ragas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSRD Directive RAG Evaluation

This notebook does the following:

- Creating a RAG pipeline with (LangChain v0.1.0)
- Evaluating the pipeline with the Ragas library
- Making an adjustment to the RAG pipeline
- Evaluating the adjusted pipeline against the baseline


Installing dependencies

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai ragas tiktoken cohere faiss_cpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m812.8/812.8 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.6/276.6 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m266.9/266.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.7/80.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m145.3/145.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [2]:
import langchain
print(f"LangChain Version: {langchain.__version__}")

LangChain Version: 0.1.14


Importing an OpenAI API key

In [3]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Please provide your OpenAI Key: ··········


## Building the RAG pipeline

- Creating an Index
- Use a LLM to generate responses based on the retrieved context


#### Loading Data

Loading the CSRD data

In [4]:
!wget https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32022L2464 -O "direct.htm"

--2024-04-03 08:54:22--  https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32022L2464
Resolving eur-lex.europa.eu (eur-lex.europa.eu)... 18.160.213.87, 18.160.213.122, 18.160.213.110, ...
Connecting to eur-lex.europa.eu (eur-lex.europa.eu)|18.160.213.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘direct.htm’

direct.htm              [ <=>                ] 612.61K  --.-KB/s    in 0.05s   

2024-04-03 08:54:23 (13.2 MB/s) - ‘direct.htm’ saved [627311]



#### Transforming Data

Loading the document with the BSHTMLLoader function from langchain_community



In [5]:
!pip install beautifulsoup4 -q

In [6]:
from langchain_community.document_loaders import BSHTMLLoader

direct_bshtml_loader = BSHTMLLoader("direct.htm")

direct_data = direct_bshtml_loader.load()


  soup = BeautifulSoup(f, **self.bs_kwargs)


Chunking the document using `RecursiveCharacterTextSplitter`.

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

direct_documents = text_splitter.transform_documents(direct_data)


Confirming the correct splitting of the document.

In [8]:
len(direct_documents)

586

#### Loading OpenAI Embeddings Model


Using the OpenAI `text-embedding-ada-002` to convert the text into vectors that allow to compare them to the query vector.

In [9]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

Creating a FAISS VectorStore


In [10]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(direct_documents, embeddings)


vector_store.docstore._dict

{'87564c38-d6ed-4891-983e-b43c3383004e': Document(page_content='L_2022322EN.01001501.xml\n\n\n\n\n\n\n\n\n\n\n\n16.12.2022\xa0\xa0\xa0\n\n\nEN\n\n\nOfficial Journal of the European Union\n\n\nL 322/15\n\n\n\n\n\n\n\n\n            DIRECTIVE (EU) 2022/2464 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\n         \nof 14\xa0December 2022\n         \namending Regulation (EU) No\xa0537/2014, Directive 2004/109/EC, Directive 2006/43/EC and Directive 2013/34/EU, as regards corporate sustainability reporting\n(Text with EEA relevance)\n\n\nTHE EUROPEAN PARLIAMENT AND THE COUNCIL OF THE EUROPEAN UNION,\n\nHaving regard to the Treaty on the Functioning of the European Union, and in particular Articles 50 and\xa0114 thereof,\n\n\nHaving regard to the proposal from the European Commission,', metadata={'source': 'direct.htm', 'title': 'L_2022322EN.01001501.xml'}),
 '898c7681-2255-4425-a612-068bc332d696': Document(page_content='After transmission of the draft legislative act to the national parliamen

Creating a Retriever to complete the index


In [11]:
retriever = vector_store.as_retriever()

Testing the Retriever


In [12]:
retrieved_documents = retriever.invoke("What is Article 40b in Chapter 9a about?")

In [13]:
for doc in retrieved_documents:
  print(doc)

page_content='(14)\n\n\nthe following chapter is inserted:\n\n\n                                       ‘CHAPTER 9a\n\n\nREPORTING CONCERNING THIRD-COUNTRY UNDERTAKINGS\n\n\n\nArticle\xa040a\nSustainability reports concerning third-country undertakings' metadata={'source': 'direct.htm', 'title': 'L_2022322EN.01001501.xml'}
page_content='(15)\n\n\nthe title of Chapter 11 is replaced by the following:\n\n\n                                       ‘CHAPTER 11\n\n\nTRANSITIONAL AND FINAL PROVISIONS\n                                       ’;\n\n\n\n\n\n\n\n\n\n\n\n\n(16)\n\n\nThe following article is inserted:\n\n‘Article\xa048i\nTransitional provisions' metadata={'source': 'direct.htm', 'title': 'L_2022322EN.01001501.xml'}
page_content='5.\xa0\xa0\xa0The coordination measures prescribed by Articles 40a to 40d shall also apply to the laws, regulations and administrative provisions of the Member States relating to subsidiary undertakings and branches of undertakings which are not governed by th

Setting up the RAG Chain



#### Creating a Prompt Template

2 options below:
- pull a prompt from the prompt hub
- create a custom prompt template

In [14]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [15]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


Creating a custom prompt template to be more specific

In [16]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up the QA Chain

Instantiating a basic RAG chain LCEL

Ensuring to pass-through the context which is critical for RAGAS.

In [17]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Testing the QA chain

In [18]:
question = "what does CSRD stand for?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Carbon Disclosure Standards Board


In [19]:
question = "what are the key components of the CSRD?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I don't know.
[Document(page_content='on Climate-related Financial Disclosures, the Carbon Disclosure Standards Board, and CDP, formerly known as the Carbon Disclosure Project.', metadata={'source': 'direct.htm', 'title': 'L_2022322EN.01001501.xml'}), Document(page_content='and key performance indicators relevant to the business.', metadata={'source': 'direct.htm', 'title': 'L_2022322EN.01001501.xml'}), Document(page_content='Sustainability reporting standards should also take account of internationally recognised principles and frameworks on responsible business conduct, corporate social responsibility, and sustainable development, including the SDGs, the UN Guiding Principles on Business and Human Rights, the OECD Guidelines for Multinational Enterprises, the OECD Due Diligence Guidance for Responsible Business Conduct and related sectoral guidelines, the Global Compact, the International Labour Organization’s (ILO) Tripartite Declaration of Principles concerning Multinational Enterp

We can see that we could make some improvements.

## Ragas Evaluation

The Ragas library evaluates the RAG pipeline by collecting input/output/context triplets and calculating metrics relating  different aspects of the RAG pipeline.

This notebook evaluates every core metric based on a test set created using a Ragas function and GPT.

#### Synthetic Test Set Generation

Using the Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate a synthetic QC pairs and a synthetic ground truth.

This process uses `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic. If you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

Create a new set of documents to ensure that the sample test set created does not excessively favour the base model.

In [20]:
direct_documents = direct_bshtml_loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)
direct_documents = text_splitter.split_documents(direct_data)


  soup = BeautifulSoup(f, **self.bs_kwargs)


In [21]:
len(direct_documents)

418

In [22]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(direct_documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

  generator = TestsetGenerator.with_openai()


embedding nodes:   0%|          | 0/836 [00:00<?, ?it/s]



Generating:   0%|          | 0/10 [00:00<?, ?it/s]

Checking the output.

In [23]:
testset.test_data[0]

DataRow(question='What is the definition of a "company incorporated in a third country"?', contexts=['4.\n\n\n“third-country audit entity” means an entity, regardless of its legal form, which carries out audits of the annual or consolidated financial statements, or, where applicable, the assurance of sustainability reporting of a company incorporated in a third country, other than an entity which is registered as an audit firm in any Member State as a consequence of approval in accordance with Article\xa03;\n\n\n\n\n\n\n\n\n\n\n5.\n\n\n“third-country auditor” means a natural person who carries out audits of the annual or consolidated financial statements or, where applicable, the assurance of sustainability reporting of a company incorporated in a third country, other than a person who is registered as a statutory auditor in any Member State as a consequence of approval in accordance with Articles 3 and\xa044;\n\n\n\n\n\n\n\n\n\n\n6.'], ground_truth='nan', evolution_type='simple', meta

#### Generating Responses with RAG Pipeline

Evaluating the RAG pipeline using Ragas.

Extracting the questions and ground truths from the created testset.

Converting the test dataset into a Pandas DataFrame.

In [24]:
test_df = testset.to_pandas()

In [25]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,"What is the definition of a ""company incorpora...",[4.\n\n\n“third-country audit entity” means an...,,simple,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True
1,What rights do shareholders have regarding vot...,[For undertakings subject to the sustainabilit...,Shareholders which represent more than 5% of t...,simple,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True
2,What are the consequences of the gap between u...,"[In the absence of policy action, the gap betw...",The consequences of the gap between users' inf...,simple,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True
3,What information should be covered in a sustai...,[A Member State shall require that a branch lo...,The sustainability report at the group level o...,simple,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True
4,What is the purpose of Directive 2004/109/EC a...,"[Directive 2004/109/EC, as amended by this ame...","Directive 2004/109/EC, as amended by this amen...",simple,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True
5,What challenges do undertakings face in gather...,[4. Sustainability reporting standards shall...,Undertakings may face difficulties in gatherin...,reasoning,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True
6,Why is it important to consult experts during ...,[(80)\n\n\nIn order to specify the requirement...,It is important to consult experts during the ...,reasoning,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True
7,What information should be included in sustain...,[commercial position of the undertaking. Repor...,The sustainability reports for the undertaking...,multi_context,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True
8,What is the role of Article 8 in the reporting...,"[Therefore, a progressive approach to enhancin...",The role of Article 8 in the reporting require...,multi_context,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True
9,What obligations should be extended to statuto...,[To ensure the independence of the statutory a...,Statutory auditors and audit firms should have...,simple,"[{'source': 'direct.htm', 'title': 'L_2022322E...",True


In [26]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Generating responses using the RAG pipeline using the questions generated and collecting the retrieved contexts for each question.

In [27]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Wrapping the information in a Hugging Face dataset for use in the Ragas library.

In [28]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Checking the data.

In [29]:
response_dataset[0]

{'question': 'What is the definition of a "company incorporated in a third country"?',
 'answer': 'A company incorporated in a third country is defined as an entity, regardless of its legal form, which carries out audits of the annual or consolidated financial statements, or the assurance of sustainability reporting.',
 'contexts': ['4.\n\n\n“third-country audit entity” means an entity, regardless of its legal form, which carries out audits of the annual or consolidated financial statements, or, where applicable, the assurance of sustainability reporting of a company incorporated in a third country, other than an entity which is registered as an audit firm in any Member State as a consequence of approval in accordance with Article\xa03;\n\n\n\n\n\n\n\n\n\n\n5.',
  'Article where such undertaking and its subsidiary undertakings are included in the consolidated sustainability reporting of that parent undertaking that is established in a third country and where that consolidated sustainab

#### Evaluating with Ragas

Importing the desired metrics to use for evaluating the created dataset

The following metrics are used:

- Faithfulness
- Answer Relevancy
- Context Precision
- Context Recall
- Answer Correctness


In [30]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

Evaluating using the `evaluate` function.

In [31]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [32]:
results

{'faithfulness': 0.9667, 'answer_relevancy': 0.9307, 'context_recall': 0.8583, 'context_precision': 0.8389, 'answer_correctness': 0.6832}

In [33]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,"What is the definition of a ""company incorpora...",A company incorporated in a third country is d...,[4.\n\n\n“third-country audit entity” means an...,,1.0,0.974557,0.25,0.25,0.181092
1,What rights do shareholders have regarding vot...,Shareholders who represent more than 5% of the...,[Member States shall ensure that shareholders ...,Shareholders which represent more than 5% of t...,1.0,0.879029,1.0,1.0,0.73793
2,What are the consequences of the gap between u...,Investors are unable to take sufficient accoun...,"[In the absence of policy action, the gap betw...",The consequences of the gap between users' inf...,1.0,0.877619,0.666667,1.0,0.735861
3,What information should be covered in a sustai...,The sustainability report at the group level o...,[parent undertaking reporting at group level i...,The sustainability report at the group level o...,1.0,0.994279,1.0,0.638889,0.999823
4,What is the purpose of Directive 2004/109/EC a...,The purpose of Directive 2004/109/EC is to ens...,[(23)\n\n\nDirective 2004/109/EC of the Europe...,"Directive 2004/109/EC, as amended by this amen...",1.0,0.914347,0.666667,1.0,0.669464
5,What challenges do undertakings face in gather...,Undertakings face challenges in gathering info...,[Sustainability reporting standards should als...,Undertakings may face difficulties in gatherin...,1.0,0.968744,1.0,1.0,0.547723
6,Why is it important to consult experts during ...,It is important to consult experts during the ...,[In order to specify the requirements set out ...,It is important to consult experts during the ...,0.666667,0.973681,1.0,1.0,0.490675
7,What information should be included in sustain...,The information that should be included in sus...,[comparable and based on uniform indicators wh...,The sustainability reports for the undertaking...,1.0,0.954948,1.0,1.0,0.743566
8,What is the role of Article 8 in the reporting...,Article 8 of Regulation (EU) 2020/852 outlines...,[reporting. The auditor should also assess whe...,The role of Article 8 in the reporting require...,1.0,0.833371,1.0,1.0,0.736875
9,What obligations should be extended to statuto...,The obligation to report irregularities to the...,[The rules on the approval and recognition of ...,Statutory auditors and audit firms should have...,1.0,0.936387,1.0,0.5,0.989013


## Testing a More Performant Retriever

Testing how a different retriver impacts the Ragas metrics.

In [34]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

Re-creating the RAG pipeline using the abstractions in LangChain v0.1.0

Creating a chain to "stuff" our documents into our context.

In [35]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Creating the retrieval chain.

In [36]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [40]:
response = retrieval_chain.invoke({"input": "What does the acronym CSRD stand for?"})

In [41]:
print(response["answer"])

The context provided does not mention the specific acronym "CSRD."


In [43]:
response = retrieval_chain.invoke({"input": "What is article 29a about?"})

In [44]:
print(response["answer"])

Article 29a of Directive 2013/34/EU pertains to the disclosure of non-financial information by undertakings. It requires undertakings to include references to and additional explanations of amounts reported in their annual financial statements. Additionally, it specifies reporting areas that must be covered in the non-financial reporting, such as business model, policies, outcomes of policies, risks and risk management, and key performance indicators relevant to the business. The article aims to ensure consistency in the information disclosed in the management report.


Evaluating the new chain.

Collecting the pipeline's contexts and answers.

In [45]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Converting it into a dataset.

In [46]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Evaluating on the same metrics as the first pipeline.

In [47]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

### Comparing Results

Comparing results and see what directional changes occured.

Initial metrics from the first pipeline.

In [48]:
results

{'faithfulness': 0.9667, 'answer_relevancy': 0.9307, 'context_recall': 0.8583, 'context_precision': 0.8389, 'answer_correctness': 0.6832}

Metrics from the advanced retrieval chain.

In [49]:
advanced_retrieval_results

{'faithfulness': 1.0000, 'answer_relevancy': 0.9561, 'context_recall': 0.9333, 'context_precision': 0.8511, 'answer_correctness': 0.6545}

In [50]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.966667,1.0,0.033333
1,answer_relevancy,0.930696,0.956124,0.025428
2,context_recall,0.858333,0.933333,0.075
3,context_precision,0.838889,0.851052,0.012163
4,answer_correctness,0.683202,0.654532,-0.028671


Faithfulness, answer relevancy, context recall and context precision have improved but answer correctness has deteriorated.

More experimentation is needed to improve the pipeline.

This notebook is adapted from the notebook developed by AI Makerspace