<a href="https://colab.research.google.com/github/sdossou/CSRD_Legislation_RAG/blob/main/CSRD_Semantic_Chunking_LangChain_%26_RAGAS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Chunking with LangChain and RAGAs using CSRD Directive

Installing dependencies

In [1]:
!pip install -qU langchain_experimental langchain_openai langchain_community langchain ragas

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/193.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━[0m [32m143.4/193.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.4/193.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.1/289.1 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m292.8/292.8 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━

In [2]:
!pip install -qU faiss-cpu tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[?25h

Importing CSRD Directive, Delegated Act and Annex

In [3]:
!wget https://raw.githubusercontent.com/sdossou/CSRD_Legislation_RAG/main/Combined_CSRD_Doc.txt -O "direct.txt"

--2024-04-17 11:22:58--  https://raw.githubusercontent.com/sdossou/CSRD_Legislation_RAG/main/Combined_CSRD_Doc.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1220263 (1.2M) [text/plain]
Saving to: ‘direct.txt’


2024-04-17 11:22:59 (110 MB/s) - ‘direct.txt’ saved [1220263/1220263]



In [4]:
with open("./direct.txt") as f:
  direct_data = f.read()

## RecursiveCharacterTextSplitter or Naive Chunking

Using a traditional non-semantic chunking strategy.



In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False
)

In [6]:
naive_chunks = text_splitter.split_text(direct_data)

In [7]:
for chunk in naive_chunks[10:15]:
  print(chunk + "\n")

Directive 2013/34/EU of the European Parliament and of the Council of 26 June 2013 on the annual financial statements,

consolidated financial statements and related reports of certain types of undertakings, amending Directive 2006/43/EC of the

European Parliament and of the Council and repealing Council Directives 78/660/EEC and 83/349/EEC (OJ L 182, 29.6.2013, p. 19).

. The Green Deal is the new growth strategy of the 
Union. It aims to transform the Union into a modern, resource-efficient and competitive economy with no net

emissions of greenhouse gases (GHG) by 2050. It also aims to protect, conserve and enhance the Union's natural



This demonstrates that naive chunking splits across sentence and context can be lost.

The semantic chunking strategy can help with this issue

## Semantic Chunking

Importing an OpenAI API key

In [8]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

Enter your OpenAI API Key:··········


Using the `percentile` threshold.
In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split.

The approach is as follows:

1. Split our document into sentences (based on `.`, `?`, and `!`)
2. Index each sentence based on position
3. Add a `buffer_size` (`int`) of sentences on either side of our selected sentence
4. Calculate distances between groups of sentences
5. Merge groups based on similarity of the above thresholds





In [9]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_chunker = SemanticChunker(OpenAIEmbeddings(model="text-embedding-3-large"), breakpoint_threshold_type="percentile")

Creating semantic chunks

In [10]:
semantic_chunks = semantic_chunker.create_documents([direct_data])

Viewing the naive chunks.


In [11]:
for semantic_chunk in semantic_chunks:
  if "The subsidiary undertakings and branches" in semantic_chunk.page_content:
    print(semantic_chunk.page_content)
    print(len(semantic_chunk.page_content))

3. The Commission shall, at least every three years after their date of application, review the delegated acts adopted 
pursuant to this Article, taking into consideration the technical advice of the EFRAG and, where necessary, it shall 
amend such delegated acts to take into account relevant developments, including developments with regard to 
international standards. Note (<supNote>1</supNote>)*
Commission Delegated Regulation (EU) 2020/1816 of 17 July 2020 supplementing Regulation (EU) 2016/1011 
of the European Parliament and of the Council as regards the explanation in the benchmark statement of how 
environmental, social and governance factors are reflected in each benchmark provided and published (OJ L 406, 
3.12.2020, p. 1). Note (<supNote>2</supNote>)*
Commission Delegated Regulation (EU) 2020/1817 of 17 July 2020 supplementing Regulation (EU) 2016/1011 
of the European Parliament and of the Council as regards the minimum content of the explanation on how 
environmental, socia

## Creating a RAG Pipeline Utilising Semantic Chunking

Creating a RAG LCEL chain using Semantic Chunks.


### Retrieval

Using Meta's FAISS vectorstore, and `text-embedding-3-large`


In [12]:
from langchain_community.vectorstores import FAISS

semantic_chunk_vectorstore = FAISS.from_documents(semantic_chunks, embedding=OpenAIEmbeddings(model="text-embedding-3-large"))

The semantic retriever is set to `k = 1` to compare the semantic and naive chunking approaches.

In [13]:
semantic_chunk_retriever = semantic_chunk_vectorstore.as_retriever(search_kwargs={"k" : 1})

In [14]:
semantic_chunk_retriever.invoke("What is Article 40b in Chapter 9a about?")

[Document(page_content='5 Articles 19a(2), second subparagraph, and 29a(2), second subparagraph Accounting Directive.')]

### Augmented

Creating a RAG prompt to augment the question with the retrieved context

In [15]:
from langchain_core.prompts import ChatPromptTemplate

rag_template = """\
Use the following context to answer the user's query. If you cannot answer, please respond with 'I don't know'.

User's Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)

### Generation

Using `ChatOpenAI` as base model

In [16]:
from langchain_openai import ChatOpenAI

base_model = ChatOpenAI()

### LCEL Chain
Creating the LCEL semantic chain here to compare to the naive chunking LCEL chain

In [17]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

semantic_rag_chain = (
    {"context" : semantic_chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | base_model
    | StrOutputParser()
)

Testing semantic chunking RAG chain

In [18]:
semantic_rag_chain.invoke("What is Article 40b in Chapter 9a about?")

"I don't know."

In [19]:
semantic_rag_chain.invoke("what are the key components of the CSRD?")

'The key components of the CSRD include cross-cutting standards, topical standards (Environmental, Social and Governance standards), and sector-specific standards.'


Testing the naive chunking RAG chain

In [20]:
naive_chunk_vectorstore = FAISS.from_texts(naive_chunks, embedding=OpenAIEmbeddings(model="text-embedding-3-large"))

Note:  `k = 15` to make it a fair comparison between the two strategies

In [21]:
naive_chunk_retriever = naive_chunk_vectorstore.as_retriever(search_kwargs={"k" : 15})

In [22]:
naive_rag_chain = (
    {"context" : naive_chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | base_model
    | StrOutputParser()
)

In [23]:
naive_rag_chain.invoke("What is Article 40b in Chapter 9a about??")

'Article 40b in Chapter 9a is about sustainability reporting standards for third-country undertakings.'

In [24]:
naive_rag_chain.invoke("What are the key components of the CSRD?")

"I don't know."

## Ragas Assessment Comparison

Setting up a new naive chunking strategy


In [25]:
synthetic_data_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False
)

In [26]:
synthetic_data_chunks = synthetic_data_splitter.create_documents([direct_data])

Creating:

- Questions - synthetically generated (`gpt-3.5-turbo`)
- Contexts - created above
- Ground Truths - synthetically generated (`gpt-4-turbo-preview`)
- Answers - generated from our Semantic RAG Chain

In [27]:
questions = []
ground_truths_semantic = []
contexts = []
answers = []

question_prompt = """\
You are an expert in the corporate sustainability reporting directive or CSRD. Please create a question that can be answered by referencing the following context.

Context:
{context}
"""

question_prompt = ChatPromptTemplate.from_template(question_prompt)

ground_truth_prompt = """\
Use the following context and question to answer this question using *only* the provided context.

Question:
{question}

Context:
{context}
"""

ground_truth_prompt = ChatPromptTemplate.from_template(ground_truth_prompt)

question_chain = question_prompt | ChatOpenAI(model="gpt-3.5-turbo") | StrOutputParser()
ground_truth_chain = ground_truth_prompt | ChatOpenAI(model="gpt-4-turbo-preview") | StrOutputParser()

for chunk in synthetic_data_chunks[10:20]:
  questions.append(question_chain.invoke({"context" : chunk.page_content}))
  contexts.append([chunk.page_content])
  ground_truths_semantic.append(ground_truth_chain.invoke({"question" : questions[-1], "context" : contexts[-1]}))
  answers.append(semantic_rag_chain.invoke(questions[-1]))

Formating into a dataset.

In [28]:
from datasets import load_dataset, Dataset

qagc_list = []

for question, answer, context, ground_truth in zip(questions, answers, contexts, ground_truths_semantic):
  qagc_list.append({
      "question" : question,
      "answer" : answer,
      "contexts" : context,
      "ground_truth" : ground_truth
  })

eval_dataset = Dataset.from_list(qagc_list)

In [29]:
eval_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 10
})

Implementing Ragas metrics and evaluating the created dataset.

In [30]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

In [31]:
from ragas import evaluate

result = evaluate(
    eval_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]



In [32]:
result

{'context_precision': 0.7000, 'faithfulness': 0.8750, 'answer_relevancy': 0.7401, 'context_recall': 0.8333}

In [33]:
results_df = result.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall
0,What specific sustainability goals does the Eu...,I don't know.,[makes the objective of climate neutrality by ...,The European Commission aims to achieve the fo...,0.0,,0.0,0.333333
1,Question: How does the Corporate Sustainabilit...,The Corporate Sustainability Reporting Directi...,[adequately protected. That strategy aims to p...,The provided context does not contain specific...,0.0,1.0,0.91196,1.0
2,What are the key differences between the curre...,The key differences between the current Non-Fi...,[(2)],"Given the limited context provided as ""(2),"" I...",1.0,1.0,0.996318,0.5
3,Question: How does the Action Plan on Financin...,I don't know.,[In its Communication of 8 March 2018entitled ...,The Action Plan on Financing Sustainable Growt...,1.0,,0.0,0.5
4,What is the main purpose of the Corporate Sust...,The main purpose of the Corporate Sustainabili...,"[risks stemming from climate change, resource ...",The main purpose of the Corporate Sustainabili...,1.0,1.0,0.957181,1.0
5,Question: What is the specific regulation adop...,The specific regulation adopted by the Europea...,[objectives. The European Parliament and the C...,The specific regulation adopted by the Europea...,1.0,1.0,0.907852,1.0
6,Question: How does the Corporate Sustainabilit...,The Corporate Sustainability Reporting Directi...,"[in the financial services sector (OJ L 317, 9...","Based on the provided context, it is not possi...",0.0,1.0,0.931392,1.0
7,Question: How does the Corporate Sustainabilit...,The Corporate Sustainability Reporting Directi...,[governs how financial market participants and...,The Corporate Sustainability Reporting Directi...,1.0,1.0,0.912469,1.0
8,Question: What is the official title and date ...,The official title of the Regulation is Commis...,[Regulation (EU) 2020/852 of the European Parl...,The official title and date of the Regulation ...,1.0,0.0,0.880026,1.0
9,Question:\nHow does the Corporate Sustainabili...,The Corporate Sustainability Reporting Directi...,[creates a classification system of \nenvironm...,The Corporate Sustainability Reporting Directi...,1.0,1.0,0.904049,1.0


The metrics results show some improvements can be made

In [34]:
for chunk in synthetic_data_chunks[10:20]:
  questions.append(question_chain.invoke({"context" : chunk.page_content}))
  contexts.append([chunk.page_content])
  ground_truths_semantic.append(ground_truth_chain.invoke({"question" : questions[-1], "context" : contexts[-1]}))
  answers.append(naive_rag_chain.invoke(questions[-1]))

In [35]:
naive_result = evaluate(
    eval_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]



In [36]:
naive_result

{'context_precision': 0.7000, 'faithfulness': 0.8750, 'answer_relevancy': 0.7409, 'context_recall': 0.8333}

In [37]:
naive_results_df = result.to_pandas()
naive_results_df

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall
0,What specific sustainability goals does the Eu...,I don't know.,[makes the objective of climate neutrality by ...,The European Commission aims to achieve the fo...,0.0,,0.0,0.333333
1,Question: How does the Corporate Sustainabilit...,The Corporate Sustainability Reporting Directi...,[adequately protected. That strategy aims to p...,The provided context does not contain specific...,0.0,1.0,0.91196,1.0
2,What are the key differences between the curre...,The key differences between the current Non-Fi...,[(2)],"Given the limited context provided as ""(2),"" I...",1.0,1.0,0.996318,0.5
3,Question: How does the Action Plan on Financin...,I don't know.,[In its Communication of 8 March 2018entitled ...,The Action Plan on Financing Sustainable Growt...,1.0,,0.0,0.5
4,What is the main purpose of the Corporate Sust...,The main purpose of the Corporate Sustainabili...,"[risks stemming from climate change, resource ...",The main purpose of the Corporate Sustainabili...,1.0,1.0,0.957181,1.0
5,Question: What is the specific regulation adop...,The specific regulation adopted by the Europea...,[objectives. The European Parliament and the C...,The specific regulation adopted by the Europea...,1.0,1.0,0.907852,1.0
6,Question: How does the Corporate Sustainabilit...,The Corporate Sustainability Reporting Directi...,"[in the financial services sector (OJ L 317, 9...","Based on the provided context, it is not possi...",0.0,1.0,0.931392,1.0
7,Question: How does the Corporate Sustainabilit...,The Corporate Sustainability Reporting Directi...,[governs how financial market participants and...,The Corporate Sustainability Reporting Directi...,1.0,1.0,0.912469,1.0
8,Question: What is the official title and date ...,The official title of the Regulation is Commis...,[Regulation (EU) 2020/852 of the European Parl...,The official title and date of the Regulation ...,1.0,0.0,0.880026,1.0
9,Question:\nHow does the Corporate Sustainabili...,The Corporate Sustainability Reporting Directi...,[creates a classification system of \nenvironm...,The Corporate Sustainability Reporting Directi...,1.0,1.0,0.904049,1.0


As we can see the results are fairly similar

In [38]:
naive_result

{'context_precision': 0.7000, 'faithfulness': 0.8750, 'answer_relevancy': 0.7409, 'context_recall': 0.8333}

In [39]:
result

{'context_precision': 0.7000, 'faithfulness': 0.8750, 'answer_relevancy': 0.7401, 'context_recall': 0.8333}