# Semantic Chunking for RAG

https://github.com/pavanbelagatti/Semantic-Chunking-RAG/blob/main/Semantic%20Chunking%20Tutorial.ipynb

LLMs are bound to hallucinate and then we have different strategies to mitigate this hallucination behaviour of LLMs. One such strategy is Retrieval Augmented Generation (RAG), where a knowledge base is already augmented/provided to the LLM to retrieve the information from and this way LLMs won't hallucinate since the knowledge base is already specified. 

RAG involves a step by step process of loading the documents/data, splitting the documents into chunks using any AI framework such as LangChain or LlamaIndex, and vector embeddings generation for the data and storing these embeddings in a vector database.

So, broadly we can devide the RAG into two main parts, Storing and Retrieval.

While enahncing our RAG pipeline, one thing we need to looak at is the retriavl strategy and technoques involved.
We can improve retrieval in RAG using the proper chunking strategy. But finding the right chunk size for any given text is a very hard question in general.

Today, we will see how semantic chunking works.

Semantic Chunking considers the relationships within the text. It divides the text into meaningful, semantically complete chunks. This approach ensures the information’s integrity during retrieval, leading to a more accurate and contextually appropriate outcome.

Let's experiment with Semantic chunking & see the results

## Tech Stack Used
#### LangChain - Open source AI fraamework to load, split and to create embeddings of the data
#### SingleStore - As a robust vector database to store vector embeddings
#### Groq and HuggingFace - To choose our LLMs and embedding models

### Download data

In [1]:
# ! wget "https://arxiv.org/pdf/1810.04805.pdf"

## Process the PDF Content

In [49]:
import os
import logging
from dotenv import load_dotenv

In [34]:
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

load_dotenv()

True

In [1]:
from langchain.document_loaders import PyPDFLoader
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [12]:
py_loader = PyPDFLoader("data/1810.04805v2.pdf")

In [13]:
py_documents = py_loader.load()

In [14]:
print(len(py_documents))

16


In [18]:
for py_document in py_documents:
    print(py_document.id)
    print(py_document.metadata)
    print(py_document.page_content)
    print("\n-----\n")

None
{'source': 'data/1810.04805v2.pdf', 'page': 0}
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout }@google.com
Abstract
We introduce a new language representa-
tion model called BERT , which stands for
Bidirectional Encoder Representations from
Transformers. Unlike recent language repre-
sentation models (Peters et al., 2018a; Rad-
ford et al., 2018), BERT is designed to pre-
train deep bidirectional representations from
unlabeled text by jointly conditioning on both
left and right context in all layers. As a re-
sult, the pre-trained BERT model can be ﬁne-
tuned with just one additional output layer
to create state-of-the-art models for a wide
range of tasks, such as question answering and
language inference, without substantial task-
speciﬁc architecture modiﬁcations.
BERT is conceptually simple and empirically
powerful. It obtains

In [3]:
loader = UnstructuredPDFLoader("data/1810.04805v2.pdf")
documents = loader.load()

In [5]:
print(len(documents))

1


In [19]:
document = documents[0]

In [20]:
print(document.id)
print(document.metadata)

None
{'source': 'data/1810.04805v2.pdf'}


In [21]:
print(document.page_content)

9 1 0 2

y a M 4 2

] L C . s c [

2 v 5 0 8 4 0 . 0 1 8 1 : v i X r a

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com

Abstract

We introduce a new language representa- tion model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language repre- sentation models (Peters et al., 2018a; Rad- ford et al., 2018), BERT is designed to pre- train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a re- sult, the pre-trained BERT model can be ﬁne- tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task- speciﬁc architecture modiﬁcations.

There are two existing strategies for appl

## Perform Naive Chunking(RecursiveCharacterTextSplitting)

In [23]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False
)

In [25]:
naive_chunks = text_splitter.split_documents(py_documents)
print(len(naive_chunks))

72


In [26]:
for chunk in naive_chunks[10:15]:
  print(chunk.page_content+ "\n")

BERT BERT 
E[CLS] E1 E[SEP] ... ENE1’... EM’
C
T1
T[SEP] ...
 TN
T1’...
 TM’
[CLS] Tok 1 [SEP] ... Tok NTok 1 ... TokM 
Question Paragraph Start/End Span 
BERT 
E[CLS] E1 E[SEP] ... ENE1’... EM’
C
T1
T[SEP] ...
 TN
T1’...
 TM’
[CLS] Tok 1 [SEP] ... Tok NTok 1 ... TokM 
Masked Sentence A Masked Sentence B 
Pre-training Fine-Tuning NSP Mask LM Mask LM 
Unlabeled Sentence A and B Pair SQuAD 
Question Answer Pair NER MNLI Figure 1: Overall pre-training and ﬁne-tuning procedures for BERT. Apart from output layers, the same architec-
tures are used in both pre-training and ﬁne-tuning. The same pre-trained model parameters are used to initialize
models for different down-stream tasks. During ﬁne-tuning, all parameters are ﬁne-tuned. [CLS] is a special
symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques-
tions/answers).
ing and auto-encoder objectives have been used
for pre-training such models (Howard and Ruder,

2018; Radford et al., 201

## Instantiate Embedding Model

In [35]:
# pip install sentence-transformers --quiet

In [60]:
from langchain_aws import BedrockEmbeddings
from langchain_openai import OpenAIEmbeddings

# embed_model = BedrockEmbeddings(model_id="cohere.embed-multilingual-v3")
embed_model = OpenAIEmbeddings(model="text-embedding-3-small")
# embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")

### Setup the API Key for LLM

In [87]:
from langchain_aws import ChatBedrockConverse

## Perform Semantic Chunking

In [63]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_chunker = SemanticChunker(embed_model, breakpoint_threshold_type="percentile")

In [64]:
semantic_chunks = semantic_chunker.create_documents([d.page_content for d in py_documents])

In [65]:
len(semantic_chunks)

56

In [66]:
print(semantic_chunks[0].page_content)

BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout }@google.com
Abstract
We introduce a new language representa-
tion model called BERT , which stands for
Bidirectional Encoder Representations from
Transformers. Unlike recent language repre-
sentation models (Peters et al., 2018a; Rad-
ford et al., 2018), BERT is designed to pre-
train deep bidirectional representations from
unlabeled text by jointly conditioning on both
left and right context in all layers. As a re-
sult, the pre-trained BERT model can be ﬁne-
tuned with just one additional output layer
to create state-of-the-art models for a wide
range of tasks, such as question answering and
language inference, without substantial task-
speciﬁc architecture modiﬁcations. BERT is conceptually simple and empirically
powerful. It obtains new state-of-the-art re-
sults on eleven natural la

In [67]:
for semantic_chunk in semantic_chunks:
    print(len(semantic_chunk.page_content))

3764
188
2846
794
883
244
629
2821
651
3059
1120
1632
18
2182
957
2657
814
2311
1087
303
1097
3367
17
2680
118
1311
732
358
897
1340
319
794
113
169
1365
160
71
145
1744
1129
193
232
1370
2104
169
6
3508
2946
136
987
1135
815
399
2595
59
457


In [68]:
for semantic_chunk in semantic_chunks:
  if "Effect of Pre-training Tasks" in semantic_chunk.page_content:
    print(semantic_chunk.page_content)
    print(len(semantic_chunk.page_content))

Dev Set
Tasks MNLI-m QNLI MRPC SST-2 SQuAD
(Acc) (Acc) (Acc) (Acc) (F1)
BERT BASE 84.4 88.4 86.7 92.7 88.5
No NSP 83.9 84.9 86.5 92.6 87.9
LTR & No NSP 82.1 84.3 77.5 92.1 77.8
+ BiLSTM 82.1 84.1 75.7 91.6 84.9
Table 5: Ablation over the pre-training tasks using the
BERT BASE architecture. “No NSP” is trained without
the next sentence prediction task. “LTR & No NSP” is
trained as a left-to-right LM without the next sentence
prediction, like OpenAI GPT. “+ BiLSTM” adds a ran-
domly initialized BiLSTM on top of the “LTR + No
NSP” model during ﬁne-tuning. ablation studies can be found in Appendix C. 5.1 Effect of Pre-training Tasks
We demonstrate the importance of the deep bidi-
rectionality of BERT by evaluating two pre-
training objectives using exactly the same pre-
training data, ﬁne-tuning scheme, and hyperpa-
rameters as BERT BASE :
No NSP : A bidirectional model which is trained
using the “masked LM” (MLM) but without the
“next sentence prediction” (NSP) task. LTR & No NSP : A left

In [98]:
unstructured_semantic_chunks = semantic_chunker.create_documents([d.page_content for d in documents])

In [99]:
len(unstructured_semantic_chunks)

34

In [100]:
for unstructured_semantic_chunk in unstructured_semantic_chunks:
    print(len(unstructured_semantic_chunk.page_content))

27
9417
19516
8990
670
56
300
894
41
1127
38
127
197
238
675
282
429
169
398
359
158
69
144
290
1449
139
488
344
344
232
1364
9945
2021
3869


### Store the chunks in our database

In [69]:
# !pip install singlestoredb --quiet

In [70]:
from langchain_postgres import PGVector

In [109]:
POSTGRES_USER = POSTGRES_PASSWORD = POSTGRES_DB = ""
POSTGRES_LOCAL_PORT = 5445

In [108]:
connection = f"postgresql+psycopg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@localhost:{POSTGRES_LOCAL_PORT}/{POSTGRES_DB}"
collection_name = "py_semantic_chunks"

In [84]:
vectorstore = PGVector(
    embeddings=embed_model,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

In [83]:
semantic_chunk_vectorstore = PGVector.from_documents(documents=semantic_chunks, embedding=embed_model, connection=connection, collection_name=collection_name,)

### Instantiate Retrieval Step

In [101]:
semantic_chunk_retriever = semantic_chunk_vectorstore.as_retriever(search_kwargs={"k" : 2})
semantic_chunk_retriever.invoke("Describe the Feature-based Approach with BERT?")

[Document(id='be9182e3-c842-48e8-8cc9-0b34ebfcf658', metadata={}, page_content='For the feature-based approach,\nwe concatenate the last 4 layers of BERT as the\nfeatures, which was shown to be the best approach\nin Section 5.3. From the table it can be seen that ﬁne-tuning is\nsurprisingly robust to different masking strategies. However, as expected, using only the M ASK strat-\negy was problematic when applying the feature-\nbased approach to NER. Interestingly, using only\nthe R NDstrategy performs much worse than our\nstrategy as well.'),
 Document(id='212305f5-a1c9-4d4c-ac15-6b6f2d2ff931', metadata={}, page_content='We use the representation of\nthe ﬁrst sub-token as the input to the token-level\nclassiﬁer over the NER label set. To ablate the ﬁne-tuning approach, we apply the\nfeature-based approach by extracting the activa-\ntions from one or more layers without ﬁne-tuning\nany parameters of BERT. These contextual em-\nbeddings are used as input to a randomly initial-\nized two-

### Instantiate Augmentation Step(for content Augmentation)

In [102]:
from langchain_core.prompts import ChatPromptTemplate

rag_template = """\
Use the following context to answer the user's query. If you cannot answer, please respond with 'I don't know'.

User's Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)

### Instantiate the Generation Step

In [103]:
chat_model = ChatBedrockConverse(
    model="anthropic.claude-3-5-sonnet-20240620-v1:0",
    temperature=0,
    max_tokens=None,
)

## RAG Pipeline Utilizing Semantic Chunking

In [104]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

semantic_rag_chain = (
    {"context" : semantic_chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | chat_model
    | StrOutputParser()
)

### Ask Question 1

In [105]:
semantic_rag_chain.invoke("Describe the Feature-based Approach with BERT?")

"Based on the provided context, I can describe the Feature-based Approach with BERT as follows:\n\n1. In the feature-based approach, the last 4 layers of BERT are concatenated to create features. This was found to be the best approach according to Section 5.3 of the referenced study.\n\n2. These contextual embeddings (features) are then used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer.\n\n3. This approach is used without fine-tuning any parameters of BERT.\n\n4. For Named Entity Recognition (NER) tasks, the representation of the first sub-token is used as input to the token-level classifier over the NER label set.\n\n5. The feature-based approach performs competitively with state-of-the-art methods. In fact, using the token representations from the top four hidden layers of the pre-trained Transformer was only 0.3 F1 behind fine-tuning the entire model.\n\n6. This demonstrates that BERT is effective for both fine-tuning and featur

### Ask Question 2

In [106]:
semantic_rag_chain.invoke("What is SQuADv2.0?")

'SQuAD v2.0 is an extension of the Stanford Question Answering Dataset (SQuAD). It extends the SQuAD 1.1 problem by allowing for the possibility that no short answer exists in the provided paragraph, making the task more realistic. \n\nIn SQuAD v2.0, the model needs to determine not only the correct answer span when an answer exists, but also recognize when there is no answer given in the text. This makes the task more challenging and closer to real-world question answering scenarios.\n\nThe context provides some information about how BERT was adapted to handle this task, including treating questions without answers as having an answer span at the [CLS] token and comparing scores to determine whether to predict an answer or "no answer."'

### Ask Question 3

In [107]:
semantic_rag_chain.invoke("What is the purpose of Ablation Studies?")

"Ablation studies in machine learning and AI research serve several important purposes:\n\n1. Evaluating Component Importance: They help researchers understand which components or features of a model are most important for its performance. By systematically removing or altering different parts of the model, researchers can see how each component contributes to the overall effectiveness.\n\n2. Model Simplification: Ablation studies can reveal if certain parts of a model are unnecessary, allowing for potential simplification without significant loss in performance.\n\n3. Understanding Model Behavior: They provide insights into how different aspects of the model interact and contribute to its decision-making process.\n\n4. Validating Design Choices: Researchers can use ablation studies to justify specific design choices in their models by demonstrating the impact of those choices on performance.\n\n5. Identifying Overfitting: By removing certain components, researchers can sometimes ident

## Implement a RAG pipeline using Naive Chunking Strategy

### Ask Question 1

In [93]:
naive_collection_name = "py_naive_chunks"
naive_chunk_vectorstore = PGVector.from_documents(documents=naive_chunks, embedding=embed_model, connection=connection, collection_name=naive_collection_name)

In [94]:
naive_chunk_retriever = naive_chunk_vectorstore.as_retriever(search_kwargs={"k": 5})
naive_rag_chain = (
        {"context": naive_chunk_retriever, "question": RunnablePassthrough()}
        | rag_prompt
        | chat_model
        | StrOutputParser()
)

In [95]:
naive_rag_chain.invoke("Describe the Feature-based Approach with BERT?")

'Based on the provided context, I can describe the Feature-based Approach with BERT as follows:\n\nThe feature-based approach with BERT involves extracting activations from one or more layers of the pre-trained BERT model without fine-tuning any of its parameters. These extracted contextual embeddings are then used as input features for a downstream task.\n\nSpecifically, the context mentions:\n\n1. The activations are extracted from one or more layers of BERT.\n\n2. These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer.\n\n3. The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer.\n\n4. This approach performs competitively with state-of-the-art methods, achieving results only 0.3 F1 behind fine-tuning the entire model.\n\n5. This demonstrates that BERT is effective for both fine-tuning and feature-based approaches.\n\nThe feature-

### Ask Question 2

In [96]:
naive_rag_chain.invoke("What is SQuADv2.0?")

'SQuAD v2.0 is an extension of the SQuAD (Stanford Question Answering Dataset) v1.1 task. The key difference in v2.0 is that it allows for the possibility that no short answer exists in the provided paragraph, making the problem more realistic. \n\nSpecifically:\n\n1. It extends the SQuAD 1.1 problem definition by including questions that may not have an answer in the given text.\n\n2. This makes the task more challenging and closer to real-world question answering scenarios.\n\n3. For BERT models adapted to this task, questions without answers are treated as having an answer span with start and end at the [CLS] token.\n\n4. The probability space for answer span positions is extended to include the position of the [CLS] token to account for no-answer cases.\n\n5. According to the context, BERT LARGE achieved state-of-the-art performance on SQuAD 2.0, with 80.0 Exact Match (EM) and 83.1 F1 score on the test set.'

### Ask Question 3

In [97]:
naive_rag_chain.invoke("What is the purpose of Ablation Studies?")

"Ablation studies are used to systematically evaluate the impact of different components or design choices in a machine learning model or system. The purpose of ablation studies is to:\n\n1. Understand the contribution of individual components: By removing or modifying specific parts of a model, researchers can determine how much each component contributes to the overall performance.\n\n2. Identify critical elements: Ablation studies help pinpoint which aspects of a model are most crucial for its success.\n\n3. Validate design choices: By comparing different configurations, researchers can justify why certain design decisions were made.\n\n4. Optimize model architecture: The results of ablation studies can guide researchers in refining and improving their models by focusing on the most impactful components.\n\n5. Increase interpretability: By isolating the effects of different parts of a model, ablation studies can provide insights into how the model works.\n\nIn the context provided, 