
# 🔍 Retrieval-Augmented Generation (RAG) with Custom Semantic Chunking

This notebook implements a robust RAG pipeline using a document-heavy scientific report. The system includes:
- Advanced PDF parsing and cleaning
- Semantic and Custom semantic anchor-based chunking strategy
- Embedding and indexing using ChromaDB
- Retrieval with OpenAI GPT-4 for grounded Q&A

---




## ✨ Explainability & Methodology Overview



### 📄 Text Cleaning & Preprocessing

The uploaded EEAP report PDF was parsed using `pdfplumber`. Key steps:
- Removed hyperlinks (to avoid non-informative tokens)
- Normalized whitespace and joined hyphenated line breaks
- Skipped table of contents pages
- Optionally removed unwanted sections like "Executive Summary"

This results in a clean, structured body of text for downstream processing.


### 🧠 Custom Semantic Chunking Strategy

Unlike traditional sequential or fixed-size chunking, this approach:
- Uses an `anchor_stride` to select anchor sentences
- Computes cosine similarity between each anchor and all sentences (pre-combined in `comb`)
- Selects the most semantically similar, non-overlapping sentences until a chunk size limit is reached

This allows the model to retrieve **cross-paragraph** and **contextually linked** ideas — critical for scientific or long-form content.



### 🧪 Embedding + Retrieval + LLM (RAG)

- `SentenceTransformer` is used for embedding chunks
- ChromaDB is used for vector indexing and fast approximate retrieval
- GPT-4 (via OpenAI API) is used to answer questions grounded in the top-k retrieved chunks

This architecture enables accurate, explainable, and flexible QA over large documents.

You can now evaluate the performance differences between chunking strategies using a set of benchmark questions or LLM scoring.


In [2]:
!pip install pdfplumber
!pip install PyPDF2
!pip install chromadb --upgrade
!pip install --upgrade openai
import re
import openai
import chromadb
import pdfplumber
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


Collecting openai
  Downloading openai-1.82.0-py3-none-any.whl.metadata (25 kB)
Downloading openai-1.82.0-py3-none-any.whl (720 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m720.4/720.4 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.78.1
    Uninstalling openai-1.78.1:
      Successfully uninstalled openai-1.78.1
Successfully installed openai-1.82.0


# **PDF Preprocessing**

In [4]:
pdf_path = "EEAP-2022-Assessment-Report-May2023-1-30.pdf"
cleaned_pages = []


skip_pages = set(range(7, 11)).union(range(2, 3))
with pdfplumber.open(pdf_path) as pdf:
    for i, page in enumerate(pdf.pages):
        if i in skip_pages:
            continue

        text = page.extract_text()
        if text:
            text = re.sub(r'https?://\S+', '', text)
            text = re.sub(r'\s+', ' ', text)
            text = re.sub(r'-\s+', '', text)
            text = re.sub(r'\s{2,}', ' ', text)
            cleaned_pages.append(text.strip())

full_clean_text = "\n\n".join(cleaned_pages)


In [5]:
print(full_clean_text)

Environmental Effects of Stratospheric Ozone Depletion, UV Radiation, and Interactions with Climate Change 2022 Assessment Report Montreal Protocol on Substances that Deplete the Ozone Layer 1

Montreal Protocol On Substances that Deplete the Ozone Layer UNEP 2022 Assessment Report of the Environmental Effects Assessment Panel The text of this report is composed in Gibson. Co-ordination: Environmental Effects Assessment Panel Reproduction: UNEP Nairobi, Ozone Secretariat Date: March 2023 This document is available in electronic form from No copyright involved. This publication may be freely copied, abstracted and cited, with acknowledgement of the source of the material. ISBN: 978-9914-733-91-4 2

Highlights HIGHLIGHTS Environmental Effects Assessment Panel 2022 Quadrennial Assessment Environmental effects of stratospheric ozone depletion, UV radiation, and interactions with climate change The highlights of the 2022 Quadrennial Assessment focus on major findings since the last assessme

In [6]:
full_clean_text=full_clean_text.lower().replace('executive summary','')

In [7]:
print(full_clean_text)

environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1

montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson. co-ordination: environmental effects assessment panel reproduction: unep nairobi, ozone secretariat date: march 2023 this document is available in electronic form from no copyright involved. this publication may be freely copied, abstracted and cited, with acknowledgement of the source of the material. isbn: 978-9914-733-91-4 2

highlights highlights environmental effects assessment panel 2022 quadrennial assessment environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change the highlights of the 2022 quadrennial assessment focus on major findings since the last assessme

In [8]:
s=re.split(r'(?<=[.!?])\s+|\n{2,}', full_clean_text.strip())

In [9]:
len(s)

707

In [10]:
s[0]

'environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1'

In [11]:
sentences=[]
for i, j in enumerate(s):
  sentences.append({'sentences':j, 'index':i})

In [12]:
len(sentences)

707

func is a function desined for getting anchor text for Semantic Chunking prep. We take a sentence as an anchor and get the buffer_size number of sentences around it and put it into a list called comb, which will later be used.

In [13]:
def func(sentences, buffer_size):
  combined=[]
  for index,value in enumerate(sentences):
    val=''
    for o1 in range(index-min(index,buffer_size),index+1):
      val=val+sentences[o1]['sentences']
    k=val
    val=''
    for o2 in range(index+1,min(len(sentences),buffer_size+1+index)):
      val=val+sentences[o2]['sentences']
    k=k+val

    combined.append(k)
  return combined


In [14]:
comb=func(sentences,buffer_size= 1)

In [15]:
sentences[0]

{'sentences': 'environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1',
 'index': 0}

In [16]:
sentences[1]

{'sentences': 'montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.',
 'index': 1}

In [17]:
comb[0]

'environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.'

In [18]:
comb[1]

'environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.co-ordination: environmental effects assessment panel reproduction: unep nairobi, ozone secretariat date: march 2023 this document is available in electronic form from no copyright involved.'

In [19]:
comb

['environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.',
 'environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.co-ordination: environmental effects assessment panel reproduction: unep nairobi, ozone secretariat date: march 2023 this document is available in electronic form from no copyright involved.',
 'montreal protocol on substances that deplete the ozo

## 📚 Chunking Strategy Descriptions

### 🧱 Sequential + Buffer Chunking
This strategy breaks the document into chunks by moving linearly through the text, combining each sentence with a fixed number of neighboring sentences (the buffer) on either side. It preserves local coherence and is simple to implement, making it effective when important context is typically found nearby. However, it may miss deeper semantic relationships between sentences that aren't adjacent, especially in documents with dispersed or cross-referenced content.

### 🧠 Anchor-Based Semantic Chunking
In this approach, selected anchor sentences serve as the center of each chunk. For each anchor, the system computes semantic similarity with all surrounding sentence groups (pre-computed via the buffer logic) and gathers the most relevant ones, regardless of their original order in the document. This allows the chunk to contain high-context, meaningfully related content even from non-contiguous sections. It results in richer and more focused retrieval, particularly useful in complex documents with interrelated topics.


We are now using the all-MiniLM-L6-v2 model to get eh embeddings of the sentences in the comb variable, and we use the sentence similarity to figure out which sentences to put in the same chunk based on a threshold and chunk limit. chunk limit by default is 512.

In [20]:
model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunking(comb, threshold=0.75, chunk_limit=None):
    if chunk_limit is None:
        chunk_limit = 512

    embeddings = model.encode(comb)
    chunks = []
    current_chunk = [comb[0]]
    current_len = len(comb[0].split())

    for i in range(1, len(comb)):
        sim = cosine_similarity([embeddings[i]], [embeddings[i - 1]])[0][0]
        next_len = len(comb[i].split())

        if sim > threshold and current_len + next_len <= chunk_limit:
            current_chunk.append(comb[i])
            current_len += next_len
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [comb[i]]
            current_len = next_len

    chunks.append(" ".join(current_chunk))
    return chunks


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [21]:
semantic_chunks=semantic_chunking(comb)

In [22]:
len(semantic_chunks)

212

In [23]:
type(semantic_chunks)

list

In [24]:
k=[{i: len(i.split())} for i in semantic_chunks]

In [25]:
k[0]

{'environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson. environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.co-ordination: environmental effects assessment panel reproduction: unep nairobi, ozone secretariat date: march 2023 this document is available in electronic form from no copyright involved. montreal protocol on substances that deplete the ozone layer

In [26]:
k[1]

{'co-ordination: environmental effects assessment panel reproduction: unep nairobi, ozone secretariat date: march 2023 this document is available in electronic form from no copyright involved.this publication may be freely copied, abstracted and cited, with acknowledgement of the source of the material.isbn: 978-9914-733-91-4 2': 42}

In [27]:
k[3]

{'1 ultraviolet radiation, stratospheric ozone depletion, and climate change • concentrations of stratospheric ozone in the future will depend on the decrease in ozone-depleting substances (odss) controlled by the montreal protocol, other substances currently not controlled, and on emissions of greenhouse gases, such as carbon dioxide, methane and nitrous oxide.the trajectory of these emissions depends greatly on policy decisions.• large increases in uv radiation were observed during the 2020 antarctic and arctic springs, when the uv index rose by to 80% and 70%, respectively, above the historical means. the trajectory of these emissions depends greatly on policy decisions.• large increases in uv radiation were observed during the 2020 antarctic and arctic springs, when the uv index rose by to 80% and 70%, respectively, above the historical means.• in the antarctic, these anomalously high amounts of uv radiation extended over spring and the start of summer, and may have had negative co

In [28]:
k[4]

{'• thawing of permafrosts will result in the release of uv-absorbing organic carbon into aquatic ecosystems and enhanced emissions of carbon dioxide and methane to the atmosphere.• the concurrence of heat waves with drought and high uv-b irradiance (280-315 nm) may negatively affect food security and biodiversity of crops and animals.these climatic conditions can disrupt formerly favourable habitats and may shift habitats to locations with different conditions, to which plants and animals may not be adapt. • the concurrence of heat waves with drought and high uv-b irradiance (280-315 nm) may negatively affect food security and biodiversity of crops and animals.these climatic conditions can disrupt formerly favourable habitats and may shift habitats to locations with different conditions, to which plants and animals may not be adapt.tropical coral reefs under naturally high uv irradiance are of particular concern, since an increase in sea surface temperatures of 1 °c to 2 °c can cause 

In [29]:
k[6]

{'4highlights • the montreal protocol may have benefits for uv-induced inflammatory skin disorders.in some people these lead to large decreases in quality of life. highlights • the montreal protocol may have benefits for uv-induced inflammatory skin disorders.in some people these lead to large decreases in quality of life.many diuretic and anti-inflammatory drugs can cause photosensitivity when skin is exposed to uv radiation, although the global incidence of drug-induced photosensitivity is unclear. in some people these lead to large decreases in quality of life.many diuretic and anti-inflammatory drugs can cause photosensitivity when skin is exposed to uv radiation, although the global incidence of drug-induced photosensitivity is unclear.some drugs used for decreasing blood pressure may increase the risk of keratinocyte skin cancer through uv-induced dna damage. many diuretic and anti-inflammatory drugs can cause photosensitivity when skin is exposed to uv radiation, although the gl

In [30]:
list(max(k, key=lambda x: list(x.values())[0]).items())[0][1]

509

In [31]:
client = openai.OpenAI(api_key= "YOUR API KEY" )

In [32]:
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection_seq = chroma_client.get_or_create_collection("seq_chunking")
collection_sem = chroma_client.get_or_create_collection("semantic_chunking")


In [33]:
def chroma_activate(chunks, collection):
  for i, text in enumerate(chunks):
    embedding = model.encode([text])[0].tolist()
    collection.add(
        documents=[text],
        embeddings=[embedding],
        ids=[f"id-{i}"]
    )
  return collection


In [34]:
collection1= chroma_activate(semantic_chunks, collection_seq)

In [35]:
results = collection1.get(include=['documents', 'embeddings'])
results['ids'][0], results['documents'][0], results['embeddings'][0]
# for doc, emb in zip(results['documents'], results['embeddings']):
#     print(f"Document:\n{doc}\n\nEmbedding (first 5 dims):\n{emb[:5]}\n{'-'*40}")


('id-0',
 'environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson. environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.co-ordination: environmental effects assessment panel reproduction: unep nairobi, ozone secretariat date: march 2023 this document is available in electronic form from no copyright involved. montreal protocol on substances that deplete the oz

In [36]:
results = collection1.get(include=["documents", "embeddings"])
print(results["ids"][:1])
print(results["documents"][:1])
print(len(results["embeddings"]))


['id-0']
['environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson. environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.co-ordination: environmental effects assessment panel reproduction: unep nairobi, ozone secretariat date: march 2023 this document is available in electronic form from no copyright involved. montreal protocol on substances that deplete the oz

In [37]:
def semantic_cluster_chunking(comb, anchor_stride=5, chunk_limit=512):
    embeddings = model.encode(comb)
    chunks = []
    used = set()

    for i in range(0, len(comb), anchor_stride):
        anchor_emb = embeddings[i]

        similarities = cosine_similarity([anchor_emb], embeddings)[0]
        ranked = sorted(enumerate(similarities), key=lambda x: -x[1])

        chunk = []
        word_count = 0

        for idx, sim in ranked:
            if idx in used:
                continue
            wc = len(comb[idx].split())
            if word_count + wc <= chunk_limit:
                chunk.append(comb[idx])
                used.add(idx)
                word_count += wc
            if word_count >= chunk_limit:
                break

        if chunk:
            chunks.append(" ".join(chunk))

    return chunks


In [38]:
def query_with_llm(query, collection, top_k=3):
    query_emb = model.encode(query).tolist()
    results = collection.query(query_embeddings=[query_emb], n_results=top_k)
    context = "\n".join(results['documents'][0])

    prompt = f"""
            You are an expert assistant. Use the following context to answer the question concisely and accurately.
            If the answer is not in the context, say 'Not enough information in the document.'

            Context:
            {context}

            Question: {query}
            Answer:
            """


    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5
    )

    return response.choices[0].message.content



In [39]:
query_with_llm("How does ozone depletion affect UV exposure?", collection=collection1)


'Ozone depletion can lead to large increases in UV radiation. For example, during the 2020 Antarctic and Arctic springs, the UV index rose by up to 80% and 70%, respectively, above the historical means. This increased exposure to UV radiation can have negative effects on ecosystems on land and in water bodies, especially in polar and high-elevation regions. Moreover, the concurrence of heat waves with drought and high UV-B irradiance may negatively affect food security and biodiversity of crops and animals.'

In [40]:
query_with_llm("What is MCP?", collection=collection1)


'Not enough information in the document.'

In [41]:
new_semantic_chunks=semantic_cluster_chunking(comb)

In [42]:
new_semantic_chunks[0]

'environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson. environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.co-ordination: environmental effects assessment panel reproduction: unep nairobi, ozone secretariat date: march 2023 this document is available in electronic form from no copyright involved. montreal protocol on substances that deplete the ozone layer 

In [43]:
comb[0]

'environmental effects of stratospheric ozone depletion, uv radiation, and interactions with climate change 2022 assessment report montreal protocol on substances that deplete the ozone layer 1montreal protocol on substances that deplete the ozone layer unep 2022 assessment report of the environmental effects assessment panel the text of this report is composed in gibson.'

In [44]:
collection2= chroma_activate(new_semantic_chunks, collection_sem)

In [45]:
query_with_llm("How does ozone depletion affect UV exposure?", collection = collection2)

'Ozone depletion contributes to increases in UV radiation. Particular attention is given to the linkages between stratospheric ozone depletion and UV radiation. Research has shown that increases in UV radiation caused by stratospheric ozone depletion over 1980-2020 have contributed a small increase to the concentration of globally averaged OH. Decreases in UV-B radiation, as a result of the Montreal Protocol, would be expected to result in a reduced net production of O near sources of pollution and a slower consumption of O with increasing distance from polluted areas. The Antarctic ozone hole has resulted in large increases in surface UV-B radiation, with peak irradiances sometimes exceeding those observed in subtropical locations. In the Arctic, some of the highest UV-B irradiances on record were measured in March and April 2020.'

In [46]:
query_with_llm("What are the main health benefits attributed to the Montreal Protocol, and how were they quantified?", collection = collection1)

'The main health benefits attributed to the Montreal Protocol include avoiding large increases in UV-B radiation, which could have led people to spend more time indoors where the risk of Covid-19 infection is much higher. The document does not provide specific methods or data used to quantify these benefits.'

In [47]:
query_with_llm("What are the main health benefits attributed to the Montreal Protocol, and how were they quantified?", collection = collection2)

'The main health benefits attributed to the Montreal Protocol are significant reductions in UV-related diseases. It is estimated that due to the Montreal Protocol, 11 million cases of melanoma, 432 million cases of keratinocyte skin cancers, and 63 million cases of cataract will have been avoided for those born between 1890 and 2100 in the United States. These estimates were quantified by the United States Environmental Protection Agency.'

In [48]:
query_with_llm("How does UV-B radiation interact with climate factors to affect terrestrial or aquatic ecosystems?", collection = collection1)

'UV-B radiation interacts with climate factors in several ways to affect terrestrial and aquatic ecosystems. Exposure to UV-B radiation is generally limited to the surface layer of aquatic ecosystems. This exposure is controlled by factors such as surface warming, inputs of fresh water, surface winds and currents. In the Anthropocene, there is generally more warming and wind, which increases the mixed layer depth, while also sharpening the density barrier to nutrient transport from deep water. Ice melt reduces shielding and freshens the polar ocean, reducing the mixed layer depth. Terrestrial runoff from rain events can lower the transparency to UV-B radiation and warm surface waters due to enhanced absorption of solar radiation. This can result in shallower mixed layers in lakes. Drought would have the opposite effect. UV radiation, combined with anthropogenic factors, exacerbates stress on aquatic ecosystems, especially tropical coral reef ecosystems. Thawing of permafrosts releases 

In [49]:
query_with_llm("How does UV-B radiation interact with climate factors to affect terrestrial or aquatic ecosystems?", collection = collection2)

"UV-B radiation's interaction with climate factors affects terrestrial and aquatic ecosystems in several ways. Climate change alters the depth of the mixed layer in oceans, the thickness of ice cover, the duration of ice-free conditions, and inputs of dissolved organic matter, which can either increase or decrease exposure to UV radiation. In terrestrial ecosystems, UV radiation and climate interact to affect the cycling of nutrients such as carbon and nitrogen, with potential consequences for food security, biodiversity, and climate. Exposure to extreme solar UV radiation can have deleterious effects on many plants, animals, and microorganisms. Changes in stratospheric ozone and climate can also have complex effects on different types of air pollution. Solar UV radiation plays a role in the breakdown of contaminants in aquatic and terrestrial ecosystems. Increased warming can lead to more ice melt and increased exposure of ecosystems to UV radiation. Thawing of permafrosts can result 

## 📊 Comparison of Chunking Strategies

### 1. Sequential + Buffer Chunking
- ✅ Easy to implement
- ✅ Maintains local sentence continuity
- ❌ Can miss cross-paragraph context
- ❌ Chunks may contain filler or loosely related content

### 2. Anchor-Based Semantic Chunking (Proposed)
- ✅ Selects the most relevant context per query
- ✅ Pulls in semantically similar content even across sections
- ✅ Produces tighter, high-signal chunks
- ❌ Slightly more computational overhead (semantic similarity matrix) only by a few seconds. It's not that big of a trade off once we get the chunks and create the collection

### 🏆 Why the Second One Wins
- Better relevance in retrieval: LLM responses based on these chunks are more accurate and grounded
- Captures dispersed but related concepts (important in scientific or legal docs)

### 🔬 Empirical Observation
When tested across questions like "What are the combined effects of UV radiation and climate?" or "How does the Montreal Protocol influence health outcomes?", the anchor-semantic strategy provided richer and more concise grounding for GPT-4, resulting in clearer and more factually correct answers.

