
# 🔍 Retrieval-Augmented Generation (RAG) with Custom Semantic Chunking

This notebook implements a robust RAG pipeline using a document-heavy scientific report. The system includes:
- Advanced PDF parsing and cleaning
- Semantic and Custom semantic anchor-based chunking strategy
- Embedding and indexing using ChromaDB
- Retrieval with OpenAI GPT-4 for grounded Q&A

---




## ✨ Explainability & Methodology Overview



### 📄 Text Cleaning & Preprocessing

The uploaded EEAP report PDF was parsed using `pdfplumber`. Key steps:
- Removed hyperlinks (to avoid non-informative tokens)
- Normalized whitespace and joined hyphenated line breaks
- Skipped table of contents pages
- Optionally removed unwanted sections like "Executive Summary"

This results in a clean, structured body of text for downstream processing.


### 🧠 Custom Semantic Chunking Strategy

Unlike traditional sequential or fixed-size chunking, this approach:
- Uses an `anchor_stride` to select anchor sentences
- Computes cosine similarity between each anchor and all sentences (pre-combined in `comb`)
- Selects the most semantically similar, non-overlapping sentences until a chunk size limit is reached

This allows the model to retrieve **cross-paragraph** and **contextually linked** ideas — critical for scientific or long-form content.



### 🧪 Embedding + Retrieval + LLM (RAG)

- `SentenceTransformer` is used for embedding chunks
- ChromaDB is used for vector indexing and fast approximate retrieval
- GPT-4 (via OpenAI API) is used to answer questions grounded in the top-k retrieved chunks

This architecture enables accurate, explainable, and flexible QA over large documents.

You can now evaluate the performance differences between chunking strategies using a set of benchmark questions or LLM scoring.


In [None]:
!pip install pdfplumber
!pip install PyPDF2
!pip install chromadb --upgrade
!pip install --upgrade openai
import re
import openai
import chromadb
import pdfplumber
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


# **PDF Preprocessing**

In [None]:
pdf_path = "EEAP-2022-Assessment-Report-May2023-1-30.pdf"
cleaned_pages = []


skip_pages = set(range(7, 11)).union(range(2, 3))
with pdfplumber.open(pdf_path) as pdf:
    for i, page in enumerate(pdf.pages):
        if i in skip_pages:
            continue

        text = page.extract_text()
        if text:
            text = re.sub(r'https?://\S+', '', text)
            text = re.sub(r'\s+', ' ', text)
            text = re.sub(r'-\s+', '', text)
            text = re.sub(r'\s{2,}', ' ', text)
            cleaned_pages.append(text.strip())

full_clean_text = "\n\n".join(cleaned_pages)


In [None]:
print(full_clean_text)

In [None]:
full_clean_text=full_clean_text.lower().replace('executive summary','')

In [None]:
print(full_clean_text)

In [None]:
s=re.split(r'(?<=[.!?])\s+|\n{2,}', full_clean_text.strip())

In [None]:
len(s)

In [None]:
s[0]

In [None]:
sentences=[]
for i, j in enumerate(s):
  sentences.append({'sentences':j, 'index':i})

In [None]:
len(sentences)

func is a function desined for getting anchor text for Semantic Chunking prep. We take a sentence as an anchor and get the buffer_size number of sentences around it and put it into a list called comb, which will later be used.

In [None]:
def func(sentences, buffer_size):
  combined=[]
  for index,value in enumerate(sentences):
    val=''
    for o1 in range(index-min(index,buffer_size),index+1):
      val=val+sentences[o1]['sentences']
    k=val
    val=''
    for o2 in range(index+1,min(len(sentences),buffer_size+1+index)):
      val=val+sentences[o2]['sentences']
    k=k+val

    combined.append(k)
  return combined


In [None]:
comb=func(sentences,buffer_size= 1)

In [None]:
sentences[0]

In [None]:
sentences[1]

In [None]:
comb[0]

In [None]:
comb[1]

In [None]:
comb

## 📚 Chunking Strategy Descriptions

### 🧱 Sequential + Buffer Chunking
This strategy breaks the document into chunks by moving linearly through the text, combining each sentence with a fixed number of neighboring sentences (the buffer) on either side. It preserves local coherence and is simple to implement, making it effective when important context is typically found nearby. However, it may miss deeper semantic relationships between sentences that aren't adjacent, especially in documents with dispersed or cross-referenced content.

### 🧠 Anchor-Based Semantic Chunking
In this approach, selected anchor sentences serve as the center of each chunk. For each anchor, the system computes semantic similarity with all surrounding sentence groups (pre-computed via the buffer logic) and gathers the most relevant ones, regardless of their original order in the document. This allows the chunk to contain high-context, meaningfully related content even from non-contiguous sections. It results in richer and more focused retrieval, particularly useful in complex documents with interrelated topics.


We are now using the all-MiniLM-L6-v2 model to get eh embeddings of the sentences in the comb variable, and we use the sentence similarity to figure out which sentences to put in the same chunk based on a threshold and chunk limit. chunk limit by default is 512.

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunking(comb, threshold=0.75, chunk_limit=None):
    if chunk_limit is None:
        chunk_limit = 512

    embeddings = model.encode(comb)
    chunks = []
    current_chunk = [comb[0]]
    current_len = len(comb[0].split())

    for i in range(1, len(comb)):
        sim = cosine_similarity([embeddings[i]], [embeddings[i - 1]])[0][0]
        next_len = len(comb[i].split())

        if sim > threshold and current_len + next_len <= chunk_limit:
            current_chunk.append(comb[i])
            current_len += next_len
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [comb[i]]
            current_len = next_len

    chunks.append(" ".join(current_chunk))
    return chunks


In [None]:
semantic_chunks=semantic_chunking(comb)

In [None]:
len(semantic_chunks)

In [None]:
type(semantic_chunks)

In [None]:
k=[{i: len(i.split())} for i in semantic_chunks]

In [None]:
k[0]

In [None]:
k[1]

In [None]:
k[3]

In [None]:
k[4]

In [None]:
k[6]

In [None]:
list(max(k, key=lambda x: list(x.values())[0]).items())[0][1]

In [None]:
client = openai.OpenAI(api_key= "YOUR API KEY" )

In [None]:
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection_seq = chroma_client.get_or_create_collection("seq_chunking")
collection_sem = chroma_client.get_or_create_collection("semantic_chunking")


In [None]:
def chroma_activate(chunks, collection):
  for i, text in enumerate(chunks):
    embedding = model.encode([text])[0].tolist()
    collection.add(
        documents=[text],
        embeddings=[embedding],
        ids=[f"id-{i}"]
    )
  return collection


In [None]:
collection1= chroma_activate(semantic_chunks, collection_seq)

In [None]:
results = collection1.get(include=['documents', 'embeddings'])
results['ids'][0], results['documents'][0], results['embeddings'][0]
# for doc, emb in zip(results['documents'], results['embeddings']):
#     print(f"Document:\n{doc}\n\nEmbedding (first 5 dims):\n{emb[:5]}\n{'-'*40}")


In [None]:
results = collection1.get(include=["documents", "embeddings"])
print(results["ids"][:1])
print(results["documents"][:1])
print(len(results["embeddings"]))


In [None]:
def semantic_cluster_chunking(comb, anchor_stride=5, chunk_limit=512):
    embeddings = model.encode(comb)
    chunks = []
    used = set()

    for i in range(0, len(comb), anchor_stride):
        anchor_emb = embeddings[i]

        similarities = cosine_similarity([anchor_emb], embeddings)[0]
        ranked = sorted(enumerate(similarities), key=lambda x: -x[1])

        chunk = []
        word_count = 0

        for idx, sim in ranked:
            if idx in used:
                continue
            wc = len(comb[idx].split())
            if word_count + wc <= chunk_limit:
                chunk.append(comb[idx])
                used.add(idx)
                word_count += wc
            if word_count >= chunk_limit:
                break

        if chunk:
            chunks.append(" ".join(chunk))

    return chunks


In [None]:
def query_with_llm(query, collection, top_k=3):
    query_emb = model.encode(query).tolist()
    results = collection.query(query_embeddings=[query_emb], n_results=top_k)
    context = "\n".join(results['documents'][0])

    prompt = f"""
            You are an expert assistant. Use the following context to answer the question concisely and accurately.
            If the answer is not in the context, say 'Not enough information in the document.'

            Context:
            {context}

            Question: {query}
            Answer:
            """


    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5
    )

    return response.choices[0].message.content



In [None]:
query_with_llm("How does ozone depletion affect UV exposure?", collection=collection1)


In [None]:
query_with_llm("What is MCP?", collection=collection1)


In [None]:
new_semantic_chunks=semantic_cluster_chunking(comb)

In [None]:
new_semantic_chunks[0]

In [None]:
comb[0]

In [None]:
collection2= chroma_activate(new_semantic_chunks, collection_sem)

In [None]:
query_with_llm("How does ozone depletion affect UV exposure?", collection = collection2)

In [None]:
query_with_llm("What are the main health benefits attributed to the Montreal Protocol, and how were they quantified?", collection = collection1)

In [None]:
query_with_llm("What are the main health benefits attributed to the Montreal Protocol, and how were they quantified?", collection = collection2)

In [None]:
query_with_llm("How does UV-B radiation interact with climate factors to affect terrestrial or aquatic ecosystems?", collection = collection1)

In [None]:
query_with_llm("How does UV-B radiation interact with climate factors to affect terrestrial or aquatic ecosystems?", collection = collection2)

## 📊 Comparison of Chunking Strategies

### 1. Sequential + Buffer Chunking
- ✅ Easy to implement
- ✅ Maintains local sentence continuity
- ❌ Can miss cross-paragraph context
- ❌ Chunks may contain filler or loosely related content

### 2. Anchor-Based Semantic Chunking (Proposed)
- ✅ Selects the most relevant context per query
- ✅ Pulls in semantically similar content even across sections
- ✅ Produces tighter, high-signal chunks
- ❌ Slightly more computational overhead (semantic similarity matrix) only by a few seconds. It's not that big of a trade off once we get the chunks and create the collection

### 🏆 Why the Second One Wins
- Better relevance in retrieval: LLM responses based on these chunks are more accurate and grounded
- Captures dispersed but related concepts (important in scientific or legal docs)

### 🔬 Empirical Observation
When tested across questions like "What are the combined effects of UV radiation and climate?" or "How does the Montreal Protocol influence health outcomes?", the anchor-semantic strategy provided richer and more concise grounding for GPT-4, resulting in clearer and more factually correct answers.

