# Semantic Chunking

- SemanticChunker is a document splitter that uses embedding similarity between sentences to decide chunk boundaries.

- It ensures that each chunk is semantically coherent and not cut off mid-thought like traditional character/token splitters.

In [1]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


## Document segmentation

In [2]:
with open("langchain_intro.txt","r") as f:
    text = f.read()

text

'LangChain is a framework for building applications with LLMs.\nLangchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.\nYou can create chains, agents, memory, and retrievers.\nThe Eiffel Tower is located in Paris.\nFrance is a popular tourist destination.'

In [3]:
model = SentenceTransformer(
    model_name_or_path="all-MiniLM-L6-v2"
)

In [4]:
# split into sentences

sentences = [s.strip() for s in text.split("\n") if s.strip()]
sentences

['LangChain is a framework for building applications with LLMs.',
 'Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.',
 'You can create chains, agents, memory, and retrievers.',
 'The Eiffel Tower is located in Paris.',
 'France is a popular tourist destination.']

In [9]:
# embed each sentence

embeddings = model.encode(sentences)
embeddings

array([[-0.02109222, -0.04472178,  0.01087076, ..., -0.01217802,
         0.08605651,  0.02890729],
       [-0.0341802 , -0.10210427,  0.00366989, ..., -0.01398786,
         0.04454356,  0.00551362],
       [-0.02442169, -0.05424953, -0.13623357, ...,  0.03656349,
         0.07216296, -0.03104779],
       [ 0.06605352,  0.03884848,  0.01661559, ...,  0.03093833,
         0.07991003,  0.05157556],
       [ 0.10403012, -0.03097695,  0.02524884, ...,  0.07805593,
         0.01353772, -0.026849  ]], shape=(5, 384), dtype=float32)

In [6]:
# initialize threshold parameter
threshold = 0.7
chunks = []

current_chunk = [sentences[0]]

In [7]:
# semantic grouping based on threshold

for i in range(1, len(sentences)):
    sim = cosine_similarity(
        [embeddings[i-1]],
        [embeddings[i]]
    )[0][0]

    if sim >= threshold:
        current_chunk.append(sentences[i])
    else:
        chunks.append(" ".join(current_chunk))
        current_chunk = [sentences[i]]

chunks.append(" ".join(current_chunk))


In [8]:
print("Semantic chunks: ")
for idx, chunk in enumerate(chunks):
    print(f"\nChunk {idx+1}:\n{chunk}")

Semantic chunks: 

Chunk 1:
LangChain is a framework for building applications with LLMs. Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.

Chunk 2:
You can create chains, agents, memory, and retrievers.

Chunk 3:
The Eiffel Tower is located in Paris.

Chunk 4:
France is a popular tourist destination.


# RAG with modular coding

In [1]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain_classic.schema import Document
from langchain_classic.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chat_models import init_chat_model
from langchain_classic.schema.runnable import RunnableLambda, RunnableMap
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os
from dotenv import load_dotenv

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

In [2]:
os.environ['GROQ_API_KEY'] = os.getenv("GROQ_API_KEY")

In [6]:
with open("langchain_intro.txt","r") as f:
    text = f.read()

text

'LangChain is a framework for building applications with LLMs.\nLangchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.\nYou can create chains, agents, memory, and retrievers.\nThe Eiffel Tower is located in Paris.\nFrance is a popular tourist destination.'

In [11]:
# custom semantic chunker with threshold

class ThresholdSemanticChunker:
    def __init__(self,
                 model_name: str = "all-MiniLM-L6-v2",
                 threshold:float=0.7):
        self.threshold = threshold
        self.model = SentenceTransformer(model_name)

    def split(self, text: str):
        sentences = [s.strip() for s in text.split("\n") if s.strip()]
        embeddings = self.model.encode(sentences)

        chunks = []
        current_chunk = [sentences[0]]

        for i in range(1, len(sentences)):
            sim = cosine_similarity(
                [embeddings[i-1]],
                [embeddings[i]]
            )[0][0]
            if sim >= self.threshold:
                current_chunk.append(sentences[i])
            else:
                chunks.append(" ".join(current_chunk))
                current_chunk = [sentences[i]]

        chunks.append(" ".join(current_chunk))
        return chunks

    def split_documents(self, docs):
        results = []
        for doc in docs:
            for chunk in self.split(doc.page_content):
                results.append(Document(page_content=chunk,
                                        metadata=doc.metadata)
                )
        return results

In [12]:
chunker = ThresholdSemanticChunker()

In [13]:
chunks = chunker.split_documents([Document(page_content=text)])

In [14]:
chunks

[Document(metadata={}, page_content='LangChain is a framework for building applications with LLMs. Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.'),
 Document(metadata={}, page_content='You can create chains, agents, memory, and retrievers.'),
 Document(metadata={}, page_content='The Eiffel Tower is located in Paris.'),
 Document(metadata={}, page_content='France is a popular tourist destination.')]

In [15]:
## Vector store
embeddings = OpenAIEmbeddings()

vectorstore = FAISS.from_documents(chunks,
                     embeddings
                     )

retriever = vectorstore.as_retriever()

In [17]:
# Prompt template

template = """Answer the question based on the following context

{context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
prompt

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer the question based on the following context\n\n{context}\n\nQuestion: {question}\n')

In [23]:
#LLM

llm = init_chat_model(model="groq:llama-3.1-8b-instant",
                      temperature=0.4)

In [24]:
rag_chain = (
    RunnableMap(
        {
        "context": lambda x: retriever.invoke(x["question"]),
        "question": lambda x: x['question'],
    }
)
| prompt
| llm
| StrOutputParser()
)

In [26]:
query = {"question": "What is the purpose of langchain? Explain in brief"}
result = rag_chain.invoke(query)
result

'The purpose of LangChain is to build applications with Large Language Models (LLMs) by providing modular abstractions to combine LLMs with tools like OpenAI and Pinecone.'

# Semantic chunker with langchain

In [29]:
from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.document_loaders import TextLoader

In [35]:
# load the models
loader = TextLoader("langchain_intro.txt")
docs = loader.load()

In [32]:
# initialize embedding model
embedding = OpenAIEmbeddings()

In [33]:
# create semantic chunker
chunker = SemanticChunker(embedding)

In [36]:
# split the documents
chunks= chunker.split_documents(docs)

In [38]:
#print result

for i, chunk in enumerate(chunks):
    print(f"\n chunk {i+1}: \n {chunk.page_content}")


 chunk 1: 
 LangChain is a framework for building applications with LLMs. Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.

 chunk 2: 
 You can create chains, agents, memory, and retrievers. The Eiffel Tower is located in Paris. France is a popular tourist destination.


### Alternativs: to check with all sentences, not just adjacent

In [15]:
text2 = """
The old lighthouse on the cliff blinked twice before going completely dark for the first time in seventy years.
Electric cars rarely match their advertised range when driven across steep mountain roads.
A famous violinist once claimed that silence between notes carries more emotion than the music itself.
Deep-sea creatures often rely on bioluminescence to attract prey in absolute darkness.
Modern CPUs internally reorder instructions to maximize throughput without breaking program semantics.
Some people collect postcards because they feel like tiny frozen memories of distant worlds.
In deserts, temperature drops so sharply at night that metal surfaces can form thin layers of frost.
Quantum entanglement still feels like magic even to researchers who study it daily.
Tourists in Norway often underestimate how quickly the weather can switch from sunny to stormy.
The flavor of coffee changes noticeably depending on the altitude at which beans are grown.
Ancient libraries used clay tablets long before the invention of paper or parchment.
A dog’s ability to understand human pointing gestures is surprisingly unique among animals.
Many classic video games used palette swapping to give characters multiple color variations.
The number of satellites in low-Earth orbit has increased dramatically in the last five years.
High-precision robots can assemble mechanical watches faster than most skilled artisans.
A single poorly optimized SQL query can slow down an entire microservice ecosystem.
Researchers found that bees can solve simple arithmetic tasks under controlled experiments.
People in crowded cities often walk faster without realizing it.
Old sailing ships measured speed using a weighted rope with knots tied at fixed intervals.
Not all volcanic eruptions are explosive; some create slow-moving rivers of molten rock.
The earliest train stations were designed more like luxury hotels than transportation hubs.
Coral reefs act as natural breakwaters, reducing the energy of waves before they reach shore.
Some languages, like Finnish, can express complex ideas using extremely long compound words.
A perfectly ripe mango has a distinct fruity aroma that can be recognized even from a distance.
Even simple board games can become mathematically complex when analyzed as decision trees.
Meteor showers occur when Earth passes through clouds of debris left by ancient comets.
Most modern keyboards are designed with a slight tilt to minimize wrist strain.
The shape of a bird’s wings reveals a lot about its typical flight speed and maneuverability.
Shipping containers revolutionized global trade more than any digital innovation of the last century.
A classical hologram encodes information in the interference pattern of light waves.
Children often learn new languages faster because they rely less on translation.
The taste of chocolate can vary dramatically depending on its tempering process.
Tree rings preserve a readable history of droughts, volcanic activity, and seasonal cycles.
Some historians believe that early maps exaggerated sea monsters to discourage exploration.
The blue tint in some lakes comes from sunlight scattering off very fine glacial sediments.
Ancient astronomers predicted eclipses long before they understood why they occurred.
A single strong magnet can erase data from old magnetic tape recordings.
Astronauts often describe spacewalks as both peaceful and terrifying.
Chess engines evaluate millions of board positions per second to choose the best move.
The pattern on a giraffe’s skin is unique, much like a human fingerprint.
Certain mushrooms glow faintly at night due to natural bioluminescent chemicals.
Urban planners increasingly use digital twins to simulate city traffic flows.
Rainbows always appear opposite to the position of the sun.
Old manuscripts sometimes contain hidden notes written in lemon juice, revealed by heat.
Ant colonies collectively make decisions faster than many human committees.
Crystal radios can operate without any external power source apart from received waves.
Fresh snow absorbs sound, making winter nights unusually quiet.
3D printers can manufacture complex shapes that would be impossible to carve by hand.
Some species of turtles can breathe through their skin in the water during hibernation.
A well-designed user interface can make a complex system feel intuitive instantly.
"""

In [20]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# split into sentences
sentences = [s.strip() for s in text2.split("\n") if s.strip()]
print(f"Total sentences before chunking: {len(sentences)}")

# get embeddings
embeddings = model.encode(sentences)  # shape: (N, D)

threshold = 0.4  # try 0.3–0.5 for typical sentence-transformer models
chunks = []      # list of lists of sentence indices

# start with first sentence in its own chunk
chunks.append([0])

for i in range(1, len(sentences)):
    emb_i = embeddings[i].reshape(1, -1)

    # compute similarity with ALL previous sentences (0..i-1)
    prev_embs = embeddings[:i]
    sims = cosine_similarity(emb_i, prev_embs)[0]  # shape: (i,)

    # find most similar previous sentence
    best_idx = int(np.argmax(sims))
    best_sim = float(sims[best_idx])

    if best_sim >= threshold:
        # add to the chunk that contains best_idx
        for ch in chunks:
            if best_idx in ch:
                ch.append(i)
                break
    else:
        # start a new chunk
        chunks.append([i])

# convert index chunks to text chunks
text_chunks = [" ".join(sentences[j] for j in ch) for ch in chunks]

print(f"\nTotal semantic chunks: {len(text_chunks)}")
for idx, chunk in enumerate(text_chunks, start=1):
    print(f"\nChunk {idx}:\n{chunk}")


Total sentences before chunking: 50

Total semantic chunks: 47

Chunk 1:
The old lighthouse on the cliff blinked twice before going completely dark for the first time in seventy years.

Chunk 2:
Electric cars rarely match their advertised range when driven across steep mountain roads.

Chunk 3:
A famous violinist once claimed that silence between notes carries more emotion than the music itself.

Chunk 4:
Deep-sea creatures often rely on bioluminescence to attract prey in absolute darkness. Certain mushrooms glow faintly at night due to natural bioluminescent chemicals.

Chunk 5:
Modern CPUs internally reorder instructions to maximize throughput without breaking program semantics.

Chunk 6:
Some people collect postcards because they feel like tiny frozen memories of distant worlds.

Chunk 7:
In deserts, temperature drops so sharply at night that metal surfaces can form thin layers of frost.

Chunk 8:
Quantum entanglement still feels like magic even to researchers who study it daily.

C