### Semantic Chunking
- SemanticChunker is a document splitter that uses embedding similarity between sentences to decide chunk boundaries.

- It ensures that each chunk is semantically coherent and not cut off mid-thought like traditional character/token splitters.

In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain.schema import Document
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chat_models import init_chat_model
from langchain.schema.runnable import RunnableLambda, RunnableMap
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os
from dotenv import load_dotenv
load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

In [3]:
#
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["GROQ_API_KEY"]=os.getenv("GROQ_API_KEY")


In [56]:
## Initialize the model
model  = SentenceTransformer("all-MiniLM-L6-v2")
## Sample text
text="""
LangChain is a framework for building applications with LLMs.
Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.
You can create chains, agents, memory, and retrievers.
The Eiffel Tower is located in Paris.
France is a popular tourist destination.
"""


In [66]:
text1 = """
LangChain is a powerful open-source framework designed for building applications that integrate Large Language Models (LLMs) with external data and tools. It provides a modular architecture that simplifies connecting models like OpenAI’s GPT, Anthropic’s Claude, or Meta’s LLaMA with components such as databases, APIs, and vector stores.

LangChain introduces key concepts including Chains, Agents, Retrievers, and Memory.

A Chain is a sequence of operations or prompts that process input and produce output.

An Agent uses reasoning to decide which tool or action to execute next based on model output.

A Retriever fetches the most relevant information from a data source, often using embeddings and vector similarity search.


"""
sentences = [s.strip() for s in text1.split("\n") if s.strip()]
type(sentences)

list

In [67]:
## Step 1: Split into sentences

sentences = [s.strip() for s in text1.split("\n") if s.strip()]


## Step 2: Embed each sntence

embedding = model.encode(sentences)

## Step 3: initialize parameters
threshold = 0.7
chunks = []
current_chunk = [sentences[0]]


## Step 4: Semantic grouping based on threshold
for i in range(1, len(sentences)):
    sim = cosine_similarity(
        [embedding[i-1]],
        [embedding[i]]
    )[0][0]
    print(sim)
    if sim>threshold:
        current_chunk.append(sentences[i])
    else:
        chunks.append(" ".join(current_chunk))
        current_chunk = [sentences[i]]

## Append the last chunk
chunks.append(" ".join(current_chunk))

for idx, chunk in enumerate(chunks):
    print(f"\n Chunk {idx+1}: \n {chunk}")

0.6541584
0.40541422
0.37255263
0.16015358

 Chunk 1: 
 LangChain is a powerful open-source framework designed for building applications that integrate Large Language Models (LLMs) with external data and tools. It provides a modular architecture that simplifies connecting models like OpenAI’s GPT, Anthropic’s Claude, or Meta’s LLaMA with components such as databases, APIs, and vector stores.

 Chunk 2: 
 LangChain introduces key concepts including Chains, Agents, Retrievers, and Memory.

 Chunk 3: 
 A Chain is a sequence of operations or prompts that process input and produce output.

 Chunk 4: 
 An Agent uses reasoning to decide which tool or action to execute next based on model output.

 Chunk 5: 
 A Retriever fetches the most relevant information from a data source, often using embeddings and vector similarity search.


## RAG Pipeline Modular DC

In [12]:
## Custom Semantic Chunker With Threshold

class ThresholSemanticChunker:
    def __init__(self, model_name = "all-MiniLM-L6-v2", threshold = 0.7):
        self.model = SentenceTransformer(model_name)
        self.threshold = threshold

    def split(self, text:str):
        sentences = [s.strip() for s in text.split('.') if s.strip()]
        embedding = self.model.encode(sentences)
        chunks = []
        current_chunk = [sentences[0]]

        for i in range(1, len(sentences)):
            sim = cosine_similarity([embedding[i-1]],[embedding[i]])
            if sim>threshold:
                current_chunk.append(sentences[i])
            else:
                chunks.append(" ".join(current_chunk)+ '.')
                current_chunk = [sentences[i]]
        chunks.append('.'.join(current_chunk) + '.')
        return chunks

    def split_documents(self, docs):
        result = []
        for doc in docs:
            for chunk in self.split(doc.page_content):
                result.append(Document(page_content = chunk, metadata = doc.metadata))
        return result  

In [13]:
## Documnet

doc = Document(page_content=text)
doc

Document(metadata={}, page_content='\nLangChain is a framework for building applications with LLMs.\nLangchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.\nYou can create chains, agents, memory, and retrievers.\nThe Eiffel Tower is located in Paris.\nFrance is a popular tourist destination.\n')

In [14]:
chunker = ThresholSemanticChunker(threshold = 0.7)
chunks = chunker.split_documents([doc])
chunks

[Document(metadata={}, page_content='LangChain is a framework for building applications with LLMs Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.'),
 Document(metadata={}, page_content='You can create chains, agents, memory, and retrievers.'),
 Document(metadata={}, page_content='The Eiffel Tower is located in Paris.'),
 Document(metadata={}, page_content='France is a popular tourist destination.')]

In [15]:
## VectorStore

embedding = OpenAIEmbeddings()
vectorStore = FAISS.from_documents(chunks, embedding)
retriever = vectorStore.as_retriever()


In [16]:
## Prompt Template

template = """ Answer the question based on the following context:
    {context}
    Question: {question}
    """
prompt = PromptTemplate.from_template(template)
prompt

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template=' Answer the question based on the following context:\n    {context}\n    Question: {question}\n    ')

In [38]:
## LLM 
# llm = init_chat_model(model = "groq:gemma2-9b-it", temperature = 0.4)
from langchain_groq import ChatGroq

llm = ChatGroq(model="meta-llama/llama-4-maverick-17b-128e-instruct", api_key=os.getenv("GROQ_API_KEY"))

## LCEL Chain With retrieval

rag_chain = (
    RunnableMap(
        {
            "context": lambda x: retriever.invoke(x["question"]),
            "question": lambda x: x["question"],
        }
    )
    |prompt
    |llm
    |StrOutputParser()
    
)

## Run Query

query = {"question": "What is LangChain used for?"}
result = rag_chain.invoke(query)
result

'LangChain is a framework used for building applications with Large Language Models (LLMs). It provides modular abstractions that allow LLMs to be combined with various tools, such as OpenAI and Pinecone. Additionally, it enables the creation of chains, agents, memory, and retrievers, suggesting a comprehensive toolkit for developing complex applications that leverage LLMs.'

## Semantic chunker with Langchain

In [42]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain.document_loaders import TextLoader

In [68]:
## Load the documents
loader = TextLoader("langchain_intro.txt")
docs = loader.load()

## Initialize embedding model
embedding = OpenAIEmbeddings()

## Create the semantic chunker
chunker = SemanticChunker(embedding)

## Split the documents
chunks = chunker.split_documents(docs)

## result
for i, chunk in enumerate(chunks):
    print(f"\n Chunk {i+1}: \n{chunk.page_content}")


 Chunk 1: 
LangChain is a framework for building applications with LLMs. Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.

 Chunk 2: 
You can create chains, agents, memory, and retrievers. The Eiffel Tower is located in Paris. France is a popular tourist destination.


In [72]:
loader = TextLoader("langchain_intro.txt.txt")
docs = loader.load()

## Initialize embedding model
embedding = OpenAIEmbeddings(model = "text-embedding-3-small")


## Create the semantic chunker
chunker = SemanticChunker(embedding, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=70)
print(chunker)


## Split the documents
chunks = chunker.split_documents(docs)

## result
for i, chunk in enumerate(chunks):
    print(f"\n Chunk {i+1}: \n{chunk.page_content}")

<langchain_experimental.text_splitter.SemanticChunker object at 0x7f2568797790>

 Chunk 1: 
FAISS (Facebook AI Similarity Search) is an open-source library. It supports CPU and GPU acceleration for high-dimensional data.

 Chunk 2: 
FAISS is a core component in many RAG pipelines. It is used for efficient similarity search and clustering of vectors.
