# 3.5 Semantic chunker


## Setup

### Install dependencies

In [None]:
%pip install python-dotenv~=1.0 docarray~=0.40.0 pypdf~=5.1 --upgrade --quiet
%pip install chromadb~=0.5.18 sentence-transformers~=3.3 --upgrade --quiet 
%pip install langchain~=0.3.7 langchain_openai~=0.2.6 langchain_community~=0.3.5 langchain-chroma~=0.1.4 langchainhub~=0.1.21 --upgrade --quiet
%pip install langchain_experimental~=0.3.3 --upgrade --quiet

# If running locally, you can do this instead:
#%pip install -r ../requirements.txt

### Load environment variables

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# If running in Google Colab, you can use this code instead:
# from google.colab import userdata
# os.environ["AZURE_OPENAI_API_KEY"] = userdata.get("AZURE_OPENAI_API_KEY")
# os.environ["AZURE_OPENAI_ENDPOINT"] = userdata.get("AZURE_OPENAI_ENDPOINT")

### Setup models

In [None]:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
api_version = "2024-10-01-preview"
llm = AzureChatOpenAI(deployment_name="gpt-4o", temperature=0.0, api_version=api_version)
embedding_model = AzureOpenAIEmbeddings(model="text-embedding-3-large", api_version=api_version)

### Setup LangSmith tracing for this notebook

In [None]:
import os

# API key etc is in the .env file
# my_name = "Totoro"
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_PROJECT"] = f"tokyo24-test-{my_name}"

# How to split text based on semantic similarity

Taken from Greg Kamradt's wonderful notebook:
[**5_Levels_Of_Text_Splitting**](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

All credit to him.

This guide covers how to split chunks based on their semantic similarity. If embeddings are sufficiently far apart, chunks are split.

At a high level, this splits into sentences, then groups into groups of 3
sentences, and then merges one that are similar in the embedding space.

### Some benefits of this approach are:
1.	Enhanced Retrieval Accuracy: By segmenting documents into semantically coherent chunks, retrieval systems can more effectively identify and extract relevant information, leading to more precise responses.
2.	Improved Context Preservation: Semantic chunking ensures that each segment maintains its contextual integrity, preventing the disruption of ideas that can occur with fixed-size chunking methods.
3.	Reduced Hallucinations: By focusing on meaningful segments, semantic chunking allows for more efficient indexing and retrieval, optimizing computational resources and improving response times. 

### Three breakpoint types are available for semantic splitting:
   - 'percentile': Splits at differences greater than the X percentile.
   - 'standard_deviation': Splits at differences greater than X standard deviations.
   - 'interquartile': Uses the interquartile distance to determine split points.


## Load Example Data

In [None]:
# This is a long document we can split up.
with open("../data/state_of_the_union.txt") as f:
    state_of_the_union = f.read()

## Create Text Splitter

To instantiate a [SemanticChunker](https://python.langchain.com/api_reference/experimental/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html), we must specify an embedding model. Below we will use [OpenAIEmbeddings](https://python.langchain.com/api_reference/community/embeddings/langchain_community.embeddings.openai.OpenAIEmbeddings.html). 

In [None]:
from langchain_experimental.text_splitter import SemanticChunker

text_splitter = SemanticChunker(embedding_model)

## Split Text

We split text in the usual way, e.g., by invoking `.create_documents` to create LangChain [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) objects:

In [None]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

## Breakpoints

This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.

There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg.

### Percentile

The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split.

In [None]:
text_splitter = SemanticChunker(
    embedding_model, breakpoint_threshold_type="percentile"
)

In [None]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

In [None]:
print(len(docs))

### Standard Deviation

In this method, any difference greater than X standard deviations is split.

In [None]:
text_splitter = SemanticChunker(
    embedding_model, breakpoint_threshold_type="standard_deviation"
)

In [None]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

In [None]:
print(len(docs))

### Interquartile

In this method, the interquartile distance is used to split chunks.

In [None]:
text_splitter = SemanticChunker(
    embedding_model, breakpoint_threshold_type="interquartile"
)

In [None]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

In [None]:
print(len(docs))

### Gradient

In this method, the gradient of distance is used to split chunks along with the percentile method.
This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.

In [None]:
text_splitter = SemanticChunker(
    embedding_model, breakpoint_threshold_type="gradient"
)

In [None]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

In [None]:
print(len(docs))