# [How to split text based on semantic similarity](https://python.langchain.com/docs/how_to/semantic-chunker/)

Taken from Greg Kamradt's wonderful notebook: [5_Levels_Of_Text_Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

At a high level, this splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

In [None]:
from dotenv import load_dotenv
load_dotenv()
import rich

In [None]:
# This is a long document we can split up.
with open("../../text_files/state_of_the_union.txt") as f:
    state_of_the_union = f.read()

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

In [None]:
docs = text_splitter.create_documents([state_of_the_union])

In [None]:
print(docs[0].page_content)

### Breakpoints
This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.

There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg.

#### Percentile (default)
The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split.

default `breakpoint_threshold_amount` = 95 (in source code)


In [None]:
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="percentile")

In [None]:
docs = text_splitter.create_documents([state_of_the_union])

In [None]:
print(len(docs), len(docs[0].page_content))
print(docs[0].page_content)

In [None]:
for doc in docs:
    print(len(doc.page_content), end=",")

#### Standard Deviation
In this method, any difference greater than X standard deviations is split.

default `breakpoint_threshold_amount` = 3 (in source code)

In [None]:
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation", breakpoint_threshold_amount=2) 

In [None]:
docs = text_splitter.create_documents([state_of_the_union])

In [None]:
print(len(state_of_the_union), len(docs[0].page_content))
print(docs[0].page_content)

In [None]:
for doc in docs:
    print(len(doc.page_content), end=",")

#### Interquartile
In this method, the interquartile distance is used to split chunks.

default `breakpoint_threshold_amount` = 1.5 (in source code)

In [None]:
text_splitter =SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="interquartile", breakpoint_threshold_amount=1.5)

In [None]:
docs = text_splitter.create_documents([state_of_the_union])

In [None]:
print(len(state_of_the_union), len(docs[0].page_content))
print(docs[0].page_content)

In [None]:
for doc in docs:
    print(len(doc.page_content), end=",")

#### Gradient
In this method, the gradient of distance is used to split chunks along with the percentile method. This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.

default `breakpoint_threshold_amount` = 95 (in source code)

In [None]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="gradient"
)

In [None]:
docs = text_splitter.create_documents([state_of_the_union])

In [None]:
print(len(state_of_the_union), len(docs[0].page_content))
print(docs[0].page_content)

In [None]:
for doc in docs:
    print(len(doc.page_content), end=",")