# [How to split text based on semantic similarity](https://python.langchain.com/docs/how_to/semantic-chunker/)

Taken from Greg Kamradt's wonderful notebook: [5_Levels_Of_Text_Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

At a high level, this splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

In [5]:
from dotenv import load_dotenv
load_dotenv()
import rich

In [1]:
# This is a long document we can split up.
with open("../../text_files/state_of_the_union.txt") as f:
    state_of_the_union = f.read()

In [3]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

In [4]:
docs = text_splitter.create_documents([state_of_the_union])

In [6]:
print(docs[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students t

### Breakpoints
This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.

There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg.

#### Percentile (default)
The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split.

default `breakpoint_threshold_amount` = 95 (in source code)


In [42]:
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="percentile")

In [43]:
docs = text_splitter.create_documents([state_of_the_union])

In [44]:
print(len(docs), len(docs[0].page_content))
print(docs[0].page_content)

26 1601
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from st

In [45]:
for doc in docs:
    print(len(doc.page_content), end=",")

1601,523,20,2123,2018,726,363,1266,4109,755,1324,861,602,207,2264,980,988,791,2371,6628,3342,74,362,520,1259,1633,

#### Standard Deviation
In this method, any difference greater than X standard deviations is split.

default `breakpoint_threshold_amount` = 3 (in source code)

In [46]:
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation", breakpoint_threshold_amount=2) 

In [47]:
docs = text_splitter.create_documents([state_of_the_union])

In [48]:
print(len(state_of_the_union), len(docs[0].page_content))
print(docs[0].page_content)

38539 1601
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from

In [49]:
for doc in docs:
    print(len(doc.page_content), end=",")

1601,2668,2018,726,363,1266,4109,755,1324,1464,207,2264,1969,791,2371,6628,3342,437,1780,1633,

#### Interquartile
In this method, the interquartile distance is used to split chunks.

default `breakpoint_threshold_amount` = 1.5 (in source code)

In [50]:
text_splitter =SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="interquartile", breakpoint_threshold_amount=1.5)

In [51]:
docs = text_splitter.create_documents([state_of_the_union])

In [52]:
print(len(state_of_the_union), len(docs[0].page_content))
print(docs[0].page_content)

38539 1601
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from

In [53]:
for doc in docs:
    print(len(doc.page_content), end=",")

1601,523,2144,2018,726,363,1266,4109,755,1324,861,602,207,2264,980,988,791,2371,6628,3342,74,362,520,1259,1633,

#### Gradient
In this method, the gradient of distance is used to split chunks along with the percentile method. This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.

default `breakpoint_threshold_amount` = 95 (in source code)

In [36]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="gradient"
)

In [37]:
docs = text_splitter.create_documents([state_of_the_union])

In [38]:
print(len(state_of_the_union), len(docs[0].page_content))
print(docs[0].page_content)

38539 73
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.


In [41]:
for doc in docs:
    print(len(doc.page_content), end=",")

73,36,1472,1781,2861,767,336,1299,803,3571,388,363,2481,227,779,428,993,2425,2766,1162,689,1499,3297,3352,2198,1664,