# 如何基于语义相似度分割文本

摘自 Greg Kamradt 的精彩笔记：
[5_Levels_Of_Text_Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

所有功劳归于他。

本指南涵盖如何根据语义相似度分割文本块。如果词嵌入相距足够远，文本块即被分割。

总的来说，这会将文本分割成句子，然后将句子分为 3 个一组，最后合并在词嵌入空间中相似的文本组。

## Install Dependencies

In [None]:
!pip install --quiet langchain_experimental langchain_openai

## 加载示例数据

In [1]:
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

## 创建文本分割器

要实例化一个[SemanticChunker](https://python.langchain.com/api_reference/experimental/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html)，我们必须指定一个 embedding 模型。下面我们将使用[OpenAIEmbeddings](https://python.langchain.com/api_reference/community/embeddings/langchain_community.embeddings.openai.OpenAIEmbeddings.html)。

In [4]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

## 拆分文本

我们像往常一样拆分文本，例如，通过调用 `.create_documents` 来创建 LangChain [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) 对象：

In [5]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students t

## 断点

这个分块器通过确定何时将句子“断开”来工作。这是通过查找任意两句话之间嵌入的差异来实现的。当这个差异超过某个阈值时，它们就会被分割。

有几种方法可以确定这个阈值是什么，这由 `breakpoint_threshold_type` 关键字参数控制。

注意：如果生成的分块大小过小/过大，可以使用额外的关键字参数 `breakpoint_threshold_amount` 和 `min_chunk_size` 进行调整。

### 百分位数

默认的分割方式是基于百分位数的。在此方法中，计算句子之间的所有差异，然后将任何大于 X 百分位的差异进行分割。X 的默认值是 95.0，可以通过关键字参数 `breakpoint_threshold_amount` 进行调整，该参数期望一个介于 0.0 和 100.0 之间的数字。

In [12]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
)

In [13]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students t

In [14]:
print(len(docs))

26


### 标准差

在此方法中，任何大于 X 个标准差的差异都将被拆分。X 的默认值为 3.0，可以通过关键字参数 `breakpoint_threshold_amount` 进行调整。

In [15]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation"
)

In [16]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students t

In [17]:
print(len(docs))

4


### 四分位距

在此方法中，四分位距用于分割块。四分位距可以通过关键字参数 `breakpoint_threshold_amount` 进行缩放，默认值为 1.5。

In [18]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="interquartile"
)

In [19]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students t

In [20]:
print(len(docs))

25


### Gradient

在此方法中，距离的梯度与百分位法一起用于分割块。当块彼此高度相关或特定于某个领域（例如法律或医疗）时，此方法很有用。其思想是对梯度数组应用异常检测，使分布变宽，并易于识别高度语义化数据中的边界。

与百分位法类似，可以通过关键字参数 `breakpoint_threshold_amount` 来调整分割，该参数期望一个介于 0.0 和 100.0 之间的数字，默认值为 95.0。

In [None]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="gradient"
)

In [6]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.


In [8]:
print(len(docs))

26
