# SemanticChunker

- Author: [Wonyoung Lee](https://github.com/BaBetterB)
- Design: []()
- Peer Review: []()
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)



## Overview

This tutorial covers a Text Splitter that splits text based on semantic similarity.
The Semantic Chunker is a sophisticated tool within LangChain that brings an intelligent approach to document chunking. Rather than simply dividing text at fixed intervals, it analyzes the semantic meaning of content to create more meaningful divisions. This process relies on OpenAI's embedding model, which evaluates how similar different pieces of text are to each other. The tool offers flexible splitting options, including percentile-based, standard deviation, and interquartile range methods. What sets it apart from traditional text splitters is its ability to maintain context by identifying natural break points in the text, ultimately leading to better performance when working with large language models. By understanding the actual meaning of the content, it creates more coherent and useful chunks that preserve the original document's context and flow.

 [Greg Kamradt's Notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

The method divides the text into sentence units, then groups them into three sentences, and merges similar sentences in the embedding space.

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Creating a Semantic Chunker](#creating-a-semanticchunker)
- [Text Splitting](#text-splitting)
- [Breakpoints](#breakpoints)

### References

- [Greg Kamradt's Notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)


----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

Load sample text and output the content.

In [1]:
%%capture --no-stderr
!pip install langchain-opentutorial


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "SemanticChunker",  # title
    }
)

You can alternatively set `OPENAI_API_KEY` in `.env` file and load it.

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [7]:
# Configuration File for Managing API Keys as Environment Variables
from dotenv import load_dotenv

# Load API Key Information
load_dotenv(override=True)

True

In [8]:
# Open the data/appendix-keywords.txt file to create a file object called f.
with open("./data/appendix-keywords.txt", encoding="utf-8") as f:

    file = f.read()  # Read the contents of the file and save it in the file variable.


# Print part of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


## Creating a SemanticChunker

`SemanticChunker` is one of LangChain's experimental features, which serves to divide text into semantically similar chunks.

This allows you to process and analyze text data more effectively.

Use `SemanticChunker` to divide the text into semantically related chunks.


In [9]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# Initialize a semantic chunk splitter using OpenAI embeddings.
text_splitter = SemanticChunker(OpenAIEmbeddings())

## Text Splitting

- Use `text_splitter` to divide the `file` text into document units.

In [23]:
chunks = text_splitter.split_text(file)

Check the divided chunks.

In [11]:
# Print the first chunk among the divided chunks.
print(chunks[0])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks. Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.


You can convert chunks to documents using the `create_documents()` function.


In [12]:
# Split using text_splitter
docs = text_splitter.create_documents([file])
print(
    docs[0].page_content
)  # Print the content of the first document among the divided documents.

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks. Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.


## Breakpoints
This chunker works by determining when to "split" sentences. 
This is done by examining the embedding differences between two sentences.
If the difference exceeds a certain threshold, the sentences are split.

- Reference video: https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580

### Percentile
The basic splitting method is based on `Percentile`.
In this method, all differences between sentences are calculated, then splitting is done based on the specified percentile.


In [13]:
text_splitter = SemanticChunker(
    # Initialize the semantic chunker using OpenAI's embedding model
    OpenAIEmbeddings(),
    # Set the split breakpoint type to percentile
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=70,
)

Check the split results.


In [14]:
docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

[Chunk 0]

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
[Chunk 1]

Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
[Chunk 2]

Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17]. Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases. Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”. Associated keywords: tokenization, natural language processing, parsing

To

Print the length of `docs`.

In [15]:
print(len(docs))  # Print the length of docs.

27


### Standard Deviation

In this method, splitting occurs when there is a difference greater than the specified `breakpoint_threshold_amount` standard deviation.

- Set the `breakpoint_threshold_type` parameter to "standard_deviation" to specify chunk splitting criteria based on standard deviation.

In [16]:
text_splitter = SemanticChunker(
    # Initialize the semantic chunker using OpenAI's embedding model.
    OpenAIEmbeddings(),
    # Use standard deviation as the splitting criterion.
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=1.25,
)

Check the split results.

In [17]:
# Split using text_splitter.
docs = text_splitter.create_documents([file])

In [18]:
docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

[Chunk 0]

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks. Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
[Chunk 1]

Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17]. Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases. Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”. Associated keywords: tokenization, natural language processing, parsing

Tokenizer

De

Print the length of `docs`.

In [19]:
print(len(docs))  # Print the length of docs.

15


### Interquartile

This method uses interquartile range to split chunks.

- Set the `breakpoint_threshold_type` parameter to "interquartile" to specify chunk splitting criteria based on interquartile range.


In [20]:
text_splitter = SemanticChunker(
    # Initialize the semantic chunk splitter using OpenAI's embedding model.
    OpenAIEmbeddings(),
    # Set the breakpoint threshold type to interquartile range.
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=0.5,
)

In [21]:
# Split using text_splitter.
docs = text_splitter.create_documents([file])

# Print the results.
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(
        doc.page_content
    )  # Print the content of the first document among the split documents.
    print("===" * 20)

[Chunk 0]

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
[Chunk 1]

Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
[Chunk 2]

Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17]. Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases. Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”. Associated keywords: tokenization, natural language processing, parsing

To

Print the length of `docs`.


In [22]:
print(len(docs))  # Print the length of docs.

23
