 # Text Splitting Methods in NLP

- Author: [Ilgyun Jeong](https://github.com/johnny9210)
- Peer Review : [JoonHo Kim](https://github.com/jhboyo), [Sunyoung Park (architectyou)](https://github.com/Architectyou)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb)

## Overview
Text splitting is a crucial preprocessing step in Natural Language Processing (NLP). This tutorial covers various text splitting methods and tools, exploring their advantages, disadvantages, and appropriate use cases.

Main approaches to text splitting:

1. **Token-based Splitting**
   - Tiktoken: OpenAI's high-performance BPE tokenizer
   - Hugging Face tokenizers: Tokenizers for various pre-trained models
   
2. **Sentence-based Splitting**
   - SentenceTransformers: Splits text while maintaining semantic coherence
   - NLTK: Natural language processing based sentence and word splitting
   - spaCy: Text splitting utilizing advanced language processing capabilities

3. **Language-specific Tools**
   - KoNLPy: Specialized splitting tool for Korean text processing

Each tool has its unique characteristics and advantages:
- ```Tiktoken``` offers fast processing speed and compatibility with OpenAI models
- ```SentenceTransformers``` provides meaning-based sentence splitting
- ```NLTK``` and ```spaCy``` implement linguistic rule-based splitting
- ```KoNLPy``` specializes in Korean morphological analysis and splitting

Through this tutorial, you will understand the characteristics of each tool and learn to choose the most suitable text splitting method for your project.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Basic Usage of Tiktoken](#basic-usage-of-tiktoken)
- [Basic Usage of TokenTextSplitter](#basic-usage-of-tokentextsplitter)
- [Basic Usage of spaCy](#basic-usage-of-spaCy)
- [Basic Usage of SentenceTransformers](#basic-usage-of-sentencetransformers)
- [Basic Usage of NLTK](#basic-usage-of-NLTK)
- [Basic Usage of KoNLPy](#basic-usage-of-KoNLPy)
- [Basic Usage of Hugging Face tokenizers](#basic-usage-of-Hugging-Face-tokenizers)

### References

- [LangChain: How to split text by tokens](https://python.langchain.com/docs/how_to/split_by_token/)
- [Langchain TokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [langchain-opentutorial](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_text_splitters",
        "tiktoken",
        "spacy",
        "sentence-transformers",
        "nltk",
        "konlpy",
    ],
    verbose=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "TokenTextSplitter",
    }
)

Environment variables have been set successfully.


You can alternatively set ```OPENAI_API_KEY``` in ```.env``` file and load it. 

[Note] This is not necessary if you've already set ```OPENAI_API_KEY``` in previous steps.

In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Basic Usage of ```tiktoken```

tiktoken is a fast BPE tokenizer created by OpenAI.

- Open the file ./data/appendix-keywords.txt and read its contents.
- Store the read content in the file variable.

In [5]:
# Open the file data/appendix-keywords.txt and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

Print a portion of the content read from the file.

In [6]:
# Print a portion of the content read from the file.
print(file[:500])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders


Use the CharacterTextSplitter to split the text.

- Initialize the text splitter using the from_tiktoken_encoder method, which is based on the Tiktoken encoder.

In [7]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    # Set the chunk size to 300.
    chunk_size=300,
    # Ensure there is no overlap between chunks.
    chunk_overlap=50,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

Print the number of divided chunks.

In [8]:
print(len(texts))  # Output the number of divided chunks.

10


Print the first element of the texts list.

In [9]:
# Print the first element of the texts list.
print(texts[0])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].
Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.
Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.
Associated keywords: tokenization, natural language processing, parsing

Tokenizer


Reference
- When using CharacterTextSplitter.from_tiktoken_encoder, the text is split solely by CharacterTextSplitter, and the Tiktoken tokenizer is only used to measure and merge the divided text. (This means that the split text might exceed the chunk size as measured by the Tiktoken tokenizer.)
- When using RecursiveCharacterTextSplitter.from_tiktoken_encoder, the divided text is ensured not to exceed the chunk size allowed by the language model. If a split text exceeds this size, it is recursively divided. Additionally, you can directly load the Tiktoken splitter, which guarantees that each split is smaller than the chunk size.

## Basic Usage of ```TokenTextSplitter```

Use the TokenTextSplitter class to split the text into token-based chunks.

In [10]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=200,  # Set the chunk size to 10.
    chunk_overlap=50,  # Set the overlap between chunks to 0.
)

# Split the state_of_the_union text into chunks.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first chunk of the divided text.

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].
Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.
Example: Split the sentence “I am going to school


## Basic Usage of ```spaCy```

spaCy is an open-source software library for advanced natural language processing, written in Python and Cython programming languages.

Another alternative to NLTK is using the spaCy tokenizer.

1. How the text is divided: The text is split using the spaCy tokenizer.
2. How the chunk size is measured: It is measured by the number of characters.

Download the en_core_web_sm model.

In [11]:
!python -m spacy download en_core_web_sm --quiet

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Open the appendix-keywords.txt file and read its contents.

In [12]:
# Open the file data/appendix-keywords.txt and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

Verify by printing a portion of the content.

In [13]:
# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


Create a text splitter using the SpacyTextSplitter class.


In [14]:
import warnings
from langchain_text_splitters import SpacyTextSplitter

# Ignore  warning messages.
warnings.filterwarnings("ignore")

# Create the SpacyTextSplitter.
text_splitter = SpacyTextSplitter(
    chunk_size=200,  # Set the chunk size to 200.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

Use the **split_text** method of the **text_splitter** object to split the ```file``` text.

In [15]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 241, which is longer than the specified 200
Created a chunk of size 225, which is longer than the specified 200
Created a chunk of size 211, which is longer than the specified 200
Created a chunk of size 231, which is longer than the specified 200
Created a chunk of size 230, which is longer than the specified 200
Created a chunk of size 219, which is longer than the specified 200
Created a chunk of size 214, which is longer than the specified 200
Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 203, which is longer than the specified 200
Created a chunk of size 211, which is longer than the specified 200
Created a chunk of size 218, which is longer than the specified 200
Created a chunk of size 230, which is longer than the specified 200


Semantic Search

Definition: A vector store is a system that stores data converted to vector format.

It is used for search, classification, and other data analysis tasks.


## Basic Usage of ```SentenceTransformers```

SentenceTransformersTokenTextSplitter is a text splitter specialized for sentence-transformer models.

Its default behavior is to split text into chunks that fit within the token window of the sentence-transformer model being used.


In [16]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# Create a sentence splitter and set the overlap between chunks to 50.
splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=50)

Check the sample text.

In [17]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


The following code counts the number of tokens in the text stored in the `file` variable, excluding the count of start and stop tokens, and prints the result.

In [18]:
count_start_and_stop_tokens = 2  # Set the number of start and stop tokens to 2.

# Subtract the count of start and stop tokens from the total number of tokens in the text.
text_token_count = splitter.count_tokens(text=file) - count_start_and_stop_tokens
print(text_token_count)  # Print the calculated number of tokens in the text.

2231


Use the ```splitter.split_text()``` function to split the text stored in the ```text_to_split``` variable into chunks.

In [19]:
text_chunks = splitter.split_text(text=file)  # Split the text into chunks.

Split the text into chunks.


In [20]:
# Print the 0th chunk.
print(text_chunks[1])  # Print the second chunk from the divided text chunks.

a database for quick access. related keywords : embedding, database, vectorization, vectorization sql definition : sql ( structured query language ) is a programming language for managing data in a database. you can query, modify, insert, delete, and more data. example : select * from users where age > 18 ; looks up information about users who are 18 years old or older. associated keywords : database, query, data management, data management csv definition : csv ( comma - separated values ) is a file format for storing data, where each data value is separated by a comma. it is used for simple storage and exchange of tabular data. example : a csv file with the headers name, age, and occupation might contain data such as hong gil - dong, 30, developer. related keywords : data format, file processing, data exchange json definition : json ( javascript object notation ) is a lightweight data interchange format that represents data objects using text that is readable to both humans and machin

## Basic Usage of ```NLTK```

The Natural Language Toolkit (NLTK) is a library and a collection of programs for English natural language processing (NLP), written in the Python programming language.

Instead of simply splitting by "\n\n", NLTK can be used to split text based on NLTK tokenizers.
1. Text splitting method: The text is split using the NLTK tokenizer.
2.	Chunk size measurement: The size is measured by the number of characters.
3.	nltk (Natural Language Toolkit) is a Python library for natural language processing.
4.	It supports various NLP tasks such as text preprocessing, tokenization, morphological analysis, and part-of-speech tagging.

Before using NLTK, you need to run nltk.download('punkt_tab').

The reason for running nltk.download('punkt_tab') is to allow the NLTK (Natural Language Toolkit) library to download the necessary data files required for tokenizing text.

Specifically, punkt_tab is a tokenization model capable of splitting text into words or sentences for multiple languages, including English.

In [21]:
import nltk

nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/ilgyun/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Verify the sample text.


In [23]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords_kr.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.
예시: 사용자가 "태양계 행성"이라고 검색하면, "목성", "화성" 등과 같이 관련된 행성에 대한 정보를 반환합니다.
연관키워드: 자연어 처리, 검색 알고리즘, 데이터 마이닝

Embedding

정의: 임베딩은 단어나 문장 같은 텍스트 데이터를 저차원의 연속적인 벡터로 변환하는 과정입니다. 이를 통해 컴퓨터가 텍스트를 이해하고 처리할 수 있게 합니다.
예시: "사과"라는 단어를 [0.65, -0.23, 0.17]과 같은 벡터로 표현합니다.
연관키워드: 자연어 처


- Create a text splitter using the NLTKTextSplitter class.
- Set the chunk_size parameter to 1000 to split the text into chunks of up to 1000 characters.

In [24]:
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(
    chunk_size=200,  # Set the chunk size to 200.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

Use the split_text method of the text_splitter object to split the `file` text.

In [25]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

Semantic Search

정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.

예시: 사용자가 "태양계 행성"이라고 검색하면, "목성", "화성" 등과 같이 관련된 행성에 대한 정보를 반환합니다.


## Basic Usage of ```KoNLPy```

KoNLPy (Korean NLP in Python) is a Python package for Korean Natural Language Processing (NLP).

Tokenization involves the process of dividing text into smaller, more manageable units called tokens. 
These tokens often represent meaningful elements such as words, phrases, symbols, or other components crucial for further processing and analysis.

In languages like English, tokenization typically involves separating words based on spaces and punctuation.
The effectiveness of tokenization largely depends on the tokenizer's understanding of the language structure, ensuring the generation of meaningful tokens.

Tokenizers designed for English lack the ability to comprehend the unique semantic structure of other languages, such as Korean, and therefore cannot be effectively used for Korean text processing.

### Korean Tokenization Using ```KoNLPy```'s ```Kkma``` Analyzer

For Korean text, KoNLPy includes a morphological analyzer called Kkma (Korean Knowledge Morpheme Analyzer).

Kkma provides detailed morphological analysis for Korean text.
It breaks sentences into words and further decomposes words into their morphemes while identifying the part of speech for each token.
It can also split text blocks into individual sentences, which is particularly useful for processing lengthy texts.

### Considerations When Using ```Kkma```
Kkma is known for its detailed analysis. However, this precision can affect processing speed.
Therefore, Kkma is best suited for applications that prioritize analytical depth over rapid text processing.
- KoNLPy is a Python package for Korean Natural Language Processing, offering features such as morphological analysis, part-of-speech tagging, and syntactic parsing.

Verify the sample text.

In [26]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


This is an example of splitting Korean text using KonlpyTextSplitter.

In [27]:
from langchain_text_splitters import KonlpyTextSplitter

# Create a text splitter object using KonlpyTextSplitter.
text_splitter = KonlpyTextSplitter()

Use the text_splitter to split the ```file``` content into sentences.

In [28]:
texts = text_splitter.split_text(file)  # Split the file content into sentences.
print(texts[0])  # Print the first sentence from the divided text.

Semantic Search Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks. Example: Vectors of word embeddings can be stored in a database for quick access. Related keywords: embedding, database, vectorization, vectorization Embedding Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text. Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17]. Related keywords: natural language processing, vectorization, deep learning Token Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases. Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”. Associated keywords: tokenization, natural language processing, parsing Tokenizer Definition: A tokenizer is a to

## Basic Usage of ```Hugging Face tokenizers```

Hugging Face provides various tokenizers.

This code demonstrates calculating the token length of a text using one of Hugging Face's tokenizers, GPT2TokenizerFast.

The text splitting approach is as follows:

- The text is split at the character level.

The chunk size measurement is determined as follows:

- It is based on the number of tokens calculated by the Hugging Face tokenizers.
- A tokenizer object is created using the GPT2TokenizerFast class.
- from_pretrained method is called to load the pre-trained gpt2 tokenizer model.

In [29]:
from transformers import GPT2TokenizerFast

# Load the GPT-2 tokenizer.
hf_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

In [30]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


from_huggingface_tokenizer method is used to initialize a text splitter with a Hugging Face tokenizers (tokenizer).

In [31]:
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    # Use the Hugging Face tokenizers to create a CharacterTextSplitter object.
    hf_tokenizer,
    chunk_size=300,
    chunk_overlap=50,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

Check the split result of the first element

In [32]:
print(texts[1])  # Print the first element of the texts list.

Tokenizer

Definition: A tokenizer is a tool that splits text data into tokens. It is used to preprocess data in natural language processing.
Example: Split the sentence “I love programming.” into [“I”, “love”, “programming”, “.”].
Associated keywords: tokenization, natural language processing, parsing

VectorStore

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

SQL

Definition: SQL(Structured Query Language) is a programming language for managing data in a database. You can query, modify, insert, delete, and more data.
Example: SELECT * FROM users WHERE age > 18; looks up information about users who are 18 years old or older.
Associated keywords: database, query, data management, data management

CSV
