# Advance Split Document

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

The problem with long documents is that they may exceed the context size limit of the language model (LLM). For example, GPT-3.5 has a context size of 4,096 tokens. Therefore, we need to split our document into several chunks.

In LangChain, there are several strategies to split a document:

1. Character Text Splitter
2. Sentence Transformers Token Text Splitter
3. Token Text Splitter
4. Recursive Text Splitter


1. **Character Text Splitter**: 

- Splits text into chunks based on a specified number of characters. 
- This approach is useful for creating consistent chunk sizes regardless of the content structure.

In [2]:
text_to_split = "Hello there! My name is John. I have a dog, named Snowball."

In [3]:
from langchain.text_splitter import CharacterTextSplitter

# Split the document into chunks
splitter = CharacterTextSplitter(separator="", chunk_size=10, chunk_overlap=2)
splitter.split_text(text_to_split)

['Hello ther',
 'ere! My na',
 'name is Jo',
 'John. I ha',
 'have a dog',
 'og, named',
 'd Snowball',
 'll.']

2. **Sentence Transformers Token Text Splitter:** 
- Splits text into chunks based on sentences, ensuring that chunks end at sentence boundaries. 
- This method is ideal for maintaining semantic coherence within chunks. 
- Note that "sentence tokens" refer to units of text created by breaking down the text into individual sentences. Each sentence is treated as a distinct unit, preserving the natural boundaries and meaning of the text.

In [4]:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(tokens_per_chunk=5, chunk_overlap=2)
splitter.split_text(text_to_split)

  from tqdm.autonotebook import tqdm, trange


['hello there! my name',
 'my name is john.',
 'john. i have a',
 'have a dog, named',
 ', named snowball.']

3. **Token Text Splitter:** 
- Splits text into chunks based on a specified number of tokens.
- This approach ensures that each chunk fits within the token limit of the model.

- Note: remember, before it converted into embedding vector. Text will convert into numerical representation, called text encoding. There are several encoding strategy: `tiktoken`, `cl100k_base`, `gpt2`

In [5]:
from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=2, encoding_name="cl100k_base")
splitter.split_text(text_to_split)

['Hello there! My name is John. I have', ' I have a dog, named Snowball.']

4. **Recursive Text Splitter:** 
- Splits text recursively, starting with larger chunks and progressively breaking them down into smaller chunks if necessary. This strategy helps in balancing chunk size and content coherence.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(separators=['.'], keep_separator=False, chunk_size=10, chunk_overlap=0)
splitter.split_text(text_to_split)

['Hello there! My name is John', ' I have a dog, named Snowball']

All splitter we've discussed can also used to split document. Just use `.split_document` instead of `.split_text`.

In [11]:
# Text loader
from langchain_community.document_loaders import TextLoader
text_loader = TextLoader('./sources/sangkuriang.txt')
text_loader.load()

document_to_split = text_loader.load()

splitter = TokenTextSplitter(chunk_size=250, chunk_overlap=0, encoding_name="cl100k_base")
d = splitter.split_documents(document_to_split)
d

[Document(metadata={'source': './sources/sangkuriang.txt'}, page_content='Sangkuriang Story\nThe legend tells that, long ago, there lived a beautiful woman named Dayang Sumbi, the daughter of the king of Sumbing Perbangkara. Her beautiful face made Dayang Sumbi contested by the princes.\nAs a princess from the kingdom, Dayang Sumbi has a weaving hobby. One time, when she was busy weaving cloth, suddenly her loom fell. Instead of taking it herself, Dayang Sumbi said an oath: if the one who took the loom were a man, then she would take him as her husband, but if the one who took the loom were a woman, she would make her a sister.\nUnexpectedly, sometime later, there came a male dog named Si Tumang, which brought Dayang Sumbi’s loom. Finally, to fulfill her oath, Dayang Sumbi married Tumang (long story short, Tumang was a god who was expelled from heaven). From that marriage, a son named Sangkuriang was born.\nTime went on until Sangkuriang grew into a handsome boy. One day, Sangkuriang f