Document splitting happens after data is loaded  
but before storing into vector store



In [1]:
from dotenv import load_dotenv , find_dotenv

_ = load_dotenv(find_dotenv())

splitting of chunks has to be done in a way so information and context meaning is not lost.

splitting has `chunk_size` and `chunk_overlap`

chunk overlap is a sliding window of text , at the end of a chunk and start of next chunk

### langchain.text_splitter has following splitters

- *CharacterTextSplitter* : Implementation of splitting text that looks at characters

- *MarkdownHeaderTextSplitter* : Implementation of splitting markdown files based on specified headers

- *TokenTextSplitter* : splitting texts that look at token

- *SentenceTransformerTokenTextSplitter* : implementation of splitting text that sentence and tokens

- *RecursiveCharacterTextSplitter* : splitting text using characters. Recursively tries to split by different characters

- *NLTKTextSplitter* : looks at sentences using NLTK (Natural Language tool kit)

- *SpacyTextSplitter* : looks at sentences using spacy


In [2]:
# import text splitters

from langchain.text_splitter import RecursiveCharacterTextSplitter , CharacterTextSplitter 

TOY use cases 

In [3]:
chunk_size =26
chunk_overlap = 4

In [4]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [6]:
text1 = 'abcdefghijklmnopqrstuvwxyz' 
r_splitter.split_text(text1) # no splitting as lenght of text is 26 == chunk size

['abcdefghijklmnopqrstuvwxyz']

In [7]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2) # split on 26th char `z` , take 4 char from last chunk `wxyz`

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [10]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [12]:
c_splitter.split_text(text3) # character splitter by default , splits on \n

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [13]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text.

In [14]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [15]:
len(some_text)

496

In [16]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [19]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [21]:
r_splitter.split_text(some_text) # separated text on \n\n , even if chunk is smaller than size

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [23]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""] # lookback sentence regex
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Use splitting on an actual PDF

In [24]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("ml_doc.pdf")
pages = loader.load()

In [25]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [26]:
split_docs = text_splitter.split_documents(pages)

In [29]:
split_docs[0]

Document(page_content="MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we'll start to  talk a bit about machine learning.  \nBy way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so \nI personally work in machine learning, and I' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I'm actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspects of machin e learni

In [27]:
len(split_docs) > len(pages)

True

## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

In [30]:
from langchain.text_splitter import TokenTextSplitter

In [32]:
token_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [33]:
token_text = "foo bar bazzyfoo"


In [34]:
text_splitter.split_text(token_text) # last word is split into 3 tokens

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [35]:
token_splitter = TokenTextSplitter(encoding_name="cl100k_base",
                                   model_name="gpt-3.5-turbo",
                                   chunk_size = 10,
                                   chunk_overlap = 0)

In [36]:
docs_tokens = token_splitter.split_documents(pages)

In [40]:
docs_tokens[1]

Document(page_content=' Ng):  Okay. Good morning. Welcome to', metadata={'source': 'ml_doc.pdf', 'page': 0})

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [44]:
# from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [45]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [46]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [47]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [48]:
md_header_splits = markdown_splitter.split_text(markdown_document)

In [49]:
md_header_splits[0]

Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})

In [51]:
md_header_splits[1]  # metadata has changed

Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

Using of text splitter is to get sematically related chunks of data.

Need to move it to a vector store before querying