# Document Splitting

In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [3]:
chunk_size =26
chunk_overlap = 4

In [4]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [5]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [6]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [7]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [8]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

In [9]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [10]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [11]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [12]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

Try your own examples!

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [13]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [14]:
len(some_text)

496

In [15]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [17]:
import pprint

In [22]:
pprint.pp(c_splitter.split_text(some_text))

['When writing documents, writers will use document structure to group '
 "content. This can convey to the reader, which idea's are related. For "
 'example, closely related ideas are in sentances. Similar ideas are in '
 'paragraphs. Paragraphs form a document. \n'
 '\n'
 ' Paragraphs are often delimited with a carriage return or two carriage '
 'returns. Carriage returns are the "backslash n" you see embedded in this '
 'string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']


In [21]:
pprint.pp(r_splitter.split_text(some_text))

['When writing documents, writers will use document structure to group '
 "content. This can convey to the reader, which idea's are related. For "
 'example, closely related ideas are in sentances. Similar ideas are in '
 'paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage '
 'returns. Carriage returns are the "backslash n" you see embedded in this '
 'string. Sentences have a period at the end, but also, have a space.and words '
 'are separated by space.']


Let's reduce the chunk size a bit and add a period to our separators:

In [23]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [24]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [25]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/W1.pdf")
pages = loader.load()

In [26]:
len(pages)

159

In [27]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [28]:
docs = text_splitter.split_documents(pages)

In [29]:
len(docs)

159

In [30]:
len(pages)

159

In [31]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/dsp_essentials")
notion_db = loader.load()

In [41]:
text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)
docs = text_splitter.split_documents(notion_db)

In [42]:
len(notion_db)

1

In [43]:
pprint.pp(notion_db[0])

Document(metadata={'source': 'docs/dsp_essentials/DSP Essentials 0dcaa8ebc71947e8a78e2589f9a07d36.md'}, page_content="# DSP Essentials\n\nThere are four levels:\n\n1. Advertiser\n2. Campaign\n3. Order\n4. Line item\n\nCampaign Settings:\n\n1. Campaign details\n2. Schedule\n3. *ID Fusion Settings*\n4. *Programmatic Budget*\n5. *Ad Serving Budget*\n6. *Frequency Capping*\n7. *Brand Safety*\n8. *Conversion Attribution*\n9. *Viewability*\n10. *Ad Tag Type*\n11. *URL Append Rules*\n12. *Campaign Labels*\n\nCampaign Setup:\n\n1. Campaign budget amount\n2. Campaign pacing: Evenly, ASAP, Ahead\n\nYou can also set the campaign budget field to 0. This means an overall campaign budget is not applied, but\xa0**all line items**\xa0will spend their\xa0**individually**\xa0defined budgets.\n\nFor more budget flexibility, you can set this amount to be a lower number than the sum of all the RTB line item goals. But once this amount is met\xa0*all*\xa0the line items in your campaign will stop delivering.

In [44]:
len(docs)

12

In [48]:
pprint.pp(docs[0])

Document(metadata={'source': 'docs/dsp_essentials/DSP Essentials 0dcaa8ebc71947e8a78e2589f9a07d36.md'}, page_content='# DSP Essentials\n\nThere are four levels:\n\n1. Advertiser\n2. Campaign\n3. Order\n4. Line item\n\nCampaign Settings:\n\n1. Campaign details\n2. Schedule\n3. *ID Fusion Settings*\n4. *Programmatic Budget*\n5. *Ad Serving Budget*\n6. *Frequency Capping*\n7. *Brand Safety*\n8. *Conversion Attribution*\n9. *Viewability*\n10. *Ad Tag Type*\n11. *URL Append Rules*\n12. *Campaign Labels*\n\nCampaign Setup:\n\n1. Campaign budget amount\n2. Campaign pacing: Evenly, ASAP, Ahead\n\nYou can also set the campaign budget field to 0. This means an overall campaign budget is not applied, but\xa0**all line items**\xa0will spend their\xa0**individually**\xa0defined budgets.')


## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

In [49]:
from langchain.text_splitter import TokenTextSplitter

In [50]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [51]:
text1 = "foo bar bazzyfoo"

In [52]:
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [53]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [54]:
docs = text_splitter.split_documents(pages)

In [55]:
docs[0]

Document(metadata={'source': 'docs/W1.pdf', 'page': 0}, page_content='Copyright Notice \nThese slides are distributed under the')

In [56]:
pages[0].metadata

{'source': 'docs/W1.pdf', 'page': 0}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [57]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [58]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [59]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [60]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [61]:
md_header_splits[0]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}, page_content='Hi this is Jim  \nHi this is Joe')

In [62]:
md_header_splits[1]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}, page_content='Hi this is Lance')

Try on a real Markdown file, like a Notion database.

In [63]:
loader = NotionDirectoryLoader("docs/dsp_essentials")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [64]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [65]:
md_header_splits = markdown_splitter.split_text(txt)

In [69]:
md_header_splits[0].metadata

{'Header 1': 'There are four levels:'}

In [70]:
md_header_splits[1].metadata

{'Header 1': 'Campaign Settings:'}

In [71]:
md_header_splits[2].metadata

{'Header 1': 'Campaign Setup:'}