# Document Splitting

# Splitting Documents into smaller chunks
## Retaining meaningful relationships!


### Document Loading
### Splitting
### Storage

# LangChain.text_splitter

1. CharacterTextSplitter() - Implementation of splitting text that looks at characters;
2. MarkdownHeaderTextSplitter() - Implementation of splitting markdown files based on specified headers;
3. TokenTextSplitter() - Implementation of splitting text that looks at tokens;
4. SentenceTransformers Token TextSplitt() -- Implementation of splitting text that looks at tokens;
5. RecrusiveCharacterTextSplitter() -- Implementation of splitting text that llooks at characters. recursively tries to split by different characters to find one that works.
6. Language() -- for CPP, Python, Ruby, Markdown etc.
7. NLTKTextSplitter() --Implementation of splitting text that looks at sentences using NLTK (Natural Language Tool kit)
8. SpacyTextSplitter() -- Implementation of splitting text that looks at sentences using Spacy.

On this model. The toyota Camry has a head-snapping 80 HP and an eight-seed automatic transmission that will

1. Chunk 1: on this model. The toyota Camry has a head-snapping
2. Chunk 2: 80 HP and an eight-speed automatic transmission that will

## Document Splitting

In [2]:
# !pip install openai
# !pip install python-dotenv
# !pip install langchain
# !pip install langchain_community
# !pip install pypdf
# !pip install tiktoken

In [4]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [6]:
chunk_size =26
chunk_overlap = 4

In [7]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

# Why doesn't this split the string below?

In [8]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [9]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [10]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [11]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [12]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [13]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [14]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [15]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text.

In [16]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [17]:
len(some_text)

496

In [18]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

In [19]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [20]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

## Let's reduce the chunk size a bit and add a period to our separators:

In [21]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [22]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    # different separators
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [25]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("sample_data/Pret.pdf")
pages = loader.load()

In [26]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [27]:
docs = text_splitter.split_documents(pages)

In [28]:
len(docs)

14

In [29]:
docs[0]

Document(metadata={'source': 'sample_data/Pret.pdf', 'page': 0}, page_content='The\nHistory\nof\nPret\nA\nManger:\nFrom\na\nSingle\nShop\nto\na\nGlobal\nPhenomenon\nIntroduction\nPret\nA\nManger,\noften\nsimply\nknown\nas\nPret,\nis\na\nglobally\nrecognized\ncoffee\nshop\nand\nsandwich\nchain\nwith\na\nunique\napproach\nto\nfood\nand\nservice.\nWith\nits\nroots\nin\nthe\nbustling\nstreets\nof\nLondon,\nPret\nhas\ngrown\nto\nbecome\na\nfavorite\namong\nthose\nseeking\nfresh,\nhealthy,\nand\nconvenient\nmeals.\nThis\narticle\ndelves\ninto\nthe\nhistory\nof\nPret,\ntracing\nits\njourney\nfrom\na\nsingle\nshop\nto\na\nglobal\nbrand.\nThe\nBeginning:\nA\nSimple\nIdea\nThe\nstory\nof\nPret\nA\nManger\nbegins\nin\n1983\nwhen\ntwo\ncollege\nfriends,\nSinclair\nBeecham\nand\nJulian\nMetcalfe,\nnoticed\na\ngap\nin\nthe\nmarket\nfor\nfresh,\nnatural\nfood\nthat\ncould\nbe\nserved\nquickly\nto\nbusy\nLondoners.\nInspired\nby\nthe\nidea\nof\nproviding\nan\nalternative\nto\nthe\nprocessed\nand\nunhe

In [30]:
docs[0].metadata

{'source': 'sample_data/Pret.pdf', 'page': 0}

In [31]:
len(pages)

6

# Now, try MarkDown format

In [46]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("Habit_Tracker.md")
course_material_md = loader.load()

In [44]:
docs = text_splitter.split_documents(co
                                     urse_material_md)

In [49]:
# len(course_material_md), len(docs)

## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

In [61]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("sample_data/Pret.pdf")
pages = loader.load()
page = pages[0]
print(page.page_content[:10])
def reprocess_text(text):
    cleaned_text = ' '.join(text.split())
    return cleaned_text

reprocessed_page_content = reprocess_text(page.page_content)
print(reprocessed_page_content[:100])

The
Histor
The History of Pret A Manger: From a Single Shop to a Global Phenomenon Introduction Pret A Manger, 


In [62]:
print(reprocessed_page_content[:50])

The History of Pret A Manger: From a Single Shop t


In [50]:
from langchain.text_splitter import TokenTextSplitter

In [51]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [52]:
text1 = "foo bar bazzyfoo"

In [53]:
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [54]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [70]:
pages.__len__()

6

In [74]:
reprocessed_page_content

'The History of Pret A Manger: From a Single Shop to a Global Phenomenon Introduction Pret A Manger, often simply known as Pret, is a globally recognized coffee shop and sandwich chain with a unique approach to food and service. With its roots in the bustling streets of London, Pret has grown to become a favorite among those seeking fresh, healthy, and convenient meals. This article delves into the history of Pret, tracing its journey from a single shop to a global brand. The Beginning: A Simple Idea The story of Pret A Manger begins in 1983 when two college friends, Sinclair Beecham and Julian Metcalfe, noticed a gap in the market for fresh, natural food that could be served quickly to busy Londoners. Inspired by the idea of providing an alternative to the processed and unhealthy fast food options that dominated the market, they decided to create a place where people could find fresh sandwiches, salads, and coffee made from high-quality ingredients. The first Pret A Manger shop opened

In [72]:
docs = text_splitter.split_text(reprocessed_page_content)

In [76]:
docs[0:5]

['The History of Pret A Manger: From a',
 ' Single Shop to a Global Phenomenon Introduction Pret',
 ' A Manger, often simply known as Pret,',
 ' is a globally recognized coffee shop and sandwich chain with',
 ' a unique approach to food and service. With its']

In [77]:
docs[1]

' Single Shop to a Global Phenomenon Introduction Pret'

In [79]:
type(docs[3])

str

In [94]:
docs = text_splitter.split_documents(pages)
# pages are Document type, and you will need page.page_content to access the contents.
page.page_content, page.page_content[:10], page.metadata, type(page.page_content)

('The\nHistory\nof\nPret\nA\nManger:\nFrom\na\nSingle\nShop\nto\na\nGlobal\nPhenomenon\nIntroduction\nPret\nA\nManger,\noften\nsimply\nknown\nas\nPret,\nis\na\nglobally\nrecognized\ncoffee\nshop\nand\nsandwich\nchain\nwith\na\nunique\napproach\nto\nfood\nand\nservice.\nWith\nits\nroots\nin\nthe\nbustling\nstreets\nof\nLondon,\nPret\nhas\ngrown\nto\nbecome\na\nfavorite\namong\nthose\nseeking\nfresh,\nhealthy,\nand\nconvenient\nmeals.\nThis\narticle\ndelves\ninto\nthe\nhistory\nof\nPret,\ntracing\nits\njourney\nfrom\na\nsingle\nshop\nto\na\nglobal\nbrand.\nThe\nBeginning:\nA\nSimple\nIdea\nThe\nstory\nof\nPret\nA\nManger\nbegins\nin\n1983\nwhen\ntwo\ncollege\nfriends,\nSinclair\nBeecham\nand\nJulian\nMetcalfe,\nnoticed\na\ngap\nin\nthe\nmarket\nfor\nfresh,\nnatural\nfood\nthat\ncould\nbe\nserved\nquickly\nto\nbusy\nLondoners.\nInspired\nby\nthe\nidea\nof\nproviding\nan\nalternative\nto\nthe\nprocessed\nand\nunhealthy\nfast\nfood\noptions\nthat\ndominated\nthe\nmarket,\nthey\ndecided\nto\

In [95]:
pages[0].metadata

{'source': 'sample_data/Pret.pdf', 'page': 0}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [96]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [97]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n
## Chapter 2\n\n \
Hi this is Molly"""

In [98]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [99]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [100]:
md_header_splits[0]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}, page_content='Hi this is Jim  \nHi this is Joe')

In [101]:
md_header_splits[1]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}, page_content='Hi this is Lance')

# Try on a real Markdown file, like a Notion database.

In [102]:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [103]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [104]:
md_header_splits = markdown_splitter.split_text(txt)

In [105]:
# md_header_splits[0]