# Document Splitting

### Importing Dependencies

In [1]:
import os
import openai
import sys
sys.path.append('../..')
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

#### Configuring The Text Separation

In [3]:
chunk_size =26
chunk_overlap = 4

In [4]:
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
character_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

```
Why doesn't this split the string below?
```

In [6]:
text1 = 'abcdefghijklmnopqrstuvwxyz'
recursive_splitter.split_text(text1)
# chunk size is similar to the text

['abcdefghijklmnopqrstuvwxyz']

In [7]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
recursive_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

```
Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)
```

In [9]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
recursive_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

#### Splitting Text with CharacterTextSplitter

In [10]:
character_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [12]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' ' # here separated with( White Space )
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [15]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentences. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

# length
len(some_text)

496

In [16]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [17]:
# CharacterTextSplitter
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentences. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [18]:
# RecursiveCharacterTextSplitter
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentences. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

```
Let's reduce the chunk size a bit and add a period to our separators:
```

In [19]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentences. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns',
 '. Carriage returns are the "backslash n" you see embedded in this string',
 '. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [20]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
 'For example, closely related ideas are in sentences. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns.',
 'Carriage returns are the "backslash n" you see embedded in this string.',
 'Sentences have a period at the end, but also, have a space.and words are separated by space.']

#### Load PDF Doc

In [21]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [25]:
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [26]:
docs = text_splitter.split_documents(pages)

In [27]:
len(docs)

77

In [28]:
len(pages)

22

#### Working with NOTION_DB Data

In [29]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("data/Notion_DB")
notion_db = loader.load()

In [30]:
docs = text_splitter.split_documents(notion_db)

In [31]:
len(notion_db)

1

In [32]:
len(docs)

2

## Token splitting

We can also split on token count explicitly, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

In [33]:
from langchain.text_splitter import TokenTextSplitter

In [34]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [36]:
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [37]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [39]:
docs = text_splitter.split_documents(pages)

In [40]:
docs[0]

Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': 'data/MachineLearning-Lecture01.pdf', 'page': 0})

In [41]:
pages[0].metadata

{'source': 'data/MachineLearning-Lecture01.pdf', 'page': 0}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [42]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [43]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [45]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [46]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [47]:
md_header_splits[0]

{'content': 'Hi this is Jim  \nHi this is Joe',
 'metadata': {'Header 1': 'Title', 'Header 2': 'Chapter 1'}}

In [48]:
md_header_splits[1]

{'content': 'Hi this is Lance',
 'metadata': {'Header 1': 'Title',
  'Header 2': 'Chapter 1',
  'Header 3': 'Section'}}

```
Try on a real Markdown file, like a Notion database.
```

In [49]:
loader = NotionDirectoryLoader("data/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [50]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [51]:
md_header_splits = markdown_splitter.split_text(txt)

In [52]:
md_header_splits[0]

{'content': '👋 Welcome to Notion!  \nHere are the basics:  \n- [ ]  Click anywhere and just start typing\n- [ ]  Hit **/** to see all the types of content you can add - headers, videos, sub pages, etc.\n- [ ]  Highlight any text, and use the menu that pops up to **style** *your* ~~writing~~ `however` [you](https://www.notion.so/product) like\n- [ ]  See the **⋮⋮** to the left of this checkbox on hover? Click and drag to move this line\n- [ ]  Click the **+ New Page** button at the bottom of your sidebar to add a new page\n- [ ]  Click **Templates** in your sidebar to get started with pre-built pages\n- This is a toggle block. Click the little triangle to see more useful tips!\n- [**notion.com/templates**](https://www.notion.so/templates): More templates built by the Notion community\n- [**notion.com/help**](https://www.notion.so/help): ****Guides and FAQs for everything in Notion\n- [**notion.com/guides**](http://notion.com/guides): Watch videos and read tutorials to become a Notion ex