## Load Document $\rightarrow$ Split Document $\rightarrow$ Storage 

![steps](image.png)

### Split Document 

In [2]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)

In [3]:
chunk_size = 26
chunk_overlap = 4

In [5]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

In [6]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [7]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [8]:
text2 = text1 + 'veronica'
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzveronica']

In [9]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [10]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [11]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [12]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=" "
)

In [13]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [14]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [15]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150, chunk_overlap=5, separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [16]:
len(r_splitter.split_text(some_text))

4

In [17]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150, chunk_overlap=5, separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [40]:
len(r_splitter.split_text(some_text))

4

#### Split PDF document

In [18]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/machinelearning-lecture01.pdf")
pages = loader.load()

In [46]:
len(pages)

22

In [19]:
text_splitter = CharacterTextSplitter(
    separator="\n", chunk_size=1000, chunk_overlap=150, length_function=len
)

In [20]:
docs = text_splitter.split_documents(pages)

In [49]:
print(len(docs))

77


In [54]:
docs[0].metadata

{'source': 'docs/machinelearning-lecture01.pdf', 'page': 0}

In [51]:
print(docs[10].page_content)

Similarly, every time you write a check, I ac tually don't know the number for this, but a 
significant fraction of checks that you write are processed by a learning algorithm that's 
learned to read the digits, so the dolla r amount that you wrote down on your check. So 
every time you write a check, there's anot her learning algorithm that you're probably 
using without even being aware of it.  
If you use a credit card, or I know at least one phone compan y was doing this, and lots of 
companies like eBay as well that do electr onic transactions, there's a good chance that 
there's a learning algorithm in the backgr ound trying to figure out if, say, your credit 
card's been stolen or if someone's engaging in a fraudulent transaction.  
If you use a website like Amazon or Netflix that will often recommend books for you to 
buy or movies for you to rent or whatever , these are other examples of learning


In [52]:
print(docs[10].metadata)

{'source': 'docs/machinelearning-lecture01.pdf', 'page': 3}


#### Split Notion Document

In [21]:
from langchain.document_loaders import NotionDirectoryLoader

loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()

In [56]:
len(notion_db)

1

In [57]:
docs = text_splitter.split_documents(notion_db)
len(docs)

7

In [59]:
print(docs[1].page_content)

We've made this document public because we want to learn from you. We're very much interested in your feedback (including weeding out typo's and Dunglish ;)). Email us at hr@blendle.com. If you're starting your own company or if you're curious as to how we do things at Blendle, we hope that our employee handbook inspires you.
If you want to work at Blendle you can check our [job ads here](https://blendle.homerun.co/). If you want to be kept in the loop about Blendle, you can sign up for [our behind the scenes newsletter](https://blendle.homerun.co/yes-keep-me-posted/tr/apply?token=8092d4128c306003d97dd3821bad06f2).
## Blendle general
*Information gap closing in 3... 2... 1...*
---
[To Do/Read in your first week](https://www.notion.so/To-Do-Read-in-your-first-week-f0279ca808514905bcce4514a4905d90?pvs=21)
[History](https://www.notion.so/History-1bdf308cf4f84b9484af3ece21930110?pvs=21)
[DNA & culture](https://www.notion.so/DNA-culture-2e6d462451ee4100a0854fe4307b99aa?pvs=21)


#### Context Aware splitting

In [60]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [61]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [63]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [64]:
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [67]:
md_header_splits[1].page_content

'Hi this is Lance'

In [68]:
md_header_splits[1].metadata 

{'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}

In [69]:
notion_db

[Document(page_content="# Blendle's Employee Handbook\n\nThis is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change. \n\n**Everything related to working at Blendle and the people of Blendle, made public.**\n\nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.\n\nWe've made this document 

In [71]:
notion_docs = markdown_splitter.split_text(notion_db[0].page_content)

In [74]:
print(notion_docs[1].page_content)
print("\n")
print(notion_docs[1].metadata)

*Information gap closing in 3... 2... 1...*  
---  
[To Do/Read in your first week](https://www.notion.so/To-Do-Read-in-your-first-week-f0279ca808514905bcce4514a4905d90?pvs=21)  
[History](https://www.notion.so/History-1bdf308cf4f84b9484af3ece21930110?pvs=21)  
[DNA & culture](https://www.notion.so/DNA-culture-2e6d462451ee4100a0854fe4307b99aa?pvs=21)  
[General & practical ](https://www.notion.so/General-practical-3325144f20664d3abe9d4833e4945912?pvs=21)


{'Header 1': "Blendle's Employee Handbook", 'Header 2': 'Blendle general'}


#### What of multiple markdowns in Notion DB?

In [75]:
from langchain.document_loaders import NotionDirectoryLoader

loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()

In [76]:
len(notion_db)

2

In [87]:
notion_txt = " ".join([db.page_content for db in notion_db])
print(notion_txt)

# Title

## Chapter 1

Hi this is Jim\n\n Hi this is Joe

### Section

Hi this is Lance \n\n

## Chapter 2

Hi this is Molly
 # Blendle's Employee Handbook

This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change. 

**Everything related to working at Blendle and the people of Blendle, made public.**

These are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedbac

In [88]:
notion_docs = markdown_splitter.split_text(notion_txt)

In [90]:
print(notion_docs[3].page_content)
print("\n")
print(notion_docs[3].metadata)

This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  
**Everything related to working at Blendle and the people of Blendle, made public.**  
These are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  
We've made this document public because we want to learn from you. We're very much int

#### Token Splitting

In [106]:
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=3, chunk_overlap=0)
a_splitter = RecursiveCharacterTextSplitter(chunk_size=10, chunk_overlap=0)

In [107]:
text_splitter.split_text("Veronica is a senior data scientist")

['Veronica is', ' a senior data', ' scientist']

In [103]:
a_splitter.split_text("Veronica is a senior data scientist")

['Veronica', 'is a', 'senior', 'data', 'scientist']