#### Monday, February 12, 2024

This notebook will read the csv file containing the text from every chapter, and will split this text into chunks for embedding and vector storage.

Load the csv file containing all the text from every chapter of the book.

In [1]:
import pandas as pd

In [2]:
dataFolder = '../data'

In [3]:
df12Rules = pd.read_csv(dataFolder + '/12Rules.csv')

In [4]:
df12Rules.head()

Unnamed: 0.1,Unnamed: 0,ChapterName,PageNumber,PageText
0,0,Forward,4,Foreword\nRules? More rules? Really? Isn’t lif...
1,1,Forward,5,reminds us that without rules we quickly becom...
2,2,Forward,6,"liberating us, and more laughs, and making the..."
3,3,Forward,7,One might hear such questions discussed at par...
4,4,Forward,8,"I was always especially fond of mid-Western, P..."


Let's identify which page in the dataframe contains the most text, then grab that text to begin exploring how to split the data.

In [5]:
# Find the row with the maximum length in the 'PageText' column
longest_page_id = df12Rules['PageText'].apply(len).idxmax()

In [6]:
# Access only the 'PageText' column for the row with the maximum length
longest_page_text = df12Rules.loc[longest_page_id, 'PageText']

In [7]:
len(longest_page_text)

3262

In [8]:
longest_page_text[:64]

'One day passed, however, another and another; she did not come a'

Let's also load the other 12 Rules csv file we created that has contains an entire chapter in every record.

In [9]:
df12RulesChapters = pd.read_csv(dataFolder + '/12RulesChapters.csv')

In [10]:
df12RulesChapters.head()

Unnamed: 0.1,Unnamed: 0,ChapterName,ChapterText
0,0,Forward,Foreword\n\nRules? More rules? Really? Isn’t l...
1,1,Overture,Overture\n\nThis book has a short history and ...
2,2,RULE 1: Stand up straight with your shoulders ...,R U L E 1\n\nSTAND UP STRAIGHT WITH YOUR\nSH...
3,3,RULE 2: Treat yourself like someone you are re...,R U L E 2\n\nTREAT YOURSELF LIKE SOMEONE YOU...
4,4,RULE 3: Make friends with people who want the ...,R U L E 3\n\nMAKE FRIENDS WITH PEOPLE WHO WA...


Look here [LangChain Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/) to better understand what we want to do with the text from the pdf document.

And yes, I want to implement something myself to once again better understand what and why we want to do this.

Which chapter has the most text?

In [11]:
longest_chapter_id = df12RulesChapters['ChapterText'].apply(len).idxmax()
longest_chapter_id

12

In [12]:
longest_chapter_text = df12RulesChapters.loc[longest_chapter_id, 'ChapterText']
len(longest_chapter_text)

106955

In [13]:
longest_chapter_text[:128]

'R U L E   11\n\nDO NOT BOTHER CHILDREN WHEN THEY\nARE SKATEBOARDING\n\nDANGER AND MASTERY\n\nThere was a time when kids skateboarded on'

What happens when we apply the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter) by LangChain to this chapter of text?

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

1) How the text is split: by list of characters.
2) How the chunk size is measured: by number of characters.

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

texts = text_splitter.create_documents([longest_chapter_text])
print(texts[0])
print(texts[1])


page_content='R U L E   11\n\nDO NOT BOTHER CHILDREN WHEN THEY\nARE SKATEBOARDING\n\nDANGER AND MASTERY'
page_content='There was a time when kids skateboarded on the west side of Sidney Smith\nHall, at the University of Toronto, where I work. Sometimes I stood there and'


In [15]:
texts[:6]

[Document(page_content='R U L E   11\n\nDO NOT BOTHER CHILDREN WHEN THEY\nARE SKATEBOARDING\n\nDANGER AND MASTERY'),
 Document(page_content='There was a time when kids skateboarded on the west side of Sidney Smith\nHall, at the University of Toronto, where I work. Sometimes I stood there and'),
 Document(page_content='watched them. There are rough, wide, shallow concrete steps there, leading\nup from the street to the front entrance, accompanied by tubular iron'),
 Document(page_content='handrails, about two and a half inches in diameter and twenty feet long. The\ncrazy kids, almost always boys, would pull back about fifteen yards from the'),
 Document(page_content='top of the steps. Then they would place a foot on their boards, and skate like\nmad to get up some speed. Just before they collided with the handrail, they'),
 Document(page_content='would reach down, grab their board with a single hand and jump onto the top\nof the rail, boardsliding their way down its length, propelling t

In [16]:
text_splitter.split_text(longest_chapter_text)[:6]

['R U L E   11\n\nDO NOT BOTHER CHILDREN WHEN THEY\nARE SKATEBOARDING\n\nDANGER AND MASTERY',
 'There was a time when kids skateboarded on the west side of Sidney Smith\nHall, at the University of Toronto, where I work. Sometimes I stood there and',
 'watched them. There are rough, wide, shallow concrete steps there, leading\nup from the street to the front entrance, accompanied by tubular iron',
 'handrails, about two and a half inches in diameter and twenty feet long. The\ncrazy kids, almost always boys, would pull back about fifteen yards from the',
 'top of the steps. Then they would place a foot on their boards, and skate like\nmad to get up some speed. Just before they collided with the handrail, they',
 'would reach down, grab their board with a single hand and jump onto the top\nof the rail, boardsliding their way down its length, propelling themselves off']

In [17]:
for sentence in text_splitter.split_text(longest_chapter_text)[:6]:
    print(len(sentence))

84
150
144
153
154
154
