# **Document Splitting**

Document splitting is crucial because it ensures that semantically relevant content is grouped together within the same chunk. This is particularly important when answering questions or performing other tasks that rely on the contextual information present in the documents.



In [None]:
%%capture
# update or install the necessary libraries
!pip install --upgrade langchain langchain_community pypdf tiktoken python-dotenv

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

os.environ["AWS_ACCESS_KEY_ID"] = os.getenv("AWS_ACCESS_KEY_ID")
os.environ["AWS_SECRET_ACCESS_KEY"] = os.getenv("AWS_SECRET_ACCESS_KEY")
os.environ["AWS_DEFAULT_REGION"] = os.getenv("AWS_DEFAULT_REGION")

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

## **RecursiveCharacterTextSplitter**

The RecursiveCharacterTextSplitter is recommended for generic text splitting. It splits the text based on a hierarchy of separators, starting with double newlines (\n\n), then single newlines (\n), spaces ( ), and finally, individual characters. This approach aims to preserve the structure and coherence of the text by prioritizing splitting at natural boundaries like paragraphs and sentences.

In [None]:
chunk_size = 15
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [None]:
text1 = 'employeemanagementsystem'

r_splitter.split_text(text1)

In [None]:
text2 = 'employeemanagementsystemarch'

r_splitter.split_text(text2)

## **CharacterTextSplitter**

The CharacterTextSplitter is a more basic splitter that splits the text based on a single character separator, such as a space or a newline. This splitter is useful when dealing with text that doesn't have a clear structure or when you want to split the text at specific points.



In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [None]:
text3 = "e m p l o y e e m a n a g e m e n t s y s t e m"
r_splitter.split_text(text3)
c_splitter.split_text(text3)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

# **Recursive splitting details**


RecursiveCharacterTextSplitter is recommended for generic text.



In [None]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [None]:
len(some_text)

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=343,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

c_splitter.split_text(some_text)


In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)



In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("./content/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [None]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

docs = text_splitter.split_documents(pages)

len(docs)
len(pages)

# **Token splitting**

The TokenTextSplitter splits the text based on token count rather than character count. This can be useful because many language models have context windows designated by token count rather than character count. Tokens are often approximately four characters long, so splitting based on token count can provide a better representation of how the language model will process the text.

In [None]:
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [None]:
text1 = "Hello World"
text_splitter.split_text(text1)

In [None]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [None]:
docs = text_splitter.split_documents(pages)

In [None]:
docs[0]

In [None]:
pages[0].metadata

# **Let's Do an Activity**

## **Objective**

Practice document splitting techniques with LangChain to manage large text content effectively. You will learn to use different splitters to break down text into manageable chunks, which is essential for tasks like text analysis, summarization, and feeding content into language models.

## **Scenario**

You are developing a text analysis module that processes large documents. This activity will help you understand how to use various text splitting techniques in LangChain to handle large text inputs efficiently.

## **Steps**

* Load a Sample Document
* RecursiveCharacterTextSplitter

  * Use `RecursiveCharacterTextSplitter` to split the text based on a hierarchy of separators.
  * Set a `chunk_size` and `chunk_overlap` to see how the text is divided.
  * Experiment with different separators to observe how the splitting changes.

* CharacterTextSplitter

  * Use `CharacterTextSplitter` to split the text based on a single character separator.
  * Compare the results with the recursive splitter to understand the differences.

* TokenTextSplitter

  * Use `TokenTextSplitter` to split the text based on token count.
  * Set a `chunk_size` and `chunk_overlap` to see how the text is divided.
  * Understand the importance of token-based splitting for language model processing.

* Exploration and Analysis

  * Experiment with different `chunk_size` and `chunk_overlap` settings to see how they affect the splitting.
  * Use various text samples and documents to explore the effectiveness of each splitting technique.