#### Text Splitting from Documents- RecursiveCharacter Text Splitters
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.


In [12]:
from docx import Document as DocxDocument
from langchain_core.documents.base import Document
from pathlib import Path

# Path to your .docx file
docx_file = "CA-Inter-P5 - Chapter 1 - Nature, Objective and Scope of Audit.docx"

# Load the .docx file using python-docx
doc = DocxDocument(docx_file)

# Extract text from paragraphs
text = "\n".join([para.text for para in doc.paragraphs])

# Split by the delimiter '****' to identify sections
sections = text.split("****")

# Clean up sections by removing extra whitespace
sections = [section.strip() for section in sections]

# Create Document objects with metadata for each section
docs = []
for i, section in enumerate(sections):
    doc_metadata = {
        "source": docx_file,
        "page": i,  # For simplicity, using section index as page number
    }
    docs.append(Document(metadata=doc_metadata, page_content=section))

docs

[Document(metadata={'source': 'CA-Inter-P5 - Chapter 1 - Nature, Objective and Scope of Audit.docx', 'page': 0}, page_content=''),
 Document(metadata={'source': 'CA-Inter-P5 - Chapter 1 - Nature, Objective and Scope of Audit.docx', 'page': 1}, page_content="CHAPTER 1 -  NATURE, OBJECTIVE AND SCOPE OF ADULT\nINTRODUCTION\nWhat do such real-life situations highlight? Such instances underline importance of auditing in today's complex business environment. Be it investors desirous of investing their money in companies, shareholders anxious to know financial position of companies they have invested in, banks or financial institutions willing to lend funds to credit-worthy organizations, governments desirous of collecting taxes from trade and industry in accordance with applicable laws, trade unions negotiating with corporate managements for better wages or insurance companies wanting to settle property claims caused by fire or other disasters - range of diverse users in equally diverse fiel

In [10]:
## Reading a PDf File
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents.base import Document
from pathlib import Path


# loader = PyPDFLoader(file_path="attention.pdf")
chapter1 = "CA-Inter-P5 - Chapter 1 - Nature, Objective and Scope of Audit.docx"
loader = PyPDFLoader(file_path="attention.pdf")
docs: list[Document] = loader.load()
docs

[Document(metadata={'source': 'attention.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Tran

In [2]:
type(docs[0])

langchain_core.documents.base.Document

##### How to recursively split text by characters

In [None]:
from typing import List
from langchain_core.documents.base import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
final_documents: List[Document] = text_splitter.create_documents(texts=chapter1)
final_documents

[Document(metadata={}, page_content='C'),
 Document(metadata={}, page_content='A'),
 Document(metadata={}, page_content='-'),
 Document(metadata={}, page_content='I'),
 Document(metadata={}, page_content='n'),
 Document(metadata={}, page_content='t'),
 Document(metadata={}, page_content='e'),
 Document(metadata={}, page_content='r'),
 Document(metadata={}, page_content='-'),
 Document(metadata={}, page_content='P'),
 Document(metadata={}, page_content='5'),
 Document(metadata={}, page_content='-'),
 Document(metadata={}, page_content='C'),
 Document(metadata={}, page_content='h'),
 Document(metadata={}, page_content='a'),
 Document(metadata={}, page_content='p'),
 Document(metadata={}, page_content='t'),
 Document(metadata={}, page_content='e'),
 Document(metadata={}, page_content='r'),
 Document(metadata={}, page_content='1'),
 Document(metadata={}, page_content='-'),
 Document(metadata={}, page_content='N'),
 Document(metadata={}, page_content='a'),
 Document(metadata={}, page_conten

In [14]:
from typing import List
from langchain_core.documents.base import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
final_documents: List[Document] = text_splitter.split_documents(documents=docs)
final_documents

[Document(metadata={'source': 'CA-Inter-P5 - Chapter 1 - Nature, Objective and Scope of Audit.docx', 'page': 1}, page_content='CHAPTER 1 -  NATURE, OBJECTIVE AND SCOPE OF ADULT\nINTRODUCTION'),
 Document(metadata={'source': 'CA-Inter-P5 - Chapter 1 - Nature, Objective and Scope of Audit.docx', 'page': 1}, page_content="What do such real-life situations highlight? Such instances underline importance of auditing in today's complex business environment. Be it investors desirous of investing their money in companies, shareholders anxious to know financial position of companies they have invested in, banks or financial institutions willing to lend funds to credit-worthy organizations, governments desirous of collecting taxes from trade and industry in accordance with applicable laws, trade unions negotiating with"),
 Document(metadata={'source': 'CA-Inter-P5 - Chapter 1 - Nature, Objective and Scope of Audit.docx', 'page': 1}, page_content='applicable laws, trade unions negotiating with cor

In [18]:
from typing import List
from langchain_core.documents.base import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from docx import Document as DocxDocument


# Step 1: Load the DOCX file and split by '****'
def load_docx(file_path: str) -> str:
    doc = DocxDocument(file_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return "\n".join(full_text)


# Load DOCX content
docx_text = load_docx(chapter1)

# Step 2: Split the text by '****'
sections = docx_text.split("****")

# Step 3: Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

# Step 4: Create documents for each section
final_documents: List[Document] = text_splitter.create_documents(texts=sections)

# Final output
final_documents


[Document(metadata={}, page_content='CHAPTER 1 -  NATURE, OBJECTIVE AND SCOPE OF ADULT\nINTRODUCTION'),
 Document(metadata={}, page_content="What do such real-life situations highlight? Such instances underline importance of auditing in today's complex business environment. Be it investors desirous of investing their money in companies, shareholders anxious to know financial position of companies they have invested in, banks or financial institutions willing to lend funds to credit-worthy organizations, governments desirous of collecting taxes from trade and industry in accordance with applicable laws, trade unions negotiating with"),
 Document(metadata={}, page_content='applicable laws, trade unions negotiating with corporate managements for better wages or insurance companies wanting to settle property claims caused by fire or other disasters - range of diverse users in equally diverse fields rely upon audited financial statements.'),
 Document(metadata={}, page_content='Can you figu

In [19]:
print(final_documents[0])
print(final_documents[1])

page_content='CHAPTER 1 -  NATURE, OBJECTIVE AND SCOPE OF ADULT
INTRODUCTION'
page_content='What do such real-life situations highlight? Such instances underline importance of auditing in today's complex business environment. Be it investors desirous of investing their money in companies, shareholders anxious to know financial position of companies they have invested in, banks or financial institutions willing to lend funds to credit-worthy organizations, governments desirous of collecting taxes from trade and industry in accordance with applicable laws, trade unions negotiating with'


In [20]:
## Text Loader

from langchain_community.document_loaders import TextLoader

loader = TextLoader("speech.txt")
docs = loader.load()
docs


[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness be

In [21]:
speech = ""
with open("speech.txt") as f:
    speech = f.read()


text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text = text_splitter.create_documents([speech])
print(text[0])
print(text[1])

page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of'
page_content='foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no'


In [22]:
type(text[0])

langchain_core.documents.base.Document