#### Text Splitting from Documents- RecursiveCharacter Text Splitters
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.


In [3]:
## Reading a PDf File
from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('syllabus.pdf')
docs=loader.load()
docs

[Document(metadata={'source': 'syllabus.pdf', 'page': 0, 'page_label': '1'}, page_content='MACHINE\nLEARNING\nDEEP\nLEARNING\nPYTHON +\nSTATS\nCOMPUTER VISIONNATURAL LANGUAGE PROCESSING\nGENERATIVE AI\nRETRIEVAL AUGUMENT GENERATION\nVECTOR DB'),
 Document(metadata={'source': 'syllabus.pdf', 'page': 1, 'page_label': '2'}, page_content='This course is designed for aspiring data scientists, machine learning enthusiasts, and\nprofessionals looking to build expertise in Python programming, data analysis, machine learning,\nand deep learning. Whether you are just starting or have some experience, this comprehensive\ncourse will equip you with the skills needed to work with real-world datasets, apply machine\nlearning algorithms, and deploy AI solutions. By the end of the course, you’ll have a solid\nfoundation in AI, a portfolio of end-to-end projects, and the confidence to tackle complex\nchallenges in data science and AI.\nLearning Objectives\nMaster Python Programming: Understand Python f

In [4]:
type(docs[0])

langchain_core.documents.base.Document

##### How to recursively split text by characters

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)
final_documents=text_splitter.split_documents(docs)
final_documents

[Document(metadata={'source': 'syllabus.pdf', 'page': 0, 'page_label': '1'}, page_content='MACHINE\nLEARNING\nDEEP\nLEARNING\nPYTHON +\nSTATS\nCOMPUTER VISIONNATURAL LANGUAGE PROCESSING\nGENERATIVE AI\nRETRIEVAL AUGUMENT GENERATION\nVECTOR DB'),
 Document(metadata={'source': 'syllabus.pdf', 'page': 1, 'page_label': '2'}, page_content='This course is designed for aspiring data scientists, machine learning enthusiasts, and\nprofessionals looking to build expertise in Python programming, data analysis, machine learning,\nand deep learning. Whether you are just starting or have some experience, this comprehensive\ncourse will equip you with the skills needed to work with real-world datasets, apply machine\nlearning algorithms, and deploy AI solutions. By the end of the course, you’ll have a solid'),
 Document(metadata={'source': 'syllabus.pdf', 'page': 1, 'page_label': '2'}, page_content='foundation in AI, a portfolio of end-to-end projects, and the confidence to tackle complex\nchallenges

In [6]:
print(final_documents[0])
print(final_documents[1])

page_content='MACHINE
LEARNING
DEEP
LEARNING
PYTHON +
STATS
COMPUTER VISIONNATURAL LANGUAGE PROCESSING
GENERATIVE AI
RETRIEVAL AUGUMENT GENERATION
VECTOR DB' metadata={'source': 'syllabus.pdf', 'page': 0, 'page_label': '1'}
page_content='This course is designed for aspiring data scientists, machine learning enthusiasts, and
professionals looking to build expertise in Python programming, data analysis, machine learning,
and deep learning. Whether you are just starting or have some experience, this comprehensive
course will equip you with the skills needed to work with real-world datasets, apply machine
learning algorithms, and deploy AI solutions. By the end of the course, you’ll have a solid' metadata={'source': 'syllabus.pdf', 'page': 1, 'page_label': '2'}


In [7]:
## Text Loader

from langchain_community.document_loaders import TextLoader

loader=TextLoader('speech.txt')
docs=loader.load()
docs


[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\nâ€¦\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness 

In [8]:
speech=""
with open("speech.txt") as f:
    speech=f.read()


text_splitter=RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=20)
text=text_splitter.create_documents([speech])
print(text[0])
print(text[1])

page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of'
page_content='foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no'


In [9]:
type(text[0])

langchain_core.documents.base.Document