#### Text Splitting from Documents- RecursiveCharacter Text Splitters
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.


In [2]:
## Reading a PDf File
from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('attention.pdf')
docs=loader.load()

##### How to recursively split text by characters

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# chunk overlap: how much of the text from one chunk overlaps with the next chunk.
# with chunk_overlap=50, the last 50 characters of one chunk will be included as the first 50 characters of the subsequent chunk.
# chunk size: defines the maximum length (in characters) of each chunk.
text_splitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)
final_documents=text_splitter.split_documents(docs)

In [5]:
# this is also list of langchain docs, which contains further methods
type(final_documents)

list

In [6]:
final_documents[0].page_content[:100]

'Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and'

In [7]:
print(final_documents[0].page_content[-100:])
print("\t")
print(final_documents[1].page_content[:100])

Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
	
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia


In [8]:
## Text Loader

from langchain_community.document_loaders import TextLoader

loader=TextLoader('speech.txt')
docs=loader.load()

### Let's see create_documents when we have raw text file

That's mean we have to pass text file (python native) instead of langchain doc

In [10]:
# this speech variable will just contain text, it is not a document
speech=""
with open("speech.txt") as f:
    speech=f.read()

# because speech this time is not documnet, we are using create_doc
text_splitter=RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=20)
text=text_splitter.create_documents([speech])

In [12]:
# it is same, list of langchain docs
type(text[0])

langchain_core.documents.base.Document

In [20]:
print(text[0])
print(text[1])

page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of'
page_content='foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no'
