## Understanding Document Structure In Langchain

In [3]:
from langchain_core.documents import Document

In [7]:

doc=Document(
    page_content="This is the place where actual data will be saved",
    metadata={
        "source":"file.pdf",
        "author":"unknown",
        "page":1,
        "custom_field":"value"
    }
)

print(doc)
print(doc.metadata)

page_content='This is the place where actual data will be saved' metadata={'source': 'file.pdf', 'author': 'unknown', 'page': 1, 'custom_field': 'value'}
{'source': 'file.pdf', 'author': 'unknown', 'page': 1, 'custom_field': 'value'}


### TEXT LOADER AND DOCUMENT LOADER (SIMPLEST EXAMPLE)

In [10]:
import os

os.makedirs("../data/text_files",exist_ok=True)

# Adding Text 

text_data={
    "../data/text_files/python_intro.txt":"""Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility. Created by Guido van Rossum in the early 1990s, Python emphasizes clean and concise syntax, making it an ideal choice for beginners and professionals alike. Its extensive standard library and vast ecosystem of third-party packages allow developers to build applications in areas such as web development, data science, artificial intelligence, automation, and more. Python’s cross-platform nature and active community support have contributed to its rise as one of the most widely used languages in the world. Whether one is scripting simple tasks or developing complex systems, Python offers both ease of use and powerful capabilities.""",
    "../data/text_files/machine_learning_intro.txt":"""Machine Learning (ML) is a branch of artificial intelligence that enables computers to learn patterns and make predictions or decisions without being explicitly programmed. Instead of following rigid rules, ML models are trained using data, gradually improving their performance through experience. This technology underpins many modern applications, from recommendation systems and fraud detection to self-driving cars and natural language processing. Machine learning techniques are broadly classified into supervised, unsupervised, and reinforcement learning, each addressing different types of problems. As the availability of data and computing power continues to grow, ML has become a driving force behind innovation across industries, shaping the way we interact with technology and the world around us."""
}


for path,content in text_data.items():
    with open(path, "w") as f:
        f.write(content)

In [16]:
# Text Loader single file
from langchain_community.document_loaders import TextLoader

text_loader=TextLoader('../data/text_files/python_intro.txt',encoding='utf-8')
loaded_text=text_loader.load()

print("Metadata For Loaded Text",loaded_text[0].metadata)
print("Content For Loaded Text",loaded_text[0].page_content)

Metadata For Loaded Text {'source': '../data/text_files/python_intro.txt'}
Content For Loaded Text Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility. Created by Guido van Rossum in the early 1990s, Python emphasizes clean and concise syntax, making it an ideal choice for beginners and professionals alike. Its extensive standard library and vast ecosystem of third-party packages allow developers to build applications in areas such as web development, data science, artificial intelligence, automation, and more. Python’s cross-platform nature and active community support have contributed to its rise as one of the most widely used languages in the world. Whether one is scripting simple tasks or developing complex systems, Python offers both ease of use and powerful capabilities.


In [19]:
# Full Directory Loader
from langchain_community.document_loaders import DirectoryLoader

directory_loader=DirectoryLoader("../data/text_files",
        glob="**/*.txt",
        loader_cls=TextLoader,
        loader_kwargs={'encoding':"utf-8"},
        show_progress=True)

documents=directory_loader.load()

print(f" Loaded {len(documents)} documents")
for i,doc in enumerate(documents):
    print(f"Document {i+1}")
    print(f"Metadata: {doc.metadata['source']}")
    print(f"Content: {doc.page_content[1:100]}")

100%|██████████| 2/2 [00:00<00:00, 2820.65it/s]

 Loaded 2 documents
Document 1
Metadata: ../data/text_files/python_intro.txt
Content: ython is a high-level, interpreted programming language known for its simplicity, readability, and 
Document 2
Metadata: ../data/text_files/machine_learning_intro.txt
Content: achine Learning (ML) is a branch of artificial intelligence that enables computers to learn pattern





### TEXT SPLITTER EXAMPLES

In [3]:
from langchain_text_splitters import CharacterTextSplitter,RecursiveCharacterTextSplitter,TokenTextSplitter
from langchain_community.document_loaders import TextLoader

In [7]:

loader= TextLoader("../data/text_files/text_splitter_example.txt",'utf-8')
extracted_file=loader.load()

extract_file_context=extracted_file[0].page_content

In [13]:
character_splitter=CharacterTextSplitter(separator=" ",chunk_size=100,chunk_overlap=10)
recursive_text_splitter=RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=300,
    chunk_overlap=20
)
splitted_text=character_splitter.split_text(extract_file_context)
recursive_splitted_text=recursive_text_splitter.split_text(extract_file_context)

print(len(splitted_text))
print(len(recursive_splitted_text))

125
55
