### Types of Documents Loader

- PDF Loaders
- Text and Markdown Files
- Microsoft Office Documents
- CSV and Structured Data
- Web-Based Document Loaders

#### PDF Loaders

In [1]:
from langchain.document_loaders import PyPDFLoader, UnstructuredPDFLoader, PDFMinerLoader

In [5]:
# Basic PDF Loader
loader = PyPDFLoader("../docs/attention.pdf")
documents = loader.load()

print(documents)

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-08-03T00:07:29+00:00', 'author': '', 'keywords': '', 'moddate': '2023-08-03T00:07:29+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../docs/attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\

In [3]:
# # For complex layouts and better text extraction
# loader = UnstructuredPDFLoader("../docs/attention.pdf")
# documents = loader.load()

# print(documents)

In [18]:
# # For detailed control over PDF parsing
# loader = PDFMinerLoader("../docs/attention.pdf")
# documents = loader.load()

# print(documents)

#### Text and Markdown Files

In [23]:
from langchain.document_loaders import TextLoader, UnstructuredMarkdownLoader

# Simple text files
loader = TextLoader("../docs/documentation.txt", encoding="utf-8")
documents = loader.load()

print(documents)

[Document(metadata={'source': '../docs/documentation.txt'}, page_content="Introduction\nLangChain is a framework for developing applications powered by large language models (LLMs).\n\nLangChain simplifies every stage of the LLM application lifecycle:\n\nDevelopment: Build your applications using LangChain's open-source components and third-party integrations. Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.\nProductionization: Use LangSmith to inspect, monitor and evaluate your applications, so that you can continuously optimize and deploy with confidence.\nDeployment: Turn your LangGraph applications into production-ready APIs and Assistants with LangGraph Platform.\n\nLangChain implements a standard interface for large language models and related technologies, such as embedding models and vector stores, and integrates with hundreds of providers. See the integrations page for more.\n\n")]


In [45]:
# Markdown files with structure preservation
loader = UnstructuredMarkdownLoader("../docs/note.md")
documents = loader.load()

print(documents)

[Document(metadata={'source': '../docs/note.md'}, page_content='Document Loaders Overview\n\nThis repository explores different LangChain document loaders and their technical use cases.\n\nTypes of Document Loaders\n\n📄 PDF Loaders\n\nExtract text from PDF files.\n\nCan handle scanned vs. text-based PDFs differently.\n\nOften combined with chunking for better LLM performance.\n\n📝 Text and Markdown Files\n\nLoad plain .txt or .md documents.\n\nSimple and lightweight.\n\nUseful for notes, logs, and documentation.\n\n🏢 Microsoft Office Documents\n\nSupport for .docx, .pptx, .xlsx.\n\nExtracts text and metadata from structured Office files.\n\nUseful in enterprise settings with legacy document storage.\n\n📊 CSV and Structured Data\n\nLoads tabular data into structured formats.\n\nOften transformed into documents row-by-row.\n\nGood for financial, product, or log data ingestion.\n\n🌐 Web-Based Document Loaders\n\nWebBaseLoader: Scrapes and loads content from arbitrary web pages.\n\nWikiped

#### Microsoft Office Documents

In [35]:
# from langchain.document_loaders import UnstructuredWordDocumentLoader

# # Word documents
# loader = UnstructuredWordDocumentLoader("../docs/file-sample_100kB.doc")
# documents = loader.load()

# print(documents)

In [37]:
# from langchain.document_loaders import UnstructuredExcelLoader

# # Excel files
# loader = UnstructuredExcelLoader("../docs/file_example_XLS_10.xls")
# documents = loader.load()

# print(documents)


In [41]:
# from langchain.document_loaders import UnstructuredPowerPointLoader


# # PowerPoint presentations
# loader = UnstructuredPowerPointLoader("../docs/file_example_PPT_250kB.ppt")
# documents = loader.load()

# print(documents)

####  CSV and Structured Data

In [42]:
from langchain.document_loaders import CSVLoader

# Basic CSV loading
loader = CSVLoader("../docs/iris-data.csv")
documents = loader.load()

print(documents)

[Document(metadata={'source': '../docs/iris-data.csv', 'row': 0}, page_content='sepal_length: 5.1\nsepal_width: 3.5\npetal_length: 1.4\npetal_width: 0.2\nspecies: setosa'), Document(metadata={'source': '../docs/iris-data.csv', 'row': 1}, page_content='sepal_length: 4.9\nsepal_width: 3.0\npetal_length: 1.4\npetal_width: 0.2\nspecies: setosa'), Document(metadata={'source': '../docs/iris-data.csv', 'row': 2}, page_content='sepal_length: 4.7\nsepal_width: 3.2\npetal_length: 1.3\npetal_width: 0.2\nspecies: setosa'), Document(metadata={'source': '../docs/iris-data.csv', 'row': 3}, page_content='sepal_length: 4.6\nsepal_width: 3.1\npetal_length: 1.5\npetal_width: 0.2\nspecies: setosa'), Document(metadata={'source': '../docs/iris-data.csv', 'row': 4}, page_content='sepal_length: 5.0\nsepal_width: 3.6\npetal_length: 1.4\npetal_width: 0.2\nspecies: setosa'), Document(metadata={'source': '../docs/iris-data.csv', 'row': 5}, page_content='sepal_length: 5.4\nsepal_width: 3.9\npetal_length: 1.7\npeta

In [44]:
# from langchain.document_loaders import UnstructuredCSVLoader

# # Advanced CSV with custom formatting
# loader = UnstructuredCSVLoader("../docs/iris-data.csv", mode="elements")
# documents = loader.load()

#### Web-Based Document Loaders

In [None]:
from langchain.document_loaders import WebBaseLoader

# Simple web scraping
url = "https://www.bbc.com/news/science-environment-66759065"  # Example science article
loader = WebBaseLoader(url)
documents = loader.load()

print(documents)

[Document(metadata={'source': 'https://www.bbc.com/news/science-environment-66759065', 'title': 'BBC', 'language': 'en-GB'}, page_content='BBCSkip to contentBritish Broadcasting CorporationHomeNewsSportBusinessInnovationCultureArtsTravelEarthAudioVideoLiveHomeNewsIsrael-Gaza WarWar in UkraineUS & CanadaUKUK PoliticsEnglandN. IrelandN. Ireland PoliticsScotlandScotland PoliticsWalesWales PoliticsAfricaAsiaChinaIndiaAustraliaEuropeLatin AmericaMiddle EastIn PicturesBBC InDepthBBC VerifySportBusinessExecutive LoungeTechnology of BusinessFuture of BusinessInnovationTechnologyScience & HealthArtificial IntelligenceAI v the MindCultureFilm & TVMusicArt & DesignStyleBooksEntertainment NewsArtsArts in MotionTravelDestinationsAfricaAntarcticaAsiaAustralia and PacificCaribbean & BermudaCentral AmericaEuropeMiddle EastNorth AmericaSouth AmericaWorld’s TableCulture & ExperiencesAdventuresThe SpeciaListEarthNatural WondersWeather & ScienceClimate SolutionsSustainable BusinessGreen LivingAudioPodcast

In [30]:
# from langchain.document_loaders import WikipediaLoader

# loader = WikipediaLoader(query="Agentic AI", load_max_docs=2)
# documents = loader.load()

# print(documents)

## Text Splitters

#### Character-Based Splitters

In [None]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\\n\\n",
    chunk_size=100,
    chunk_overlap=20
)
chunks = splitter.split_text(documents)

In [51]:
chunks

[Document(metadata={'source': '../docs/note.md'}, page_content='Document Loaders Overview\n\nThis repository explores different LangChain document loaders and their technical use cases.\n\nTypes of Document Loaders\n\n📄 PDF Loaders\n\nExtract text from PDF files.\n\nCan handle scanned vs. text-based PDFs differently.\n\nOften combined with chunking for better LLM performance.\n\n📝 Text and Markdown Files\n\nLoad plain .txt or .md documents.\n\nSimple and lightweight.\n\nUseful for notes, logs, and documentation.\n\n🏢 Microsoft Office Documents\n\nSupport for .docx, .pptx,'),
 Document(metadata={'source': '../docs/note.md'}, page_content='Office Documents\n\nSupport for .docx, .pptx, .xlsx.\n\nExtracts text and metadata from structured Office files.\n\nUseful in enterprise settings with legacy document storage.\n\n📊 CSV and Structured Data\n\nLoads tabular data into structured formats.\n\nOften transformed into documents row-by-row.\n\nGood for financial, product, or log data ingestion.

#### Recursive Character Splitters

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\\n\\n", "\\n", " ", ""]
)

chunks = splitter.split_documents(documents)

In [50]:
chunks

[Document(metadata={'source': '../docs/note.md'}, page_content='Document Loaders Overview\n\nThis repository explores different LangChain document loaders and their technical use cases.\n\nTypes of Document Loaders\n\n📄 PDF Loaders\n\nExtract text from PDF files.\n\nCan handle scanned vs. text-based PDFs differently.\n\nOften combined with chunking for better LLM performance.\n\n📝 Text and Markdown Files\n\nLoad plain .txt or .md documents.\n\nSimple and lightweight.\n\nUseful for notes, logs, and documentation.\n\n🏢 Microsoft Office Documents\n\nSupport for .docx, .pptx,'),
 Document(metadata={'source': '../docs/note.md'}, page_content='Office Documents\n\nSupport for .docx, .pptx, .xlsx.\n\nExtracts text and metadata from structured Office files.\n\nUseful in enterprise settings with legacy document storage.\n\n📊 CSV and Structured Data\n\nLoads tabular data into structured formats.\n\nOften transformed into documents row-by-row.\n\nGood for financial, product, or log data ingestion.

####  Token-Based Splitters

In [53]:
from langchain.text_splitter import TokenTextSplitter

# # Using tiktoken for OpenAI models
# text_splitter = TokenTextSplitter(
#     encoding_name="cl100k_base",  # GPT-4 encoding
#     chunk_size=1000,              # tokens, not characters
#     chunk_overlap=200
# )

# chunks = text_splitter.split_text(documents)

#### Semantic Splitters

In [56]:
# from langchain.text_splitter import SpacyTextSplitter

# splitter = SpacyTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=200,
#     separator="\\n\\n"
# )

# chunks = splitter.split_text(documents)

#### HTML and Markdown Splitters

In [57]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on=[
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3")
]

html_splitter=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Foo'}, page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Bar main section'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Bar subsection 1'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 

In [58]:
from langchain.text_splitter import MarkdownTextSplitter

# Sample Markdown text about Machine Learning
markdown_text = """
# Introduction to Machine Learning

## What is Machine Learning?
Machine learning is a subset of artificial intelligence that enables systems to learn from data.

## Types of Machine Learning
### Supervised Learning
In supervised learning, the model learns from labeled data.

### Unsupervised Learning
Unsupervised learning deals with finding patterns in data without labels.

### Reinforcement Learning
Reinforcement learning is about learning by trial and error through rewards.

## Applications
- Image recognition
- Natural language processing
- Recommendation systems
"""

# Initialize the MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size=50, chunk_overlap=0)

# Create documents from the Markdown content
documents = splitter.create_documents([markdown_text])

# Display the split document chunks
for i, doc in enumerate(documents, start=1):
    print(f"--- Chunk {i} ---")
    print(doc.page_content)
    print()

--- Chunk 1 ---
# Introduction to Machine Learning

--- Chunk 2 ---
## What is Machine Learning?

--- Chunk 3 ---
Machine learning is a subset of artificial

--- Chunk 4 ---
intelligence that enables systems to learn from

--- Chunk 5 ---
data.

--- Chunk 6 ---
## Types of Machine Learning

--- Chunk 7 ---
### Supervised Learning

--- Chunk 8 ---
In supervised learning, the model learns from

--- Chunk 9 ---
labeled data.

--- Chunk 10 ---
### Unsupervised Learning

--- Chunk 11 ---
Unsupervised learning deals with finding patterns

--- Chunk 12 ---
in data without labels.

--- Chunk 13 ---
### Reinforcement Learning

--- Chunk 14 ---
Reinforcement learning is about learning by trial

--- Chunk 15 ---
and error through rewards.

--- Chunk 16 ---
## Applications
- Image recognition

--- Chunk 17 ---
- Natural language processing

--- Chunk 18 ---
- Recommendation systems

