### Document Loader

Document Loaders are specialized components within LangChain designed to access and convert data from a vast array of formats and sources into a standardized Document object.1 This object is fundamental for streamlining content management in LLM workflows.

The Document class is a core component in LangChain, encapsulating textual content alongside rich, contextual metadata.It primarily consists of two key attributes:

- **page_content**: This attribute stores the actual textual data as a string.
- **metadata**: This is a dictionary holding additional contextual information about the document.4 This can include details like the source URL or file path (source, url, filename), creation timestamp, author, or any custom attributes relevant to a specific use case.


#### BaseLoader
All LangChain document loaders adhere to the BaseLoader interface, ensuring a consistent approach to data loading.
The primary loading methods are load() and lazy_load():

- **load()**: Eagerly loads all documents into memory. Useful for prototyping or interactive work.
- **lazy_load()**: Loads documents one by one, yielding them as a generator. Recommended for production code to efficiently handle large datasets and avoid memory issues.

In [1]:
from langchain_community.document_loaders.csv_loader import CSVLoader

## loading data in one go
loader = CSVLoader(file_path="./small_data.csv")
doc= loader.load()
print(doc)

[Document(metadata={'source': './small_data.csv', 'row': 0}, page_content='ProductID: 1001\nProductName: Wireless Mouse\nCategory: Electronics\nPrice: 19.99'), Document(metadata={'source': './small_data.csv', 'row': 1}, page_content='ProductID: 1002\nProductName: Notebook\nCategory: Stationery\nPrice: 2.49'), Document(metadata={'source': './small_data.csv', 'row': 2}, page_content='ProductID: 1003\nProductName: Water Bottle\nCategory: Home\nPrice: 8.99'), Document(metadata={'source': './small_data.csv', 'row': 3}, page_content='ProductID: 1004\nProductName: Desk Lamp\nCategory: Electronics\nPrice: 24.50'), Document(metadata={'source': './small_data.csv', 'row': 4}, page_content='ProductID: 1005\nProductName: Backpack\nCategory: Accessories\nPrice: 34.99'), Document(metadata={'source': './small_data.csv', 'row': 5}, page_content='ProductID: 1006\nProductName: Running Shoes\nCategory: Sports\nPrice: 59.95'), Document(metadata={'source': './small_data.csv', 'row': 6}, page_content='Produc

In [2]:
## loading data one by one
lazy_loader = CSVLoader(file_path='./large_data.csv')
for doc in lazy_loader.lazy_load():
    print(f'Loaded document:{doc.metadata.get('row')}')

Loaded document:0
Loaded document:1
Loaded document:2
Loaded document:3
Loaded document:4
Loaded document:5
Loaded document:6
Loaded document:7
Loaded document:8
Loaded document:9
Loaded document:10
Loaded document:11
Loaded document:12
Loaded document:13
Loaded document:14
Loaded document:15
Loaded document:16
Loaded document:17
Loaded document:18
Loaded document:19
Loaded document:20
Loaded document:21
Loaded document:22
Loaded document:23
Loaded document:24
Loaded document:25
Loaded document:26
Loaded document:27
Loaded document:28
Loaded document:29
Loaded document:30
Loaded document:31
Loaded document:32
Loaded document:33
Loaded document:34
Loaded document:35
Loaded document:36
Loaded document:37
Loaded document:38
Loaded document:39
Loaded document:40
Loaded document:41
Loaded document:42
Loaded document:43
Loaded document:44
Loaded document:45
Loaded document:46
Loaded document:47
Loaded document:48
Loaded document:49
Loaded document:50
Loaded document:51
Loaded document:52


### CSVLoader

CSVLoader converts each row into a Document object. You can specify metadata_columns to include specific columns in the document's metadata.

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
import os

# Create a dummy CSV file for demonstration
csv_file_path = "products.csv"
with open(csv_file_path, "w") as f:
    f.write("id,name,category,price\n")
    f.write("1,Laptop,Electronics,1200\n")
    f.write("2,Keyboard,Electronics,75\n")
    f.write("3,Desk,Furniture,300\n")

# Loading a CSV file and including 'category' and 'price' in metadata
loader = CSVLoader(
    file_path=csv_file_path,
    ## column names to be used as metadata
    metadata_columns=['category'],
    #csv_args={"delimiter": ","}
)
documents = loader.load()


print(f"Number of documents loaded: {len(documents)}")
for doc in documents:
    print(f"Page Content: {doc.page_content.strip()}")
    print(f"Metadata: {doc.metadata}\n")

# Clean up dummy file
os.remove(csv_file_path)

Number of documents loaded: 3
Page Content: id: 1
name: Laptop
price: 1200
Metadata: {'source': 'products.csv', 'row': 0, 'category': 'Electronics'}

Page Content: id: 2
name: Keyboard
price: 75
Metadata: {'source': 'products.csv', 'row': 1, 'category': 'Electronics'}

Page Content: id: 3
name: Desk
price: 300
Metadata: {'source': 'products.csv', 'row': 2, 'category': 'Furniture'}



#### Text Loader

The TextLoader is the simplest loader, designed to read plain text files. It takes the entire content of the file and places it into a single Document object. 

In [4]:
from langchain_community.document_loaders import TextLoader
import os

## create a dummy txt file
text_file_path = ' mydoc.txt'
with open(text_file_path,'w') as f:
    f.write("This is a simple text document.\n")
    f.write("It contains some random text for demo purpose")
    
## Load the text file
loader = TextLoader(text_file_path)
documents = loader.load()

print(f"Number of documents loaded: {len(documents)}")
for doc in documents:
    print(f"Page Content: {doc.page_content.strip()}")
    print(f"Metadata: {doc.metadata}\n")
    
## clean up 
os.remove(text_file_path)

Number of documents loaded: 1
Page Content: This is a simple text document.
It contains some random text for demo purpose
Metadata: {'source': ' mydoc.txt'}



## PDF

**UnstructuredPDFLoader**: This loader leverages the unstructured library for advanced, intelligent content extraction. It's excellent for PDFs with complex layouts, scanned documents (using OCR), or mixed content, as it attempts to preserve structural elements. You can control extraction quality with mode and strategy parameters (e.g., 'hi_res', 'ocr_only', 'fast', 'auto').  


**PyPDFLoader**: This loader is simpler and often faster for straightforward, text-based PDFs. It extracts text page by page, with each page typically becoming a separate Document object. It's a good default for PDFs where complex layout parsing isn't critical.   

In [47]:

from langchain_community.document_loaders import UnstructuredPDFLoader, PyPDFLoader
import os
pdf_path = 'attention.pdf'
loader= UnstructuredPDFLoader(pdf_path, 
                            mode='elements', 
                            unstructured_kwargs = {
                                'strategy':'hi_res',
                                'extract_image_block_types':["Image"],   # Add 'Table' to list to extract tables as images
                                'extract_image_block_to_payload':True,   # Extract metadata element containing base64 object of the image                              
                                'chunking_strategy':"by_title",          # Chunking strategy to use, can be 'by_title', 'basic'
                                'max_characters':2000,                   # Maximum number of characters per chunk
                                'combine_text_under_n_chars':500,        # Combine text blocks under this number of characters with previous text block
                                'new_after_n_chars':6000,       
                            })
doc = loader.load()
doc[0]


Document(metadata={'source': 'attention.pdf', 'coordinates': {'points': ((16.34, 213.92000000000007), (16.34, 253.92000000000007), (36.34, 253.92000000000007), (36.34, 213.92000000000007)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'attention.pdf', 'languages': ['eng'], 'last_modified': '2025-03-26T23:07:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'da4d57fbf8f55a96700ed365e2d347f3'}, page_content='3 2 0 2')

In [70]:

from langchain_community.document_loaders import UnstructuredPDFLoader, PyPDFLoader
import os
pdf_path = 'attention.pdf'

print(f"--- Loading with UnstructuredPDFLoader ({pdf_path}) ---")
# Loading a PDF in 'elements' mode with automatic strategy selection
loader_unstructured= UnstructuredPDFLoader(pdf_path, mode='elements')                         
document_unstructured = loader_unstructured.load()

print(f"Number of documents (elements) loaded: {len(document_unstructured)}")
if document_unstructured:
    print(f"First document content : {document_unstructured[0].page_content[:200]}")
    print(f'First document metadata : {document_unstructured[0].metadata}')
    
print(f"\n--- Loading with PyPDFLoader ({pdf_path}) ---")
# Loading a PDF page by page
loader_pypdf = PyPDFLoader(pdf_path)
documents_pypdf = loader_pypdf.load()

print(f"Number of documents (pages) loaded: {len(documents_pypdf)}")
if documents_pypdf:
    print(f"First page content (partial): {documents_pypdf[0].page_content[:200]}...")
    print(f"First page metadata: {documents_pypdf[0].metadata}")


--- Loading with UnstructuredPDFLoader (attention.pdf) ---
Number of documents (elements) loaded: 393
First document content : 3 2 0 2
First document metadata : {'source': 'attention.pdf', 'coordinates': {'points': ((16.34, 213.92000000000007), (16.34, 253.92000000000007), (36.34, 253.92000000000007), (36.34, 213.92000000000007)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'attention.pdf', 'languages': ['eng'], 'last_modified': '2025-03-26T23:07:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'da4d57fbf8f55a96700ed365e2d347f3'}

--- Loading with PyPDFLoader (attention.pdf) ---
Number of documents (pages) loaded: 15
First page content (partial): Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
...
First page metadata: {'producer': 'pdfTeX-1.40.25', '

In [71]:
documents_pypdf[0].metadata

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2024-04-10T21:11:43+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2024-04-10T21:11:43+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': 'attention.pdf',
 'total_pages': 15,
 'page': 0,
 'page_label': '1'}

In [55]:
document_unstructured[0].metadata

{'source': 'attention.pdf',
 'coordinates': {'points': ((16.34, 213.92000000000007),
   (16.34, 253.92000000000007),
   (36.34, 253.92000000000007),
   (36.34, 213.92000000000007)),
  'system': 'PixelSpace',
  'layout_width': 612,
  'layout_height': 792},
 'filename': 'attention.pdf',
 'languages': ['eng'],
 'last_modified': '2025-03-26T23:07:27',
 'page_number': 1,
 'filetype': 'application/pdf',
 'category': 'UncategorizedText',
 'element_id': 'da4d57fbf8f55a96700ed365e2d347f3'}

In [75]:
for d in documents_pypdf:
    if d.metadata['page']==2:
        print(d.metadata)   
        print(d)
        print("\n")

{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3'}
page_content='Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around 

## WebBaseLoader

WebBaseLoader fetches content directly from URLs. Post-processing with libraries like BeautifulSoup is often needed to clean raw HTML.

In [None]:
from langchain_community.document_loaders import WebBaseLoader
from bs4 import BeautifulSoup

## loading a blog from URL

loader = WebBaseLoader("https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.UnstructuredPDFLoader.html")
web_document = loader.load()

In [78]:
web_document[0]

Document(metadata={'source': 'https://www.langchain.com/blog/langchain-expression-language', 'title': '404', 'description': '404. Oops! page not found.', 'language': 'en'}, page_content='404\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProducts\n\nFrameworksLangGraphLangChainPlatformsLangSmithLangGraph PlatformResources\n\nResources HubBlogCustomer StoriesLangChain AcademyCommunityExpertsChangelogDocs\n\nPythonLangGraphLangSmithLangChainJavaScriptLangGraphLangSmithLangChainCompany\n\nAboutCareersPricing\n\nLangSmithLangGraph PlatformGet a demoSign up404Oops! page not found.The page you are looking for might have been removed hadÂ\xa0its name changed or is temporarily unavailable.Go to Home Page \n\n\n\n\n\n\n')