# Document Loaders in LangChain

**Document loaders** are components in LangChain used to load data from various sources into a standardized format  
 (usually as `Document` objects), which can then be used for chunking, embedding, retrieval, and generation.


In [11]:
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
from dotenv import load_dotenv
load_dotenv()

llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Llama-3.1-8B-Instruct",
    temperature=0.7,
    max_new_tokens=256,
)

MetaLlama = ChatHuggingFace(llm=llm)

## TextLoader

**TextLoader** loads plain text (`.txt`) files and converts them into LangChain `Document` objects.

It helps bring raw text into a LangChain pipeline for embedding, search, and question answering.

**Limit:** Supports only `.txt` files

In [17]:
from langchain_community.document_loaders import TextLoader
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

stringParser = StrOutputParser()

prompt = PromptTemplate(
    template="Give the Story line and main Chaacter of the following story: \n {story}",
    input_variables=['story']
)

chain = prompt | MetaLlama | stringParser

textLoader = TextLoader('Story_Book/salt-water.txt')

saltWater = textLoader.load()

res = chain.invoke({'story': saltWater[0].page_content})

print(res)

**Main Character:**

The main character of the story is Neil D'Arcy, a young boy who is about to start his life at sea as a sailor in the English Navy. The story is written in the first person from Neil's perspective, allowing the reader to experience his thoughts, feelings, and adventures firsthand.

**Story Line:**

The story begins with Neil's early life in Ireland, where he is born and raised in a family with a strong maritime tradition. His mother dies when he is young, and his father, a wounded soldier, follows her to the grave. Neil is left in the care of Larry Harrigan, the family's loyal butler and a seasoned sailor who teaches him the nautical arts.

As Neil grows up, he becomes increasingly fascinated with the sea and decides to pursue a career in the navy. With the support of his uncle, a Dublin barrister, and Larry, Neil begins to prepare himself for life at sea.

The story takes Neil to various locations, including Cork, Dublin, London, and Portsmouth, where he meets new 

## PyPDFLoader

**PyPDFLoader** loads PDF (`.pdf`) files and converts each page into LangChain `Document` objects.

It is used to bring PDF content into a LangChain pipeline for search, embeddings, and question answering.

**Limit:** Works only with text-based PDFs (OCR needed for scanned files)

In [18]:
from langchain_community.document_loaders import PyPDFLoader

pdfloader = PyPDFLoader('Story_Book/fast-nine.pdf')

fastNine = pdfloader.load()

print(fastNine)



In [63]:
print(fastNine[2].page_content[:177])

"What time is it, Chatz; since you seem to be the only one in the lot who 
had the good sense and also the decency to fetch a watch along?" 
The Southern boy readily pulled out 


## DirectoryLoader

**DirectoryLoader** loads all supported files from a folder and converts them into LangChain `Document` objects.

It is useful for quickly importing large collections of documents into a LangChain pipeline.

**Limit:** Needs compatible loaders for file types

In [32]:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

directoryLoader = DirectoryLoader(
    path='Story_Book',
    glob='*.pdf',
    loader_cls=PyPDFLoader
)

story = directoryLoader.load()

In [62]:
print(story[114].page_content[:177])

thus far he had not made much progress toward reaching the ordinary 
height of a lad of fifteen. Still, he clung to hope and tried to fill his position 
as Number Four in the Be


## Load vs Lazy Load in LangChain

### `load()`
**load()** loads all documents at once into memory and returns them as a list.  
Best for small datasets where memory usage is not a concern.

### `lazy_load()`
**lazy_load()** loads documents one by one only when needed.  
Best for large datasets and memory-efficient processing.

### Key Differences

| Feature          | load()                     | lazy_load()               |
|------------------|----------------------------|---------------------------|
| Loading style    | All at once                | One by one                |
| Memory usage     | High                        | Low                       |
| Speed (start)    | Faster for small data      | Slower start, scalable    |
| Return type      | List of Documents          | Generator of Documents    |
| Best for         | Small datasets             | Large datasets            |


In [60]:
# With `load()`
directoryLoader = DirectoryLoader(
    path='Story_Book',
    glob='*.pdf',
    loader_cls=PyPDFLoader
)

story = directoryLoader.load()

for doc in story:
    print(doc.metadata['author'], end=' | ')

Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Dougl

In [59]:
# With `lazy_load()`
directoryLoader = DirectoryLoader(
    path='Story_Book',
    glob='*.pdf',
    loader_cls=PyPDFLoader
)

story = directoryLoader.lazy_load()

for doc in story:
    print(doc.metadata['author'], end=' | ')

Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Douglas | Alan Dougl

## WebBaseLoader

**WebBaseLoader** is a document loader in LangChain used to load and extract text content from web pages (URLs).

It uses **BeautifulSoup** under the hood to parse HTML and extract visible text.

### When to Use
- Blogs  
- News articles  
- Public websites with mostly static, text-based content  

### Limitations
- Does not handle JavaScript-heavy pages well  
- Loads only static HTML content (not dynamically rendered content)  
- Use `SeleniumURLLoader` for dynamic websites


In [50]:
from langchain_community.document_loaders import WebBaseLoader

url = "https://www.amazon.in/CORSAIR-Vengeance-5200MHz-Compatible-Computer/dp/B0D9PRVBRZ/ref=sr_1_10?crid=32XTEJWO0MY4V&dib=eyJ2IjoiMSJ9.82cktfcHlx_lZ4-4jLDPben7FnsTTQzOe_YpoSdIr4EorlWcnnHKWebKIXhsYDaIsTqkUbDEiYJFgNRW--wtcs9cdp3Q1hKDryynNavDvb2OdV_q7_E7w0uiKfK1f5eyggS7KGd5_9uHSFoHyqRMUGmU6PeC8_Aw0r8k9U-0alwWQVKpntCrkxT3AIuScQpBvX0g1KQPwu8xxo47g-o847SFc_3jatU92_hATYFmErQ.vnqm8taIp5kexz8QW8oKekzwQzbItWj-1yxK_NWNbu4&dib_tag=se&keywords=ram%2B16gb&qid=1767863591&sprefix=ram%2Caps%2C455&sr=8-10&th=1"

webLoader = WebBaseLoader(url)

product_detail = webLoader.load()

prompt = PromptTemplate(
    template="Answer the following question:\n {question} from the given product detail:\n {product_detail}",
    input_variables=['question', 'product_detail']
)

chain = prompt | MetaLlama | stringParser

res = chain.invoke({
    'question':'Which product is and what its price?',
    'product_detail': product_detail[0].page_content
})

print(res)

The product is CORSAIR Vengeance RGB DDR5 16GB (2 x 8GB) DDR5 5200 CL40-40-40-77 1.25V Intel XMP - Black.

The price of the product is ₹24,692.00.


In [46]:
from langchain_community.document_loaders import WebBaseLoader

url = "https://www.amazon.in/CORSAIR-Vengeance-5200MHz-Compatible-Computer/dp/B0D9PRVBRZ/ref=sr_1_10?crid=32XTEJWO0MY4V&dib=eyJ2IjoiMSJ9.82cktfcHlx_lZ4-4jLDPben7FnsTTQzOe_YpoSdIr4EorlWcnnHKWebKIXhsYDaIsTqkUbDEiYJFgNRW--wtcs9cdp3Q1hKDryynNavDvb2OdV_q7_E7w0uiKfK1f5eyggS7KGd5_9uHSFoHyqRMUGmU6PeC8_Aw0r8k9U-0alwWQVKpntCrkxT3AIuScQpBvX0g1KQPwu8xxo47g-o847SFc_3jatU92_hATYFmErQ.vnqm8taIp5kexz8QW8oKekzwQzbItWj-1yxK_NWNbu4&dib_tag=se&keywords=ram%2B16gb&qid=1767863591&sprefix=ram%2Caps%2C455&sr=8-10&th=1"

webLoader = WebBaseLoader(url)

product_detail = webLoader.load()

In [57]:
product_detail[0].page_content.strip().replace('\n', '').replace('\t', ' ').replace('\r', ' ')

'Amazon.in: Buy CORSAIR Vengeance RGB DDR5 16GB (2 x 8GB) DDR5 5200 CL40-40-40-77 1.25V Intel XMP - Black Online at Low Prices in India | Corsair Reviews &amp; RatingsSkip to        Main content              About this item              About this item              About this item              Buying options              Compare with similar items              Videos              Reviews            Keyboard shortcuts  Searchalt+/Cartshift+alt+CHomeshift+alt+HOrdersshift+alt+OAdd to cartshift+alt+KShow/Hide shortcutsshift+alt+ZTo move between items, use your keyboard\'s up or down arrows..in                   Delivering to Patna 800020                                   Update location                AllSelect the department you want to search inAll CategoriesAlexa SkillsAmazon DevicesAmazon FashionAmazon FreshAmazon PharmacyAppliancesApps & GamesAudible AudiobooksBabyBeautyBooksCar & MotorbikeClothing & AccessoriesCollectiblesComputers & AccessoriesDealsElectronicsFurnitureGarden & Outd

## CSVLoader

**CSVLoader** loads CSV (`.csv`) files and converts each row into a LangChain `Document` object.

It is used to bring structured tabular data into a LangChain pipeline for search and question answering.

**Limit:** Works only with CSV files


In [72]:
from langchain_community.document_loaders import CSVLoader

csvLoader = CSVLoader(file_path='farmers_detail.csv')

detail = csvLoader.load()

print(detail[0].page_content)

farmer_id: 1
name: Ramesh Kumar
age: 45
village: Rampur
district: Varanasi
state: Uttar Pradesh
crop: Wheat
land_acres: 3.5
