```{contents}
```
## Document Loader

### What a Document Loader Is

A **Document Loader** in LangChain is a component that **loads raw data from a source and converts it into standardized `Document` objects**.

> It is the **entry point of any RAG pipeline**.

Document loaders do **not embed, chunk, or retrieve** data.
They only **read and normalize data**.

---

### Why Document Loaders Exist

Data comes in many formats:

* PDFs
* Text files
* HTML pages
* CSVs
* Databases
* APIs

LLMs cannot consume these formats directly.

Document loaders:

* Read source data
* Normalize content
* Attach metadata
* Output LangChain `Document` objects

---

### What a Document Object Is

#### Document Structure



In [1]:

from langchain_core.documents import Document

Document(
    page_content="Actual text content",
    metadata={"source": "file.pdf", "page": 1}
)


Document(metadata={'source': 'file.pdf', 'page': 1}, page_content='Actual text content')



Every loader outputs a **list of `Document` objects**.

---

### Where Document Loaders Fit in RAG

```
Raw Data Source
   ↓
Document Loader
   ↓
Documents
   ↓
Text Splitter
   ↓
Chunks
   ↓
Embeddings
   ↓
Vector Store
   ↓
Retriever
```

Document loaders are **always the first step**.

---

### Common Types of Document Loaders

#### File-Based Loaders

* TextLoader
* PDFLoader
* CSVLoader
* JSONLoader

#### Web Loaders

* WebBaseLoader
* SitemapLoader
* UnstructuredURLLoader

#### Code & Repo Loaders

* GitLoader
* DirectoryLoader

#### Database Loaders

* SQLDatabaseLoader
* Custom DB loaders

---

### Basic Document Loader Demonstration

#### Loading a Text File



In [None]:

from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/sample.txt")
documents = loader.load()




Output:

```python
[Document(page_content="...", metadata={"source": "sample.txt"})]
```

---

### Loading Multiple Files (DirectoryLoader)



In [None]:

from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader(
    path="data/",
    glob="**/*.txt",
    loader_cls=TextLoader
)

documents = loader.load()




Loads **all matching files recursively**.

---

### PDF Document Loader



In [None]:

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/report.pdf")
documents = loader.load()




Each page becomes a separate `Document`.

Metadata includes:

* page number
* source file

---

### CSV Loader



In [None]:

from langchain_community.document_loaders import CSVLoader
loader = CSVLoader(
    file_path="tickets.csv",
    source_column="ticket_id"
)

documents = loader.load()




Each row becomes a `Document`.

---

### Web Page Loader



In [3]:

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://example.com")
documents = loader.load()


USER_AGENT environment variable not set, consider setting it to identify your requests.




Metadata includes:

* URL
* title (if available)

---

### Loader + Text Splitter (Typical Pattern)



In [5]:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_documents(documents)




Loaders never split text — **splitters do**.

---

### Metadata Handling (Critical)

Metadata travels through the entire pipeline:

```python
Document(
    page_content="...",
    metadata={
        "source": "report.pdf",
        "page": 3,
        "category": "finance"
    }
)
```

Used later for:

* Filtering
* Citations
* Debugging
* Access control

---

### Custom Document Loader

#### When Built-in Loaders Are Not Enough



In [8]:

from langchain_core.document_loaders import BaseLoader

def fetch_from_api():
    """Placeholder function to fetch data from an API"""
    # Replace this with actual API call logic
    return [
        {"id": 1, "text": "Sample document 1"},
        {"id": 2, "text": "Sample document 2"}
    ]

class MyAPILoader(BaseLoader):
    def load(self):
        data = fetch_from_api()
        return [
            Document(
                page_content=item["text"],
                metadata={"id": item["id"]}
            )
            for item in data
        ]




---

### Document Loader vs Text Splitter

| Aspect    | Document Loader | Text Splitter     |
| --------- | --------------- | ----------------- |
| Purpose   | Read data       | Chunk data        |
| Input     | Raw source      | Documents         |
| Output    | Documents       | Smaller Documents |
| LLM usage | ❌               | ❌                 |

---

### Document Loader vs Retriever

| Aspect     | Loader         | Retriever  |
| ---------- | -------------- | ---------- |
| Timing     | Ingestion time | Query time |
| Reads data | Yes            | No         |
| Search     | ❌              | ✅          |

---

### Common Mistakes

#### Expecting loader to chunk text

❌ Loaders only read data

#### Dropping metadata

❌ Breaks traceability

#### Loading entire huge files blindly

❌ Memory issues

#### Using loaders at query time

❌ Should be ingestion-only

---

### Best Practices

* Use loaders only during ingestion
* Preserve meaningful metadata
* Choose loader based on data format
* Combine with appropriate text splitters
* Validate encoding and cleanup

---

### When to Use Which Loader

| Data Source    | Loader          |
| -------------- | --------------- |
| Plain text     | TextLoader      |
| PDF            | PyPDFLoader     |
| CSV            | CSVLoader       |
| Website        | WebBaseLoader   |
| Repository     | GitLoader       |
| Multiple files | DirectoryLoader |

---

### Interview-Ready Summary

> “A Document Loader in LangChain reads raw data from various sources and converts it into standardized `Document` objects with content and metadata. It is the first step in any RAG ingestion pipeline.”

---

### Rule of Thumb

* **Ingestion time → Document Loader**
* **Chunking → Text Splitter**
* **Search → Retriever**
* **Answering → LLM**
