```{contents}
```
## Document Loaders — PDF / CSV / HTML


Document loaders are components that **read raw files** and convert them into a unified internal format (`Document`) that LLM pipelines (RAG, QA, agents) can process.

They handle:

* File reading
* Text extraction
* Metadata attachment
* Standardization

In practice they are the **entry point of every RAG pipeline**.

---

### Why Loaders Matter in LLM Systems

Without loaders:

* No structured ingestion
* No metadata for retrieval
* No reproducible pipelines

With loaders:

* Clean ingestion
* Source traceability
* Scalable data pipelines

---

### Common Loader Types

| Loader                                | Used For                |
| ------------------------------------- | ----------------------- |
| PDFLoader                             | PDFs                    |
| CSVLoader                             | Structured tabular data |
| UnstructuredHTMLLoader / BSHTMLLoader | Web pages               |
| TextLoader                            | Plain text              |
| DirectoryLoader                       | Bulk ingestion          |

---

### PDF Loader

#### Purpose

Extracts text and metadata from PDF files.

---

#### Demonstration

```python
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("manual.pdf")
documents = loader.load()

print(documents[0].page_content)
print(documents[0].metadata)
```

**Output Structure**

```text
Document(
  page_content="Extracted text from page...",
  metadata={"source": "manual.pdf", "page": 1}
)
```

---

### CSV Loader

#### Purpose

Loads structured rows and converts each row into a `Document`.

---

#### Demonstration

```python
from langchain.document_loaders import CSVLoader

loader = CSVLoader(file_path="tickets.csv")
documents = loader.load()

print(documents[0].page_content)
```

**Result**

```text
"id: 101, issue: login failed, priority: high"
```

Each row becomes a searchable knowledge chunk.

---

### HTML Loader

#### Purpose

Extracts readable content from HTML / webpages.

---

#### Demonstration

```python
from langchain.document_loaders import BSHTMLLoader

loader = BSHTMLLoader("https://example.com/docs.html")
documents = loader.load()

print(documents[0].page_content)
```

Cleans HTML tags and keeps meaningful text.

---

### Bulk Loading with DirectoryLoader

```python
from langchain.document_loaders import DirectoryLoader, PyPDFLoader

loader = DirectoryLoader(
    "data/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)

documents = loader.load()
```

Loads **entire document collections**.

---

### Typical RAG Ingestion Pipeline

```
Files → Loaders → Documents → Chunking → Embeddings → Vector DB
```

---

### Best Practices

| Practice            | Reason                        |
| ------------------- | ----------------------------- |
| Preserve metadata   | Enables citations & filtering |
| Chunk after loading | Better retrieval              |
| Normalize text      | Improves embeddings           |
| Store source info   | Explainability                |

---

### Mental Model

```
Loader = Translator between raw data and LLM pipelines
```

Without loaders, RAG does not exist.

---

### Key Takeaways

* Loaders standardize all data into `Document` objects
* PDF, CSV, and HTML loaders cover most enterprise data
* They are the foundation of every RAG system
* Correct loading = better retrieval = better answers
