```{contents}
```

## Data Source Loader

LangChain has **many** loaders, but they fall into clear **categories**. You do not need to memorize every class name. Understand the groups and what they load.

---

### **1) File-Based Loaders**

Used to load **documents from local files** into LangChain `Document` objects.

| Format             | Loader                                             |
| ------------------ | -------------------------------------------------- |
| Plain text         | `TextLoader`                                       |
| PDF                | `PyPDFLoader`, `PDFPlumberLoader`, `PyMuPDFLoader` |
| Word files (.docx) | `Docx2txtLoader`                                   |
| PowerPoint (.pptx) | `UnstructuredPowerPointLoader`                     |
| Excel (.xlsx)      | `UnstructuredExcelLoader`                          |
| CSV                | `CSVLoader`, `PandasCSVLoader`                     |
| Markdown           | `UnstructuredMarkdownLoader`                       |
| EML/MSG emails     | `UnstructuredEmailLoader`                          |
| HTML/Web Pages     | `UnstructuredHTMLLoader`                           |

Key point: “Unstructured” loaders rely on the **unstructured** library for extraction.

---

### **2) Directory / Folder Loaders**

To load **many files** at once.

| Loader            | Purpose                               |
| ----------------- | ------------------------------------- |
| `DirectoryLoader` | Loads files recursively from a folder |
| `GenericLoader`   | Wraps custom loader logic             |

Usage example:

```python
from langchain.document_loaders import DirectoryLoader, TextLoader
loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
```

---

### **3) Web / Online Content Loaders**

Load data from **websites**, media platforms, and APIs.

| Source                | Loader              |
| --------------------- | ------------------- |
| Webpages              | `WebBaseLoader`     |
| Sitemap of whole site | `SitemapLoader`     |
| YouTube transcripts   | `YoutubeLoader`     |
| Notion pages          | `NotionDBLoader`    |
| Confluence            | `ConfluenceLoader`  |
| SharePoint            | `SharePointLoader`  |
| Google Drive          | `GoogleDriveLoader` |
| GitHub repos          | `GitLoader`         |

Key idea: these loaders **fetch online data**, convert to text, wrap as `Document`.

---

### **4) Database Loaders**

Load documents directly from **databases**.

| DB Type                       | Loader                            |
| ----------------------------- | --------------------------------- |
| SQL (Postgres / MySQL / etc.) | `SQLDatabaseLoader`               |
| MongoDB                       | `MongoDBLoader`                   |
| Elasticsearch                 | `ElasticsearchLoader`             |
| Pinecone/Chroma Vector Stores | `VectorStoreIndexWrapper` loaders |

These are useful when building RAG over enterprise data.

---

### **5) Cloud Storage Loaders**

| Storage           | Loader                   |
| ----------------- | ------------------------ |
| AWS S3            | `S3DirectoryLoader`      |
| Azure Blob        | `AzureBlobStorageLoader` |
| GCP Cloud Storage | `GCSDirectoryLoader`     |

Used to load large corpora stored remotely.

---

### **6) “Unstructured” Family (Important)**

These handle formats where layout matters:

* `UnstructuredFileLoader`
* `UnstructuredPDFLoader`
* `UnstructuredPowerPointLoader`
* `UnstructuredHTMLLoader`

They extract the text **while preserving readability**.

---

### **7) Specialized / Legacy / Misc**

| Loader                     | Purpose                            |
| -------------------------- | ---------------------------------- |
| `WikipediaLoader`          | Load from Wikipedia                |
| `HuggingFaceDatasetLoader` | Load datasets from HuggingFace Hub |
| `BigQueryLoader`           | Load from Google BigQuery          |
| `SlackFileLoader`          | Load Slack export data             |
| `EverNoteLoader`           | Load Evernote notes                |

---

**Key Insight**

All loaders convert input → `Document` objects.

```
source data → loader → Document → text splitter → embeddings → vector store
```

Loaders differ **only** in how they read the source.

---

**Simplified Memory Rule**

| Category    | Examples                                  |
| ----------- | ----------------------------------------- |
| Local files | `TextLoader`, `PyPDFLoader`               |
| Directories | `DirectoryLoader`                         |
| Web         | `WebBaseLoader`, `YoutubeLoader`          |
| Databases   | `SQLDatabaseLoader`, `MongoDBLoader`      |
| Cloud       | `S3DirectoryLoader`, `GCSDirectoryLoader` |
