```{contents}
```
## API-Based Loader 

### What Is an API-Based Loader?

An **API-based loader** ingests data **directly from an external API** (instead of files or web pages), transforms the response into structured `Document` objects, and feeds them into your LLM / RAG pipeline.

It is used when the **source of truth is a service**, not a file.

---

### Why API-Based Loaders Matter

Most enterprise data lives behind APIs:

* Ticketing systems (Jira, ServiceNow)
* Knowledge bases (Confluence, Notion)
* CRMs (Salesforce)
* Internal microservices
* Public APIs

API loaders allow your LLM system to stay **fresh, synchronized, and real-time**.

---

#### Where It Fits in the RAG Pipeline

```
API → Loader → Documents → Chunking → Embeddings → Vector DB → RAG
```



---

### Simple API Loader (Custom Implementation)

#### Example: Loading Data from a REST API

```python
import requests
from langchain.schema import Document

def api_loader(url):
    response = requests.get(url)
    data = response.json()

    documents = []
    for item in data:
        content = f"Title: {item['title']}\nBody: {item['body']}"
        documents.append(Document(
            page_content=content,
            metadata={"source": url, "id": item["id"]}
        ))
    return documents
```

---

#### Using the Loader

```python
docs = api_loader("https://jsonplaceholder.typicode.com/posts")
print(docs[0])
```

Output becomes searchable knowledge.

---

### Authenticated API Loader

```python
headers = {"Authorization": "Bearer YOUR_API_KEY"}

def secure_api_loader(url):
    response = requests.get(url, headers=headers)
    data = response.json()
    return [Document(page_content=str(item), metadata={"source": url}) for item in data]
```

---

### Real-World Example — Jira-Like Loader

```python
def ticket_loader(api_url, token):
    headers = {"Authorization": f"Bearer {token}"}
    data = requests.get(api_url, headers=headers).json()

    return [
        Document(
            page_content=f"Ticket: {t['summary']}\nDescription: {t['description']}",
            metadata={"ticket_id": t["id"], "priority": t["priority"]}
        )
        for t in data["issues"]
    ]
```

---

### Incremental Sync (Important in Production)

```python
def incremental_loader(url, last_updated):
    response = requests.get(url, params={"updated_since": last_updated})
    return response.json()
```

Prevents re-embedding unchanged data.

---

### API Loader → RAG Integration

```python
docs = api_loader("https://api.company.com/kb")

# Chunk → Embed → Store in vector DB
```

---

### Best Practices

| Practice          | Reason              |
| ----------------- | ------------------- |
| Delta sync        | Reduce cost         |
| Preserve metadata | Source traceability |
| Normalize JSON    | Better embeddings   |
| Rate limiting     | Avoid bans          |
| Retry + backoff   | Reliability         |

---

### API Loader vs Web Scraping

| API Loader | Web Scraping    |
| ---------- | --------------- |
| Structured | Unstructured    |
| Reliable   | Fragile         |
| Fast       | Slower          |
| Legal      | Sometimes risky |
| Preferred  | Fallback        |

---

### Mental Model

```
API Loader = Bridge between live systems and your LLM knowledge base
```

---

### Key Takeaways

* API-based loaders keep knowledge **fresh and authoritative**
* They are the backbone of enterprise RAG
* Must handle auth, pagination, retries, and delta updates
* Always convert responses into `Document` objects before chunking