```{contents}
```
## Incremental Ingestion 


**Incremental ingestion** is the process of **updating only the changed or new data** in your LLM knowledge base instead of reprocessing everything.

It ensures that:

* Knowledge stays fresh
* Embedding costs stay low
* Ingestion is fast and scalable

Without incremental ingestion, every update becomes **slow, expensive, and risky**.

---

### 2. Where It Fits in the Pipeline

```
Data Source → Change Detection → Incremental Ingestion → Chunking → Embeddings → Vector DB → RAG
```

---

### 3. Why Incremental Ingestion Is Critical

| Without It        | With It                       |
| ----------------- | ----------------------------- |
| Full re-embedding | Only changed docs re-embedded |
| High cost         | Low cost                      |
| Slow updates      | Near-real-time updates        |
| High downtime     | Continuous availability       |


---

### A. Track Document Versions

```python
documents = {
    "doc1": {"content": "RAG improves QA.", "version": 1},
}
```

---

### B. Incoming Update

```python
incoming = {
    "doc1": {"content": "RAG improves QA by combining retrieval and generation.", "version": 2},
    "doc2": {"content": "Vector DBs store embeddings.", "version": 1}
}
```

---

### C. Detect Changes

```python
def detect_changes(existing, incoming):
    updated = []
    new = []
    for k, v in incoming.items():
        if k not in existing:
            new.append(k)
        elif existing[k]["content"] != v["content"]:
            updated.append(k)
    return new, updated

new_docs, updated_docs = detect_changes(documents, incoming)
```

---

### D. Selective Processing

```python
docs_to_process = new_docs + updated_docs
```

---

### E. Update Knowledge Base

```python
for doc_id in docs_to_process:
    documents[doc_id] = incoming[doc_id]
```

Only these documents go through:

* Cleaning
* Chunking
* Embeddings
* Vector DB update

---

### F. Production-Style Change Detection (Checksum)

```python
import hashlib

def checksum(text):
    return hashlib.md5(text.encode()).hexdigest()
```

Store and compare checksums to detect edits reliably.

---

### G. Deletion Handling

```python
deleted = set(documents) - set(incoming)
```

Remove their embeddings from vector DB.

---

### 4. Typical Enterprise Flow

```
API / Files → Delta detection → Selective re-embed → Vector DB update → Live RAG
```

---

### 5. Best Practices

* Track doc_id + version + checksum
* Never re-embed unchanged documents
* Log ingestion history
* Support rollback & audit
* Schedule periodic consistency checks

---

### 6. Mental Model

```
Incremental Ingestion = Git pull for your knowledge base
```

---

### Key Takeaways

* Essential for scalable RAG systems
* Saves cost and time
* Keeps knowledge continuously fresh
* Enables real-time knowledge updates

