```{contents}
```

## CDC (Change Data Capture) Style Updates


**CDC (Change Data Capture)** is a pattern where **every change in the source system** (insert, update, delete) is captured as an **event**, and your LLM knowledge system reacts to those events **in real time**.

Instead of polling data periodically, the system is **event-driven**.

---

### 2. Why CDC Matters for RAG Systems

CDC enables:

* Near real-time knowledge updates
* Minimal embedding cost
* High ingestion scalability
* Strong consistency between systems

Without CDC, your RAG knowledge is always stale.

---

### 3. Where CDC Fits

```
Database → CDC Stream → Ingestion Service → Vector DB → RAG
```

---

### A. Source Database Change Event

```json
{
  "event": "UPDATE",
  "doc_id": "kb-101",
  "old_content": "RAG improves QA.",
  "new_content": "RAG improves QA using retrieval and generation.",
  "timestamp": "2025-02-10T10:45:00"
}
```

---

### B. CDC Consumer Logic

```python
def process_event(event):
    if event["event"] in ["INSERT", "UPDATE"]:
        update_vector_store(event["doc_id"], event["new_content"])
    elif event["event"] == "DELETE":
        remove_from_vector_store(event["doc_id"])
```

---

### C. Selective Re-Embedding

```python
def update_vector_store(doc_id, content):
    cleaned = clean(content)
    chunks = chunk(cleaned)
    embeddings = embed(chunks)
    store(doc_id, embeddings)
```

Only the changed document is processed.

---

### D. Handling Deletes

```python
def remove_from_vector_store(doc_id):
    vector_db.delete(filter={"doc_id": doc_id})
```

---

### E. Example Kafka-Based Flow

```
Postgres → Debezium → Kafka → Ingestion Worker → Vector DB
```

---

### F. Idempotency Protection

```python
processed_events = set()

def safe_process(event_id, event):
    if event_id in processed_events:
        return
    process_event(event)
    processed_events.add(event_id)
```

---

### 4. Why CDC Beats Batch Ingestion

| Batch             | CDC          |
| ----------------- | ------------ |
| Slow              | Real-time    |
| High cost         | Minimal cost |
| Stale data        | Fresh data   |
| Full re-ingestion | Delta only   |

---

### 5. Production Design Principles

* Use message queues (Kafka, Pulsar)
* Add retries and DLQ
* Ensure exactly-once processing
* Preserve event ordering
* Store ingestion history

---

### 6. Mental Model

```
CDC = Live nervous system of your knowledge base
```

---

### Key Takeaways

* CDC enables real-time RAG updates
* Eliminates expensive full re-ingestion
* Keeps vector DB in sync with source systems
* Required for production-grade LLM platforms