```{contents}
```
## Change Data Capture (CDC)

**Change Data Capture (CDC) in LLM workflows** refers to detecting and processing **only the data that has changed**—instead of reprocessing the entire dataset—so your LLM, embeddings, vector store, or RAG system stays up-to-date efficiently.

LLMs themselves don’t *perform* CDC, but CDC is used in **data pipelines feeding LLMs**, especially in **continuous ingestion RAG pipelines**, enterprise systems, and real-time AI applications.

---

### What CDC Means

CDC = A technique that captures **insert**, **update**, and **delete** events happening in a source system (e.g., database, files, logs) and sends only the changed portions to downstream systems.

---

### Why CDC Matters in LLM / RAG Systems

LLM pipelines often require:

* Rebuilding embeddings
* Updating indexes
* Keeping vector stores synced with source data
* Updating RAG responses based on newest information

Without CDC:

* You would reprocess the entire dataset (slow and expensive).
* Embeddings would repeatedly be regenerated.
* Vector DBs would grow incorrectly or become stale.

With CDC:

* Only changed records get re-embedded
* Vector stores update incrementally
* RAG answers stay aligned with the latest data

---

### How CDC Integrates with LLM Pipelines

Below is the typical flow:

#### **Source Database → CDC Engine → Event Stream → LLM Pipeline → Vector Store**

CDC engine examples:

* Debezium
* PostgreSQL WAL
* MySQL binlog
* Kafka Connect CDC
* DynamoDB Streams
* Snowflake Streams

### LLM pipeline consumes:

* New rows → generate embeddings → insert into vector DB
* Updated rows → regenerate embeddings → update record
* Deleted rows → remove embeddings from vector DB

This keeps the knowledge base **real-time**, **consistent**, and **fresh**.

---

### **Example: CDC in a Practical RAG Setup**

#### When a new customer ticket is added:

1. CDC captures the new row
2. Sends message to Kafka topic
3. Worker reads event, embeds text
4. Adds to FAISS / Chroma / Pinecone
5. LLM instantly uses the updated index

#### When a ticket is updated:

1. CDC captures UPDATE
2. Worker re-embeds only the changed content
3. Index entry is replaced

#### When a ticket is deleted:

1. CDC captures DELETE
2. Worker removes the vector entry

---

### Benefits of Using CDC for LLM

#### Fresh, real-time knowledge

LLMs often answer questions about:

* Logs
* CRM data
* Tickets
* Policies
* Product catalogs

CDC ensures answers reflect the latest state.

#### Lower cost

Embedding large datasets repeatedly is expensive.
CDC eliminates unnecessary reprocessing.

#### Better consistency

Vector store and source DB stay aligned.

#### Event-driven architecture

Fits perfectly with streaming ingestion + LLMs.

---

**Example Code Flow (High-Level)**

```python
def handle_cdc_event(event):
    if event.type == "insert":
        embed_and_upsert(event.data)
    elif event.type == "update":
        embed_and_update(event.data)
    elif event.type == "delete":
        vector_db.delete(event.primary_key)
```

This runs inside a worker subscribed to a CDC stream.

---

**Where CDC + LLM is used**

* Enterprise knowledge bases
* Customer support copilots
* Real-time analytics assistants
* AI over ERP/CRM systems
* Compliance and audit LLMs
* Automated document intelligence

---

### Demonstration
Below is a **clean, production-style demonstration** of **CDC using Debezium + Kafka → Embeddings → Vector DB → RAG**.
This is the most commonly used real-time architecture for enterprise LLM systems.

The explanation is **fully structured**, followed by **complete code** for:

* Debezium + Kafka setup
* CDC event stream
* Python consumer
* Embedding + vector DB updates
* RAG answering
* Live incremental knowledge refresh

---

#### Architecture (Debezium → Kafka → Python → RAG Pipeline)

```
        ┌──────────┐
        │ Postgres │
        │  Table   │
        └─────┬────┘
              │  CDC (WAL)
              ▼
       ┌─────────────┐
       │  Debezium   │
       │ PostgreSQL  │
       └──────┬──────┘
              │ Change Events
              ▼
       ┌─────────────┐
       │   Kafka     │
       │   Topic     │
       └──────┬──────┘
              │ JSON CDC Payload
              ▼
      ┌────────────────┐
      │ Python Worker  │
      │ (Kafka Consumer)
      └──────┬─────────┘
             │
             ├── Insert → embed → upsert vector DB
             ├── Update → re-embed → update vector DB
             └── Delete → remove vector from index
             ▼
       ┌──────────────┐
       │ Vector Store │
       └──────┬───────┘
              ▼
         ┌──────────┐
         │   RAG    │
         └──────────┘
```

This ensures **real-time synchronization** between Postgres and your vector database.

---

#### Debezium + Kafka Setup (docker-compose)

Create a file named **docker-compose.yml**:

```yaml
version: '3.7'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.4.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  kafka:
    image: confluentinc/cp-kafka:7.4.0
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

  postgres:
    image: debezium/example-postgres:2.5
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: mydb
    ports:
      - "5432:5432"

  debezium:
    image: debezium/connect:2.5
    depends_on:
      - kafka
      - postgres
    ports:
      - "8083:8083"
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: 1
      CONFIG_STORAGE_TOPIC: debezium_config
      OFFSET_STORAGE_TOPIC: debezium_offsets
      STATUS_STORAGE_TOPIC: debezium_status
```

Start it:

```bash
docker-compose up -d
```

---

#### Register Debezium Postgres Connector

Run:

```bash
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
  "name": "pg-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "postgres",
    "database.dbname": "mydb",
    "database.server.name": "pg",
    "plugin.name": "pgoutput",
    "slot.name": "debezium",
    "table.include.list": "public.tickets"
  }
}'
```

This sends CDC events into Kafka topic:

```
pg.public.tickets
```

---

#### Prepare the Postgres Table

```sql
CREATE TABLE tickets (
    id SERIAL PRIMARY KEY,
    text TEXT NOT NULL,
    updated_at TIMESTAMP DEFAULT now()
);

INSERT INTO tickets (text) VALUES ('User cannot login');
```

Any INSERT, UPDATE, DELETE will now emit CDC events.

---

#### Python Kafka Consumer + Embedding + Vector DB

Install required libs:

```bash
pip install kafka-python langchain-openai faiss-cpu langchain-community
```

---

#### Python Worker: Consume CDC + Update VectorDB

```python
from kafka import KafkaConsumer
import json
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# Initialize embeddings + LLM
emb = OpenAIEmbeddings(model="text-embedding-3-large")
llm = ChatOpenAI(model="gpt-4.1-mini")

# Create an empty vectordb
vectordb = FAISS.from_texts(["initial"], embedding=emb)

retriever = vectordb.as_retriever(search_kwargs={"k": 3})
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# Kafka CDC topic
consumer = KafkaConsumer(
    "pg.public.tickets",
    bootstrap_servers="localhost:9092",
    auto_offset_reset="earliest",
    value_deserializer=lambda m: json.loads(m.decode("utf-8"))
)

def handle_cdc(payload):
    op = payload["op"]

    if op == "c":   # create
        id = str(payload["after"]["id"])
        text = payload["after"]["text"]
        vectordb.add_texts([text], ids=[id])
        print(f"[INSERT] id={id}")

    elif op == "u":  # update
        id = str(payload["after"]["id"])
        text = payload["after"]["text"]
        vectordb.delete([id])
        vectordb.add_texts([text], ids=[id])
        print(f"[UPDATE] id={id}")

    elif op == "d":  # delete
        id = str(payload["before"]["id"])
        vectordb.delete([id])
        print(f"[DELETE] id={id}")

    else:
        print("Unknown event:", op)


print("\nListening to CDC events...")
for msg in consumer:
    payload = msg.value["payload"]
    handle_cdc(payload)
```

---

#### Test RAG Answering After CDC Update

Place this in a separate terminal:

```python
query = "Why is the user unable to log in?"
result = qa({"query": query})
print(result["result"])
```

As you modify the Postgres table, answers **automatically update**.

---

#### Example CDC in Action

##### 1. Insert a new ticket:

```sql
INSERT INTO tickets (text) VALUES ('Payment gateway is failing');
```

Terminal output:

```
[INSERT] id=2
```

##### 2. Update ticket:

```sql
UPDATE tickets SET text='User cannot login due to reset failure' WHERE id=1;
```

Terminal output:

```
[UPDATE] id=1
```

##### 3. Delete ticket:

```sql
DELETE FROM tickets WHERE id=2;
```

Terminal:

```
[DELETE] id=2
```

Your RAG now reflects the **latest source-of-truth instantly**.

---
