```{contents}
```
## Document Versioning 

**Document versioning** is the practice of **tracking changes to documents over time** so that your LLM system always knows:

* Which version is current
* Which version produced a given answer
* How knowledge has evolved
* When to re-embed or invalidate old data

Without versioning, your RAG system slowly becomes **inconsistent and untrustworthy**.

---

### Where Document Versioning Fits

```
Data Ingestion → Cleaning → Metadata Enrichment → Versioning → Chunking → Embeddings → Vector DB → RAG
```

Versioning controls **knowledge correctness over time**.

---

### Why Versioning Is Critical

| Problem                     | Solved By Versioning       |
| --------------------------- | -------------------------- |
| Wrong answers from old docs | Track latest version       |
| No audit trail              | Historical versions        |
| Broken embeddings           | Re-embed only changed docs |
| Compliance failures         | Traceable knowledge        |


---

### Initial Document (v1)

```python
from langchain.schema import Document

doc_v1 = Document(
    page_content="RAG improves accuracy using document retrieval.",
    metadata={
        "doc_id": "kb-101",
        "version": 1,
        "status": "active",
        "updated_at": "2025-01-01"
    }
)
```

Store and embed this version.

---

### Updated Document (v2)

```python
doc_v2 = Document(
    page_content="RAG improves accuracy by combining retrieval with generation.",
    metadata={
        "doc_id": "kb-101",
        "version": 2,
        "status": "active",
        "updated_at": "2025-02-10"
    }
)
```

---

### Mark Old Version as Deprecated

```python
doc_v1.metadata["status"] = "deprecated"
```

---

### Selective Re-Embedding Logic

```python
def needs_reembedding(old_doc, new_doc):
    return old_doc.page_content != new_doc.page_content
```

Only changed documents get re-embedded.

---

### Vector Store Storage

Each chunk inherits:

```text
doc_id = kb-101
version = 2
status = active
```

---

### Version-Aware Retrieval

```python
retriever = vectorstore.as_retriever(
    search_kwargs={
        "filter": {"doc_id": "kb-101", "status": "active"}
    }
)
```

Guarantees only latest knowledge is used.

---

### Audit Trail Example

```text
Question: What is RAG?
Answer: ...
Source: kb-101 (version 2, updated 2025-02-10)
```

---

### Production Versioning Schema (Recommended)

| Field      | Purpose                |
| ---------- | ---------------------- |
| doc_id     | Stable identity        |
| version    | Incremental version    |
| status     | active / deprecated    |
| checksum   | Detect content changes |
| updated_at | Freshness              |
| source     | Traceability           |

---

### Common Versioning Mistakes

* Overwriting documents in place
* Losing old embeddings
* No version metadata in chunks
* Re-embedding entire corpus unnecessarily
* No retrieval filters

---

### Mental Model

```
Document Versioning = Git for your knowledge base
```

---

### Key Takeaways

* Versioning is mandatory for production RAG
* Enables correctness, auditing, and rollback
* Prevents old knowledge from polluting answers
* Saves cost by selective re-embedding