```{contents}
```
## Document Stores 

### 1. Definition

A **Document Store** is a specialized data system designed to store, index, retrieve, and serve large collections of unstructured or semi-structured documents for **Generative AI** applications, especially **Retrieval-Augmented Generation (RAG)**.

**Core purpose:**

> Provide relevant context to language models by retrieving the most useful documents for a given query.

---

### 2. Why Document Stores Matter in Generative AI

Large language models:

* Do **not** have access to your private or dynamic data
* Have a **fixed knowledge cutoff**
* Cannot remember large external corpora

Document stores solve this by enabling:

| Capability                      | Benefit                          |
| ------------------------------- | -------------------------------- |
| Fast semantic search            | Finds meaning, not just keywords |
| Long-term knowledge persistence | Model answers from your data     |
| Dynamic updates                 | New data without retraining      |
| Controlled grounding            | Reduces hallucinations           |

---

### 3. Position in the RAG Architecture

```
User Query
   ↓
Embedding Model
   ↓
Vector Search in Document Store
   ↓
Top-k Relevant Documents
   ↓
Prompt Construction
   ↓
LLM Generation
```

---

### 4. Types of Document Stores

| Type               | Purpose                       | Examples                  |
| ------------------ | ----------------------------- | ------------------------- |
| **Vector Store**   | Semantic similarity search    | FAISS, Pinecone, Weaviate |
| **Keyword Store**  | Exact / boolean search        | Elasticsearch, OpenSearch |
| **Hybrid Store**   | Combine both                  | Vespa, OpenSearch Hybrid  |
| **Metadata Store** | Filters, permissions, routing | SQL, MongoDB              |

Most production systems combine **vector + keyword + metadata**.

---

### 5. Internal Data Model

Each stored document is represented as:

```json
{
  "id": "doc_001",
  "text": "Transformers use self-attention...",
  "embedding": [0.012, -0.33, ...],
  "metadata": {
      "source": "ml_book.pdf",
      "page": 42,
      "date": "2024-01-01"
  }
}
```

---

### 6. Core Operations

| Operation    | Description                                 |
| ------------ | ------------------------------------------- |
| **Ingest**   | Chunk documents, generate embeddings        |
| **Index**    | Organize vectors for fast similarity search |
| **Retrieve** | Find top-k relevant chunks                  |
| **Filter**   | Apply metadata constraints                  |
| **Update**   | Insert, delete, refresh documents           |

---

### 7. Document Ingestion Workflow

```
Raw Files → Text Extraction → Chunking → Embedding → Store
```

**Chunking guidelines:**

* 200–1000 tokens per chunk
* Overlap: 10–20%
* Preserve semantic boundaries (paragraphs, sections)

---

### 8. Retrieval Workflow

```python
from sentence_transformers import SentenceTransformer
import faiss

model = SentenceTransformer("all-MiniLM-L6-v2")

docs = ["Transformers use self-attention", "CNNs use convolutions"]
embeddings = model.encode(docs)

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

query = "How do transformers work?"
q_emb = model.encode([query])

D, I = index.search(q_emb, k=1)
print(docs[I[0][0]])
```

---

### 9. How Document Stores Improve Generation

| Without Document Store | With Document Store       |
| ---------------------- | ------------------------- |
| Hallucinations         | Grounded answers          |
| Stale knowledge        | Live knowledge            |
| Limited memory         | External long-term memory |
| Generic output         | Context-specific output   |

---

### 10. Evaluation Metrics

| Metric       | Meaning                            |
| ------------ | ---------------------------------- |
| Recall@k     | Were relevant docs retrieved?      |
| MRR          | Ranking quality                    |
| Faithfulness | Does answer use retrieved context? |
| Latency      | Query speed                        |

---

### 11. Common Design Patterns

| Pattern                 | Use Case               |
| ----------------------- | ---------------------- |
| **RAG**                 | QA, chatbots, copilots |
| **Agent Memory Store**  | Tool-using agents      |
| **Knowledge Base**      | Enterprise search      |
| **Conversation Memory** | Long-term dialogue     |

---

### 12. Summary

A **Document Store** is the **external memory and knowledge substrate** of Generative AI systems.
It enables models to operate over real, private, and continuously evolving information with high accuracy and low hallucination.

---

If you want, next topics that naturally follow are:

* Advanced RAG architectures
* Hybrid retrieval strategies
* Document re-ranking and cross-encoders
* Memory systems for AI agents
