```{contents}
```

## Latency vs Consistency

**Latency** = how fast a document moves through ingestion → embedding → indexing → becomes retrievable.

**Consistency** = how correct, complete, and synchronized the indexed data is across:

* chunks
* embeddings
* metadata
* dedup/versioning
* vector store shards
* caches

**Trade-off:**
To ingest fast (low latency), we often accept temporary inconsistencies.
To guarantee perfect consistency, ingestion slows down.

In Generative-AI systems, both cannot be maximized simultaneously.

---

### 2. Why the trade-off exists

GenAI ingestion pipelines involve multiple asynchronous steps:

```
Extract → Clean → Chunk → Dedup → Embed → Store → Refresh Index
```

Each step introduces a delay or inconsistency risk:

* batching
* parallel writes
* eventual consistency of vector DB
* metadata updates
* retries
* re-embedding due to model upgrades

Optimizing for one side hurts the other.

---

### 3. Examples of Latency-Optimized vs Consistency-Optimized Behavior

#### A. Latency-Optimized (Faster Ingestion)

Characteristics:

* Push documents quickly into vector DB.
* Skip heavy quality checks.
* Allow stale embeddings temporarily.
* Accept partial indexing during failure recovery.
* Rely on **eventual consistency** across shards.

Outcome:

* New documents appear in search faster.
* Retrieval may briefly return inconsistent or duplicated chunks.

Used for:

* customer support search
* troubleshooting documentation
* systems where freshness matters more than completeness

---

#### B. Consistency-Optimized (Stronger Guarantees)

Characteristics:

* Strict dedup before indexing
* Validate all embeddings
* Atomic upserts (all chunks succeed or none)
* Index versioning
* Two-phase commit style writes (staging → publish)

Outcome:

* Retrieval is stable, correct, reproducible
* Ingestion pipeline is slower (more blocking)

Used for:

* legal documents
* policy/finance domains
* regulated industries
* LLM fine-tuning datasets

---

### 4. Where the Trade-Off Appears in Each Ingestion Stage

#### 4.1 Parsing → Cleaning

**Low latency:**

* Use fast parsing; skip heavy noise-filtering.

**High consistency:**

* Slow exhaustive parsing, OCR fallback, metadata verification.

---

#### 4.2 Chunking

**Low latency:**

* Chunk with simple size-based splits.

**High consistency:**

* Use semantic chunking, language detection, quality scoring.

---

#### 4.3 Deduplication

**Low latency:**

* Only exact dedup via hashing.

**High consistency:**

* Near-duplicate checks (embeddings), LSH, similarity thresholds.

Near-dedup increases latency significantly.

---

#### 4.4 Embedding Generation

**Low latency:**

* Smaller batches, parallel GPU fans, faster but expensive.

**High consistency:**

* Wait for accumulated batch, align embedding versions, enforce deterministic settings.

---

#### 4.5 Vector DB Indexing

**Low latency:**

* Write quickly, accept temporary inconsistencies across shards.
* Use eventual consistency: some replicas may be stale for seconds/minutes.

**High consistency:**

* Strict write barriers.
* Use transaction logs.
* Disable search until index rebuild completes.

---

#### 4.6 Re-embedding / Model Upgrades

**Low latency:**

* Lazy re-embedding (only re-embed on retrieval).
* On-read fallback to older embeddings.

**High consistency:**

* Full offline re-embed and re-index before switching version.
* Synchronous upgrade.

---

### 5. Concrete Scenarios Demonstrating the Trade-off

#### Scenario 1: 10,000 new documents dropped into S3 at once

* **Latency mode**: stream them and index as soon as embedding is done.
* **Consistency mode**: hold ingestion until chunking, dedup, quality checks complete.

#### Scenario 2: Embedding model upgrade from v2 → v3

* **Latency mode**: migrate incrementally (eventual consistency).
* **Consistency mode**: reprocess entire corpus before switching index version.

#### Scenario 3: Vector DB sharding

* **Latency mode**: allow out-of-sync shards; queries eventually converge.
* **Consistency mode**: force write quorums; slower ingest.

---

### 6. Techniques to Manage the Trade-off (Practical)

#### A. Dual-Written Index (Shadow Index)

Write to a new index version in background:

```
index_v2 (active)  
index_v3 (building)
```

Switch traffic when ready.

Balances:

* fast ingestion
* high consistency for final published index

---

#### B. Micro-Batch Staging Layer

```
Chunks → staging table → (validated) → vector index
```

Enables:

* fast ingestion into staging
* strict validation for index
* rollback support

---

#### C. Eventual Consistency with Time-Based SLO

Common pattern: **fresh within 30–120 seconds**.

Used when slight delays are acceptable.

---

## D. Strong Consistency with Write Barriers

Use:

* atomic upserts
* version-controlled metadata
* shard-wide sync points

Slower, but consistent.

---

#### E. Quality Gates (Configurable)

Switchable at runtime:

```
FAST_MODE:
    skip_near_dedup = true
    skip_quality_scoring = true

SAFE_MODE:
    enforce_dedup = true
    verify_embeddings = true
    require_metadata = true
```

---

### 7. Industry Examples

#### Slack Search / Helpdesk Search

Optimized for latency → content visible almost immediately.

#### Financial Policy Document Systems

Optimized for consistency → ingestion pipelines run slower with strict validation.

#### RAG Systems in Production

Often adopt **mixed** approach:

* ingestion: eventually consistent
* retrieval: strongly consistent
  (using version barriers + shadow indexes)

---

**Summary Table**

| Ingestion decision    | Latency focus | Consistency focus  |
| --------------------- | ------------- | ------------------ |
| Parsing               | fast, shallow | strict, deep       |
| Cleaning              | minimal       | heavy              |
| Dedup                 | exact only    | exact + near       |
| Embeddings            | immediate     | batched, versioned |
| Indexing              | eventual      | strong             |
| Updates               | incremental   | full rebuild       |
| RAG freshness         | high          | moderate           |
| Risk of inconsistency | higher        | minimal            |

---

**Final Summary**

**Latency-optimized ingestion** prioritizes speed and data freshness, accepting temporary inconsistencies (e.g., some chunks not yet indexed, older embedding versions in use).

**Consistency-optimized ingestion** prioritizes correctness, version alignment, and deterministic outputs at the cost of slower ingestion.

In practice, production GenAI systems use **hybrid strategies**:

* fast ingestion to staging
* consistent publishing to index
* versioned embeddings
* partial eventual consistency with bounded staleness



Below is a production-style, ready-to-run **Python example** that implements two operational modes for a GenAI ingestion pipeline:

* **FAST_MODE** — low-latency path (fewer checks, streaming into index/staging)
* **SAFE_MODE** — consistency-first path (strict validation, dedup, staged publish, atomic upsert)

Features:

* single `ModeConfig` to switch behaviour
* pipeline stages (parse → clean → chunk → dedup → embed → upsert)
* staging area + atomic publish for SAFE_MODE
* tunable batching, retries, and thresholds
* simple in-memory vector store and DLQ for demonstration
* example runs showing both modes

Save as `mode_ingest_pipeline.py` and run.

```python
# mode_ingest_pipeline.py
import time
import hashlib
import random
import threading
from typing import List, Dict, Any, Optional
import numpy as np

# -------------------------
# Mode configuration
# -------------------------
class ModeConfig:
    def __init__(
        self,
        name: str,
        skip_near_dedup: bool,
        enforce_quality_checks: bool,
        embed_batch_size: int,
        use_staging_publish: bool,
        require_all_chunks_atomic: bool
    ):
        self.name = name
        self.skip_near_dedup = skip_near_dedup
        self.enforce_quality_checks = enforce_quality_checks
        self.embed_batch_size = embed_batch_size
        self.use_staging_publish = use_staging_publish
        self.require_all_chunks_atomic = require_all_chunks_atomic

FAST_MODE = ModeConfig(
    name="FAST_MODE",
    skip_near_dedup=True,
    enforce_quality_checks=False,
    embed_batch_size=64,
    use_staging_publish=False,
    require_all_chunks_atomic=False,
)

SAFE_MODE = ModeConfig(
    name="SAFE_MODE",
    skip_near_dedup=False,
    enforce_quality_checks=True,
    embed_batch_size=16,
    use_staging_publish=True,
    require_all_chunks_atomic=True,
)

# -------------------------
# Simple in-memory stores
# -------------------------
class InMemoryVectorStore:
    def __init__(self):
        self.store = {}           # vector_id -> (vector, metadata)
        self.lock = threading.Lock()
        self.index_version = 0

    def upsert(self, id: str, vector: List[float], metadata: Dict[str, Any]):
        with self.lock:
            self.store[id] = (vector, metadata)

    def bulk_upsert(self, rows: List[Dict[str, Any]]):
        with self.lock:
            for r in rows:
                self.store[r["id"]] = (r["vector"], r.get("metadata", {}))

    def fetch(self, id: str):
        return self.store.get(id)

    def count(self):
        return len(self.store)

    def snapshot(self):
        with self.lock:
            return dict(self.store)

vector_db = InMemoryVectorStore()
staging_store = {}   # doc_id -> list of chunk records
DLQ = []

# -------------------------
# Utilities
# -------------------------
def sha256(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

def simple_clean(text: str) -> str:
    return " ".join(text.replace("\n", " ").split())

def simple_chunk(text: str, max_chars=300) -> List[str]:
    return [text[i:i+max_chars] for i in range(0, len(text), max_chars)]

def fake_embed_batch(texts: List[str], dim=1536) -> List[List[float]]:
    # deterministic pseudo-embeds using hashing seed
    out = []
    for t in texts:
        h = int(sha256(t)[:8], 16)
        rng = np.random.RandomState(h)
        v = rng.randn(dim).astype(float)
        v = (v / (np.linalg.norm(v) + 1e-9)).tolist()
        out.append(v)
    return out

def cosine_sim(a, b):
    a = np.array(a); b = np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-12))

# -------------------------
# Dedup structures (simple)
# -------------------------
exact_hash_set = set()

def exact_dedup_check(text: str) -> bool:
    h = sha256(" ".join(text.split()))
    if h in exact_hash_set:
        return True
    exact_hash_set.add(h)
    return False

# For near-dedup: small ANN-like naive sample of vector DB (expensive at scale)
def near_dedup_check(embedding, threshold=0.995) -> bool:
    # iterate small sample: here entire db (demo). Production: use ANN (HNSW) search.
    for vid, (vec, meta) in vector_db.snapshot().items():
        if cosine_sim(vec, embedding) >= threshold:
            return True
    return False

# -------------------------
# Pipeline class
# -------------------------
class IngestPipeline:
    def __init__(self, mode: ModeConfig):
        self.mode = mode

    def ingest_document(self, doc: Dict[str, Any]) -> bool:
        """Single-document ingestion orchestration, returns True on (attempted) success."""
        doc_id = doc.get("doc_id") or f"doc_{int(time.time()*1000)}"
        text = doc.get("text", "")
        if not text.strip():
            DLQ.append((doc, "EMPTY_TEXT"))
            return False

        # Parse & clean
        clean_text = simple_clean(text)
        if self.mode.enforce_quality_checks and len(clean_text) < 20:
            DLQ.append((doc, "TOO_SHORT"))
            return False

        # Chunk
        chunks = simple_chunk(clean_text, max_chars=400)   # tune per model
        chunk_records = []
        for idx, ch in enumerate(chunks):
            if exact_dedup_check(ch):
                # exact duplicate skip
                continue
            chunk_id = f"{doc_id}#c{idx}"
            chunk_records.append({
                "doc_id": doc_id,
                "chunk_id": chunk_id,
                "chunk_text": ch,
                "chunk_index": idx,
                "metadata": doc.get("metadata", {})
            })

        if not chunk_records:
            # nothing to ingest after dedup
            return True

        # If SAFE_MODE and require atomic all-chunks, stage first
        if self.mode.use_staging_publish:
            staging_store[doc_id] = {"chunks": chunk_records, "status": "staged", "created": time.time()}

        # Embedding in batches (mode controls batch sizes)
        texts = [c["chunk_text"] for c in chunk_records]
        batch_size = self.mode.embed_batch_size
        embeddings = []
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            emb = fake_embed_batch(batch_texts)
            embeddings.extend(emb)

        # Near-dedup (optional; expensive)
        rows_to_upsert = []
        for c, emb in zip(chunk_records, embeddings):
            if not self.mode.skip_near_dedup:
                if near_dedup_check(emb, threshold=0.995):
                    # treat as duplicate; skip indexing
                    continue
            rows_to_upsert.append({"id": c["chunk_id"], "vector": emb, "metadata": c["metadata"]})

        # Upsert strategy:
        if self.mode.use_staging_publish:
            # store into staging area first
            staging_store[doc_id]["embeddings"] = rows_to_upsert
            staging_store[doc_id]["status"] = "ready"
            # Publish atomically if policy requires
            if self.mode.require_all_chunks_atomic:
                return self._publish_staging_atomic(doc_id)
            else:
                # best-effort publish: push immediately but keep staging record
                vector_db.bulk_upsert(rows_to_upsert)
                return True
        else:
            # FAST_MODE: upsert directly (streaming)
            try:
                vector_db.bulk_upsert(rows_to_upsert)
                return True
            except Exception as e:
                DLQ.append((doc, f"UPsertFail:{str(e)}"))
                return False

    def _publish_staging_atomic(self, doc_id: str) -> bool:
        """Publish staged doc atomically: all-or-none semantics (demo)."""
        record = staging_store.get(doc_id)
        if not record or record.get("status") != "ready":
            DLQ.append(({"doc_id": doc_id}, "STAGING_NOT_READY"))
            return False
        rows = record.get("embeddings", [])
        # Basic atomicity: validate vectors then perform one bulk upsert
        # If validation fails, do not upsert and leave staged for manual inspection
        for r in rows:
            v = r["vector"]
            if not self._validate_vector(v):
                DLQ.append(({"doc_id": doc_id}, "INVALID_EMBEDDING"))
                return False
        try:
            vector_db.bulk_upsert(rows)
            record["status"] = "published"
            # increment index_version to indicate consistent publish (demo)
            vector_db.index_version += 1
            return True
        except Exception as e:
            DLQ.append(({"doc_id": doc_id}, f"PUBLISH_FAIL:{str(e)}"))
            return False

    def _validate_vector(self, v):
        arr = np.array(v)
        if arr.ndim != 1:
            return False
        if np.isnan(arr).any() or np.isinf(arr).any():
            return False
        return True

# -------------------------
# Demo usage
# -------------------------
def demo_run():
    docs = [
        {"doc_id": "doc_fast_1", "text": "How to install the Acme Widget? Unpack device, connect cable, run setup."},
        {"doc_id": "doc_fast_2", "text": "Short"},  # will fail in SAFE_MODE (quality)
        {"doc_id": "doc_fast_3", "text": "Repeated content. " * 50}
    ]

    print("=== Running FAST_MODE ===")
    p_fast = IngestPipeline(FAST_MODE)
    for d in docs:
        ok = p_fast.ingest_document(d)
        print("FAST ingest", d["doc_id"], "ok=", ok)
    print("Vector DB count (FAST):", vector_db.count())
    print("DLQ:", DLQ)

    # reset small state for SAFE_MODE demonstration
    print("\nClearing exact_hash_set and DLQ for safe demo")
    exact_hash_set.clear()
    DLQ.clear()

    print("\n=== Running SAFE_MODE ===")
    p_safe = IngestPipeline(SAFE_MODE)
    for d in docs:
        ok = p_safe.ingest_document(d)
        print("SAFE ingest", d["doc_id"], "ok=", ok)
    print("Vector DB count (SAFE):", vector_db.count())
    print("Staging store keys:", list(staging_store.keys()))
    print("Index version:", vector_db.index_version)
    print("DLQ:", DLQ)

if __name__ == "__main__":
    demo_run()
```

---

### How the code maps FAST vs SAFE decisions

* **Deduplication**

  * FAST: exact dedup only (fast hash check)
  * SAFE: exact + near-dedup using similarity (slower; uses ANN in production)

* **Quality checks**

  * FAST: skip heavy quality checks
  * SAFE: enforce minimum length and any other validators

* **Embedding batch size**

  * FAST: larger batch (higher throughput)
  * SAFE: smaller batch (more deterministic, easier validation)

* **Publishing**

  * FAST: streaming direct upsert to vector DB
  * SAFE: stage then atomic publish (all-or-none) with index version increment

* **Error handling**

  * FAST: best-effort; failures go to DLQ but do not always block
  * SAFE: strict validation, failures kept in staging until resolved

---

### Production notes and improvements to adopt

* Replace `fake_embed_batch` with real embedding service; batch aggregation logic remains the same.
* Replace naive `near_dedup_check` with an ANN index (HNSW/FAISS/Chroma) for speed.
* Staging should be durable (S3 / DB) and support transactional publish (e.g., swap index alias).
* Use queues (Kafka/SQS) between stages to decouple latency and scale each stage independently.
* Add observability: metrics for docs/sec, chunks/sec, queue lag, embedding latency, DLQ rates.
* Implement configurable time-bounded staging (expire stale staged entries).
* Add per-tenant access controls and encryption on metadata before upsert.
* For SAFE_MODE atomic publish at scale, use shadow index pattern: build new index/version and switch alias atomically.

---

### Quick run explanation

* Run the file: `python mode_ingest_pipeline.py`
* It prints behavior differences:

  * FAST_MODE ingests quickly and upserts directly.
  * SAFE_MODE stages items, validates embeddings, and publishes atomically (or fails to DLQ if validation fails).

This demonstrates practical switches you can flip at runtime to trade **latency** for **consistency**.