```{contents}
```

## Scalability & throughput optimization 

Below is a compact, actionable guide covering where ingestion systems bottleneck, concrete strategies to scale each stage, configuration examples, metrics to track, and operational trade-offs.

---

### 1 — Common bottlenecks

* **Parsing/IO**: reading many files from S3 / networks.
* **Parsing/CPU**: PDF/OCR/HTML cleaning.
* **Dedup / similarity checks**: expensive similarity ops at scale.
* **Embedding generation**: GPU/remote API latency and rate limits.
* **Vector DB writes**: upsert throughput and index rebuild costs.
* **Orchestration**: task orchestration and task startup latency.
* **Backpressure**: saturating downstream services (vector DB, embedder).

---

### 2 — High-level strategies (principles)

* **Push compute down**: do expensive ops (OCR, dedup) with parallel workers close to data.
* **Batch work**: group documents/chunks into optimal batches for embeddings and DB writes.
* **Streaming + micro-batch**: combine streaming ingestion (Kafka) with timed micro-batches for downstream heavy ops.
* **Autoscale resources**: scale GPU workers and CPU workers independently.
* **Idempotent operations**: enable safe retries and replays.
* **Asynchronous pipelines**: decouple stages with durable queues (Kafka/SQS).
* **Cache & reuse**: cache embeddings for identical texts / reuse existing vectors.
* **Backpressure / circuit breakers**: prevent overload when external services degrade.

---

### 3 — Stage-by-stage optimizations

#### Extraction & IO

* Use **parallel listing & range GETs** for object stores.
* Use **S3 Transfer Acceleration / multi-part downloads** where applicable.
* Use workers colocated in the same region and VPC endpoints.

#### Parsing & Cleaning (CPU)

* Parallelize by file; use **multiprocessing** or Kubernetes pods.
* Use optimized parsers (pdfminer / tika) with streaming parsing.
* Offload OCR to GPU pods only for scanned docs (detect first).

#### Deduplication & Noise Removal

* **Exact dedup**: fast hash (SHA-256) with bloom filter to reject early.
* **Near-dedup**: do an approximate LSH / MinHash prefilter then fine cosine on a small candidate set.
* Maintain a **cache of recent chunk signatures** to avoid recomputation.

#### Chunking

* Chunk in parallel; emit metadata (doc_id, offsets) to preserve lineage.
* Choose chunk sizes tuned to embedding model token window.

#### Embedding Generation (the primary hotspot)

* Use **batching**: group N chunks per call. Typical batch sizes: 16–512 depending on model & GPU memory.
* For local models:

  * Use **GPU pods** via KubernetesPodOperator / K8s jobs; utilize mixed precision and batch inference.
  * Use model servers (TorchServe, Triton) that accept batched requests.
* For remote API providers:

  * Batch multiple inputs into a single API call if supported.
  * Implement **rate limiters** and **global token-based queues**.
* Use **quantized / distilled embedding models** for throughput on CPU when acceptable (8-bit weights).

#### Embedding Validation & Dedup (after embedding)

* Compute candidate similarity against a small **ANN index** (FAISS/HNSW) with tuned recall; avoid full pairwise comparisons.
* Use incremental indexing: upsert small batches, merge indices offline.

#### Vector DB Writes

* Use **bulk upserts** and tuned write parallelism.
* Prefer vector stores that support sharding and horizontal scaling (Pinecone, Qdrant, Weaviate).
* Use async writes with a retry/DLQ mechanism and ensure idempotent upsert keys.

#### Orchestration

* Use dynamic task mapping (Airflow) or partitioned assets (Dagster) to fan-out chunk processing.
* Use task pools & resource quotas to limit concurrent heavy tasks.
* Prefer **KubernetesPodOperator** or launching worker containers for GPU work.

---

### 4 — Architectural patterns

#### Micro-batch streaming (recommended)

```
Producer (S3 crawler / API) → Kafka topic (raw) →
Consumer group 1: parsing/cleaning → emits clean docs to `clean` topic →
Consumer group 2: chunker/dedup → emits chunk batches to `embed` topic →
Embedding pool (GPU pods) consumes embed topic → emits embeddings →
Vector DB writer consumes embedding topic → upsert
```

Advantages: high throughput, backpressure support, independent scaling per stage.

#### Batch + reindex pattern

* Bulk ingest raw into landing zone → run parallel map jobs (Spark / Dask) to transform → generate embeddings in parallel on GPU cluster → bulk upsert into vector DB.

---

### 5 — Autoscaling & resource configs (examples)

#### Kubernetes HPA (GPU worker)

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: embedder-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: embedder-deployment
  minReplicas: 1
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: nvidia.com/gpu
      target:
        type: Utilization
        averageUtilization: 70
```

(Use custom metrics if you track queue length or GPU utilization.)

#### Airflow dynamic task mapping (embedding fan-out)

```python
# pseudo
docs = load_docs()
mapped = map(lambda batch: embed_batch(batch), docs_in_batches)
```

Use `max_active_tasks_per_dag` and pools to avoid overprovisioning.

---

### 6 — Batching, concurrency & sizing guidance

* **Batch size** for embedding API: start 32–64; tune for latency/throughput.
* **Parallel workers**: start with #workers = (#docs_per_minute * avg_processing_time) / batch_size and iterate.
* **Vector DB upsert**: bulk upsert 256–4096 vectors per request depending on provider.
* Use **profiling** and adjust batch size to maximize GPU utilization without causing OOM.

---

### 7 — Cost & performance tradeoffs

* Larger batches → better throughput, higher memory usage and latency.
* More parallel pods → lower latency but higher cost.
* Local models on GPUs → cheaper at high scale vs cloud API per-call costs, but ops overhead higher.
* Quantization reduces cost/latency but may reduce quality.

---

### 8 — Monitoring & SLOs (what to track)

* **Throughput**: docs/sec, chunks/sec, embeddings/sec, vectors/sec.
* **Latency**: parse latency, embed latency (per batch), upsert latency.
* **Queue depth**: Kafka topic lag, task queue length.
* **Resource utilization**: GPU/CPU/memory, pod restarts.
* **Error rates**: validation fails, embedding fails, upsert fails.
* **Duplicate rate** after dedup (helps tune thresholds).
  Set SLOs: e.g., 95% of batches processed within X seconds.

---

### 9 — Large-scale operational tips (1M+ docs)

* **Shard data** by tenant/namespace and process shards independently.
* Precompute and store **signatures** to avoid reprocessing duplicates across runs.
* Use **tiered storage**: hot index for recent docs, cold snapshot for archival.
* Implement **lazy reembedding**: only reembed on-read or for high-impact docs when embedding model upgrades.
* Use **backfills** for historical reprocessing with controlled rate to avoid overload.

---

### 10 — Quick code pattern: async batching worker (pseudo-Python)

```python
from asyncio import Queue, create_task

queue = Queue(maxsize=1000)

async def ingest_producer(docs):
    for d in docs:
        await queue.put(d)

async def embed_worker(embed_model, out_queue):
    batch = []
    while True:
        d = await queue.get()
        batch.append(d)
        if len(batch) >= BATCH_SIZE:
            emb = embed_model.embed(batch)  # sync or await
            await out_queue.put( (batch, emb) )
            batch = []

# spawn N embed_worker tasks, tune N by GPU count and batch size
```

---
