```{contents}
```

## Storage Targets

This covers **data lakes, warehouses, OLTP, OLAP**, and how each fits into a GenAI architecture.

---

### 1. Why storage targets matter in GenAI ingestion

A GenAI ingestion pipeline must store:

* **raw data** (PDFs, HTML, text, tickets, logs)
* **cleaned data**
* **chunked data**
* **embeddings**
* **metadata and lineage**
* **training/fine-tuning datasets**
* **analytics metrics**
* **audit logs and governance data**

Each category has different **performance**, **cost**, **consistency**, and **query** requirements — hence different storage systems.

---

### 2. Storage targets and their roles

#### **A. Data Lake**

Examples: **S3, GCS, Azure Blob, MinIO, HDFS**

#### Purpose in Generative-AI ingestion

Data lake is the central **landing zone** for:

* raw documents (PDF, HTML, DOCX, images)
* OCR output
* cleaned/normalized text
* chunk files
* embeddings (optional)
* intermediate transformation artifacts
* structured/unstructured hybrid data

#### Why it’s used

* Cheap, infinitely scalable
* Stores any format (binary, text, parquet, JSONL)
* Ideal for large unstructured corpora
* Enables versioned storage (raw v1, parsed v2, chunks v3…)
* Used for lineage + reproducibility

#### Typical structures

```
/raw/2025/02/17/docs/*.pdf
/clean/...
/chunks/...
/embeddings/...
/fine_tuning/...
/lineage/...
```

#### Tradeoff

* Slower queries
* Eventual consistency
* Not suitable for real-time updates

#### Summary

**Best location for raw → processed → model-ready unstructured data.**

---

### **B. Data Warehouse**

Examples: **Snowflake, BigQuery, Redshift, Databricks SQL Warehouse**

#### Purpose in Generative-AI ingestion

Stores **structured and semi-structured analytics data**:

* metadata for each document/chunk
* quality scores
* ingestion metrics (latency, throughput)
* lineage tables
* PII classification results
* per-tenant access control metadata
* training dataset catalogs
* evaluation metrics

#### Why it’s used

* Fast SQL queries
* Perfect for monitoring ingestion
* Good for reporting and dashboards
* Enforces strong consistency
* Ideal for governance & compliance

#### Typical tables

```
documents (doc_id, source, version, hash, ingest_time, pii_flags, classification)
chunks    (chunk_id, doc_id, index, offsets, quality_score)
embeddings_metadata (chunk_id, model_version, dim, timestamp)
pipeline_runs
pii_audit
```

#### Tradeoff

* Not ideal for storing large vectors or binary content
* More expensive than data lake

#### Summary

**Best for structured oversight, metadata, metrics, governance.**

---

### **C. OLTP Databases (Operational DBs)**

Examples: **PostgreSQL, MySQL, CockroachDB, DynamoDB, Firestore**

#### Purpose in GenAI ingestion

Stores **fast-changing operational state**:

* document registry (tracking ingest state)
* user permissions
* chunk status flags (parsed / deduped / embedded / indexed)
* ingestion jobs & offsets
* index versioning
* active document revisions

### Why it’s used

* Fast read/write
* Strong consistency
* Low latency for pipeline control logic
* Good for storing "current state" of ingestion

#### Example table: `ingest_status`

```
doc_id
status (raw, parsed, chunked, embedded, indexed)
retry_count
checksum
last_updated
```

#### Tradeoff

* Not suitable for large blobs
* Not suitable for analytical queries over millions of rows

#### Summary

**Controls ingestion workflow and state tracking.**

---

### **D. OLAP Systems (Analytics Engines)**

Examples: **ClickHouse, BigQuery, Druid, DuckDB, Databricks Delta Engine**

#### Purpose in GenAI ingestion

OLAP systems help run **heavy analytics on ingestion pipelines**, such as:

* dedup analytics
* embedding similarity metrics
* ingestion performance
* quality scoring pipelines
* drift detection
* curator dashboards (find bad chunks)

#### Why it’s used

* High throughput analytical scans
* Excellent for time-series ingestion metrics
* Useful for continuous evaluation of RAG quality

#### Example use cases

* “How many documents were ingested last 24 hours?”
* “Which embeddings changed for model v3?”
* “Which chunks produce lowest retrieval quality?”
* “What is the near-duplicate ratio?”

#### Tradeoff

* Not for transactional ingestion state
* Typically append-only

#### Summary

**Used for monitoring + analytics + QA over large ingestion volumes.**

---

### 3. How these storage systems work together in a GenAI pipeline

#### Complete pipeline with correct storage choices

```
          (Raw PDFs, HTMLs)
                  ↓
            Data Lake (raw)
                  ↓
        Parsing / Cleaning cluster
                  ↓
            Data Lake (clean)
                  ↓
           Chunking & Dedup
                  ↓
            Data Lake (chunks)
                  ↓
    Embedding service (GPU / API)
                  ↓
             Vector Database
            (Pinecone, Qdrant)
                  ↓
  Metadata to Data Warehouse + OLTP
```

#### Vector DB (important note)

Vector DBs (e.g., Pinecone, Qdrant, Chroma) are **not** a general-purpose storage target; they store:

* embeddings
* chunk metadata

But **raw content** must always live in the data lake to ensure reproducibility and lineage.

---

### 4. A practical separation of concerns

| Storage Type  | What goes here                                                   | Why                              |
| ------------- | ---------------------------------------------------------------- | -------------------------------- |
| **Data Lake** | raw files, cleaned text, chunks, embeddings (optional), datasets | cheap, scalable, versioned       |
| **OLTP**      | ingestion state, retries, pipeline run states                    | fast updates, strong consistency |
| **Warehouse** | metadata, lineage, audit logs, pipeline metrics                  | analytics, governance            |
| **OLAP**      | large-scale query evaluation, ingestion performance analytics    | fast analytics                   |
| **Vector DB** | embeddings + chunk metadata                                      | similarity search                |

---

### 5. Design patterns for GenAI data ingestion using these storage targets

#### Pattern A: “Medallion Architecture” for RAG

```
Bronze  (raw)   → Data Lake
Silver  (clean) → Data Lake
Gold    (chunks + metadata) → Data Lake + Warehouse
Vectors (embeddings) → Vector DB
```

#### Pattern B: “Hybrid Control Plane”

```
OLTP = job control + statuses
Warehouse = analytics + lineage
Lake = data products
Vector DB = retrieval engine
```

---

### 6. Example metadata flow (showing each store)

#### When processing a document:

1. Raw PDF → **Data Lake** (`raw/`)
2. Parsed text → **Data Lake** (`clean/`)
3. Chunk metadata → **Warehouse** (`chunks table`)
4. Chunk state → **OLTP** (`ingest_status`)
5. Final embeddings → **Vector DB**
6. Index version → **OLTP**
7. Ingestion metrics → **OLAP**

Everything is stored in the correct target.

---

### 7. Practical guidance for choosing where to store what

#### Store raw data in **data lake**

Because:

* reproducibility
* low cost
* independent of model version

#### Store metadata + lineage in **warehouse**

Because:

* you need SQL analytics
* governance + compliance

#### Store ingestion state in **OLTP**

Because:

* low-latency workflow control

#### Store embeddings only in **vector DB**

Because:

* optimized for similarity search

#### Store analytics in **OLAP**

Because:

* fast analytical queries
* time-series metrics

---

**Summary**

**Data Lake**

* Master store for raw & processed unstructured data.
* Best for versioning, lineage, cheap storage.

**Data Warehouse**

* Stores structured metadata, lineage, PII classifications, metrics.
* Enables governance and analytics.

**OLTP (operational store)**

* Tracks ingest status, job state, retries, versioning.
* Ensures consistent pipeline coordination.

**OLAP**

* High-performance analytics for quality, dedup, drift, performance metrics.

**Vector DB**

* Final store for embeddings + semantic metadata to enable RAG.

Together, these form the complete storage architecture required for production GenAI ingestion.