
## Metadata

Metadata is **descriptive information attached to documents, chunks, embeddings, or model outputs**.
It enables traceability, quality control, searchability, and reproducibility.

### Types of metadata

#### **A. Document Metadata**

* `doc_id`
* `source` (URL, S3 path, SharePoint path)
* `filename`
* `ingestion_timestamp`
* `language`
* `document_type` (FAQ, manual, ticket)
* `author` (if available)
* `version`
* `hash` (raw document checksum)

#### **B. Chunk Metadata**

* `chunk_id`
* `doc_id`
* `chunk_index`
* `section_title`
* `page_number`
* `char_offset`
* `token_count`
* `quality_score`
* `chunk_version` (if chunking strategy changes)

#### **C. Embedding Metadata**

* `embedding_model_version`
* `dimensions`
* `vector_id`
* `embedding_timestamp`
* `similarity_threshold_used`

#### **D. Processing Metadata**

* `parser_used` (pdfminer, tika, html cleaner)
* `ocr_engine` (tesseract, paddleocr)
* `cleaning_version`
* `chunking_version`
* `dedup_strategy_version`
* `safety_filter_version`

#### **E. Operational Metadata**

* `ingestion_batch_id`
* `retry_count`
* `worker_id`
* `pipeline_id` (Airflow/Dagster run ID)
* `latency` per stage

This metadata lets you reconstruct *exactly how* the data was produced.

---

### 2. What Lineage Tracking Means

**Lineage tracking** is the process of recording every transformation step that data undergoes from raw → cleaned → chunked → embedded → indexed.

It answers:

* Where did this chunk come from?
* What raw PDF created this embedding?
* Which chunking strategy was used?
* Which model created this embedding?
* When was it processed?
* Can we reproduce it exactly if needed?

Lineage is mandatory for:

* debugging
* audits
* model drift analysis
* re-ingesting with new embedding models
* ensuring reproducibility

---

### 3. Lineage Tracking Model (Simple Practical Graph)

```
RAW_DOCUMENT
  |
  ├── parsed_by = pdf_parser_v2
  ↓
PARSED_TEXT
  |
  ├── cleaned_by = cleaner_v3
  ↓
CLEAN_TEXT
  |
  ├── chunked_by = chunker_v4
  ↓
CHUNKS
  |
  ├── embedded_by = embedding_model_v3
  ├── vector_id = x123
  ↓
EMBEDDINGS
  |
  └── indexed_in = pinecone_index_v7
```

Each arrow represents a **lineage edge** with metadata.

---

### 4. Why Metadata + Lineage Are Critical in Generative AI

#### A. Embedding Model Upgrades

If you switch embedding model from `v1` → `v2`, you must know:

* which chunks were embedded with old model
* which need re-embedding
* which vector store index version was used

Without lineage, re-indexing becomes impossible.

---

#### B. Chunking Strategy Changes

If you update chunk size from **200 tokens → 500 tokens**, lineage lets you:

* rollback
* regenerate
* compare retrieval metrics between versions

---

#### C. Safety / Compliance

Lineage provides:

* proof of document origin
* GDPR/PII deletion tracking
* dataset auditability for regulations

---

#### D. Debugging RAG Failures

If retrieval produces irrelevant results, lineage tells you:

* which chunk
* which embedding model
* which preprocessing
* which version of dedup strategy
  caused the issue.

---

### 5. How to Store Metadata and Lineage

#### A. Vector Databases

Add metadata per vector:

```
{
  "doc_id": "doc_001",
  "chunk_id": "doc_001_2",
  "source": "s3://bucket/manual.pdf",
  "page": 12,
  "section": "Installation",
  "chunk_version": 4,
  "embedding_version": "embed-v3",
  "ingested_at": "2025-02-10T10:22:00Z"
}
```

#### B. Document Store (SQL / NoSQL)

Use tables:

##### `documents`

```
doc_id | source | hash | language | version | parser_version
```

##### `chunks`

```
chunk_id | doc_id | chunk_index | text | chunk_version | cleaning_version
```

##### `embeddings`

```
vector_id | chunk_id | model_version | dim | timestamp | index_version
```

#### C. Data Lake (S3/GCS)

Store intermediate states:

```
raw/       doc_001.pdf
clean/     doc_001.clean.json
chunks/    doc_001.chunks.json
embed/     doc_001.embed.json
```

#### D. Orchestrator Metadata (Airflow / Dagster)

* Airflow: `run_id`, task logs, XCom
* Dagster: materializations + asset lineage view

---

### 6. Lineage Tracking in Airflow (Simplified)

Airflow stores lineage metadata with each task:

```
context["ti"].xcom_push("lineage", {
   "doc_id": d["doc_id"],
   "cleaner_version": "3.1",
   "embedder_model": "openai-embed-v3",
   "run_id": context['run_id']
})
```

Can be written to:

* S3 JSON lineage file
* DynamoDB lineage table
* Elasticsearch index for search

---

### 7. Lineage Tracking in Dagster (Native Support)

Dagster provides **software-defined assets** with built-in lineage.

```
@asset
def clean_text(raw_document):
    ...
    return cleaned_text    # Dagster tracks lineage automatically

@asset
def chunks(clean_text):
    ...
    return chunk_list

@asset
def embeddings(chunks):
    ...
```

Dagster UI automatically shows:

* dependencies
* materialization history
* metadata for each step

This is ideal for GenAI ingestion.

---

### 8. Metadata + Lineage Best Practices for Generative-AI Data Pipelines

#### Mandatory fields to store for each chunk

* `doc_id`
* `chunk_id`
* `chunk_index`
* `cleaning_version`
* `chunking_version`
* `embedding_model_version`
* `ingestion_timestamp`
* `source_uri`
* `hash_raw`
* `hash_clean`

#### Recommended workflows

* Store hashes at each stage to detect unintended changes
* Use consistent versioning (clean_v3, chunk_v4, embed_v3)
* Maintain index_version for vector DB rebuild
* Keep full lineage in S3 as JSON for audits
* Use Dagster for asset lineage if possible

---

**Summary**

**Metadata** gives context to data at every step: source, document type, chunk info, model versions, timestamps.

**Lineage** tracks how data moves and transforms through the ingestion pipeline: raw → parsed → cleaned → chunked → embedded → indexed.

Together, they ensure:

* reproducibility
* debuggability
* compliance
* rollbacks
* quality control
* scalable ingestion


This is the **recommended Dagster pattern** for RAG / LLM data pipelines because assets give you:

* automatic lineage
* versioning
* metadata tracking
* materialization history
* orchestration
* retries
* partitions
* observability

---

### 1. Folder Structure (recommended)

```
genai_ingest/
    ├── assets/
    │     ├── raw_docs.py
    │     ├── clean_docs.py
    │     ├── chunks.py
    │     ├── embeddings.py
    │     └── vector_store.py
    ├── resources/
    │     ├── s3.py
    │     ├── embedder.py
    │     └── vector_db.py
    ├── __init__.py
    └── workspace.yaml
```

---

### 2. Dagster Asset-Based Pipeline (Full Example)

#### A. **raw_docs asset**

Extracts documents from S3 / DB / API.

```python
# raw_docs.py
from dagster import asset, Output, MetadataValue
import json

@asset
def raw_documents():
    docs = [
        {"doc_id": "doc_001", "text": "Acme manual text...", "metadata": {"source":"s3://bucket/acme.pdf"}},
        {"doc_id": "doc_002", "text": "Install guide steps...", "metadata": {"source":"s3://bucket/install.pdf"}}
    ]

    return Output(
        docs,
        metadata={"count": len(docs), "source": MetadataValue.text("S3 bucket")}
    )
```

Dagster automatically tracks lineage and metadata here.

---

#### B. **clean_docs asset**

Performs cleaning, normalization, boilerplate removal.

```python
# clean_docs.py
from dagster import asset, Output, MetadataValue
import re

def clean_text(t: str) -> str:
    t = re.sub(r"Page \d+ of \d+", "", t)
    t = re.sub(r"\s+", " ", t).strip()
    return t

@asset
def clean_documents(raw_documents):
    cleaned = []
    for d in raw_documents:
        if not d["text"].strip():
            continue
        d["clean_text"] = clean_text(d["text"])
        cleaned.append(d)

    return Output(
        cleaned,
        metadata={"cleaned_count": len(cleaned), "cleaning_version": "v3"}
    )
```

Dagster UI now shows:
`raw_documents → clean_documents` lineage.

---

#### C. **chunks asset**

Chunk using semantic or size-based splitting.

```python
# chunks.py
from dagster import asset, Output, MetadataValue

def chunk_text(text, max_char=300):
    return [text[i:i+max_char] for i in range(0, len(text), max_char)]

@asset
def document_chunks(clean_documents):
    all_chunks = []

    for d in clean_documents:
        chunks = chunk_text(d["clean_text"])
        for i, ch in enumerate(chunks):
            all_chunks.append({
                "doc_id": d["doc_id"],
                "chunk_id": f"{d['doc_id']}_chunk_{i}",
                "chunk_text": ch,
                "chunk_index": i,
                "metadata": d["metadata"]
            })

    return Output(
        all_chunks,
        metadata={
            "total_chunks": len(all_chunks),
            "chunking_version": "v4"
        }
    )
```

Dagster lineage:
`raw_documents → clean_documents → document_chunks`

---

#### D. **embeddings asset** (GPU or CPU)

Connects to embedding model (OpenAI or local).

```python
# embeddings.py
from dagster import asset, Output, MetadataValue
import numpy as np

def fake_embed(text):
    # Placeholder for real model call
    return np.random.rand(1536).tolist()

@asset
def chunk_embeddings(document_chunks):
    embedded = []

    for ch in document_chunks:
        emb = fake_embed(ch["chunk_text"])
        ch["embedding"] = emb
        embedded.append(ch)

    return Output(
        embedded,
        metadata={
            "embedding_model": "embed-v3",
            "count": len(embedded),
            "dim": 1536
        }
    )
```

Dagster lineage:
`raw_documents → clean_documents → document_chunks → chunk_embeddings`

---

#### E. **vector_store asset**

Writes to FAISS / Pinecone / Chroma / Weaviate.

```python
# vector_store.py
from dagster import asset, Output, MetadataValue

@asset
def upsert_to_vector_store(chunk_embeddings):
    for rec in chunk_embeddings:
        # In production: pinecone.upsert(...)
        pass

    return Output(
        "completed",
        metadata={"vector_count": len(chunk_embeddings), "index_version": "v7"}
    )
```

Final lineage:

```
raw_documents
    → clean_documents
        → document_chunks
            → chunk_embeddings
                → upsert_to_vector_store
```

Dagster visualizes this DAG automatically.

---

# 3. Unified Job (asset group execution)

In `__init__.py` or `repository.py`:

```python
from dagster import Definitions, load_assets_from_package_module
import assets

all_assets = load_assets_from_package_module(assets)

defs = Definitions(assets=all_assets)
```

Dagster's UI automatically:

* maps assets
* draws lineage graph
* shows metadata
* shows materialization history

---

### 4. Scheduling the Asset Pipeline

You can schedule ingestion:

```python
from dagster import ScheduleDefinition

ingest_schedule = ScheduleDefinition(
    job=defs.get_software_defined_asset_job("all_assets_job"),
    cron_schedule="0 * * * *"   # hourly
)
```

---

### 5. Advantages of Asset-Based Dagster for GenAI

#### 1. **Automatic lineage tracking**

Dagster visually links:

* raw → clean → chunks → embeddings → vector store

#### 2. **Version tracking**

You can version:

* cleaning versions
* chunking versions
* embedding models
* index versions

#### 3. **Materialization history**

Every run stores metadata such as:

* timestamp
* data size
* model version
* quality scores

#### 4. **Backfills**

You can easily re-chunk or re-embed historical data after upgrading your embedding model.

#### 5. **Observability**

Dagster UI shows:

* asset dependencies
* failed ops
* retries
* per-asset metadata

#### 6. **Parallel execution**

Hotspots:

* chunking
* embeddings
  can easily scale.

---

### 6. Add Optional Metadata for Governance

Dagster supports rich asset metadata:

```python
return Output(
    data,
    metadata={
        "record_count": len(data),
        "hash": MetadataValue.md5(data_bytes),
        "quality_score": 0.94,
        "source": MetadataValue.url("https://docs.example.com")
    }
)
```

---

**Summary**

Dagster asset-based pipelines are ideal for GenAI ingestion because they automatically give:

* **lineage**
* **metadata tracking**
* **version control**
* **observability**
* **modular ingestion stages**
* **easy incremental updates**
* **support for backfills and model upgrades**