```{contents}
```

## Error Handling and Retry Policies

Data ingestion for Generative AI involves **parsing, cleaning, chunking, dedup, embeddings, storage, and orchestration**.
Each step can fail due to schema issues, bad documents, API errors, GPU failures, or rate limits.
Robust **error handling + retry strategy** ensures that ingestion is *reliable, fault-tolerant, and recoverable* at scale.

Below is a clear, structured explanation.

---

### 1. Why Error Handling Matters in GenAI Ingestion

Generative-AI pipelines process:

* PDFs, HTML, DOCX, OCR
* Logs, tickets, emails
* Large batches of heterogeneous documents
* External APIs (OpenAI, inference servers)
* Vector stores (Pinecone, FAISS, Qdrant)

Errors are inevitable:

* Parsing failures (bad PDFs, OCR corruptions)
* Chunking/text errors (unicode issues)
* Validation failures (missing fields, unsafe content)
* Embedding generation outages
* API rate limits
* Vector DB write failures
* Network timeouts

Without structured handling:

* Broken pipeline
* Partial ingestion
* Orphaned embeddings
* Inconsistent vector index
* Repeated work / wasted cost

---

### 2. Types of Errors in Generative-AI Ingestion

#### **A. Document-level errors**

Affect only one document.

* corrupted PDF
* empty text
* missing metadata
* invalid format

**Action**: send to Dead-Letter Queue (DLQ), mark failed, continue pipeline.

---

#### **B. Pipeline-stage errors**

Specific step fails.

Examples:

* cleaning function crashes
* chunking produces empty results
* dedup hash failure

**Action**: retry step, fallback to alternative method (e.g., OCR fallback).

---

#### **C. External service errors**

* Embedding API rate limit
* OpenAI 429 / 5xx
* Vector database unresponsive

**Action**: exponential retries, circuit breaker, queue for delayed retry.

---

#### **D. System-level errors**

* Out of memory
* GPU crash
* Disk full
* Worker node failure

**Action**: automated retry on new worker, checkpoint restart.

---

### 3. Retry Policies (Core Patterns)

#### **1. Fixed Delay Retry**

Retry after fixed interval.

```
Retry every 30 seconds, up to 3 attempts.
```

Used for predictable transient issues.

---

#### **2. Exponential Backoff**

Wait time increases exponentially.

```
Wait: 10s → 30s → 90s → 180s
```

Used for:

* rate limits
* external API overload

---

#### **3. Jitter (Randomized Backoff)**

Adds randomness to avoid thundering herd.

```
Wait = base * 2^retry + random(0,1)
```

---

#### **4. Max Retry With Fallback**

After N failures, try fallback logic.

Example:

* If PDF parser fails 3 times → use OCR parser
* If embedding API fails → use backup embedding model

---

#### **5. Dead-Letter Queue (DLQ)**

Documents that repeatedly fail are routed to DLQ for manual inspection.

---

#### **6. Idempotency**

Ensures the same document can be retried **without corrupting vector-storage**.

Example idempotent rule:

* Upsert by chunk_id
* Delete previous version before re-embedding

---

### 4. Error Handling in Each Step (Detailed)

#### **A. Extraction**

Errors:

* network timeouts
* corrupted downloads

Handling:

* retries (3–5)
* fallback mirror locations
* hash integrity check

---

#### **B. Parsing**

Errors:

* PDF unparseable
* HTML malformed

Handling:

* try alternate parser
* fallback to OCR
* move to DLQ if both fail

---

#### **C. Cleaning / Noise Removal**

Errors:

* regex crashes
* Unicode decode issues

Handling:

* catch exceptions
* run normalization (replace invalid chars)
* skip faulty sections

---

#### **D. Validation**

Errors:

* missing fields
* unsupported language
* unsafe content

Handling:

* mark invalid → DLQ
* don’t retry (not transient)

---

#### **E. Chunking**

Errors:

* empty chunks
* chunk too large

Handling:

* retry with default chunk size
* fallback to sentence-based chunker

---

#### **F. Dedup**

Errors:

* hashing failure
* similarity computation errors

Handling:

* skip chunk
* record logs

---

#### **G. Embedding Generation**

Errors:

* API rate limit
* internal model server error
* GPU OOM

Handling:

* exponential backoff
* retry on different worker
* batch smaller inputs

---

#### **H. Vector DB Write**

Errors:

* network timeout
* index not ready
* dimension mismatch

Handling:

* retry
* validate embedding before write
* auto-delete bad entries

---

### 5. Error Handling in Orchestration (Airflow / Dagster)

#### **Airflow**

* `retries=5`
* `retry_delay=timedelta(minutes=2)`
* Task marked as failed moves to alerting
* XCom stores intermediary checkpoints
* Use `on_failure_callback` → send Slack/email
* Dead-Letter Queue via S3 or DB table

---

#### **Dagster**

* Retries with `RetryPolicy(max_retries=3)`
* Automatic logging / materialization of failing ops
* Sensors for failure detection
* Assets keep lineage so partial ingestion is safe

---

### 6. Common Retry Policy for Embedding Services

```
max_retries = 6
backoff = exponential (5s, 10s, 20s, 40s, 80s)
jitter = random(0–2s)
retry_on = [429, 500, 502, 503, connection errors]
fallback = switch to alternate embedding endpoint
```

---

### 7. Dead-Letter Queue (DLQ)

DLQ stores permanent failures in:

* S3 bucket: `dlq/ingestion/<timestamp>/<doc_id>.json`
* DynamoDB/Firestore
* Kafka DLQ topic

Contains:

* document ID
* raw content
* failure reason
* stack trace
* timestamps

This avoids stopping the pipeline for bad data.

---

### 8. Example: Simple Retry Logic (Python)

```python
import time
import random

def retry_with_exponential_backoff(func, retries=5):
    delay = 2
    for attempt in range(1, retries+1):
        try:
            return func()
        except Exception as e:
            if attempt == retries:
                raise
            wait = delay * attempt + random.uniform(0, 1)
            time.sleep(wait)
```

Used for:

* embedding calls
* vector DB writes
* network fetch

---

### 9. Summary

#### Error Handling

* Detect and classify failures
* Recover when possible
* Route irrecoverable data to DLQ
* Preserve pipeline consistency
* Log everything for debugging

#### Retry Policies

* Fixed retry for predictable issues
* Exponential backoff for API/rate limits
* Fallback logic for alternate parsers or embedding models
* Idempotent writes to ensure safe replays

Together, they make the GenAI ingestion pipeline **resilient, scalable, and production-grade**.

---

### 1. Airflow Example (with retries, backoff, DLQ, fallback)

#### File: `genai_ingest_dag.py`

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
import random
import time
import json
import boto3

DLQ_BUCKET = "genai-dlq-bucket"
s3 = boto3.client("s3")

# -------------------------
# Helper: Write to DLQ
# -------------------------
def send_to_dlq(doc, reason):
    key = f"dlq/{doc['doc_id']}_{int(time.time())}.json"
    payload = {"doc": doc, "reason": reason}
    s3.put_object(Bucket=DLQ_BUCKET, Key=key, Body=json.dumps(payload))

# -------------------------
# Extract Task
# -------------------------
def extract(**context):
    docs = [
        {"doc_id": "doc_001", "text": "Acme manual...", "metadata": {}},
        {"doc_id": "doc_bad", "text": "", "metadata": {}},  # bad doc
    ]
    context["ti"].xcom_push(key="docs", value=docs)

# -------------------------
# Clean Task With Fallback + DLQ
# -------------------------
def clean(**context):
    docs = context["ti"].xcom_pull(key="docs")
    cleaned = []
    for d in docs:
        if not d["text"].strip():     # invalid content
            send_to_dlq(d, "EMPTY_TEXT")
            continue

        # cleaning with retry wrapper
        try:
            cleaned_text = d["text"].replace("Page 1 of 10", "").strip()
        except Exception as e:
            send_to_dlq(d, f"CLEANING_FAILED: {str(e)}")
            continue

        d["clean_text"] = cleaned_text
        cleaned.append(d)

    context["ti"].xcom_push(key="clean_docs", value=cleaned)

# -------------------------
# Embedding Task With Exponential Backoff
# -------------------------
def embed(**context):
    docs = context["ti"].xcom_pull(key="clean_docs")
    out = []

    for d in docs:
        retries = 5
        delay = 2

        for attempt in range(1, retries+1):
            try:
                # simulate API failure
                if random.random() < 0.3:
                    raise ConnectionError("API timeout")

                emb = [0.1] * 1536   # Fake embedding
                d["embedding"] = emb
                out.append(d)
                break

            except Exception as e:
                if attempt == retries:
                    send_to_dlq(d, f"EMBED_FAIL: {str(e)}")
                else:
                    wait = delay * attempt
                    time.sleep(wait)

    context["ti"].xcom_push(key="embedded_docs", value=out)

# -------------------------
# Load Task
# -------------------------
def load(**context):
    docs = context["ti"].xcom_pull(key="embedded_docs")
    for d in docs:
        print(f"Upsert vector for {d['doc_id']}")
    return "done"

# -------------------------
# DAG Definition
# -------------------------
default_args = {
    "owner": "genai-team",
    "retries": 3,
    "retry_delay": timedelta(seconds=10),
}

with DAG(
    "genai_ingestion_with_retry",
    default_args=default_args,
    start_date=datetime(2025, 1, 1),
    schedule_interval="@hourly",
    catchup=False,
) as dag:

    t1 = PythonOperator(task_id="extract", python_callable=extract)
    t2 = PythonOperator(task_id="clean", python_callable=clean)
    t3 = PythonOperator(task_id="embed", python_callable=embed)
    t4 = PythonOperator(task_id="load", python_callable=load)

    t1 >> t2 >> t3 >> t4
```

#### What this Airflow DAG demonstrates

* Document-level validation
* DLQ routing
* Exponential backoff (manual implementation)
* Automatic Airflow retries per-task
* Partial success ingestion
* No pipeline crash due to a bad document

---

### 2. Dagster Example (RetryPolicy + failure sensors + DLQ)

#### File: `genai_ingest_job.py`

```python
from dagster import op, job, RetryPolicy, Failure, fs_io_manager
import time, random, json

DLQ_FILE = "dlq.json"

# -------------------------
# DLQ Writer
# -------------------------
def write_dlq(doc, reason):
    with open(DLQ_FILE, "a") as f:
        f.write(json.dumps({"doc": doc, "reason": reason}) + "\n")

# -------------------------
# Ops
# -------------------------

@op
def extract_docs():
    return [
        {"doc_id": "doc_001", "text": "Acme manual...", "metadata": {}},
        {"doc_id": "doc_bad", "text": "", "metadata": {}},
    ]

@op
def clean_docs(docs):
    out = []
    for d in docs:
        if not d["text"].strip():
            write_dlq(d, "EMPTY_TEXT")
            continue
        d["clean_text"] = d["text"].replace("Page 1 of 10", "").strip()
        out.append(d)
    return out

@op(
    retry_policy=RetryPolicy(max_retries=5, delay=2)
)
def embed_chunks(docs):
    out = []
    for d in docs:
        if random.random() < 0.3:
            raise Failure(description=f"Embedding API failed for {d['doc_id']}")
        d["embedding"] = [0.1] * 1536
        out.append(d)
    return out

@op
def load_vectors(docs):
    for d in docs:
        print(f"Upsert vector for {d['doc_id']}")

# -------------------------
# Job
# -------------------------
@job(resource_defs={"io_manager": fs_io_manager})
def genai_ingestion_job():
    docs = extract_docs()
    cleaned = clean_docs(docs)
    embedded = embed_chunks(cleaned)
    load_vectors(embedded)
```

---

### What the Dagster job demonstrates

#### A. RetryPolicy with built-in retry handling

```
RetryPolicy(max_retries=5, delay=2)
```

Dagster automatically:

* retries failed ops
* logs failure metadata
* visualizes backoff in UI

#### B. Failure handling via `Failure()`

You can classify failures and provide messages.

#### C. DLQ support

Manual DLQ writing for failed ops:

```
write_dlq(doc, "EMPTY_TEXT")
```

#### D. Strong lineage tracking

Dagster automatically tracks:

* asset materialization
* partial runs
* upstream/downstream dependencies

---

**Summary Comparison**

| Concern            | Airflow                                         | Dagster                                    |
| ------------------ | ----------------------------------------------- | ------------------------------------------ |
| Retry policies     | Built-in + custom code                          | Built-in `RetryPolicy`                     |
| Backoff            | Manual or built-in `ExponentialBackoff` plugins | Automatic exponential retry via user logic |
| DLQ handling       | S3/DB/Kafka in tasks                            | file/DB/kafka via ops                      |
| Failure visibility | Task logs                                       | Rich UI + event logs                       |
| Best for           | ETL-heavy GenAI ingestion                       | ML/GenAI pipelines with assets             |

---

### KubernetesPodOperator Example for GenAI Embedding

#### File: `genai_k8s_ingest_dag.py`

```python
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

# --------------------------------------------------
# Airflow DAG: CPU tasks → GPU embedding pod → load
# --------------------------------------------------

default_args = {
    "owner": "genai-team",
    "retries": 2,
    "retry_delay": timedelta(seconds=20)
}

with DAG(
    dag_id="genai_ingestion_k8s_gpu",
    start_date=datetime(2025, 1, 1),
    default_args=default_args,
    schedule_interval="@hourly",
    catchup=False,
) as dag:

    # ---------------------------------------------
    # CPU task: extract documents from S3/DB/API
    # ---------------------------------------------
    def extract(**context):
        docs = [
            {"doc_id": "doc_001", "text": "Acme manual text...", "metadata": {}}
        ]
        context["ti"].xcom_push(key="docs", value=docs)

    t1_extract = PythonOperator(
        task_id="extract_docs",
        python_callable=extract
    )

    # ---------------------------------------------
    # GPU task: run embedding inside KubernetesPod
    # ---------------------------------------------
    t2_embed_gpu = KubernetesPodOperator(
        task_id="embed_gpu",
        name="embed-gpu-job",
        namespace="genai",
        image="myrepo/genai-embedder:latest",         # custom embedding image
        cmds=["python3", "embed.py"],                # entrypoint in the container
        arguments=[
            "--input-xcom", "{{ ti.xcom_pull(key='docs') }}",
            "--output-path", "/output/embeddings.json"
        ],
        env_vars={
            "EMBED_MODEL": "my-embedding-v3",
            "BATCH_SIZE": "32"
        },
        volumes=[
            # example: mount persistent volume for output
        ],
        volume_mounts=[
        ],
        container_resources={
            "limits": {"nvidia.com/gpu": "1"},        # request 1 GPU
            "requests": {"memory": "4Gi", "cpu": "1"}
        },
        get_logs=True,
        is_delete_operator_pod=True,                  # auto-clean pod
        in_cluster=True,                              # running inside cluster
    )

    # ---------------------------------------------
    # CPU task: load embeddings into vector DB
    # ---------------------------------------------
    def load_embeddings():
        print("loading embeddings into vector DB (FAISS / Pinecone / Chroma)")

    t3_load = PythonOperator(
        task_id="load_embeddings",
        python_callable=load_embeddings
    )

    t1_extract >> t2_embed_gpu >> t3_load
```

---

### 2. What This KubernetesPodOperator Setup Demonstrates

#### A. GPU Execution

```
container_resources = {
    "limits": {"nvidia.com/gpu": "1"}
}
```

This ensures the embedding workload runs on a GPU-enabled node.

---

#### B. Pod runs *only* the heavy step

Airflow handles lightweight orchestrating tasks; heavy work (embeddings, OCR, vectorization) is passed to a dedicated K8s pod.

---

#### C. Fully isolated environment

Embedding container contains:

* embedding model
* tokenizer
* dependencies
* CUDA drivers (if needed)

You avoid overloading Airflow workers.

---

#### D. Auto-cleanup

```
is_delete_operator_pod=True
```

Pods are deleted after completion.

---

#### E. XCom passing → Pod arguments

You can pass documents from Airflow to the pod as arguments.

---

#### F. Retries (Airflow-level)

`retries=2` ensures Airflow re-runs the pod if it fails.

---

### 3. Example: embed.py

This script runs *inside* the Kubernetes pod.

```python
import json
import argparse
import numpy as np

parser = argparse.ArgumentParser()
parser.add_argument("--input-xcom")
parser.add_argument("--output-path")
args = parser.parse_args()

docs = json.loads(args.input_xcom)

# Fake GPU embedding (replace with real model)
embeddings = []
for d in docs:
    emb = np.random.rand(1536).tolist()
    embeddings.append({"doc_id": d["doc_id"], "embedding": emb})

with open(args.output_path, "w") as f:
    json.dump(embeddings, f)

print("Embedding completed inside GPU pod")
```

In reality, replace this with:

* OpenAI embedding calls
* Local embedding model (Jina, SentenceTransformers, LlamaIndex)
* TensorRT optimized encoder
* Custom PyTorch model

---

### Typical Use Cases in GenAI Ingestion

| Task                      | Should run in CPU? | Should run in GPU pod? |
| ------------------------- | ------------------ | ---------------------- |
| PDF parsing               | CPU                | No                     |
| OCR                       | CPU or GPU         | Yes (GPU recommended)  |
| Noise removal             | CPU                | No                     |
| Chunking                  | CPU                | No                     |
| Dedup & similarity checks | CPU                | Sometimes              |
| Embedding generation      | No                 | **Yes**                |
| Vector DB writes          | CPU                | No                     |
| Reranking                 | No                 | Optional               |

Embedding is typically the heavy GPU-bound stage → KubernetesPodOperator is ideal.

---

### 5. Production Considerations

#### A. Node selectors for GPU nodes

```
node_selector={"gpu": "true"}
```

#### B. Tolerations for GPU node pools

Allows pods to run only on GPU nodes.

#### C. Use Kubernetes Secrets

Inject OpenAI keys or auth tokens securely.

#### D. Artifact storage

Use S3/GCS for passing embeddings back to Airflow.

#### E. Logging

`get_logs=True` streams pod logs to Airflow UI.

#### F. Batch splitting

Large ingestion → multiple embedding pods launched in parallel.

