```{contents}
```

## Schema Design and Evolution

### Schema Design in Generative AI

Schema design defines **how data is structured, stored, and accessed** for training, evaluation, and inference in generative AI systems. It ensures consistency, quality, and efficient retrieval.

Key principles:

1. **Clear structure for each modality**

   * Text: `{id, source, content, metadata}`
   * Image: `{id, caption, image_path, labels}`
   * Multimodal: `{id, text, image, audio, metadata}`

2. **Consistent metadata fields**
   Examples: `timestamp`, `language`, `tags`, `source_url`, `annotations`.

3. **Embedding schema**

   * `{id, vector, dims, doc_id, chunk_id}`
     Needed for RAG, search, and vector DBs.

4. **Chunking and indexing structure (for RAG)**

   * `{doc_id, chunk_id, chunk_text, position, embedding}`
     Ensures contextual retrieval.

5. **Versioning**

   * Keep `v1`, `v2`, `v3` schemas as the dataset evolves.
   * Avoid mixing old/new formats without clear identifiers.

---

### Schema Evolution in Generative AI

Schema evolution means **changing the schema over time without breaking existing data or pipelines**. Crucial for scaling large LLM/RAG systems.

Drivers for schema evolution:

* Adding new fields (e.g., quality score)
* Changing data modalities (text → multimodal)
* Improving chunking strategy
* Adding embeddings or reranker scores
* Changing annotation structure

Approaches:

1. **Backward-Compatible Changes**
   Old data can still be read when new fields are added.
   Example: Add `quality_score` with a default value.

2. **Forward-Compatible Changes**
   New consumers can handle older records.
   Example: New pipeline checks for missing fields.

3. **Schema Versioning**
   Store a field:

   * `"schema_version": "v2"`
     Allows models and pipelines to interpret data correctly.

4. **Migration Strategies**

   * Re-processing old data (slow but clean)
   * On-the-fly migration when data is accessed
   * Dual-write: write both old and new formats temporarily

5. **Evolution in Vector Databases (RAG)**
   Updating schema for embeddings:

   * Add `rerank_score`
   * Add new `model_embedding_v2`
   * Change dimension size (requires backfill)

6. **Evolution during model fine-tuning**

   * New instruction formats
   * New output schema (e.g., JSON-mode LLM outputs)
   * Updated prompt templates

---

### Examples

**1. Text dataset evolving to include labels**

* v1: `{id, text}`
* v2: `{id, text, label}`
* v3: `{id, text, label, metadata}`

**2. RAG pipeline schema evolution**

* Add chunk-level `citations`
* Add `approx_position`
* Change chunk size 200 → 512 tokens

**3. Multimodal model evolution**

* Add `audio_transcript` field
* Add `image_embedding`

---


### Schema Design (Initial Version)

Assume you start with a simple RAG system that stores documents and their embeddings.

#### **Schema v1**

```
Document {
    doc_id: string
    text: string
}

Embedding {
    embed_id: string
    doc_id: string
    vector: float[]
}
```

#### Explanation

* Minimal fields
* Only supports basic text retrieval
* No metadata, no chunking, no ranking

---

#### Evolving Requirements

After using the system for a while, you realize you need:

1. Chunking (to improve retrieval)
2. Metadata (author, date, tags)
3. Reranker scores
4. Multiple embedding models (v1 vs v2)
5. Support for multimodal data (e.g., images)

This triggers **schema evolution**.

---

### Schema Evolution (Backward Compatible)

#### **Schema v2**

Add chunking + metadata, but keep old fields readable.

```
Document {
    doc_id: string
    title: string
    text: string
    metadata: {
        author: string
        created_at: string
        tags: string[]
    }
    schema_version: "v2"
}

Chunk {
    chunk_id: string
    doc_id: string
    chunk_text: string
    chunk_index: int     // position of the chunk in the document
}

Embedding {
    embed_id: string
    chunk_id: string
    vector: float[]
    model_version: "embed_v1"
}
```

#### Changes

* Introduced metadata
* Introduced chunking
* Added schema_version
* Added model_version
* Older v1 data is still valid (doc without metadata still works)

---

### Schema Evolution (Forward + Backward Compatible)

Requirements increase again as your system needs to support:

* Reranking
* Multi-embedding support (OpenAI + local model)
* Quality scores for chunks
* Image + text (multimodal)

#### **Schema v3**

```
Document {
    doc_id: string
    title: string
    text: string
    metadata: {
        author: string
        created_at: string
        tags: string[],
        source_url: string,
        language: string
    }
    schema_version: "v3"
}

Chunk {
    chunk_id: string
    doc_id: string
    chunk_text: string
    chunk_index: int
    quality_score: float         // new field
    has_image: bool              // multimodal support
    image_path: string?          // nullable
}

Embedding {
    embed_id: string
    chunk_id: string
    vector_openai: float[]       // new
    vector_local: float[]        // new
    rerank_score: float?         // optional
    model_version: {
        base: "embed_v1",
        local: "local_v2"
    }
}
```

#### Highlights of Evolution

* New metadata fields
* Optional multimodal fields
* Multiple embedding vectors
* Reranking support
* Backward-compatible (old data still works)

---

#### When Schema Migration Happens

In practice, you migrate data:

**Full migration:**
Re-run chunking logic → regenerate embeddings → backfill missing fields.

**Lazy migration:**
Add defaults (`quality_score = 1.0`), fill missing fields on demand.

**Dual writes:**
Write to both v2 and v3 schema during transition.

---

**Final Demonstration Summary**

| Version | Capabilities Added                                                    |
| ------- | --------------------------------------------------------------------- |
| v1      | Simple text + embedding                                               |
| v2      | Chunking, metadata, schema versioning                                 |
| v3      | Multimodal, reranker score, multiple embedding models, extra metadata |

This shows how schema **design** starts small and how **schema evolution** scales the system without breaking anything already running.

If needed, I can also demonstrate:

* A **SQL schema example**
* A **MongoDB schema example**
* A **Vector DB schema example (FAISS / Pinecone / Chroma)**
  Just specify.


### Summary

**Schema design** ensures structured, consistent, high-quality data for generative AI.
**Schema evolution** ensures the system can grow and adapt without breaking existing pipelines.

If needed, I can provide a **sample schema design for a full RAG system** or **best practices for maintaining schema versions in production**.


