```{contents}
```

## Data Validation and Quality Checks

Data ingestion must ensure that **all data entering the system is accurate, complete, consistent, and safe**. Poor-quality data directly produces poor model performance, hallucinations, bias, and failures in downstream LLM/RAG pipelines.

Below is a structured explanation.

---

###  What is Data Validation in Generative AI?

Data validation ensures that incoming data **conforms to expected formats, schema, and integrity rules** before storage or processing.

It checks:

1. Structure
2. Schema
3. Format
4. Completeness
5. Consistency
6. Safety
7. Relevance

---

### What is Data Quality Checking?

Data quality checks ensure that data is **usable and reliable** for generative AI tasks such as RAG, fine-tuning, embedding creation, and multimodal processing.

It checks:

1. Accuracy
2. Noise level
3. Duplicates
4. Clarity
5. Annotation correctness
6. Toxic or unsafe content
7. Content freshness

---

### 3. Validation & Quality Checks During Ingestion

Below is the end-to-end set of checks done in a typical generative-AI ingestion pipeline.

---

### **A. Structural Validation**

#### 1. Schema Validation

Check whether data matches the required schema.

**Example:**

```
{ doc_id, text, metadata }
```

Missing any field → reject or send to error queue.

#### 2. Type Validation

* text must be string
* vector must be float[]
* images must be PNG/JPEG
* metadata must be dict

#### 3. Length Validation

* Document length not zero
* Token count within limits (e.g., max 8k tokens)

---

### **B. Content Validation**

#### 1. Language Detection

Reject or route unsupported languages.

#### 2. Encoding Check

Ensure UTF-8; fix corrupted characters.

#### 3. Format Validation

* PDF correctly parsed
* HTML cleaned
* Audio duration valid
* Image resolution valid

---

### **C. Quality Checks**

#### 1. Duplicate Detection

Detect:

* Duplicate documents
* Duplicate chunks
* Duplicate embeddings

Use:

* Hashing
* Similarity search (FAISS, Chroma)

#### 2. Noise Reduction

Remove:

* Boilerplate
* Headers/footers
* Repeated sentences
* Random characters
* Very short low-information text

#### 3. Text Quality

Check:

* Grammar noise
* Missing punctuation
* Nonsensical phrases
* Empty paragraphs

#### 4. Dead Links / Invalid Paths

Validate URLs, file paths, blob storage references.

---

### **D. Safety Validation**

#### 1. Toxicity / Hate / Violence Filters

Mark or remove harmful or unsafe content.

#### 2. PII Detection

Detect:

* Phone numbers
* Emails
* Identifiable personal info
  Decide to mask, remove, or keep based on usage.

#### 3. IP / Copyright Check

Remove copyrighted text if disallowed.

---

### **E. Semantic Validation**

#### 1. Relevance Checking

Ensure document is relevant to domain.

Example:
A legal RAG system should not ingest random blogs.

#### 2. Chunk-Level Coherence

Chunk should contain:

* Complete sentences
* No abrupt cutoff
* No mixed topics

#### 3. Metadata-Content Alignment

Check if metadata matches content.

Example:
metadata.language = "en" but content is French → reject.

---

### **F. Embedding Validation (for RAG)**

#### 1. Vector Dimension Check

Correct dimensionality (e.g., 1536, 768, 3072).

#### 2. Vector Range Check

Vector values not infinite or NaN.

#### 3. Vector–Text Linking

Every embedding must map to a valid chunk.

---

### **G. Versioning Checks**

#### 1. Schema Version

Enforce schema_version = v1, v2, etc.

#### 2. Embedding Model Version

Ensure embeddings are created using correct model.

---

### **H. Monitoring & Logging**

#### 1. Data Quality Metrics

* Fill rate
* Error rate
* Duplicate rate
* Quality score distribution

#### 2. Data Drift Detection

Identify changes in:

* Language patterns
* Tone
* Topic distribution
* Sentence length

---

### Example: Ingestion Pipeline with Validation

```
Raw Data → Parser → Schema Validation → Content Quality Checks
        → Safety Filters → Deduplication → Chunking
        → Embedding Generation → Embedding Validation → Storage
```

---

**Summary**

Data validation and quality checks ensure:

* Clean input
* Reliable text
* Safe content
* Accurate embeddings
* High-quality chunks
* Better LLM/RAG performance

Below are **all three demonstrations**:

### 1. Complete Validation Ruleset (Structured)

#### **A. Structural Validation**

1. `doc_id` must be present and unique
2. `text` must be non-empty
3. `metadata` must be a dictionary
4. File formats must be valid:

   * PDFs must parse correctly
   * Images must be PNG/JPEG
   * Audio must be WAV/MP3
5. Length rules:

   * Text length ≥ 20 characters
   * Token count ≤ configured limit
6. Embedding dimension must match model (e.g., 1536)

---

#### **B. Content Quality Rules**

1. Remove boilerplate (headers, footers, navigation links)
2. Remove duplicate sentences
3. Reject documents with gibberish, corrupted text, random characters
4. Ensure clear sentence boundaries
5. Reject documents with excessive empty lines or whitespace

---

#### **C. Safety Rules**

1. Detect and remove/flag:

   * Hate speech
   * Violence
   * Self-harm content
   * NSFW content
2. PII detection:

   * Mask email, phone, address if policy requires
3. Copyright/IP checks

---

#### **D. Relevance Rules**

1. Document must belong to the selected domain (finance, legal, healthcare)
2. Reject off-topic content
3. Enforce metadata-content alignment (metadata.language must match detected language)

---

#### **E. Embedding Validation Rules**

1. No NaN or infinite values
2. Confirm vector dimension
3. Embedding must link to a valid chunk
4. Reject embeddings produced with obsolete model versions

---

#### **F. Metadata Validation**

1. Required fields: `source`, `language`, `timestamp`
2. Validate timestamp format
3. Validate language using fastText/CLD3

---

---

### 2. Python Class for Validation (Production-Style Example)

```python
import re
import langdetect
import numpy as np

class DocumentValidator:
    def __init__(self, embedding_dim=1536, allowed_langs=["en"]):
        self.embedding_dim = embedding_dim
        self.allowed_langs = allowed_langs

    # ---------------------------
    # Structural Validation
    # ---------------------------
    def validate_schema(self, doc):
        required_fields = ["doc_id", "text", "metadata"]
        for f in required_fields:
            if f not in doc:
                return False, f"Missing field: {f}"

        if not isinstance(doc["metadata"], dict):
            return False, "metadata must be a dictionary"

        if len(doc["text"].strip()) < 20:
            return False, "Text too short"

        return True, "OK"

    # ---------------------------
    # Language Validation
    # ---------------------------
    def validate_language(self, text):
        try:
            lang = langdetect.detect(text)
            if lang not in self.allowed_langs:
                return False, f"Unsupported language: {lang}"
            return True, "OK"
        except:
            return False, "Language detection failed"

    # ---------------------------
    # Duplicate / Noise Check
    # ---------------------------
    def has_gibberish(self, text):
        nonsense_pattern = r"[A-Za-z]{20,}"  # very long unbroken letters
        return bool(re.search(nonsense_pattern, text))

    # ---------------------------
    # Safety Check (basic)
    # ---------------------------
    def contains_disallowed_content(self, text):
        unsafe_keywords = ["kill", "suicide", "bomb", "hate"]
        for kw in unsafe_keywords:
            if kw.lower() in text.lower():
                return True
        return False

    # ---------------------------
    # Embedding Validation
    # ---------------------------
    def validate_embedding(self, vector):
        arr = np.array(vector)
        if arr.shape[0] != self.embedding_dim:
            return False, "Incorrect embedding dimension"
        if np.isnan(arr).any() or np.isinf(arr).any():
            return False, "Embedding contains invalid values"
        return True, "OK"

    # ---------------------------
    # Master Validation Function
    # ---------------------------
    def validate(self, doc, embedding=None):
        # Schema
        ok, msg = self.validate_schema(doc)
        if not ok:
            return False, msg

        # Language
        ok, msg = self.validate_language(doc["text"])
        if not ok:
            return False, msg

        # Gibberish check
        if self.has_gibberish(doc["text"]):
            return False, "Contains noise/gibberish"

        # Safety
        if self.contains_disallowed_content(doc["text"]):
            return False, "Unsafe content detected"

        # Embedding
        if embedding is not None:
            ok, msg = self.validate_embedding(embedding)
            if not ok:
                return False, msg

        return True, "Document is valid"
```

This can be plugged directly into a RAG ingestion pipeline.

---

### 3. Full Architecture Diagram (Text Representation)

```
                  ┌──────────────────────────────┐
                  │          RAW DATA            │
                  │ (pdf, html, images, audio)   │
                  └──────────────┬───────────────┘
                                 │
                                 ▼
                ┌──────────────────────────────────┐
                │          PARSING LAYER            │
                │ PDF parsers, OCR, HTML cleaners   │
                └──────────────┬───────────────────┘
                               │
                               ▼
         ┌─────────────────────────────────────────────────┐
         │         DATA VALIDATION & QUALITY CHECKS         │
         │  - Schema Validation                             │
         │  - Format Validation                             │
         │  - Language Detection                            │
         │  - Noise & Boilerplate Removal                   │
         │  - Duplicate Detection                           │
         │  - Safety Filters                                │
         │  - Metadata Validation                           │
         └──────────────┬──────────────────────────────────┘
                        │
                        ▼
             ┌──────────────────────────────┐
             │      CLEAN & VALID DATA      │
             └──────────────┬───────────────┘
                            │
                            ▼
    ┌────────────────────────────────────────────────────────┐
    │                 CHUNKING & NORMALIZATION               │
    │  - Split into semantic chunks                          │
    │  - Remove irrelevant parts                             │
    │  - Add chunk metadata                                  │
    └───────────────┬──────────────────────────────────────┘
                    │
                    ▼
         ┌──────────────────────────────────────────┐
         │           EMBEDDING GENERATION           │
         │    OpenAI / local embedding models        │
         └───────────────┬──────────────────────────┘
                         │
                         ▼
              ┌─────────────────────────────┐
              │     EMBEDDING VALIDATION    │
              │ - Dimension check            │
              │ - NaN/Inf check              │
              │ - Version match              │
              └─────────────┬──────────────┘
                            │
                            ▼
            ┌─────────────────────────────────┐
            │      VECTOR STORE / DATABASE    │
            │  (FAISS / Chroma / Pinecone)    │
            └─────────────────────────────────┘
```

---

