```{contents}
```

## Deduplication and Noise Removal

Deduplication and noise removal ensure that the input data for LLMs, RAG pipelines, and embeddings is **clean, unique, and semantically meaningful**. Poor-quality or repetitive data leads to weak retrieval, increased hallucinations, and degraded fine-tuning output.

Both processes are core components of the ingestion pipeline.

---

###  Deduplication

Deduplication removes **exact or near-duplicate documents, chunks, sentences, or embeddings.**

#### Why Deduplication Matters

1. Prevents model **overfitting** to repeated content
2. Reduces **vector DB size**
3. Improves **retrieval diversity**
4. Removes spam content
5. Avoids **skewed fine-tuning** from repeated examples

---

### A. Types of Deduplication

#### **1. Exact Deduplication**

Checks if two documents or chunks are **identical**.

Methods:

* SHA256 hash of full text
* Hash of normalized text
* Fingerprinting (MinHash, SimHash)

Use case: Remove multiple copies of same file.

---

#### **2. Near-Duplicate Deduplication**

Detects content that is *not identical* but *very similar*.

Examples:

* Same paragraph with slight punctuation changes
* Duplicated bullet points
* Repetitive logs or boilerplate

Methods:

* MinHash + Locality-Sensitive Hashing (LSH)
* TF-IDF cosine similarity
* Embedding similarity (e.g., >0.95 threshold)
* SentenceTransformer similarity

Use case: RAG ingestion, cleaning scraped websites.

---

#### **3. Chunk-Level Deduplication**

Even if documents differ, chunks may repeat.

Detection:

* Compare embeddings
* Compare n-grams
* Compare sentences

Use case: Big PDFs, legal docs, policies, product manuals.

---

#### **4. Cross-Document Deduplication**

Identify repeated content across multiple sources.

Example:
FAQ answers repeated across sources.

---

### B. Deduplication Pipeline Example

```
Raw Text → Normalize → Hash → Exact Dedup → Embedding → Near Dedup → Store
```

---

### 2. Noise Removal

Noise removal removes content that is **irrelevant, meaningless, low-quality, or harmful**.

#### Why Noise Removal Matters

1. Reduces garbage chunks in RAG
2. Improves embedding quality
3. Prevents hallucinations
4. Enhances chunk coherence
5. Prevents indexing junk data

---

### A. Types of Noise

#### **1. Boilerplate Noise**

* Headers, footers
* Navigation menus
* Copyright lines
* Page numbers
* Repeated watermarks
* “Back to top” links

#### **2. OCR/Text Extraction Noise**

* Misrecognized characters
* Random symbols
* Garbled text from scanned PDFs

#### **3. Formatting Noise**

* Excessive whitespace
* HTML tags
* Broken sentences

#### **4. Semantic Noise**

* Very short sentences with no meaning
* Random paragraphs unrelated to topic
* Duplicate disclaimers on every page

#### **5. Log/Technical Noise**

* Debug logs
* Time stamps
* Unstructured terminal outputs

---

### B. Noise Removal Techniques

#### **1. Text Normalization**

* Lowercasing
* Removing non-printable characters
* Removing repeated spaces
* Fixing broken lines

#### **2. HTML Cleaners**

* Strip tags
* Remove scripts, styles
* Keep only semantic text

#### **3. OCR Post-Processing**

* Spell correction
* Remove invalid Unicode sequences

#### **4. Boilerplate Removal**

* Use heuristic patterns
* Domain-specific rules
* Readability scores
* Page template detection

#### **5. Content-Length Filters**

* Remove chunks below min length (e.g., < 20 characters)
* Remove ultra-long garbage sections

#### **6. ML-Based Noise Detection**

* Classifiers trained to detect:

  * Spam
  * Junk paragraphs
  * Unusable OCR text

---

### C. Noise Removal Pipeline Example

```
Raw Document → Clean HTML / OCR → Remove Boilerplate → Normalize → Filter by Quality → Chunk
```

---

### 3. Combined Dedup + Noise Removal Workflow

Below is a typical RAG ingestion sequence.

```
1. Parse document
2. Normalize text
3. Remove noise and boilerplate
4. Deduplicate exact matches (hash-based)
5. Generate temporary embeddings
6. Deduplicate near-duplicates (embedding similarity)
7. Keep only clean, unique chunks
8. Generate final embeddings
9. Store in vector DB
```

---

### 4. Example: Deduplication Implementation (Python)

#### **Exact Dedup**

```python
import hashlib

def hash_text(text):
    normalized = " ".join(text.split())
    return hashlib.sha256(normalized.encode()).hexdigest()
```

#### **Near-Duplicate with Embeddings**

```python
def is_near_duplicate(emb1, emb2, threshold=0.95):
    cos_sim = np.dot(emb1, emb2) / (np.linalg.norm(emb1)*np.linalg.norm(emb2))
    return cos_sim > threshold
```

---

**Summary**

### Deduplication

* Removes exact and near copies
* Prevents redundancy, improves retrieval quality

### Noise Removal

* Removes irrelevant, corrupted, low-quality text
* Ensures chunks are clean and meaningful

---

### 1. Raw Input Example (Realistic)

```
==================== PAGE 1 ====================
Acme Product Manual – Version 4.2
Copyright © 2024 Acme Corp.
Page 1 of 32

How to install the Acme Widget?
--------------------------------
Step 1: Unpack the device.
Step 2: Connect the power cable.
Step 3: Download the Acme setup app.

Back to top
================================================
```

Notice the noise:

* Page number
* Headers / footers
* Copyright lines
* “Back to top”
* Hyphens and separators

---

### 2. Noise Removal Demonstration

#### Step A: Remove boilerplate

```
Acme Product Manual – Version 4.2

How to install the Acme Widget?

Step 1: Unpack the device.
Step 2: Connect the power cable.
Step 3: Download the Acme setup app.
```

Removed:

* “Page 1 of 32”
* “Copyright © 2024…”
* “Back to top”
* Page separators

---

#### Step B: Normalize text

* Fix spacing
* Remove repeated newlines
* Lowercase if needed
* Remove random symbols

Result:

```
Acme Product Manual – Version 4.2
How to install the Acme Widget?
Step 1: Unpack the device.
Step 2: Connect the power cable.
Step 3: Download the Acme setup app.
```

---

### 3. Chunking

```
chunk_1:
Acme Product Manual – Version 4.2

chunk_2:
How to install the Acme Widget?

chunk_3:
Step 1: Unpack the device.
Step 2: Connect the power cable.
Step 3: Download the Acme setup app.
```

---

### 4. Exact Deduplication Demonstration

Assume 3 documents contain the same installation steps.

#### Before exact-dedup

```
chunk_3 from doc A
chunk_3 from doc B
chunk_3 from doc C
```

We hash each normalized chunk:

```
chunk_3 hash = 6f94c5c681bf3c...
chunk_3 hash = 6f94c5c681bf3c...
chunk_3 hash = 6f94c5c681bf3c...
```

All match → keep only 1.

#### After exact-dedup

```
chunk_3 (unique)
```

---

### 5. Near-Duplicate Dedup (Embedding Similarity Demonstration)

Assume we have two chunks:

```
chunk_1: "Step 1: Unpack the device."
chunk_2: "Unpack the device first."
```

Not exact duplicates, but very similar.

#### Embedding similarity

```
similarity(chunk_1, chunk_2) = 0.98
```

If threshold = **0.95**, treat as near-duplicates.

Keep only the more complete or higher-quality version:

```
chunk_1 kept
chunk_2 discarded
```

---

### 6. Final Output After Cleaning + Deduplication

**Final unique, noise-free chunks:**

```
1. Acme Product Manual – Version 4.2
2. How to install the Acme Widget?
3. Step 1: Unpack the device.
   Step 2: Connect the power cable.
   Step 3: Download the Acme setup app.
```

These three chunks are now clean, readable, and ready for embedding.

---

### 7. Full Demonstration Code (Python Example)

```python
import hashlib
import numpy as np

# ----------------------------------------
# Noise removal
# ----------------------------------------
def remove_noise(text):
    boilerplate = [
        r"Page \d+ of \d+",
        r"Copyright.*",
        r"Back to top",
        r"=+",
    ]

    import re
    cleaned = text
    for p in boilerplate:
        cleaned = re.sub(p, "", cleaned)

    cleaned = cleaned.strip()
    cleaned = " ".join(cleaned.split())  # normalize whitespace
    return cleaned

# ----------------------------------------
# Exact deduplication
# ----------------------------------------
def text_hash(text):
    normalized = " ".join(text.split())
    return hashlib.sha256(normalized.encode()).hexdigest()

# ----------------------------------------
# Near-duplicate check
# ----------------------------------------
def is_near_duplicate(emb1, emb2, threshold=0.95):
    emb1, emb2 = np.array(emb1), np.array(emb2)
    sim = np.dot(emb1, emb2) / (np.linalg.norm(emb1)*np.linalg.norm(emb2))
    return sim >= threshold

# ----------------------------------------
# Example demonstration
# ----------------------------------------
raw_text = """
==================== PAGE 1 ====================
Acme Product Manual – Version 4.2
Copyright © 2024 Acme Corp.
Page 1 of 32

How to install the Acme Widget?
--------------------------------
Step 1: Unpack the device.
Step 2: Connect the power cable.
Step 3: Download the Acme setup app.

Back to top
================================================
"""

clean_text = remove_noise(raw_text)
print("Cleaned Text:\n", clean_text)

# Example chunks
chunks = [
    "Step 1: Unpack the device.",
    "Unpack the device first."
]

# Fake embeddings for demonstration
embeddings = [
    np.random.rand(1536),
    np.random.rand(1536) * 0.99   # artificially similar
]

print("Near duplicate:", is_near_duplicate(embeddings[0], embeddings[1], 0.95))
```

---

**Summary of Demonstration**

### Noise Removal

* Remove boilerplate
* Normalize text
* Clean OCR/HTML junk

### Deduplication

* Exact dedup → hash matching
* Near-dedup → embedding similarity
* Keep only clean, unique, high-quality chunks

### Result

You end up with **clean, meaningful, non-repetitive data** optimized for RAG and generative-AI training.

If you want, I can demonstrate:

* Full ingestion pipeline combining validator + dedup + chunker
* Noise removal rules for PDFs, HTML, logs
* Dedup logic for large datasets (1M+ documents)
