**Deduplication in LLM pre-training datasets** refers to the process of identifying and removing duplicate or near-duplicate text data from the training corpus before training a large language model (LLM).

### Why is Deduplication Important?

1. **Avoids overfitting to repeated data**: LLMs trained on duplicated text can memorize common passages, such as Wikipedia articles or popular books, leading to overfitting and reduced generalization.
2. **Improves model quality**: Redundant data reduces the diversity of information, which can hurt performance on downstream tasks.
3. **Reduces training cost**: Duplicate content wastes compute resources since the model sees the same content multiple times.

---

### Types of Duplicates

1. **Exact duplicates**: Identical documents or paragraphs repeated multiple times.
2. **Near-duplicates**: Slight variations of the same text (e.g., with typos, formatting changes, added/removed sentences).
3. **Template-based duplicates**: Repeated boilerplate text with minor changes (common in web data).

---

### Deduplication Techniques

1. **Hashing-based methods**:

   * Use hash functions (like SHA256) to detect exact matches.
   * **MinHash** or **SimHash** are used for approximate deduplication.

2. **Embedding similarity**:

   * Generate vector embeddings for documents and compute cosine similarity.
   * High-similarity pairs are considered near-duplicates.

3. **Shingling + Jaccard similarity**:

   * Break text into overlapping substrings (shingles).
   * Compare sets using Jaccard similarity to detect overlaps.

4. **Clustering**:

   * Cluster similar documents and keep only representative samples.

---

### In Practice

For example, OpenAI's LLM pre-training (like GPT-4) or Meta’s LLaMA series includes deduplication to ensure high-quality, diverse data. This is especially critical when using large web crawls (e.g., Common Crawl) where duplication is rampant.

---

Let me know if you want to see code or tools used for deduplication.


Here are **code-level techniques and tools** commonly used for deduplication in LLM pretraining pipelines, categorized by approach:

---

### ✅ **1. Exact Deduplication using Hashing**

Fast and simple for **exact duplicates**.

```python
import hashlib

def hash_document(text: str) -> str:
    return hashlib.sha256(text.strip().encode('utf-8')).hexdigest()

unique_docs = {}
for doc in documents:
    h = hash_document(doc)
    if h not in unique_docs:
        unique_docs[h] = doc

deduplicated = list(unique_docs.values())
```

---

### ✅ **2. MinHash (Approximate Deduplication)**

Efficient for **near-duplicate detection** using **Jaccard similarity**.

```python
from datasketch import MinHash, MinHashLSH

def get_minhash(text, num_perm=128):
    shingles = set(text.split())
    m = MinHash(num_perm=num_perm)
    for shingle in shingles:
        m.update(shingle.encode('utf8'))
    return m

# Create LSH index
lsh = MinHashLSH(threshold=0.8, num_perm=128)
minhashes = {}

for i, doc in enumerate(documents):
    m = get_minhash(doc)
    if not lsh.query(m):
        lsh.insert(f"doc_{i}", m)
        minhashes[f"doc_{i}"] = doc

deduplicated = list(minhashes.values())
```

Install with:

```bash
pip install datasketch
```

---

### ✅ **3. Embedding-based Deduplication**

Best for **semantic similarity** (e.g., paraphrased or reworded duplicates).

```python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents, convert_to_tensor=True)

# Compute pairwise cosine similarity
cos_sim = util.pytorch_cos_sim(embeddings, embeddings)

# Filter out documents that are too similar (e.g., > 0.9)
import numpy as np
mask = np.ones(len(documents), dtype=bool)

for i in range(len(documents)):
    if not mask[i]:
        continue
    for j in range(i+1, len(documents)):
        if cos_sim[i][j] > 0.9:
            mask[j] = False

deduplicated = [doc for i, doc in enumerate(documents) if mask[i]]
```

---

### ✅ **4. Tools Used in Research Pipelines**

| Tool / Library                                                            | Purpose                                          | Notes                                       |
| ------------------------------------------------------------------------- | ------------------------------------------------ | ------------------------------------------- |
| `dedup-texts` ([GitHub](https://github.com/facebookresearch/dedup-texts)) | Near-duplicate detection using SimHash           | Used by Meta in LLaMA                       |
| `datasketch`                                                              | Locality Sensitive Hashing, MinHash              | Good for large-scale deduplication          |
| `Cleanlab`                                                                | Dataset cleaning, includes duplication detection | Focused on data quality                     |
| `Apache Spark` + `LSH`                                                    | Scalable deduplication on large datasets         | Used in large-scale web crawls              |
| `FAISS`                                                                   | Vector similarity search                         | Efficient for embedding-based deduplication |

---

Let me know if you want:

* A scalable pipeline example (e.g., Spark-based).
* Deduplication tailored for JSONL datasets used in LLMs.
* Integration into your current tokenizer or pretraining pipeline.
