```{contents}
```
## Data Cleaning 


**Data cleaning** is the process of **transforming raw data into high-quality, reliable input** for LLM pipelines.

In LLM / RAG systems, data cleaning directly impacts:

* Retrieval accuracy
* Hallucination rate
* Answer faithfulness
* Embedding quality

Bad data = bad answers.

---

### Where Data Cleaning Fits in the Pipeline

```
Raw Data → Cleaning → Chunking → Embeddings → Vector DB → RAG
```

If cleaning fails, **everything after degrades**.

---

### What Data Cleaning Fixes

| Issue                   | Effect if not fixed  |
| ----------------------- | -------------------- |
| HTML tags               | Noisy embeddings     |
| Boilerplate text        | Irrelevant retrieval |
| Encoding errors         | Corrupted content    |
| Duplicated data         | Skewed similarity    |
| Inconsistent formatting | Poor chunking        |
| PII leakage             | Security risk        |


---

### Raw Data Example

```python
raw_text = """
<div>   Welcome to Support Portal </div>
Contact us at support@company.com
\n\n\n
This    product   supports    RAG.
© 2024 Company Inc.
"""
```

---

### Remove HTML Tags

```python
from bs4 import BeautifulSoup

text = BeautifulSoup(raw_text, "html.parser").get_text()
```

---

### Normalize Whitespace

```python
import re

text = re.sub(r"\s+", " ", text).strip()
```

---

### Remove Boilerplate & Noise

```python
noise_patterns = ["©", "Contact us", "Welcome"]

for n in noise_patterns:
    text = text.replace(n, "")
```

---

### Remove PII (Basic Example)

```python
text = re.sub(r"\S+@\S+", "[EMAIL_REDACTED]", text)
```

---

### Deduplicate Content

```python
unique_lines = list(set(text.split(".")))
text = ".".join(unique_lines)
```

---

### Final Cleaned Output

```python
print(text)
```

Result:

```
This product supports RAG.
```

---

### Convert to LLM Document

```python
from langchain.schema import Document

doc = Document(
    page_content=text,
    metadata={"source": "support_page"}
)
```

Ready for:

* Chunking
* Embeddings
* Vector DB

---

### Production Cleaning Checklist

| Step                     | Required |
| ------------------------ | -------- |
| HTML stripping           | ✅        |
| Boilerplate removal      | ✅        |
| Whitespace normalization | ✅        |
| Encoding normalization   | ✅        |
| PII scrubbing            | ✅        |
| Deduplication            | ✅        |
| Language normalization   | ✅        |

---

### Common Data Cleaning Mistakes

* Over-cleaning (losing meaning)
* Leaving navigation text
* Not preserving metadata
* Cleaning after chunking
* Skipping deduplication

---

### Mental Model

```
Data Cleaning = Noise Reduction for the Brain of the System
```

---

### Key Takeaways

* Cleaning is the **highest ROI step** in RAG pipelines
* It directly improves accuracy and faithfulness
* Always clean **before** chunking and embedding
* Treat cleaning as a core ML operation, not preprocessing trivia