```{contents}
```
## Toxicity Filtering

### 1. Motivation & Intuition

**Toxicity filtering** is the process of detecting and controlling harmful, offensive, abusive, or unsafe content in user inputs or model outputs.

Why it matters in ML systems:

| Risk              | Impact                            |
| ----------------- | --------------------------------- |
| Hate speech       | Legal, ethical, reputational harm |
| Harassment        | User safety, product trust        |
| Extremism         | Security & compliance             |
| Self-harm content | Human safety                      |
| Profanity         | Platform moderation               |

Large Language Models (LLMs) **must** incorporate toxicity filtering to be deployable in real-world systems.

---

### 2. What is "Toxicity"?

Toxicity is typically defined across multiple dimensions:

| Category         | Examples              |
| ---------------- | --------------------- |
| Hate             | racism, sexism, slurs |
| Harassment       | insults, bullying     |
| Threats          | violence, coercion    |
| Sexual content   | explicit, abusive     |
| Self-harm        | suicide encouragement |
| Illegal activity | drugs, terrorism      |

This is **multi-label classification**, not a single yes/no decision.

---

### 3. System Architecture

A modern toxicity filtering pipeline:

```
User Input
   ↓
Pre-Filter (Fast, rule-based)
   ↓
Neural Toxicity Classifier
   ↓
Policy Engine (thresholding, action)
   ↓
Response Control / Blocking / Rewrite
   ↓
Final Output
```

At generation guarantees:

```
Prompt → LLM → Candidate Output → Toxicity Filter → Safe Output
```

---

### 4. Types of Toxicity Filtering

| Type                    | Description                          |
| ----------------------- | ------------------------------------ |
| Rule-based              | Regex, keyword blacklists            |
| Statistical ML          | Logistic regression, SVM             |
| Deep Learning           | Transformers fine-tuned for toxicity |
| Hybrid                  | Rules + neural model                 |
| Reinforcement Filtering | Reward models penalize toxicity      |

---

### 5. Neural Toxicity Classifiers

Most production systems use **fine-tuned Transformer encoders**:

Common datasets:

* Jigsaw Toxic Comment
* Hatebase
* Civil Comments
* OpenAI moderation data

Typical labels:

| Label         |
| ------------- |
| toxic         |
| severe_toxic  |
| insult        |
| threat        |
| obscene       |
| identity_hate |
| sexual        |
| self_harm     |

---

### 6. Training Workflow

```
Text → Tokenizer → Transformer Encoder → Multi-head classifier → Sigmoid outputs
```

Loss:

```
Binary Cross Entropy (multi-label)
```

---

### 7. Demonstration (PyTorch)

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "unitary/toxic-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "I hate you and want to hurt you"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.sigmoid(logits)

labels = ["toxic","severe_toxic","obscene","threat","insult","identity_hate"]
for label, p in zip(labels, probs[0]):
    print(label, float(p))
```

---

### 8. Policy Layer (Decision Logic)

```python
THRESHOLD = 0.7

if probs.max() > THRESHOLD:
    action = "BLOCK"
else:
    action = "ALLOW"
```

More advanced policies:

* Soft rewrite (rephrase safely)
* Partial masking
* Human review escalation
* Safety warnings

---

### 9. Filtering at Generation Time

During decoding:

```
LLM logits → Apply toxicity penalty → Sample next token
```

Reward Model discourages toxic continuations.

This is the foundation of **RLHF safety alignment**.

---

### 10. Evaluation Metrics

| Metric       | Meaning               |
| ------------ | --------------------- |
| Precision    | Avoid false positives |
| Recall       | Catch harmful content |
| F1           | Balance               |
| AUROC        | Ranking quality       |
| Human review | Real-world validity   |

---

### 11. Limitations

* Context dependence (sarcasm, quoting)
* Cultural & linguistic bias
* Adversarial attacks ("h@te", spacing, emojis)
* Over-filtering harms helpfulness

---

### 12. Where It Is Used

* Chatbots & assistants
* Social media moderation
* Comment filtering
* Search engines
* Online gaming chat
* Education platforms

---

### 13. Key Takeaways

* Toxicity filtering is a **core safety layer** of modern AI systems.
* Implemented using **multi-label Transformer classifiers** plus policy logic.
* Must operate **before and after generation**.
* Balances **safety, fairness, and usability**.
