```{contents}
```
## Content Moderation

### 1. Definition and Motivation

**Content Moderation** is the process of automatically or semi-automatically identifying, filtering, and managing user-generated content that violates platform policies, legal requirements, or ethical standards.

**Why it matters:**

| Risk                     | Consequence                       |
| ------------------------ | --------------------------------- |
| Toxic or abusive content | User harm, platform degradation   |
| Misinformation           | Public trust erosion              |
| Illegal content          | Legal liability                   |
| Unsafe AI outputs        | Model misuse, reputational damage |

Modern moderation systems are **machine-learning pipelines** designed to operate at scale with strict accuracy, latency, and fairness constraints.

---

### 2. Moderation Taxonomy (What is moderated)

| Category             | Examples                              |
| -------------------- | ------------------------------------- |
| Hate & Harassment    | Slurs, threats, bullying              |
| Violence & Extremism | Terrorist propaganda, violent threats |
| Adult & Sexual       | Explicit content, exploitation        |
| Self-harm            | Suicide ideation, encouragement       |
| Drugs & Weapons      | Trafficking, manufacturing            |
| Misinformation       | Medical, political falsehoods         |
| Privacy Violations   | Leaks of personal data                |

Each category typically contains **subclasses** with different severity levels.

---

### 3. System Architecture

```
User Content
     ↓
Preprocessing (cleaning, language detection)
     ↓
Feature Extraction / Embedding
     ↓
Moderation Models (multi-task classifiers)
     ↓
Policy Engine (thresholds + rules)
     ↓
Action (allow / warn / block / escalate)
```

Key properties:

* **Low latency**
* **High recall for severe harms**
* **Human-in-the-loop for edge cases**

---

### 4. Modeling Approaches

| Approach      | Description                                         |
| ------------- | --------------------------------------------------- |
| Rule-based    | Regex, keyword lists                                |
| Classical ML  | TF-IDF + logistic regression                        |
| Deep Learning | Transformers (BERT, RoBERTa, GPT-style classifiers) |
| Hybrid        | Model + rules + heuristics                          |

Modern systems use **multi-label transformers** with shared encoders.

---

### 5. Multi-Label Moderation Formulation

Given input text `x`, predict vector:

```
y = [hate, violence, sexual, self_harm, drugs, misinformation, ...]
```

Each label is independent:

```
P(y_i = 1 | x) = σ(W_i · h_x)
```

Loss:

```
L = Σ BinaryCrossEntropy(y_i, ŷ_i)
```

---

### 6. Example: Training a Moderation Classifier

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=5, problem_type="multi_label_classification"
)

text = "I will hurt you if you come here again."
labels = torch.tensor([[1,1,0,0,0]])  # hate, violence, sexual, self_harm, drugs

inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model(**inputs, labels=labels)
loss = outputs.loss
```

---

### 7. Decision & Policy Layer

| Score Range | Action       |
| ----------- | ------------ |
| < 0.3       | Allow        |
| 0.3 – 0.6   | Soft warning |
| 0.6 – 0.85  | Block        |

> 0.85 | Immediate escalation |

Thresholds are tuned per category based on **risk tolerance**.

---

### 8. Human-in-the-Loop Workflow

```
Model Prediction
     ↓
Uncertain or High-Risk?
     ↓ Yes
Human Review
     ↓
Policy Feedback → Dataset Update → Model Retraining
```

This loop prevents dataset drift and improves rare-class recall.

---

### 9. Evaluation Metrics

| Metric      | Why                        |
| ----------- | -------------------------- |
| Recall      | Must catch harmful content |
| Precision   | Minimize false bans        |
| ROC-AUC     | Ranking quality            |
| Calibration | Reliable probabilities     |
| Latency     | Real-time performance      |

Severe categories optimize **recall over precision**.

---

### 10. Deployment Challenges

* Class imbalance (toxic content is rare)
* Adversarial attacks (obfuscation, code-switching)
* Cultural & linguistic variation
* Policy evolution

---

### 11. Advanced Topics

* **Contrastive safety training**
* **Active learning for rare harms**
* **Adversarial data augmentation**
* **Cross-lingual moderation**
* **Context-aware moderation (conversation-level)**

---

### 12. Summary

Content moderation is a **high-stakes multi-task ML system** combining:

* Transformer classifiers
* Policy engines
* Human feedback loops
* Continuous evaluation
