```{contents}
```
## Knowledge Distillation

**Knowledge Distillation** is a model compression and training paradigm where a **large, powerful model (teacher)** transfers its learned knowledge to a **smaller, faster model (student)** so the student achieves high performance with much lower computational cost.

---

### 1. Why Distillation Exists

Modern models are accurate but expensive:

| Problem         | Effect                       |
| --------------- | ---------------------------- |
| Large models    | High latency, memory, energy |
| Edge deployment | Often impossible             |
| Cloud cost      | Expensive at scale           |

**Goal of KD:**

> Preserve **teacher-level performance** while achieving **student-level efficiency**.

---

### 2. Core Intuition

Instead of training the student only on hard labels:

```
cat → 1
dog → 0
```

we train it using the **teacher’s probability distribution**:

```
Teacher: [cat: 0.72, tiger: 0.18, fox: 0.06, dog: 0.04]
```

This distribution contains **dark knowledge**:

* inter-class similarity
* decision boundaries
* uncertainty information

The student learns **how** the teacher thinks, not just what it predicts.

---

### 3. Formal Definition

Let:

* Teacher: ( T(x) )
* Student: ( S(x) )
* Ground truth: ( y )
* Temperature: ( \tau )

**Soft targets:**

[
p_T = \text{softmax}(z_T / \tau)
\quad
p_S = \text{softmax}(z_S / \tau)
]

**Distillation Loss:**

[
\mathcal{L} = \alpha \cdot CE(y, S) + (1-\alpha) \cdot \tau^2 \cdot KL(p_T | p_S)
]

---

### 4. Training Workflow

```
1. Train Teacher model
2. Freeze Teacher
3. For each batch:
      - Teacher produces soft labels
      - Student predicts
      - Compute distillation loss
      - Update Student
```

---

### 5. Minimal PyTorch Example

```python
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    hard_loss = F.cross_entropy(student_logits, labels)
    
    soft_student = F.log_softmax(student_logits / T, dim=1)
    soft_teacher = F.softmax(teacher_logits / T, dim=1)
    
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (T * T)
    
    return alpha * hard_loss + (1 - alpha) * soft_loss
```

---

### 6. Types of Distillation

| Type                | Description                             |
| ------------------- | --------------------------------------- |
| Response-based      | Match output probabilities (classic KD) |
| Feature-based       | Match internal representations          |
| Relation-based      | Match relationships between samples     |
| Self-distillation   | Model teaches its smaller version       |
| Online distillation | Teacher and student trained together    |
| Multi-teacher       | Ensemble of teachers                    |

---

### 7. Feature Distillation Example

```python
loss = mse(student_feature_map, teacher_feature_map)
```

Used heavily in **CNN compression** and **vision transformers**.

---

### 8. Where Distillation Is Used

| Domain  | Use Case                 |
| ------- | ------------------------ |
| NLP     | BERT → DistilBERT        |
| Vision  | ResNet-152 → ResNet-18   |
| Speech  | Large ASR → mobile ASR   |
| LLMs    | GPT → on-device LLMs     |
| Edge AI | Cloud → embedded devices |

---

### 9. Distillation vs Fine-tuning

| Aspect             | Distillation | Fine-tuning |
| ------------------ | ------------ | ----------- |
| Teacher used       | Yes          | No          |
| Model size change  | Yes          | No          |
| Knowledge transfer | Explicit     | Implicit    |
| Goal               | Compression  | Adaptation  |

---

### 10. Benefits and Limitations

**Advantages**

* Large speedup
* Lower memory
* Often improves generalization

**Limitations**

* Requires strong teacher
* Careful tuning of ( \tau ) and ( \alpha )
* Student architecture still matters

---

### 11. Conceptual Summary

```
Big model → distilled knowledge → Small model
Accuracy preserved, cost reduced
```
