```{contents}
```

---

### Intrinsic (Model-Centric) Metrics

*Used during training & research*

| Metric                  | Purpose                            |
| ----------------------- | ---------------------------------- |
| Perplexity          | How well the model predicts tokens |
| Cross-Entropy Loss      | Training optimization objective    |
| Negative Log-Likelihood | Token probability accuracy         |
| Bits-per-Token          | Compression efficiency             |
| Entropy                 | Uncertainty of predictions         |

These answer:

> *Is this a good language model mathematically?*


**Model intrinsic metrics** measure the **internal quality of the language model itself**, independent of any downstream task.

They answer one question:

> **How good is this model as a probabilistic language model?**

They are mainly used during:

* Pretraining
* Fine-tuning
* Research benchmarking

Not for production product evaluation.

---

### Core Model Intrinsic Metrics

| Metric                  | What It Measures                           |
| ----------------------- | ------------------------------------------ |
| Perplexity              | How well the model predicts the next token |
| Cross-Entropy Loss      | Training optimization objective            |
| Negative Log-Likelihood | Log probability of tokens                  |
| Bits-per-Token          | Compression efficiency                     |
| Entropy                 | Uncertainty of predictions                 |

---

### Why Intrinsic Metrics Matter

Intrinsic metrics:

* Track training progress
* Compare base models
* Diagnose under/overfitting
* Guide architecture changes

They **do not** measure usefulness, safety, or business success.

---

### Key Metric: Perplexity (Demonstration)

#### Formula

$$
\text{Perplexity} = e^{\text{Cross-Entropy Loss}}
$$

Lower = better language modeling.

---

#### Example Demonstration (PyTorch-style)

```python
import torch
import torch.nn.functional as F
import math

# Example predicted token probabilities
logits = torch.tensor([[2.0, 0.5, 0.1]])  # model output
target = torch.tensor([0])               # correct token index

# Compute cross-entropy loss
loss = F.cross_entropy(logits, target)

# Compute perplexity
perplexity = math.exp(loss.item())

print("Cross-Entropy Loss:", loss.item())
print("Perplexity:", perplexity)
```

#### Interpretation

* Lower loss → lower perplexity → better model

---

### Negative Log-Likelihood (NLL)

```python
log_probs = F.log_softmax(logits, dim=1)
nll = -log_probs[0, target]
```

Measures how surprised the model is by the correct token.

---

### Bits-per-Token

```python
bits_per_token = loss / math.log(2)
```

Lower bits → more efficient language representation.

---

### Entropy (Uncertainty)

```python
probs = torch.softmax(logits, dim=1)
entropy = -(probs * torch.log(probs)).sum()
```

High entropy = model unsure
Low entropy = confident predictions

---

### When to Use Intrinsic Metrics

| Phase                       | Used? |
| --------------------------- | ----- |
| Pretraining                 | ✅     |
| Fine-tuning                 | ✅     |
| Model architecture research | ✅     |
| Prompt evaluation           | ❌     |
| RAG systems                 | ❌     |
| Chatbot quality             | ❌     |

---

### Why Intrinsic Metrics Fail for Real Systems

A model with low perplexity can still:

* Hallucinate
* Be unsafe
* Be irrelevant
* Be useless for business tasks

Hence intrinsic metrics ≠ product quality.

---

### Mental Model

```
Intrinsic Metrics → Model Fitness
Extrinsic Metrics → System Success
```

---

### Task & Quality Metrics

*Used in real applications*

| Metric       | Purpose                      |
| ------------ | ---------------------------- |
| Correctness  | Factual accuracy             |
| Relevance    | Question–answer alignment    |
| Completeness | Coverage of requirements     |
| Fluency      | Grammar & naturalness        |
| Coherence    | Logical structure            |
| Consistency  | Stability across runs        |
| Specificity  | Avoids generic answers       |
| Conciseness  | Avoids unnecessary verbosity |

---

### RAG & Knowledge Grounding Metrics

| Metric                      | Purpose                     |
| --------------------------- | --------------------------- |
| Retrieval Precision@K       | Relevant docs ratio         |
| Retrieval Recall@K          | Was required info retrieved |
| MRR                         | Ranking quality             |
| Context Relevance           | Context usefulness          |
| Faithfulness / Groundedness | Hallucination detection     |
| Hallucination Rate          | % unsupported claims        |
| Context Coverage            | Info completeness           |
| Attribution Accuracy        | Traceability to sources     |

---

### Safety & Alignment Metrics

| Metric               | Purpose                |
| -------------------- | ---------------------- |
| Toxicity Score       | Harmful language       |
| Bias Score           | Fairness               |
| Policy Compliance    | Rule adherence         |
| Jailbreak Resistance | Attack robustness      |
| Refusal Accuracy     | Proper denial behavior |
| PII Leakage Rate     | Privacy risk           |

---

### Performance Metrics

| Metric          | Purpose         |
| --------------- | --------------- |
| Latency         | Response time   |
| p50 / p95 / p99 | Tail latency    |
| TTFT            | Streaming speed |
| Throughput      | Requests/sec    |
| Error Rate      | Failure ratio   |
| Availability    | Uptime          |

---

### Cost & Efficiency Metrics

| Metric           | Purpose               |
| ---------------- | --------------------- |
| Token Usage      | Cost driver           |
| Cost per Request | Dollar spend          |
| Cost per User    | Monetization          |
| Retry Cost       | Reliability indicator |
| Fallback Cost    | Capacity indicator    |
| Cache Hit Ratio  | Cost optimization     |

---

### Reliability & Stability Metrics

| Metric                 | Purpose             |
| ---------------------- | ------------------- |
| Regression Delta       | Quality change      |
| Consistency Drift      | Stability over time |
| Model Degradation Rate | Long-term health    |
| Retry Rate             | Infra health        |
| Fallback Rate          | Capacity health     |
| Timeout Rate           | SLA risk            |

---

### User Experience Metrics

| Metric                  | Purpose                  |
| ----------------------- | ------------------------ |
| User Satisfaction Score | Human rating             |
| Resolution Rate         | Task success             |
| Escalation Rate         | Automation effectiveness |
| Conversation Length     | Efficiency               |
| Abandonment Rate        | UX health                |

---

### Final Mental Model

```
Training Phase → Intrinsic Metrics (Perplexity, Loss)
Production Phase → All Other Metrics
```

Perplexity belongs **only** in **Intrinsic Metrics**.