# Transformer Architecture Deep Dive

**Reference**: [HuggingFace LLM Course — Chapter 1.4](https://huggingface.co/learn/llm-course/chapter1/4)  
**Design doc**: [docs/transformer_architecture_experiment.md](../docs/transformer_architecture_experiment.md)

This notebook walks through 6 probes that empirically verify the theoretical claims from
Chapter 1.4 about attention mechanisms, architecture families, language modeling objectives,
and transfer learning.

## The Three Architecture Families

```
┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────────────┐
│  Encoder-Only       │  │  Decoder-Only        │  │  Encoder-Decoder            │
│  (BERT-like)        │  │  (GPT-like)          │  │  (T5-like)                  │
│                     │  │                      │  │                             │
│  Bidirectional      │  │  Causal (L→R)        │  │  Bidir enc + causal dec     │
│  attention          │  │  attention            │  │  + cross-attention          │
│                     │  │                      │  │                             │
│  → Classification   │  │  → Text generation   │  │  → Translation              │
│  → NER, QA          │  │  → Code completion   │  │  → Summarization            │
└─────────────────────┘  └─────────────────────┘  └─────────────────────────────┘
```

## Probe Overview

| # | Probe | What We Measure |
|---|-------|-----------------|
| 1 | Transformer Timeline | Family distribution, scale trends |
| 2 | Causal vs Masked LM | Token predictions, context directionality |
| 3 | Transfer Learning | Pretrained vs scratch accuracy gap |
| 4 | Model Anatomy | Parameter counts, layer breakdown |
| 5 | Attention Visualization | Attention matrices, coreference resolution |
| 6 | Architecture Comparison | Hidden states, masking patterns, output formats |

> **Note**: Probes 2–6 download models on first run (~2 GB total). Subsequent runs use the local cache.

## Setup

In [None]:
import json
import os
import sys
import time
from pathlib import Path

import torch
from dotenv import load_dotenv

# Load .env from project root
load_dotenv(Path.cwd().parent / ".env")

# Ensure project root is on the path so `src.architecture_deepdive` is importable
sys.path.insert(0, str(Path.cwd().parent))

# Detect device
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {DEVICE}")
if DEVICE == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

# Create results directory
os.makedirs("../results/figures", exist_ok=True)

In [None]:
# Helper: pretty-print results
def show(title, data, indent=2):
    print(f"\n{'─' * 60}")
    print(f"  {title}")
    print(f"{'─' * 60}")
    print(json.dumps(data, indent=indent, default=str, ensure_ascii=False))

---
## Probe 1 — Transformer History Timeline

**Module**: `p1_model_timeline.py`  
**Models used**: None (data only)  

From the course: *"A bit of Transformer history"*

The original Transformer was introduced in June 2017. Since then, models have been
organized into three families:
- **GPT-like** (auto-regressive / decoder-only)
- **BERT-like** (auto-encoding / encoder-only)
- **T5-like** (sequence-to-sequence / encoder-decoder)

This probe builds a structured timeline and analyzes the family distribution over time.

In [None]:
from src.architecture_deepdive.probes.p1_model_timeline import TIMELINE, run_experiment

result_p1 = run_experiment()

# Display the timeline
print(f"{'Model':<25} {'Date':<18} {'Params':<12} {'Family'}")
print("─" * 75)
for m in TIMELINE:
    print(f"{m.name:<25} {m.date:<18} {m.params:<12} {m.family}")

In [None]:
# Family distribution
print("\nFamily Distribution:")
for family, count in result_p1["family_distribution"].items():
    bar = "█" * (count * 3)
    print(f"  {family:<18} {bar} {count}/{len(TIMELINE)}")

print(f"\nObservation: {result_p1['observation']}")

In [None]:
from datetime import datetime

import matplotlib.dates as mdates
import matplotlib.pyplot as plt

# Visualize: family distribution pie chart + timeline
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
dist = result_p1["family_distribution"]
colors = {"decoder-only": "#2196F3", "encoder-only": "#4CAF50", "encoder-decoder": "#FF9800"}
ax1.pie(
    dist.values(),
    labels=[f"{k}\n({v})" for k, v in dist.items()],
    colors=[colors.get(k, "#999") for k in dist],
    autopct="%1.0f%%",
    startangle=90,
    textprops={"fontsize": 10},
)
ax1.set_title("Architecture Family Distribution\n(Models from Course Timeline)")

# Timeline scatter
for m in TIMELINE:
    try:
        date = datetime.strptime(m.date, "%B %Y")
    except ValueError:
        continue
    color = colors.get(m.family, "#999")
    ax2.scatter(date, m.family, color=color, s=100, zorder=3)
    ax2.annotate(
        m.name,
        (date, m.family),
        textcoords="offset points",
        xytext=(5, 8),
        fontsize=7,
        rotation=30,
    )

ax2.set_title("Transformer Model Timeline")
ax2.set_xlabel("Date")
ax2.grid(True, alpha=0.3)
ax2.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))

plt.tight_layout()
plt.savefig("../results/figures/p1_timeline.png", dpi=150, bbox_inches="tight")
plt.show()
print("Saved: results/figures/p1_timeline.png")

---
## Probe 2 — Causal vs Masked Language Modeling

**Module**: `p2_language_modeling.py`  
**Models**: GPT-2 (Causal LM) and BERT-base (Masked LM)  

From the course:
> *"Causal language modeling: predicting the next word having read the n previous words"*  
> *"Masked language modeling: the model predicts a masked word in the sentence"*

### Key Investigation
How does context directionality affect predictions?  
- **CLM (GPT)**: Only sees tokens to the LEFT of the current position
- **MLM (BERT)**: Sees tokens to the LEFT and RIGHT of the masked position

In [None]:
from src.architecture_deepdive.probes.p2_language_modeling import (
    run_causal_lm_probe,
    run_masked_lm_probe,
)

# --- 2.1: Causal LM (GPT-2) ---
print("═" * 60)
print("  CAUSAL LANGUAGE MODELING (GPT-2)")
print("  Context: left-to-right only")
print("═" * 60)

clm_result = run_causal_lm_probe(device=DEVICE)

for r in clm_result["results"]:
    print(f"\nPrompt: {r['prompt']!r}")
    print(f"  Direction: {r['context_direction']}")
    print("  Top 5 next-token predictions:")
    for p in r["top_5_predictions"]:
        bar = "█" * int(p["probability"] * 40)
        print(f"    {p['token']:15s} {bar} {p['probability']:.4f}")

In [None]:
# --- 2.2: Masked LM (BERT) ---
print("═" * 60)
print("  MASKED LANGUAGE MODELING (BERT)")
print("  Context: bidirectional (sees LEFT and RIGHT)")
print("═" * 60)

mlm_result = run_masked_lm_probe(device=DEVICE)

for r in mlm_result["results"]:
    print(f"\nSentence: {r['sentence']!r}")
    print(f"  Direction: {r['context_direction']}")
    print("  Top 5 predictions for [MASK]:")
    for p in r["top_5_predictions"]:
        bar = "█" * int(p["probability"] * 40)
        print(f"    {p['token']:15s} {bar} {p['probability']:.4f}")

In [None]:
# --- 2.3: Comparison ---
print("\n" + "═" * 60)
print("  KEY COMPARISON: CLM vs MLM")
print("═" * 60)
print("""
Shared test: Predicting the word after 'The capital of France is'

  CLM (GPT-2):
    Sees: 'The capital of France is' → predicts next token
    Can only use LEFT context (past tokens)

  MLM (BERT):
    Sees: 'The capital of France is [MASK] .' → predicts masked token
    Uses BOTH left and right context (including the period)

  Key difference:
    This is why BERT is better for understanding tasks (classification, NER)
    and GPT is better for generation tasks (text completion, chat).
""")

---
## Probe 3 — Transfer Learning: Pretrained vs From-Scratch

**Module**: `p3_transfer_learning.py`  
**Model**: Tiny BERT (`google/bert_uncased_L-2_H-128_A-2`)  

From the course:
> *"The pretrained model was already trained on a dataset that has some similarities
> with the fine-tuning dataset. The fine-tuning process is thus able to take advantage
> of knowledge acquired during pretraining."*

### Experiment Design
- **Dataset**: 8 training + 4 test sentiment examples (tiny!)
- **Pretrained**: Fine-tune a pretrained BERT on this data
- **From-scratch**: Train a randomly initialized BERT on the same data
- **Hypothesis**: Pretrained should converge faster and achieve higher accuracy

In [None]:
from src.architecture_deepdive.data import TRANSFER_LEARNING_DATA

# Show the dataset
print("Training Data (8 examples):")
for text, label in TRANSFER_LEARNING_DATA["train"]:
    sentiment = "positive" if label == 1 else "negative"
    print(f"  [{sentiment:>8}] {text}")

print("\nTest Data (4 examples):")
for text, label in TRANSFER_LEARNING_DATA["test"]:
    sentiment = "positive" if label == 1 else "negative"
    print(f"  [{sentiment:>8}] {text}")

In [None]:
from src.architecture_deepdive.probes.p3_transfer_learning import train_and_evaluate

model_name = "google/bert_uncased_L-2_H-128_A-2"  # Tiny BERT
num_epochs = 10

# --- 3.1: Pretrained fine-tuning ---
print("Training pretrained model...")
t0 = time.perf_counter()
pretrained_result = train_and_evaluate(
    model_name, from_scratch=False, num_epochs=num_epochs, device=DEVICE
)
pt_time = time.perf_counter() - t0
print(f"  Done in {pt_time:.1f}s — Final accuracy: {pretrained_result['final_accuracy']:.2%}")

# --- 3.2: From-scratch training ---
print("\nTraining from-scratch model...")
t0 = time.perf_counter()
scratch_result = train_and_evaluate(
    model_name, from_scratch=True, num_epochs=num_epochs, device=DEVICE
)
sc_time = time.perf_counter() - t0
print(f"  Done in {sc_time:.1f}s — Final accuracy: {scratch_result['final_accuracy']:.2%}")

In [None]:
# --- 3.3: Epoch-by-epoch comparison table ---
pt_hist = pretrained_result["history"]
sc_hist = scratch_result["history"]

print(
    f"{'Epoch':>6} │ {'Pretrained Acc':>15} {'Pretrained Loss':>16} │ {'Scratch Acc':>12} {'Scratch Loss':>14}"
)
print("─" * 72)
for i in range(num_epochs):
    print(
        f"{pt_hist['epochs'][i]:>6} │ "
        f"{pt_hist['test_accuracy'][i]:>14.2%} {pt_hist['train_loss'][i]:>15.4f} │ "
        f"{sc_hist['test_accuracy'][i]:>11.2%} {sc_hist['train_loss'][i]:>13.4f}"
    )

gap = pretrained_result["final_accuracy"] - scratch_result["final_accuracy"]
print(f"\nAccuracy gap: {gap:+.2%}")

In [None]:
# --- 3.4: Learning curves plot ---
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

epochs = pt_hist["epochs"]

# Accuracy
ax1.plot(
    epochs,
    pt_hist["test_accuracy"],
    "o-",
    color="#4CAF50",
    label="Pretrained + fine-tuned",
    linewidth=2,
)
ax1.plot(
    epochs, sc_hist["test_accuracy"], "s--", color="#F44336", label="From scratch", linewidth=2
)
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Test Accuracy")
ax1.set_title("Transfer Learning: Accuracy Comparison")
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 1.05)

# Loss
ax2.plot(
    epochs,
    pt_hist["train_loss"],
    "o-",
    color="#4CAF50",
    label="Pretrained + fine-tuned",
    linewidth=2,
)
ax2.plot(epochs, sc_hist["train_loss"], "s--", color="#F44336", label="From scratch", linewidth=2)
ax2.set_xlabel("Epoch")
ax2.set_ylabel("Training Loss")
ax2.set_title("Transfer Learning: Loss Comparison")
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.suptitle(
    "Probe 3: Transfer Learning — Pretrained vs From-Scratch\n"
    f"(Model: {model_name}, {num_epochs} epochs, dataset: 8 train / 4 test)",
    fontsize=12,
    y=1.03,
)
plt.tight_layout()
plt.savefig("../results/figures/p3_transfer_learning.png", dpi=150, bbox_inches="tight")
plt.show()
print("Saved: results/figures/p3_transfer_learning.png")

---
## Probe 4 — Model Anatomy: Architecture vs Checkpoint

**Module**: `p4_model_anatomy.py`  
**Models**: BERT-base, GPT-2, T5-small  

From the course:
> *"Architecture: the skeleton of the model — the definition of each layer
> and each operation that happens within the model."*  
> *"Checkpoints: the weights that will be loaded in a given architecture."*

We inspect one representative model from each family to understand their internal structure.

In [None]:
from src.architecture_deepdive.data import MODEL_REGISTRY
from src.architecture_deepdive.utils.model_inspector import format_param_count, inspect_model

anatomies = {}
for family_key, info in MODEL_REGISTRY.items():
    model_name = info["primary"]
    print(f"Inspecting {model_name} ({family_key})...")
    anatomy = inspect_model(model_name, family_key)
    anatomies[family_key] = anatomy
    print(
        f"  → {format_param_count(anatomy.num_parameters)} parameters, "
        f"{anatomy.num_layers} layers, hidden={anatomy.hidden_size}"
    )

In [None]:
# --- 4.1: Side-by-side comparison table ---
print(f"{'Property':<30} {'BERT-base':>15} {'GPT-2':>15} {'T5-small':>15}")
print("─" * 78)

enc = anatomies["encoder_only"]
dec = anatomies["decoder_only"]
encdec = anatomies["encoder_decoder"]

rows = [
    (
        "Architecture class",
        enc.architecture_class,
        dec.architecture_class,
        encdec.architecture_class,
    ),
    (
        "Parameters",
        format_param_count(enc.num_parameters),
        format_param_count(dec.num_parameters),
        format_param_count(encdec.num_parameters),
    ),
    ("Layers", str(enc.num_layers), str(dec.num_layers), str(encdec.num_layers)),
    ("Hidden size", str(enc.hidden_size), str(dec.hidden_size), str(encdec.hidden_size)),
    (
        "Attention heads",
        str(enc.num_attention_heads),
        str(dec.num_attention_heads),
        str(encdec.num_attention_heads),
    ),
    ("Vocab size", f"{enc.vocab_size:,}", f"{dec.vocab_size:,}", f"{encdec.vocab_size:,}"),
    (
        "Max positions",
        f"{enc.max_position_embeddings:,}",
        f"{dec.max_position_embeddings:,}",
        f"{encdec.max_position_embeddings:,}",
    ),
    ("Has encoder", str(enc.has_encoder), str(dec.has_encoder), str(encdec.has_encoder)),
    ("Has decoder", str(enc.has_decoder), str(dec.has_decoder), str(encdec.has_decoder)),
]

for prop, v1, v2, v3 in rows:
    print(f"{prop:<30} {v1:>15} {v2:>15} {v3:>15}")

In [None]:
# --- 4.2: Layer breakdown for each model ---
for family_key, anatomy in anatomies.items():
    info = MODEL_REGISTRY[family_key]
    print(f"\n{'═' * 50}")
    print(f"  {info['primary']} ({info['family']})")
    print(f"  Objective: {info['objective']}")
    print(f"  Attention: {info['attention']}")
    print(f"{'═' * 50}")
    print(f"{'Module':<30} {'Tensors':>10} {'Parameters':>15}")
    print("─" * 58)
    for module, breakdown in anatomy.layer_breakdown.items():
        print(
            f"{module:<30} {breakdown['count']:>10} {format_param_count(breakdown['params']):>15}"
        )

In [None]:
from matplotlib.patches import Patch

# --- 4.3: Parameter comparison chart ---
model_data = [
    {"name": a.name, "family": fk, "params": a.num_parameters} for fk, a in anatomies.items()
]

colors = {
    "encoder_only": "#4CAF50",
    "decoder_only": "#2196F3",
    "encoder_decoder": "#FF9800",
}
names = [d["name"] for d in model_data]
params_m = [d["params"] / 1e6 for d in model_data]
bar_colors = [colors[d["family"]] for d in model_data]

fig, ax = plt.subplots(figsize=(10, 4))
bars = ax.barh(names, params_m, color=bar_colors, edgecolor="white")
ax.bar_label(bars, fmt="%.1fM", padding=4)
ax.set_xlabel("Parameters (Millions)")
ax.set_title("Parameter Count by Architecture Family")

legend_elements = [
    Patch(facecolor="#4CAF50", label="Encoder-only"),
    Patch(facecolor="#2196F3", label="Decoder-only"),
    Patch(facecolor="#FF9800", label="Encoder-decoder"),
]
ax.legend(handles=legend_elements, loc="lower right")
plt.tight_layout()
plt.savefig("../results/figures/p4_parameter_comparison.png", dpi=150)
plt.show()
print("Saved: results/figures/p4_parameter_comparison.png")

---
## Probe 5 — Attention Layer Visualization

**Module**: `p5_attention_viz.py`  
**Models**: BERT-base (bidirectional) and GPT-2 (causal)  

From the course:
> *"This layer will tell the model to pay specific attention to certain words
> in the sentence you passed it (and more or less ignore the others)."*

> *"A translation model will need to also attend to the adjacent word 'You'
> to get the proper translation for the word 'like'."*

### What We'll Verify
1. **Bidirectional vs Causal masks** — theoretical comparison
2. **BERT attention** — full matrices (every token sees every token)
3. **Coreference resolution** — does "it" attend to "animal" vs "street"?
4. **GPT-2 attention** — lower-triangular causal pattern

In [None]:
import numpy as np
import seaborn as sns

from src.architecture_deepdive.data import ATTENTION_SENTENCES
from src.architecture_deepdive.utils.attention_tools import (
    compare_causal_vs_bidirectional_mask,
    extract_attention_weights,
    get_attention_to_token,
)

# --- 5.1: Attention mask comparison (theoretical) ---
tokens_demo = ["You", "like", "this", "course"]
masks = compare_causal_vs_bidirectional_mask(seq_len=len(tokens_demo))

fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))

for ax, mask, title in zip(
    axes,
    [masks["bidirectional"], masks["causal"]],
    ["Bidirectional (Encoder / BERT)", "Causal (Decoder / GPT)"],
    strict=True,
):
    sns.heatmap(
        mask,
        xticklabels=tokens_demo,
        yticklabels=tokens_demo,
        cmap="Blues",
        vmin=0,
        vmax=1,
        cbar=False,
        ax=ax,
        linewidths=0.5,
        linecolor="gray",
        annot=True,
        fmt=".0f",
    )
    ax.set_title(title, fontsize=12, fontweight="bold")
    ax.set_xlabel("Key position")
    ax.set_ylabel("Query position")

plt.suptitle(
    "Attention Mask Patterns — The Key Architectural Difference",
    fontsize=13,
    fontweight="bold",
    y=1.02,
)
plt.tight_layout()
plt.savefig("../results/figures/p5_mask_comparison.png", dpi=150, bbox_inches="tight")
plt.show()

print(f"\n{masks['note']}")
print(f"Masked positions in causal model: {masks['causal_masked_positions']}")

In [None]:
# --- 5.2: BERT attention on key sentences ---
bert_model = "bert-base-uncased"

# Extract attention for coreference sentences
sentences_to_plot = [
    ("coref_animal", ATTENTION_SENTENCES["coref_animal"]),
    ("coref_street", ATTENTION_SENTENCES["coref_street"]),
]

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for ax, (key, sentence) in zip(axes, sentences_to_plot, strict=True):
    attn_data = extract_attention_weights(bert_model, sentence, DEVICE)
    # Last layer, first head (more task-specific patterns)
    attn = attn_data["attentions"][-1, 0]  # (seq, seq)
    tokens = attn_data["tokens"]

    sns.heatmap(
        attn,
        xticklabels=tokens,
        yticklabels=tokens,
        cmap="YlOrRd",
        vmin=0,
        vmax=None,
        ax=ax,
        annot=len(tokens) <= 15,
        fmt=".2f",
    )
    ax.set_title(f"BERT (Last Layer): {key}\n{sentence}", fontsize=9)
    ax.set_xlabel("Key")
    ax.set_ylabel("Query")
    ax.tick_params(axis="x", rotation=45)
    ax.tick_params(axis="y", rotation=0)

plt.suptitle(
    "BERT Bidirectional Attention — Coreference Resolution", fontsize=13, fontweight="bold"
)
plt.tight_layout()
plt.savefig("../results/figures/p5_bert_coreference.png", dpi=150, bbox_inches="tight")
plt.show()

In [None]:
# --- 5.3: Coreference test — "it" → "animal" vs "it" → "street" ---
coref_tired = extract_attention_weights(bert_model, ATTENTION_SENTENCES["coref_animal"], DEVICE)
coref_wide = extract_attention_weights(bert_model, ATTENTION_SENTENCES["coref_street"], DEVICE)

it_to_animal = get_attention_to_token(
    coref_tired["attentions"], coref_tired["tokens"], "animal", layer=-1
)
it_to_street = get_attention_to_token(
    coref_wide["attentions"], coref_wide["tokens"], "street", layer=-1
)

print("Coreference Resolution Test")
print("═" * 60)
print(f"\nSentence 1: {ATTENTION_SENTENCES['coref_animal']!r}")
print("  Hypothesis: 'it' refers to 'animal' (because it was too tired)")
print("  Attention FROM each token TO 'animal' (last layer, avg over heads):")
for tok, attn in it_to_animal.items():
    bar = "█" * int(attn * 50)
    marker = " ◀" if tok == "it" else ""
    print(f"    {tok:12s} {bar} {attn:.4f}{marker}")

print(f"\nSentence 2: {ATTENTION_SENTENCES['coref_street']!r}")
print("  Hypothesis: 'it' refers to 'street' (because it was too wide)")
print("  Attention FROM each token TO 'street' (last layer, avg over heads):")
for tok, attn in it_to_street.items():
    bar = "█" * int(attn * 50)
    marker = " ◀" if tok == "it" else ""
    print(f"    {tok:12s} {bar} {attn:.4f}{marker}")

In [None]:
# --- 5.4: GPT-2 attention (causal) vs BERT (bidirectional) ---
test_sentence = ATTENTION_SENTENCES["agreement_short"]
print(f"Sentence: {test_sentence!r}\n")

bert_attn = extract_attention_weights(bert_model, test_sentence, DEVICE)
gpt_attn = extract_attention_weights("gpt2", test_sentence, DEVICE)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# BERT — bidirectional (full matrix)
ax = axes[0]
attn = bert_attn["attentions"][0, 0]
sns.heatmap(
    attn,
    xticklabels=bert_attn["tokens"],
    yticklabels=bert_attn["tokens"],
    cmap="YlOrRd",
    vmin=0,
    ax=ax,
    annot=True,
    fmt=".2f",
)
ax.set_title("BERT (Layer 0, Head 0) — Bidirectional", fontsize=11)
ax.set_xlabel("Key")
ax.set_ylabel("Query")
ax.tick_params(axis="x", rotation=45)

# GPT-2 — causal (lower triangular)
ax = axes[1]
attn = gpt_attn["attentions"][0, 0]
sns.heatmap(
    attn,
    xticklabels=gpt_attn["tokens"],
    yticklabels=gpt_attn["tokens"],
    cmap="YlOrRd",
    vmin=0,
    ax=ax,
    annot=True,
    fmt=".2f",
)
ax.set_title("GPT-2 (Layer 0, Head 0) — Causal", fontsize=11)
ax.set_xlabel("Key")
ax.set_ylabel("Query")
ax.tick_params(axis="x", rotation=45)

plt.suptitle(
    "Bidirectional vs Causal Attention — Real Attention Weights",
    fontsize=13,
    fontweight="bold",
)
plt.tight_layout()
plt.savefig("../results/figures/p5_bert_vs_gpt2.png", dpi=150, bbox_inches="tight")
plt.show()

# Verify causal pattern numerically
gpt_attn_matrix = gpt_attn["attentions"][0, 0]
upper_sum = float(np.triu(gpt_attn_matrix, k=1).sum())
print(f"\nGPT-2 upper triangle sum: {upper_sum:.8f} (should be ~0 for causal)")
print(f"Causal pattern verified: {upper_sum < 1e-6}")

---
## Probe 6 — Architecture Comparison: Encoder vs Decoder vs Encoder-Decoder

**Module**: `p6_arch_comparison.py`  
**Models**: BERT-base, GPT-2, T5-small  

From the course:
> *"Encoder-only models: Good for tasks that require understanding of the input"*  
> *"Decoder-only models: Good for generative tasks such as text generation"*  
> *"Encoder-decoder models: Good for generative tasks that require an input"*

We run all three architecture types on the **same input** and compare their
hidden representations, output formats, and task-specific behaviors.

In [None]:
from src.architecture_deepdive.probes.p6_arch_comparison import (
    SHARED_INPUT,
    probe_decoder_only,
    probe_encoder_decoder,
    probe_encoder_only,
)

print(f"Shared input: {SHARED_INPUT!r}\n")

# --- 6.1: Encoder-only (BERT) ---
print("Running encoder-only probe (BERT)...")
enc_result = probe_encoder_only(device=DEVICE)
show(
    "Encoder-Only: BERT",
    {
        "model": enc_result["model"],
        "output_type": enc_result["output_type"],
        "hidden_state_shape": enc_result["hidden_state_shape"],
        "tokens": enc_result["tokens"],
        "attention_is_bidirectional": enc_result["attention_is_bidirectional"],
        "cls_embedding_norm": enc_result["cls_embedding_norm"],
        "typical_use": enc_result["typical_use"],
    },
)

In [None]:
# --- 6.2: Decoder-only (GPT-2) ---
print("Running decoder-only probe (GPT-2)...")
dec_result = probe_decoder_only(device=DEVICE)
show(
    "Decoder-Only: GPT-2",
    {
        "model": dec_result["model"],
        "output_type": dec_result["output_type"],
        "logits_shape": dec_result["logits_shape"],
        "vocab_size": dec_result["vocab_size"],
        "attention_is_causal": dec_result["attention_is_causal"],
        "next_token_predictions": dec_result["next_token_predictions"],
        "typical_use": dec_result["typical_use"],
    },
)

In [None]:
# --- 6.3: Encoder-decoder (T5) ---
print("Running encoder-decoder probe (T5)...")
encdec_result = probe_encoder_decoder(device=DEVICE)
show(
    "Encoder-Decoder: T5",
    {
        "model": encdec_result["model"],
        "output_type": encdec_result["output_type"],
        "encoder_hidden_shape": encdec_result["encoder_hidden_shape"],
        "generated_text": encdec_result["generated_text"],
        "key_feature": encdec_result["key_feature"],
        "typical_use": encdec_result["typical_use"],
    },
)

In [None]:
# --- 6.4: Synthesis comparison table ---
print("\n" + "═" * 80)
print("  ARCHITECTURE COMPARISON SYNTHESIS")
print("═" * 80)

print(f"\n{'':30} {'Encoder-only':>16} {'Decoder-only':>16} {'Encoder-Decoder':>16}")
print("─" * 80)

comparison_rows = [
    ("Model", enc_result["model"], dec_result["model"], encdec_result["model"]),
    ("Attention", "Bidirectional", "Causal (L→R)", "Bidir + Causal"),
    ("Output", "Embeddings", "Logits", "Generated seq"),
    ("Best for", "Understanding", "Generation", "Seq2Seq"),
]

for label, v1, v2, v3 in comparison_rows:
    print(f"{label:<30} {v1:>16} {v2:>16} {v3:>16}")

print('\n"Each of these parts can be used independently, depending on the task." — Course')

---
## Key Takeaways

1. **Decoder-only dominates** post-2022: 8 of 11 models in the course timeline are decoder-only,
   reflecting the industry shift toward autoregressive LLMs.

2. **CLM vs MLM context**: BERT (MLM) uses bidirectional context and predicts "Paris" with high
   confidence using both left and right context. GPT-2 (CLM) can only use left context.

3. **Transfer learning works**: Pretrained models achieve significantly higher accuracy on small
   datasets, validating the course's claim that fine-tuning "takes advantage of knowledge acquired
   during pretraining."

4. **Architecture anatomy**: BERT ~110M, GPT-2 ~124M, T5-small ~60M parameters. Despite similar
   sizes, they have fundamentally different structures (encoder-only vs decoder-only vs both).

5. **Attention patterns are verifiable**: BERT's attention matrix is full (bidirectional),
   GPT-2's is lower-triangular (causal). The coreference test shows BERT can link "it" to its
   correct antecedent using both left and right context.

6. **Same input, different outputs**: Given the same text, the encoder produces embeddings,
   the decoder produces next-token logits, and the encoder-decoder generates a new sequence.

---
*Reference*: [HuggingFace LLM Course — Chapter 1.4](https://huggingface.co/learn/llm-course/chapter1/4)