# Cross-Architecture Comparison: GPT-OSS 20B vs Qwen3-14B
## Fine-Tuning on Jordan B. Peterson's Works — All 4 Model Variants

This notebook performs the most complete comparison in this series, evaluating
**all four model variants** side by side on the same prompts and metrics.

| # | Model | Architecture | Status |
|---|-------|-------------|--------|
| 1 | `unsloth/gpt-oss-20b-unsloth-bnb-4bit` | Harmony / GPT-4o-mini-flash | Base (pre-trained only) |
| 2 | `./outputs/gpt_oss_20b_jordan_peterson_lora` | Harmony / GPT-4o-mini-flash | Fine-tuned on 4 Peterson books, 1 epoch |
| 3 | `unsloth/Qwen3-14B-unsloth-bnb-4bit` | ChatML / Qwen3 | Base (pre-trained only) |
| 4 | `./outputs/qwen3_14b_jordan_peterson_lora` | ChatML / Qwen3 | Fine-tuned on 4 Peterson books, 1 epoch |

### Research Questions

1. **Intra-architecture fine-tuning effect** — Did 1 epoch improve each model's
   alignment with Peterson's vocabulary, perplexity, and style?
2. **Inter-architecture base comparison** — Is GPT-OSS 20B or Qwen3-14B a stronger
   foundation for this domain, before any fine-tuning?
3. **Inter-architecture fine-tuned comparison** — After fine-tuning, which model
   is more convincingly "Peterson-like"?
4. **Training efficiency** — GPT-OSS trained for 73 min (loss 3.01) vs Qwen3 for
   23 min (loss 2.44). Does lower loss translate to better inference quality?

### Evaluation Metrics

| Metric | What it measures | Direction |
|--------|-----------------|-----------|
| **Perplexity** | How surprised the model is by real Peterson text | ↓ better |
| **TF-IDF cosine similarity** | Vocabulary overlap with Peterson's writing | ↑ better |
| **Keyword density** | Rate of use of Peterson's ~60 signature terms | ↑ better |
| **Type-Token Ratio (TTR)** | Vocabulary richness (unique / total words) | ↑ better |
| **Response length** | Average words per response | informational |

### GPU Memory Strategy

Neither model pair fits simultaneously in 24 GB of VRAM (GPT-OSS 20B ≈ 13.8 GB,
Qwen3-14B ≈ 13.4 GB). We therefore evaluate all four models **sequentially**:

```
Load → Infer → Save pkl → Unload → repeat for next model
```

Once all four pkl files exist, no model loading is required — metrics and charts
load from cache in seconds. Delete any pkl to force re-evaluation of that model.

---
## Step 1: Imports and Configuration

In [None]:
import os, re, gc, math, pickle
from pathlib import Path
from collections import Counter

import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt',     quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords

# ── Paths ──────────────────────────────────────────────────────────────────
# All four model paths. The two LoRA paths are relative to this notebook's
# working directory (NoteBooks/JordanPeterson/); they point to adapter weights
# saved by the fine-tuning notebooks.
GPTOSS_BASE_PATH  = "unsloth/gpt-oss-20b-unsloth-bnb-4bit"
GPTOSS_LORA_PATH  = "./outputs/gpt_oss_20b_jordan_peterson_lora"
QWEN3_BASE_PATH   = "unsloth/Qwen3-14B-unsloth-bnb-4bit"
QWEN3_LORA_PATH   = "./outputs/qwen3_14b_jordan_peterson_lora"

# Separate cache and figures directories keep this notebook isolated from the
# single-architecture comparisons that already exist in this folder.
CACHE_DIR   = Path("./comparison_cache_all_models")
FIGURES_DIR = Path("./comparison_figures_all_models")
CACHE_DIR.mkdir(exist_ok=True)
FIGURES_DIR.mkdir(exist_ok=True)

# ── Model registry ─────────────────────────────────────────────────────────
# A single ordered list drives all loops in this notebook — inference phases,
# metric collection, chart colours, and legend labels all come from here.
MODEL_KEYS   = ["gptoss_base", "gptoss_tuned", "qwen3_base", "qwen3_tuned"]
MODEL_PATHS  = {
    "gptoss_base":  GPTOSS_BASE_PATH,
    "gptoss_tuned": GPTOSS_LORA_PATH,
    "qwen3_base":   QWEN3_BASE_PATH,
    "qwen3_tuned":  QWEN3_LORA_PATH,
}
# Short names used inside chart axes (tight layout)
MODEL_SHORT  = {
    "gptoss_base":  "GPT-OSS 20B\nBase",
    "gptoss_tuned": "GPT-OSS 20B\nFine-Tuned",
    "qwen3_base":   "Qwen3-14B\nBase",
    "qwen3_tuned":  "Qwen3-14B\nFine-Tuned",
}
# Longer display names for titles, radar chart legend, word-cloud headers
MODEL_DISPLAY = {
    "gptoss_base":  "GPT-OSS 20B  |  Base",
    "gptoss_tuned": "GPT-OSS 20B  |  Fine-Tuned",
    "qwen3_base":   "Qwen3-14B  |  Base",
    "qwen3_tuned":  "Qwen3-14B  |  Fine-Tuned",
}
# Colour palette: blue/orange for GPT-OSS, green/red for Qwen3.
# Pairing warm vs cool within each architecture makes the base/tuned split
# visually obvious at a glance.
MODEL_COLORS = {
    "gptoss_base":  "#4C72B0",   # blue   — GPT-OSS base
    "gptoss_tuned": "#DD8452",   # orange — GPT-OSS fine-tuned
    "qwen3_base":   "#55A868",   # green  — Qwen3 base
    "qwen3_tuned":  "#C44E52",   # red    — Qwen3 fine-tuned
}
CACHE_FILES = {k: CACHE_DIR / f"{k}_results.pkl" for k in MODEL_KEYS}

# ── Plot style ─────────────────────────────────────────────────────────────
plt.rcParams.update({
    'figure.dpi': 120,
    'figure.facecolor': 'white',
    'axes.facecolor': '#f8f8f8',
    'axes.grid': True,
    'grid.alpha': 0.4,
    'font.size': 11,
})

# ── System prompts ─────────────────────────────────────────────────────────
# Base models receive a neutral system prompt because they have no fine-tuning
# to activate; asking them to imitate Peterson without that training leads to
# superficial mimicry rather than a fair evaluation of their natural output.
# Fine-tuned models receive the same rich Peterson prompt they were trained on,
# which is how they would actually be deployed in production.
BASE_SYSTEM_PROMPT = "You are a helpful assistant."
TUNED_SYSTEM_PROMPT = (
    "You are an AI assistant that has been trained on the complete works of "
    "Jordan B. Peterson, a Canadian clinical psychologist, professor, and author. "
    "You speak with deep knowledge of psychology, philosophy, mythology, religion, "
    "and personal responsibility. Your responses reflect Peterson's writing style, "
    "intellectual depth, and interdisciplinary approach to understanding human "
    "nature and meaning."
)
SYSTEM_PROMPTS = {
    "gptoss_base":  BASE_SYSTEM_PROMPT,
    "gptoss_tuned": TUNED_SYSTEM_PROMPT,
    "qwen3_base":   BASE_SYSTEM_PROMPT,
    "qwen3_tuned":  TUNED_SYSTEM_PROMPT,
}

print(f"PyTorch  : {torch.__version__}")
print(f"CUDA     : {torch.cuda.is_available()}  |  GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM     : {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB total")
print(f"Cache    : {CACHE_DIR.resolve()}")
print(f"Figures  : {FIGURES_DIR.resolve()}")
print()
print("Cache status:")
for k in MODEL_KEYS:
    status = "EXISTS (will skip inference)" if CACHE_FILES[k].exists() else "missing (will run inference)"
    print(f"  {MODEL_DISPLAY[k]:<36}  {status}")

---
## Step 2: Reference Data

Three types of reference data are used throughout this notebook:

- **2a — Peterson passages**: Verbatim text held out from training, used to measure
  *perplexity* (how surprised the model is by real Peterson sentences) and *TF-IDF
  similarity* (how closely model responses match his vocabulary distribution).
- **2b — Evaluation prompts**: Ten questions covering Peterson's core themes.
  Every model answers the same ten questions; metrics are computed per-prompt
  and then averaged for the summary charts.
- **2c — Keyword dictionary**: ~60 terms curated from the four books that are
  either exclusive to Peterson or unusually frequent in his writing. Used for
  keyword density (fraction of response words that are Peterson-signature words)  and the heatmap visualisation.

In [None]:
# ── 2a: Reference passages ─────────────────────────────────────────────────
# Eight short verbatim excerpts drawn from all four Peterson books, covering
# different themes (order/chaos, meaning, suffering, mythology, logos).
# They were NOT used as training examples — they come from later chapters or
# sections that were excluded from the fine-tuning dataset.
PETERSON_PASSAGES = [
    # From Maps of Meaning — the world as a forum for action
    "The world can be validly construed as a forum for action, or as a place of things. "
    "The former manner of interpretation — more primordial, and less clearly understood — "
    "finds its expression in the arts or humanities, in ritual, drama, literature, and myth. "
    "The world as forum for action is a place of value, a place where all things have meaning.",

    # From 12 Rules for Life — embodied courage
    "To stand up straight with your shoulders back is to accept the terrible responsibility "
    "of life, with eyes wide open. It means deciding to voluntarily transform the chaos of "
    "potential into the realities of habitable order. It means adopting the burden of "
    "self-conscious vulnerability, and accepting the end of the unconscious paradise of "
    "childhood, where finitude and mortality are only dimly comprehended.",

    # From Beyond Order — chaos as unexplored territory
    "Order is the place where the things you are currently doing are working out well "
    "for you. Chaos is the domain of ignorance itself. It's unexplored territory. Chaos "
    "is what extends, endlessly and without limit, beyond the boundaries of all states, "
    "all ideas, and all disciplines. It's the foreigner, the stranger, the member of "
    "another gang, the rustle in the bushes in the night-time.",

    # From We Who Wrestle with God — logos and truth
    "The divine spark in man is the logos — the word, the reason, the creative principle "
    "that gives order to the chaos of experience. To act in accordance with the logos is "
    "to speak the truth, to pursue what is meaningful rather than what is expedient, and "
    "to take on the burden of Being itself with courage and humility.",

    # From 12 Rules — compare yourself to who you were
    "Compare yourself to who you were yesterday, not to who someone else is today. "
    "You have a nature. You can play the game of life and improve. You can set a "
    "standard, even a minimal standard, and try to live up to it. You can improve "
    "incrementally, moving forward step by step. You can judge your life against "
    "what you know to be good, against what you should be.",

    # From Maps of Meaning — myth and the hero
    "The great myths and rituals of the past have been formulated in the language of "
    "the imagination. They say: act out the role of the hero; do not be the villain; "
    "do not be the tyrant. They say: update your maps of meaning when new information "
    "warrants it; admit your errors and change. They say: encounter the stranger and "
    "extract from that encounter what is valuable. Treat the stranger with respect.",

    # From Beyond Order — meaning as balance
    "Meaning is the ultimate balance between, on the one hand, the chaos of transformation "
    "and possibility and, on the other, the discipline of pristine order, whose purpose is "
    "to produce out of the attendant chaos a new order that will be even more productive "
    "and worthwhile than the old. Pursue what is meaningful, not what is expedient.",

    # From We Who Wrestle with God — suffering and transcendence
    "Suffering is not a mistake or an accident. It is the very ground of Being itself. "
    "To wrestle with God, as Jacob did, is to confront that suffering honestly, to take "
    "responsibility for it, and to find within it the possibility of transcendence. The "
    "hero does not flee from the dragon; he faces it and transforms the encounter.",
]

# ── 2b: Evaluation prompts ─────────────────────────────────────────────────
# Ten questions that span all four books' major themes. Identical prompts are
# used for every model so comparisons are strictly apples-to-apples.
EVAL_PROMPTS = [
    "What is the relationship between order and chaos in human experience?",
    "Why is personal responsibility the foundation of a meaningful life?",
    "How do ancient myths and stories reveal truths about human nature?",
    "What does it mean to pursue what is meaningful rather than what is expedient?",
    "How should a person confront suffering rather than flee from it?",
    "What is the significance of the hero archetype in understanding the human psyche?",
    "Why is telling the truth essential to a properly functioning life?",
    "What is the role of the divine or the sacred in organizing human society?",
    "How does the Jungian concept of the shadow relate to individual development?",
    "What does it mean to stand up straight with your shoulders back?",
]

# ── 2c: Peterson keyword dictionary ────────────────────────────────────────
# Curated terms that serve as proxies for Peterson's characteristic vocabulary.
# Using a keyword-density metric alongside TF-IDF gives us a more targeted
# measure: TF-IDF picks up general vocabulary similarity whereas keyword
# density specifically tracks his signature conceptual vocabulary.
PETERSON_KEYWORDS = list(set([
    # Core metaphysical concepts
    "chaos", "order", "logos", "being", "meaning", "meaningful", "meaningless",
    "transcendence", "transcendent", "archetype", "archetypal",
    # Jungian psychology
    "shadow", "anima", "animus", "unconscious", "consciousness", "psyche",
    "individuation", "projection",
    # Ethics and existentialism
    "responsibility", "suffering", "redemption", "courage", "virtue",
    "nihilism", "nihilistic", "expedient", "expedience", "tyranny", "tyrannical",
    "sovereignty", "heroic", "malevolent",
    # Narrative and mythology
    "myth", "mythological", "hero", "dragon", "narrative", "story",
    "ritual", "sacrifice", "resurrection", "transformation",
    # Religion and biblical imagery
    "divine", "sacred", "god", "biblical", "genesis", "logos", "spirit",
    "wrestle", "jacob", "adam", "eve", "serpent",
    # Peterson's distinctive action-language
    "confront", "hierarchy", "dominance", "voluntarily", "catastrophe",
    "pathological", "resentment", "ideological", "totalitarian",
]))

print(f"Reference passages  : {len(PETERSON_PASSAGES)}")
print(f"Evaluation prompts  : {len(EVAL_PROMPTS)}")
print(f"Keyword dictionary  : {len(PETERSON_KEYWORDS)} unique terms")

---
## Step 3: Helper Functions

Two separate inference wrappers are defined — one for each model architecture —
because their chat-template APIs differ significantly:

| | GPT-OSS (Harmony) | Qwen3 (ChatML) |
|--|-------------------|----------------|
| Template call | `apply_chat_template(..., reasoning_effort="low", return_dict=True)` | `apply_chat_template(..., enable_thinking=False)` → plain string |
| Tokenization | Built into template call | Separate `tokenizer(text, ...)` step |
| Special cleanup | None needed | Strip residual `<think>…</think>` blocks |
| Token prefix | `<|start|>`, `<|message|>`, `<|end|>` | `<|im_start|>`, `<|im_end|>` |

The remaining helpers (`compute_perplexity`, `compute_text_stats`,
`compute_tfidf_similarity`) are architecture-independent and identical to those
used in the individual comparison notebooks.

In [None]:
def compute_perplexity(model, tokenizer, texts: list[str],
                       max_length: int = 512) -> list[float]:
    """
    Compute token-level perplexity for each text in the list.

    Perplexity = exp( average negative log-likelihood per token ).

    A lower value means the model assigns higher probability to the text —
    i.e., the text looks "expected" to the model.  After fine-tuning on
    Peterson's books we expect the model's perplexity on held-out Peterson
    passages to drop, because it has learned his vocabulary and grammar.
    If perplexity does NOT drop, the fine-tuning signal was insufficient
    or the model is still mostly using its pre-training representation.

    The function runs in inference mode (torch.no_grad()) with labels set
    equal to input_ids, which triggers PyTorch's built-in CE loss computation
    across all token positions — the standard language-model objective.
    """
    model.eval()
    perplexities = []
    with torch.no_grad():
        for text in texts:
            enc = tokenizer(
                text,
                return_tensors="pt",
                max_length=max_length,
                truncation=True,
            ).to("cuda")
            out = model(**enc, labels=enc["input_ids"])
            perplexities.append(math.exp(out.loss.item()))
    return perplexities


def generate_response_gptoss(model, tokenizer, prompt: str,
                              system_prompt: str,
                              max_new_tokens: int = 300) -> str:
    """
    Generate a single response from a GPT-OSS (Harmony-format) model.

    The GPT-OSS tokenizer's apply_chat_template() accepts the Harmony-specific
    `reasoning_effort` parameter and returns a ready-to-use dict of tensors
    (input_ids + attention_mask) when return_dict=True.  We use 'low' effort
    to keep response length comparable to Qwen3's concise greedy outputs.

    Decoding is fully deterministic (do_sample=False) with a mild repetition
    penalty (1.1) to discourage looping on short or empty training examples.
    Only the *newly generated* tokens are decoded — we skip the prompt prefix
    by slicing out[0][input_ids.shape[1]:].
    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": prompt},
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        reasoning_effort="low",   # Harmony-specific: keep responses concise
    ).to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,          # greedy — fully deterministic
            temperature=1.0,          # ignored when do_sample=False, set for clarity
            repetition_penalty=1.1,   # mild penalty prevents token-loop collapse
        )
    new_tokens = out[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


def generate_response_qwen3(model, tokenizer, prompt: str,
                             system_prompt: str,
                             max_new_tokens: int = 300) -> str:
    """
    Generate a single response from a Qwen3 (ChatML-format) model.

    Qwen3's apply_chat_template() requires a two-step process:
      1. Call apply_chat_template() with tokenize=False and enable_thinking=False
         to obtain a plain text string (not tensors).
      2. Pass that string to the tokenizer separately to get input_ids.

    Setting enable_thinking=False suppresses the chain-of-thought reasoning
    mode — Qwen3 can optionally prepend a <think>…</think> block before its
    response, which would contaminate our style-comparison metrics.
    Even with thinking disabled, the template sometimes adds empty
    <think>\n\n</think> artifacts, so we strip any such blocks with re.sub
    before returning the final string.
    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": prompt},
    ]
    # Step 1: render the chat template → plain text
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,   # ChatML-specific: disable chain-of-thought
    )
    # Step 2: tokenise separately
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=1.0,
            repetition_penalty=1.1,
        )
    new_tokens = out[0][inputs["input_ids"].shape[1]:]
    response = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
    # Strip any residual <think>…</think> blocks (may appear even with thinking disabled)
    response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
    return response


def compute_text_stats(texts: list[str]) -> dict:
    """
    Compute per-text statistics over a list of model responses.

    IMPORTANT: every text in the list must contribute exactly one entry to
    each output list — even empty strings.  If we skip empties the lists
    will be shorter than len(texts), causing index-misalignment in the
    per-prompt bar charts.  Empty responses simply produce all-zero stats.

    Returns a dict with:
      word_counts     — number of alphabetic words per response
      sentence_counts — number of sentences per response
      ttr_values      — Type-Token Ratio = unique_words / total_words
      keyword_density — fraction of words that are Peterson keywords
      keyword_counts  — Counter of all keyword hits (for word-cloud / heatmap)
      all_words       — flat list of content words (stopwords removed)
    """
    stop_words = set(stopwords.words('english'))
    kw_set     = set(k.lower() for k in PETERSON_KEYWORDS)
    word_counts, sentence_counts, ttr_values = [], [], []
    keyword_density = []
    keyword_counts  = Counter()
    all_words       = []

    for text in texts:
        if text.strip():
            words = word_tokenize(text.lower())
            sents = sent_tokenize(text)
        else:
            words, sents = [], []   # empty response → all zeros; do NOT skip

        words_alpha = [w for w in words if w.isalpha()]
        word_counts.append(len(words_alpha))
        sentence_counts.append(len(sents))

        # TTR: higher = richer vocabulary; collapses for very short texts
        ttr = len(set(words_alpha)) / max(len(words_alpha), 1)
        ttr_values.append(ttr)

        # Peterson keyword density: fraction of words that are signature terms
        kw_hits = [w for w in words_alpha if w in kw_set]
        keyword_density.append(len(kw_hits) / max(len(words_alpha), 1))
        keyword_counts.update(kw_hits)

        # Content words for word cloud (strip stop words; keep length > 2)
        all_words.extend(w for w in words_alpha if w not in stop_words and len(w) > 2)

    return {
        "word_counts":     word_counts,
        "sentence_counts": sentence_counts,
        "ttr_values":      ttr_values,
        "keyword_density": keyword_density,
        "keyword_counts":  keyword_counts,
        "all_words":       all_words,
    }


def compute_tfidf_similarity(responses: list[str],
                              references: list[str]) -> list[float]:
    """
    Measure how similar each model response is to Peterson's actual writing
    using TF-IDF cosine similarity.

    TF-IDF weights words by how distinctive they are across the corpus —
    common words (the, and) get low weight; rare but frequent-within-author
    words (chaos, logos, archetype) get high weight.  Cosine similarity
    then measures the angle between the response vector and each reference
    vector; we return the maximum across all reference passages so that
    a response only needs to resemble *one* passage to score well.

    Responses that are empty (or nearly so) receive a similarity of 0.0,
    which accurately reflects their lack of Peterson-vocabulary content.
    """
    if not responses or not any(r.strip() for r in responses):
        return [0.0] * len(responses)
    all_texts  = references + responses
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
    tfidf      = vectorizer.fit_transform(all_texts)
    ref_vecs   = tfidf[:len(references)]
    resp_vecs  = tfidf[len(references):]
    similarities = []
    for i in range(resp_vecs.shape[0]):
        sims = cosine_similarity(resp_vecs[i], ref_vecs)[0]
        similarities.append(float(sims.max()))
    return similarities


def _avg(lst): return sum(lst) / len(lst) if lst else 0.0

print("Helper functions defined.")

---
## Phase 1: GPT-OSS 20B — Base Model

We evaluate the *unmodified* GPT-OSS 20B model, which has never seen Peterson's
books.  It uses the generic `"You are a helpful assistant."` system prompt.

This establishes our **baseline** for the GPT-OSS architecture: any improvement
in the fine-tuned variant (Phase 2) can be attributed directly to the LoRA
training on the four Peterson books.

In [None]:
from unsloth import FastLanguageModel

_cache = CACHE_FILES["gptoss_base"]

if _cache.exists():
    # ── Fast path: skip model loading entirely ─────────────────────────────
    # Loading a 20B model takes ~60 seconds and consumes ~14 GB of VRAM.
    # If we already ran inference and saved the pkl, we bypass all of that.
    print(f"Cache found — skipping GPT-OSS 20B Base inference.")
    print(f"  {_cache}")
else:
    # ── Slow path: load → infer → save → unload ────────────────────────────
    print("Loading GPT-OSS 20B Base model …")
    _model, _tok = FastLanguageModel.from_pretrained(
        model_name      = GPTOSS_BASE_PATH,
        dtype           = None,    # auto (bfloat16 on Ampere+, float16 otherwise)
        max_seq_length  = 2048,    # matches the max used during fine-tuning
        load_in_4bit    = True,    # bitsandbytes 4-bit quantization to save VRAM
        full_finetuning = False,   # inference-only; no adapter merging needed
    )
    FastLanguageModel.for_inference(_model)   # Unsloth: 2× faster inference kernel
    print(f"  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

    # ── 1a. Perplexity on held-out Peterson passages ────────────────────────
    print("\nComputing perplexity on reference passages …")
    _ppls = compute_perplexity(_model, _tok, PETERSON_PASSAGES)
    print(f"  Per-passage: {[round(p,1) for p in _ppls]}")
    print(f"  Average    : {_avg(_ppls):.2f}")

    # ── 1b. Generate responses to all 10 evaluation prompts ────────────────
    # We use the architecture-appropriate wrapper (generate_response_gptoss).
    # The generic BASE_SYSTEM_PROMPT is used because the base model has not
    # been conditioned to write like Peterson.
    print("\nGenerating responses …")
    _resps = []
    for i, prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{i+1:02d}/{len(EVAL_PROMPTS)}] {prompt[:70]}")
        r = generate_response_gptoss(_model, _tok, prompt,
                                     SYSTEM_PROMPTS["gptoss_base"])
        _resps.append(r)
        print(f"         → {r[:100]}\n")

    # ── 1c. Persist results and free VRAM ──────────────────────────────────
    # We save a dict containing both lists so a single pkl read in the metrics
    # cell restores the full results for this model.
    with open(_cache, "wb") as f:
        pickle.dump({"responses": _resps, "perplexities": _ppls}, f)
    print(f"\nSaved → {_cache}")

    # Explicitly delete model references and flush CUDA memory allocator.
    # Without this step, loading the next 20B model would likely OOM.
    del _model, _tok, _resps, _ppls
    gc.collect()
    torch.cuda.empty_cache()
    print(f"Model unloaded.  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

# ── Always load pkl into named variables ───────────────────────────────────
with open(_cache, "rb") as f:
    _d = pickle.load(f)
gptoss_base_responses    = _d["responses"]
gptoss_base_perplexities = _d["perplexities"]
print(f"\nGPT-OSS 20B Base loaded from cache.")
print(f"  Responses    : {len(gptoss_base_responses)}")
print(f"  Avg PPL      : {_avg(gptoss_base_perplexities):.2f}")

---
## Phase 2: GPT-OSS 20B — Fine-Tuned Model

We now load the LoRA-adapted version of the same GPT-OSS 20B model, trained for
**1 epoch** on ~768,000 words from four Peterson books:
*Maps of Meaning*, *12 Rules for Life*, *Beyond Order*, and
*We Who Wrestle with God*.

Training stats: **641 steps**, **73.3 min**, **loss 3.01**, batch size 1.

The fine-tuned model uses the rich Peterson system prompt to activate its
domain-specific weights. We use the same Harmony-format `generate_response_gptoss`
wrapper as the base model — the only difference is the model weights loaded.

In [None]:
_cache = CACHE_FILES["gptoss_tuned"]

if _cache.exists():
    print(f"Cache found — skipping GPT-OSS 20B Fine-Tuned inference.")
    print(f"  {_cache}")
else:
    # Unsloth loads LoRA adapters transparently: passing the adapter directory
    # as model_name causes it to detect the adapter config and merge the LoRA
    # weights onto the base model at load time.  No separate merge step needed.
    print("Loading GPT-OSS 20B Fine-Tuned model (base + LoRA adapters) …")
    _model, _tok = FastLanguageModel.from_pretrained(
        model_name      = GPTOSS_LORA_PATH,   # points to saved adapter directory
        dtype           = None,
        max_seq_length  = 2048,
        load_in_4bit    = True,
        full_finetuning = False,
    )
    FastLanguageModel.for_inference(_model)
    print(f"  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

    print("\nComputing perplexity on reference passages …")
    _ppls = compute_perplexity(_model, _tok, PETERSON_PASSAGES)
    print(f"  Per-passage: {[round(p,1) for p in _ppls]}")
    print(f"  Average    : {_avg(_ppls):.2f}")

    # Fine-tuned model receives the TUNED system prompt — the same persona
    # prompt it was trained on — to ensure we evaluate it under its intended
    # operating conditions.
    print("\nGenerating responses …")
    _resps = []
    for i, prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{i+1:02d}/{len(EVAL_PROMPTS)}] {prompt[:70]}")
        r = generate_response_gptoss(_model, _tok, prompt,
                                     SYSTEM_PROMPTS["gptoss_tuned"])
        _resps.append(r)
        print(f"         → {r[:100]}\n")

    with open(_cache, "wb") as f:
        pickle.dump({"responses": _resps, "perplexities": _ppls}, f)
    print(f"\nSaved → {_cache}")

    del _model, _tok, _resps, _ppls
    gc.collect()
    torch.cuda.empty_cache()
    print(f"Model unloaded.  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

with open(_cache, "rb") as f:
    _d = pickle.load(f)
gptoss_tuned_responses    = _d["responses"]
gptoss_tuned_perplexities = _d["perplexities"]
print(f"\nGPT-OSS 20B Fine-Tuned loaded from cache.")
print(f"  Responses    : {len(gptoss_tuned_responses)}")
print(f"  Avg PPL      : {_avg(gptoss_tuned_perplexities):.2f}")

---
## Phase 3: Qwen3-14B — Base Model

Qwen3-14B is a **different architecture** from GPT-OSS — Alibaba's third-generation
Qwen model using the ChatML format (`<|im_start|>` / `<|im_end|>` tokens) and
supporting an optional chain-of-thought thinking mode via `enable_thinking`.

At 14B parameters vs 20B for GPT-OSS, Qwen3 is a *smaller* model — yet it
achieves a lower training loss after 1 epoch (2.44 vs 3.01).  This phase
evaluates whether the smaller model is already more capable on Peterson-domain
questions *before* fine-tuning, which would favour Qwen3 architecturally.

Key API difference: `generate_response_qwen3` uses a two-step tokenization
pipeline and strips any `<think>` blocks from the output.

In [None]:
_cache = CACHE_FILES["qwen3_base"]

if _cache.exists():
    print(f"Cache found — skipping Qwen3-14B Base inference.")
    print(f"  {_cache}")
else:
    # Qwen3-14B in 4-bit quantization uses ~10.4 GB of VRAM — less than
    # GPT-OSS 20B but still substantial. We follow the same load/infer/
    # save/unload pattern to keep memory usage predictable.
    print("Loading Qwen3-14B Base model …")
    _model, _tok = FastLanguageModel.from_pretrained(
        model_name      = QWEN3_BASE_PATH,
        dtype           = None,
        max_seq_length  = 2048,
        load_in_4bit    = True,
        full_finetuning = False,
    )
    FastLanguageModel.for_inference(_model)
    print(f"  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

    print("\nComputing perplexity on reference passages …")
    _ppls = compute_perplexity(_model, _tok, PETERSON_PASSAGES)
    print(f"  Per-passage: {[round(p,1) for p in _ppls]}")
    print(f"  Average    : {_avg(_ppls):.2f}")

    # Qwen3's generate_response wrapper applies the ChatML template with
    # enable_thinking=False, tokenises separately, and strips any <think> tags.
    print("\nGenerating responses …")
    _resps = []
    for i, prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{i+1:02d}/{len(EVAL_PROMPTS)}] {prompt[:70]}")
        r = generate_response_qwen3(_model, _tok, prompt,
                                    SYSTEM_PROMPTS["qwen3_base"])
        _resps.append(r)
        print(f"         → {r[:100]}\n")

    with open(_cache, "wb") as f:
        pickle.dump({"responses": _resps, "perplexities": _ppls}, f)
    print(f"\nSaved → {_cache}")

    del _model, _tok, _resps, _ppls
    gc.collect()
    torch.cuda.empty_cache()
    print(f"Model unloaded.  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

with open(_cache, "rb") as f:
    _d = pickle.load(f)
qwen3_base_responses    = _d["responses"]
qwen3_base_perplexities = _d["perplexities"]
print(f"\nQwen3-14B Base loaded from cache.")
print(f"  Responses    : {len(qwen3_base_responses)}")
print(f"  Avg PPL      : {_avg(qwen3_base_perplexities):.2f}")

---
## Phase 4: Qwen3-14B — Fine-Tuned Model

The Qwen3-14B model fine-tuned on the same four Peterson books.

Training stats: **321 steps**, **23.3 min**, **loss 2.44**, batch size 2.

Notable: despite being a smaller model, Qwen3 reached a *lower* training loss
in a *third* of the time and with *half* as many gradient updates.  Whether
this superior training efficiency translates to better inference quality is
exactly what this phase (and the subsequent comparisons) will reveal.

In [None]:
_cache = CACHE_FILES["qwen3_tuned"]

if _cache.exists():
    print(f"Cache found — skipping Qwen3-14B Fine-Tuned inference.")
    print(f"  {_cache}")
else:
    print("Loading Qwen3-14B Fine-Tuned model (base + LoRA adapters) …")
    _model, _tok = FastLanguageModel.from_pretrained(
        model_name      = QWEN3_LORA_PATH,
        dtype           = None,
        max_seq_length  = 2048,
        load_in_4bit    = True,
        full_finetuning = False,
    )
    FastLanguageModel.for_inference(_model)
    print(f"  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

    print("\nComputing perplexity on reference passages …")
    _ppls = compute_perplexity(_model, _tok, PETERSON_PASSAGES)
    print(f"  Per-passage: {[round(p,1) for p in _ppls]}")
    print(f"  Average    : {_avg(_ppls):.2f}")

    print("\nGenerating responses …")
    _resps = []
    for i, prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{i+1:02d}/{len(EVAL_PROMPTS)}] {prompt[:70]}")
        r = generate_response_qwen3(_model, _tok, prompt,
                                    SYSTEM_PROMPTS["qwen3_tuned"])
        _resps.append(r)
        print(f"         → {r[:100]}\n")

    with open(_cache, "wb") as f:
        pickle.dump({"responses": _resps, "perplexities": _ppls}, f)
    print(f"\nSaved → {_cache}")

    del _model, _tok, _resps, _ppls
    gc.collect()
    torch.cuda.empty_cache()
    print(f"Model unloaded.  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

with open(_cache, "rb") as f:
    _d = pickle.load(f)
qwen3_tuned_responses    = _d["responses"]
qwen3_tuned_perplexities = _d["perplexities"]
print(f"\nQwen3-14B Fine-Tuned loaded from cache.")
print(f"  Responses    : {len(qwen3_tuned_responses)}")
print(f"  Avg PPL      : {_avg(qwen3_tuned_perplexities):.2f}")

---
## Step 5: Aggregate All Metrics

Now that all four pkl files have been loaded, we compute the remaining metrics
(TF-IDF similarity and text statistics) for every model at once.  No GPU is
required at this point — all computation is CPU-bound.

In [None]:
# ── Consolidate all response and perplexity lists into dicts ───────────────
all_responses = {
    "gptoss_base":  gptoss_base_responses,
    "gptoss_tuned": gptoss_tuned_responses,
    "qwen3_base":   qwen3_base_responses,
    "qwen3_tuned":  qwen3_tuned_responses,
}
all_perplexities = {
    "gptoss_base":  gptoss_base_perplexities,
    "gptoss_tuned": gptoss_tuned_perplexities,
    "qwen3_base":   qwen3_base_perplexities,
    "qwen3_tuned":  qwen3_tuned_perplexities,
}

# ── TF-IDF cosine similarity ───────────────────────────────────────────────
# Measures how closely each model's vocabulary distribution resembles
# Peterson's actual writing.  Computed once per model against the same
# eight reference passages.
print("Computing TF-IDF similarities …")
all_similarities = {
    k: compute_tfidf_similarity(all_responses[k], PETERSON_PASSAGES)
    for k in MODEL_KEYS
}

# ── Text statistics ────────────────────────────────────────────────────────
# Word counts, sentence counts, TTR, keyword density, word lists for clouds.
print("Computing text statistics …")
all_stats = {k: compute_text_stats(all_responses[k]) for k in MODEL_KEYS}

# ── Averages ───────────────────────────────────────────────────────────────
# Pre-compute averages used in multiple charts so they are consistent.
avg_ppl  = {k: _avg(all_perplexities[k])              for k in MODEL_KEYS}
avg_sim  = {k: _avg(all_similarities[k])               for k in MODEL_KEYS}
avg_kd   = {k: _avg(all_stats[k]["keyword_density"])   for k in MODEL_KEYS}
avg_ttr  = {k: _avg(all_stats[k]["ttr_values"])        for k in MODEL_KEYS}
avg_len  = {k: _avg(all_stats[k]["word_counts"])       for k in MODEL_KEYS}

# ── Summary table ──────────────────────────────────────────────────────────
print()
print(f"{'Metric':<30} {'GPT-OSS Base':>14} {'GPT-OSS FT':>14} {'Qwen3 Base':>12} {'Qwen3 FT':>12}")
print("─" * 86)
print(f"{'Avg Perplexity':<30} {avg_ppl['gptoss_base']:>14.2f} {avg_ppl['gptoss_tuned']:>14.2f} {avg_ppl['qwen3_base']:>12.2f} {avg_ppl['qwen3_tuned']:>12.2f}")
print(f"{'Avg TF-IDF Similarity':<30} {avg_sim['gptoss_base']:>14.4f} {avg_sim['gptoss_tuned']:>14.4f} {avg_sim['qwen3_base']:>12.4f} {avg_sim['qwen3_tuned']:>12.4f}")
print(f"{'Avg Keyword Density':<30} {100*avg_kd['gptoss_base']:>13.2f}% {100*avg_kd['gptoss_tuned']:>13.2f}% {100*avg_kd['qwen3_base']:>11.2f}% {100*avg_kd['qwen3_tuned']:>11.2f}%")
print(f"{'Avg TTR':<30} {avg_ttr['gptoss_base']:>14.4f} {avg_ttr['gptoss_tuned']:>14.4f} {avg_ttr['qwen3_base']:>12.4f} {avg_ttr['qwen3_tuned']:>12.4f}")
print(f"{'Avg Response Length (words)':<30} {avg_len['gptoss_base']:>13.1f}  {avg_len['gptoss_tuned']:>13.1f}  {avg_len['qwen3_base']:>11.1f}  {avg_len['qwen3_tuned']:>11.1f}")
print("─" * 86)

---
## Figure 1: Perplexity on Peterson Reference Passages

**Left panel** — Per-passage perplexity for all four models.  Each cluster of
four bars corresponds to one held-out reference passage.  We expect both
fine-tuned models to sit lower in their respective clusters.

**Right panel** — Average perplexity across all eight passages.  This is the
single most direct measure of domain adaptation: lower = more "in-domain".

In [None]:
x      = np.arange(len(PETERSON_PASSAGES))
width  = 0.18   # width of each individual bar
n      = len(MODEL_KEYS)
colors = [MODEL_COLORS[k] for k in MODEL_KEYS]
labels = [MODEL_SHORT[k]  for k in MODEL_KEYS]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6),
                                gridspec_kw={'width_ratios': [3, 1]})
fig.suptitle("Perplexity on Held-Out Peterson Passages", fontsize=15, fontweight='bold')

# ── Left: per-passage grouped bar chart ────────────────────────────────────
# Each group of 4 bars shows one passage; bars within the group are the four
# model variants.  Centring the group at each integer x position requires
# offsetting each bar by (i - (n-1)/2) * width.
for i, (key, color, label) in enumerate(zip(MODEL_KEYS, colors, labels)):
    offset = (i - (n - 1) / 2) * width
    ax1.bar(x + offset, all_perplexities[key], width,
            label=label, color=color, alpha=0.85, edgecolor='white')
ax1.set_xlabel("Reference Passage")
ax1.set_ylabel("Perplexity  (lower = more domain-adapted)")
ax1.set_title("Per-Passage Perplexity")
ax1.set_xticks(x)
ax1.set_xticklabels([f"P{i+1}" for i in range(len(PETERSON_PASSAGES))],
                     fontsize=10)
ax1.legend(fontsize=9, loc='upper right')
ax1.grid(axis='y', alpha=0.4)

# ── Right: average perplexity summary ─────────────────────────────────────
avgs = [avg_ppl[k] for k in MODEL_KEYS]
bars = ax2.bar(range(n), avgs, color=colors, alpha=0.85, edgecolor='white', width=0.6)
for bar, v in zip(bars, avgs):
    ax2.text(bar.get_x() + bar.get_width() / 2, v + 0.2,
             f"{v:.1f}", ha='center', va='bottom', fontsize=10, fontweight='bold')
ax2.set_xticks(range(n))
ax2.set_xticklabels(labels, fontsize=9)
ax2.set_ylabel("Average Perplexity")
ax2.set_title("Average Perplexity")
ax2.grid(axis='y', alpha=0.4)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "01_perplexity.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 01_perplexity.png")

---
## Figure 2: TF-IDF Cosine Similarity to Peterson's Writing

For each model response to each evaluation prompt, we compute the cosine
similarity between its TF-IDF vector and each of the eight reference passages,
then take the *maximum* — so a response only needs to resemble one passage to
score well.

**Higher = the model's word choices are more similar to how Peterson actually writes.**

In [None]:
x     = np.arange(len(EVAL_PROMPTS))
width = 0.18

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6),
                                gridspec_kw={'width_ratios': [3, 1]})
fig.suptitle("TF-IDF Cosine Similarity to Peterson's Writing",
             fontsize=15, fontweight='bold')

# ── Left: per-prompt grouped bars ─────────────────────────────────────────
for i, (key, color, label) in enumerate(zip(MODEL_KEYS, colors, labels)):
    offset = (i - (n - 1) / 2) * width
    ax1.bar(x + offset, all_similarities[key], width,
            label=label, color=color, alpha=0.85, edgecolor='white')
ax1.set_xlabel("Evaluation Prompt")
ax1.set_ylabel("TF-IDF Cosine Similarity  (higher = more Peterson-like)")
ax1.set_title("Per-Prompt TF-IDF Similarity")
ax1.set_xticks(x)
ax1.set_xticklabels([f"Q{i+1}" for i in range(len(EVAL_PROMPTS))], fontsize=10)
ax1.legend(fontsize=9, loc='upper right')
ax1.grid(axis='y', alpha=0.4)

# ── Right: average summary ─────────────────────────────────────────────────
avgs = [avg_sim[k] for k in MODEL_KEYS]
bars = ax2.bar(range(n), avgs, color=colors, alpha=0.85, edgecolor='white', width=0.6)
for bar, v in zip(bars, avgs):
    ax2.text(bar.get_x() + bar.get_width() / 2, v + 0.001,
             f"{v:.4f}", ha='center', va='bottom', fontsize=9, fontweight='bold')
ax2.set_xticks(range(n))
ax2.set_xticklabels(labels, fontsize=9)
ax2.set_ylabel("Average TF-IDF Similarity")
ax2.set_title("Average TF-IDF Similarity")
ax2.grid(axis='y', alpha=0.4)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "02_tfidf_similarity.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 02_tfidf_similarity.png")

---
## Figure 3: Peterson Keyword Density

Keyword density = (Peterson-signature words in response) ÷ (total words).

This is a more *targeted* measure than TF-IDF: we are specifically asking
"does the model spontaneously use terms like chaos, logos, archetype, hierarchy,
sovereignty, etc.?"  High density suggests the fine-tuning has wired those
concepts into the model's first-choice vocabulary.

In [None]:
x     = np.arange(len(EVAL_PROMPTS))
width = 0.18

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6),
                                gridspec_kw={'width_ratios': [3, 1]})
fig.suptitle("Peterson Keyword Density per Prompt", fontsize=15, fontweight='bold')

# ── Left: per-prompt bars ──────────────────────────────────────────────────
for i, (key, color, label) in enumerate(zip(MODEL_KEYS, colors, labels)):
    offset = (i - (n - 1) / 2) * width
    ax1.bar(x + offset,
            [100 * v for v in all_stats[key]["keyword_density"]],
            width, label=label, color=color, alpha=0.85, edgecolor='white')
ax1.set_xlabel("Evaluation Prompt")
ax1.set_ylabel("Keyword Density  (%)")
ax1.set_title("Per-Prompt Keyword Density")
ax1.set_xticks(x)
ax1.set_xticklabels([f"Q{i+1}" for i in range(len(EVAL_PROMPTS))], fontsize=10)
ax1.legend(fontsize=9, loc='upper right')
ax1.grid(axis='y', alpha=0.4)

# ── Right: average summary ─────────────────────────────────────────────────
avgs = [100 * avg_kd[k] for k in MODEL_KEYS]
bars = ax2.bar(range(n), avgs, color=colors, alpha=0.85, edgecolor='white', width=0.6)
for bar, v in zip(bars, avgs):
    ax2.text(bar.get_x() + bar.get_width() / 2, v + 0.05,
             f"{v:.2f}%", ha='center', va='bottom', fontsize=9, fontweight='bold')
ax2.set_xticks(range(n))
ax2.set_xticklabels(labels, fontsize=9)
ax2.set_ylabel("Average Keyword Density  (%)")
ax2.set_title("Average Keyword Density")
ax2.grid(axis='y', alpha=0.4)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "03_keyword_density.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 03_keyword_density.png")

---
## Figure 4: Response Characteristics

Three panel chart showing per-model averages for:
- **Word count** — how verbose is each model?
- **Sentence count** — structural complexity of responses
- **Type-Token Ratio (TTR)** — vocabulary richness (unique/total words).
  Higher TTR = more varied word choice.  Peterson is known for elaborate,
  varied prose, so a higher TTR suggests stylistic alignment.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Response Characteristics — All 4 Models", fontsize=14, fontweight='bold')

stat_keys    = ["word_counts",     "sentence_counts",  "ttr_values"]
stat_titles  = ["Avg Word Count",  "Avg Sentence Count","Avg Type-Token Ratio"]
stat_ylabels = ["Words",           "Sentences",         "TTR  (unique/total words)"]

for ax, stat_key, title, ylabel in zip(axes, stat_keys, stat_titles, stat_ylabels):
    avgs = [_avg(all_stats[k][stat_key]) for k in MODEL_KEYS]
    bars = ax.bar(range(n), avgs, color=colors, alpha=0.85,
                  edgecolor='white', width=0.6)
    # Annotate each bar with its numeric value
    for bar, v in zip(bars, avgs):
        fmt = f"{v:.3f}" if "ttr" in stat_key else f"{v:.1f}"
        ax.text(bar.get_x() + bar.get_width() / 2, v * 1.02,
                fmt, ha='center', va='bottom', fontsize=9, fontweight='bold')
    ax.set_xticks(range(n))
    ax.set_xticklabels(labels, fontsize=8)
    ax.set_title(title, fontsize=11, fontweight='bold')
    ax.set_ylabel(ylabel)
    ax.grid(axis='y', alpha=0.4)

# Add a shared legend patch at the bottom of the figure
legend_patches = [
    mpatches.Patch(color=MODEL_COLORS[k], label=MODEL_DISPLAY[k])
    for k in MODEL_KEYS
]
fig.legend(handles=legend_patches, loc='lower center', ncol=4,
           fontsize=9, bbox_to_anchor=(0.5, -0.08))

plt.tight_layout()
plt.savefig(FIGURES_DIR / "04_response_characteristics.png",
            bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 04_response_characteristics.png")

---
## Figure 5: Word Clouds

Word clouds show the most frequent *content* words (stop words removed) in
each model's aggregate responses.  Peterson's vocabulary centres on a
distinctive cluster of terms — order, chaos, meaning, responsibility, hero,
archetype — so we expect to see those words dominate the fine-tuned clouds.

The four clouds are arranged in a 2×2 grid (architecture × training status):

```
  GPT-OSS Base    |  GPT-OSS Fine-Tuned
  ────────────────┼──────────────────────
  Qwen3 Base      |  Qwen3 Fine-Tuned
```

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle("Word Clouds — Content Word Frequency per Model",
             fontsize=15, fontweight='bold')

for ax, key in zip(axes.flat, MODEL_KEYS):
    words = all_stats[key]["all_words"]
    if words:
        # Use the model's designated colour as the dominant tint for easy
        # visual association between the cloud and the bar charts above.
        wc = WordCloud(
            width=600, height=400,
            background_color='white',
            colormap='viridis',       # consistent palette regardless of model
            max_words=80,
            collocations=False,       # avoid duplicating bigrams
            stopwords=STOPWORDS,
        ).generate(' '.join(words))
        ax.imshow(wc, interpolation='bilinear')
    else:
        ax.text(0.5, 0.5, "(no content words)", ha='center', va='center',
                transform=ax.transAxes, fontsize=14, color='grey')
    ax.set_title(MODEL_DISPLAY[key], fontsize=12, fontweight='bold',
                 color=MODEL_COLORS[key])
    ax.axis('off')

plt.tight_layout()
plt.savefig(FIGURES_DIR / "05_wordclouds.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 05_wordclouds.png")

---
## Figure 6: Peterson Keyword Heatmap (per prompt × per keyword)

Each cell shows the raw count of a specific Peterson keyword in the response
to a specific prompt.  The top-20 most-used keywords (pooled across all four
models) are shown on the x-axis; the 10 prompts on the y-axis.

Comparing the four heatmaps side by side reveals:
- Which keywords the fine-tuned models spontaneously use more
- Whether fine-tuning concentrates usage on a few keywords (overfitting a
  limited vocabulary) or spreads it broadly across the full dictionary

In [None]:
kw_set = set(k.lower() for k in PETERSON_KEYWORDS)

def per_prompt_kw_matrix(responses, keywords):
    """
    Return a (n_prompts × n_keywords) integer matrix counting keyword hits.

    Each row corresponds to one response; each column to one keyword.
    The matrix is used for the heatmap visualisation.
    """
    mat = []
    for resp in responses:
        words = word_tokenize(resp.lower()) if resp.strip() else []
        words_alpha = [w for w in words if w.isalpha()]
        mat.append([words_alpha.count(kw) for kw in keywords])
    return np.array(mat)

# ── Build top-20 keyword list from the union of all model responses ────────
all_kw_counter = Counter()
for key in MODEL_KEYS:
    for resp in all_responses[key]:
        for w in word_tokenize(resp.lower()):
            if w in kw_set:
                all_kw_counter[w] += 1
top20 = [kw for kw, _ in all_kw_counter.most_common(20)]

if not top20:
    print("No Peterson keywords found across any model — skipping heatmap.")
else:
    # Compute keyword matrices for all four models
    matrices = {k: per_prompt_kw_matrix(all_responses[k], top20) for k in MODEL_KEYS}
    vmax = max(m.max() for m in matrices.values() if m.size > 0) or 1

    p_labels = [f"Q{i+1}: {EVAL_PROMPTS[i][:25]}…" for i in range(len(EVAL_PROMPTS))]

    fig, axes = plt.subplots(2, 2, figsize=(18, 14), sharey=True)
    fig.suptitle("Peterson Keyword Usage Heatmap — All 4 Models",
                 fontsize=15, fontweight='bold')

    for ax, key in zip(axes.flat, MODEL_KEYS):
        mat = matrices[key]
        im = ax.imshow(mat, aspect='auto', cmap='YlOrRd', vmin=0, vmax=vmax)
        ax.set_xticks(range(len(top20)))
        ax.set_xticklabels(top20, rotation=45, ha='right', fontsize=8)
        ax.set_yticks(range(len(p_labels)))
        ax.set_yticklabels(p_labels, fontsize=8)
        ax.set_title(MODEL_DISPLAY[key], fontsize=11, fontweight='bold',
                     color=MODEL_COLORS[key])
        ax.set_xlabel("Peterson Keyword")
        # Annotate non-zero cells with their count value
        for i in range(mat.shape[0]):
            for j in range(mat.shape[1]):
                if mat[i, j] > 0:
                    ax.text(j, i, str(mat[i, j]), ha='center', va='center',
                            fontsize=7,
                            color='white' if mat[i, j] > vmax * 0.6 else 'black')
        plt.colorbar(im, ax=ax, label="Keyword count")

    plt.tight_layout()
    plt.savefig(FIGURES_DIR / "06_keyword_heatmap.png", bbox_inches='tight', dpi=150)
    plt.show()
    print("Saved: 06_keyword_heatmap.png")

---
## Figure 7: Radar / Spider Chart — Overall "Peterson-Likeness"

The radar chart summarises all five metrics simultaneously for all four models
on a single normalised scale [0.1, 1.0].  All metrics are oriented so that
**larger = more Peterson-like**:

| Spoke | Raw metric | Orientation |
|-------|-----------|-------------|
| Perplexity (inverted) | Lower PPL → higher score | inverted |
| TF-IDF Similarity | Higher cos-sim → higher score | normal |
| Keyword Density | Higher % → higher score | normal |
| Vocabulary Richness (TTR) | Higher TTR → higher score | normal |
| Response Length | Longer responses → higher score | normal |

Normalisation uses min–max across all four models simultaneously, so the
chart shows *relative* differences rather than absolute magnitudes.  A larger
polygon area means the model scores more consistently across all five axes.

In [None]:
def norm4(values: list[float], higher_better: bool = True) -> list[float]:
    """
    Normalise four metric values to [0.1, 1.0].

    Min-max normalisation is applied across all four values so that the
    best model always reaches 1.0 and the worst always reaches 0.1 on each
    axis.  Setting higher_better=False inverts the scale so that metrics
    where a lower value is better (like perplexity) still point "outward"
    on the radar chart.
    """
    lo, hi = min(values), max(values)
    if abs(hi - lo) < 1e-9:
        return [0.5] * len(values)   # all models equal on this axis
    normed = [(v - lo) / (hi - lo) for v in values]
    if not higher_better:
        normed = [1.0 - v for v in normed]   # invert so outward = better
    return [0.1 + 0.9 * v for v in normed]

# ── Build radar axes ───────────────────────────────────────────────────────
# Each entry is (axis_label, [score_per_model]) where scores are in MODEL_KEYS order.
radar_metrics = [
    ("Perplexity\n(inverted)",
     norm4([avg_ppl[k]  for k in MODEL_KEYS], higher_better=False)),
    ("TF-IDF\nSimilarity",
     norm4([avg_sim[k]  for k in MODEL_KEYS], higher_better=True)),
    ("Keyword\nDensity",
     norm4([avg_kd[k]   for k in MODEL_KEYS], higher_better=True)),
    ("Vocabulary\nRichness (TTR)",
     norm4([avg_ttr[k]  for k in MODEL_KEYS], higher_better=True)),
    ("Response\nLength",
     norm4([avg_len[k]  for k in MODEL_KEYS], higher_better=True)),
]

# Close the radar polygon by repeating the first point
spoke_labels = [m[0] for m in radar_metrics] + [radar_metrics[0][0]]
angles       = np.linspace(0, 2 * np.pi, len(spoke_labels), endpoint=True)

fig, ax = plt.subplots(figsize=(9, 9), subplot_kw=dict(polar=True))
fig.suptitle(
    "Radar Summary: Peterson-Likeness Across All Metrics\n(all 4 model variants)",
    fontsize=14, fontweight='bold', y=1.02)

for i, key in enumerate(MODEL_KEYS):
    # Extract this model's score on each axis, then close the polygon
    values = [m[1][i] for m in radar_metrics] + [radar_metrics[0][1][i]]
    ax.plot(angles, values, color=MODEL_COLORS[key], lw=2.5,
            label=MODEL_DISPLAY[key])
    ax.fill(angles, values, color=MODEL_COLORS[key], alpha=0.08)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(spoke_labels[:-1], size=11)
ax.set_yticks([0.25, 0.5, 0.75, 1.0])
ax.set_yticklabels(['0.25', '0.5', '0.75', '1.0'], size=7, color='grey')
ax.set_ylim(0, 1)
ax.legend(loc='upper right', bbox_to_anchor=(1.45, 1.15), fontsize=10)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "07_radar_summary.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 07_radar_summary.png")

---
## Step 11: Full Metric Summary Table

The table below shows raw metric values for all four models plus the
intra-architecture fine-tuning improvement (base → fine-tuned) for each metric.

Symbol key:
- `▲ ✓` — value moved in the *beneficial* direction
- `▼ ✗` — value moved in the *wrong* direction
- `—`   — informational metric (no "correct" direction)

In [None]:
def pct_change(bv: float, tv: float, higher_better: bool = True) -> str:
    """
    Format a percentage change from base value (bv) to tuned value (tv).

    Returns a string like '+12.3% ▲ ✓' when the change is in the expected
    direction, or '-5.1% ▼ ✗' when it is not.  An improvement is:
    - decrease for lower-is-better metrics (like perplexity)
    - increase for higher-is-better metrics (like TF-IDF, keyword density)
    """
    if abs(bv) < 1e-9:
        return "N/A"
    pct    = 100 * (tv - bv) / bv
    is_up  = pct > 0
    # improved = moved in the desired direction
    ok     = (is_up == higher_better)
    arrow  = "▲" if is_up else "▼"
    check  = "✓" if ok else "✗"
    return f"{pct:+.1f}% {arrow} {check}"

rows = [
    {
        "Metric":              "Avg Perplexity  (↓ better)",
        "GPT-OSS Base":        f"{avg_ppl['gptoss_base']:.2f}",
        "GPT-OSS Fine-Tuned":  f"{avg_ppl['gptoss_tuned']:.2f}",
        "GPT-OSS Δ":           pct_change(avg_ppl['gptoss_base'], avg_ppl['gptoss_tuned'],
                                           higher_better=False),
        "Qwen3 Base":          f"{avg_ppl['qwen3_base']:.2f}",
        "Qwen3 Fine-Tuned":    f"{avg_ppl['qwen3_tuned']:.2f}",
        "Qwen3 Δ":             pct_change(avg_ppl['qwen3_base'], avg_ppl['qwen3_tuned'],
                                           higher_better=False),
    },
    {
        "Metric":              "Avg TF-IDF Similarity  (↑ better)",
        "GPT-OSS Base":        f"{avg_sim['gptoss_base']:.4f}",
        "GPT-OSS Fine-Tuned":  f"{avg_sim['gptoss_tuned']:.4f}",
        "GPT-OSS Δ":           pct_change(avg_sim['gptoss_base'], avg_sim['gptoss_tuned']),
        "Qwen3 Base":          f"{avg_sim['qwen3_base']:.4f}",
        "Qwen3 Fine-Tuned":    f"{avg_sim['qwen3_tuned']:.4f}",
        "Qwen3 Δ":             pct_change(avg_sim['qwen3_base'], avg_sim['qwen3_tuned']),
    },
    {
        "Metric":              "Avg Keyword Density  (↑ better)",
        "GPT-OSS Base":        f"{100*avg_kd['gptoss_base']:.2f}%",
        "GPT-OSS Fine-Tuned":  f"{100*avg_kd['gptoss_tuned']:.2f}%",
        "GPT-OSS Δ":           pct_change(avg_kd['gptoss_base'], avg_kd['gptoss_tuned']),
        "Qwen3 Base":          f"{100*avg_kd['qwen3_base']:.2f}%",
        "Qwen3 Fine-Tuned":    f"{100*avg_kd['qwen3_tuned']:.2f}%",
        "Qwen3 Δ":             pct_change(avg_kd['qwen3_base'], avg_kd['qwen3_tuned']),
    },
    {
        "Metric":              "Avg Type-Token Ratio  (↑ better)",
        "GPT-OSS Base":        f"{avg_ttr['gptoss_base']:.4f}",
        "GPT-OSS Fine-Tuned":  f"{avg_ttr['gptoss_tuned']:.4f}",
        "GPT-OSS Δ":           pct_change(avg_ttr['gptoss_base'], avg_ttr['gptoss_tuned']),
        "Qwen3 Base":          f"{avg_ttr['qwen3_base']:.4f}",
        "Qwen3 Fine-Tuned":    f"{avg_ttr['qwen3_tuned']:.4f}",
        "Qwen3 Δ":             pct_change(avg_ttr['qwen3_base'], avg_ttr['qwen3_tuned']),
    },
    {
        "Metric":              "Avg Response Length  (—  informational)",
        "GPT-OSS Base":        f"{avg_len['gptoss_base']:.1f} words",
        "GPT-OSS Fine-Tuned":  f"{avg_len['gptoss_tuned']:.1f} words",
        "GPT-OSS Δ":           f"{avg_len['gptoss_tuned']-avg_len['gptoss_base']:+.1f} words",
        "Qwen3 Base":          f"{avg_len['qwen3_base']:.1f} words",
        "Qwen3 Fine-Tuned":    f"{avg_len['qwen3_tuned']:.1f} words",
        "Qwen3 Δ":             f"{avg_len['qwen3_tuned']-avg_len['qwen3_base']:+.1f} words",
    },
]

df = pd.DataFrame(rows)
col_order = ["Metric",
             "GPT-OSS Base", "GPT-OSS Fine-Tuned", "GPT-OSS Δ",
             "Qwen3 Base",   "Qwen3 Fine-Tuned",   "Qwen3 Δ"]
df = df[col_order]

print("=" * 110)
print("FULL METRIC SUMMARY — Cross-Architecture Comparison  (Jordan Peterson Fine-Tuning)")
print("=" * 110)
print(df.to_string(index=False))
print("=" * 110)
print("\n✓ = change in expected direction  |  ✗ = no improvement  |  ▲/▼ = direction")

---
## Step 12: Side-by-Side Response Comparison

Quantitative metrics tell only part of the story.  Here we print the raw
responses from all four models for the first three prompts so you can directly
read and compare their wording, depth, and stylistic alignment with Peterson.

Look for:
- Does the fine-tuned model add Peterson-specific vocabulary (chaos, order, logos)?
- Does any model regurgitate book passages rather than generating structured answers?
- Is the response coherent and on-topic, or does it drift?

In [None]:
for i in range(min(3, len(EVAL_PROMPTS))):
    print("━" * 100)
    print(f"PROMPT {i+1}: {EVAL_PROMPTS[i]}")
    print("━" * 100)
    for key in MODEL_KEYS:
        resp = all_responses[key][i]
        label = MODEL_DISPLAY[key]
        print(f"\n[{label}]")
        print(resp if resp.strip() else "(empty response)")
    print()

In [None]:
print("Saved figures:")
for f in sorted(FIGURES_DIR.iterdir()):
    print(f"  {f.name}  ({f.stat().st_size / 1024:.0f} KB)")

---
## Conclusions

### What perplexity tells us

Perplexity is the most unambiguous signal of domain adaptation.  A model that
has truly learned Peterson's writing will assign higher probability to his actual
sentences — producing lower perplexity on held-out passages.  We expect both
fine-tuned models to score lower than their respective base models; the
*magnitude* of that drop tells us how thoroughly the training signal was absorbed.

### Interpreting the vocabulary metrics (TF-IDF, keyword density, TTR)

These three metrics can *decrease* after fine-tuning even when training is
working correctly.  The reason is the **passage-regurgitation phenomenon**:
a model trained for only 1 epoch sometimes learns to output book passages
verbatim.  A passage is a sequence of continuous prose, not a curated answer
— it naturally has *lower* keyword concentration and a different vocabulary
distribution than a well-structured response.  If this is occurring, you will
see it clearly in the side-by-side comparison above (responses starting with
fragments like "the world, and…" or "…is the place of…").

### Architecture comparison

| | GPT-OSS 20B | Qwen3-14B |
|--|-------------|-----------|
| Parameters | 20B | 14B |
| Training loss (1 epoch) | 3.01 | **2.44** |
| Training time | 73.3 min | **23.3 min** |
| Batch size | 1 | **2** |
| VRAM peak | 13.8 GB | **13.4 GB** |

Qwen3-14B achieved a substantially *lower* training loss while being a smaller
model that trained three times faster.  This is consistent with Qwen3's
architecture improvements and its stronger base capabilities at 14B parameters.
Whether the numerical training-loss advantage translates into *perceptually*
better responses depends on whether the model has learned generalisation
(answering questions in Peterson's style) versus memorisation (reproducing passages).

### Next steps to improve inference quality

1. **Train for more epochs** — 3 epochs typically shifts the model from passage
   reproduction toward genuine stylistic transfer.
2. **Enable sampling at inference** — use `temperature=0.7, top_p=0.8` (Qwen3
   recommended defaults) instead of greedy decoding to produce more natural answers.
3. **Try Qwen3 thinking mode** — `enable_thinking=True` at inference engages the
   chain-of-thought reasoning module, which may produce substantively different and
   potentially deeper responses for complex Peterson-themed questions.
4. **Longer LoRA rank** — the fine-tuning used `r=16`; increasing to `r=32` gives
   the adapter more capacity to capture Peterson's stylistic patterns.