# Qwen3-14B All Versions Comparison: Jordan B. Peterson Fine-Tuning
## Base → V1 → V2 → V3: Tracking Training Task and Data-Quality Effects

This notebook evaluates all four Qwen3-14B checkpoints side by side on the same
Peterson-domain prompts and metrics, answering three specific research questions:

1. **Task change (V1→V2)** — Does switching from passage-completion to synthetic Q&A
   training improve answering ability, TF-IDF similarity, and keyword density?
2. **Data quality (V2→V3)** — Does removing front-matter pages from the training set
   produce measurably different metrics, given nearly identical training loss?
3. **Fine-tuning vs Base** — How do all three fine-tuned versions compare to the
   unmodified base model on Peterson-domain vocabulary and perplexity metrics?

### Evaluation Metrics

| Metric | What it measures | Direction |
|--------|-----------------|-----------|
| **Perplexity** | How surprised the model is by real Peterson text | ↓ better |
| **TF-IDF cosine similarity** | Vocabulary overlap with Peterson's writing | ↑ better |
| **Keyword density** | Rate of use of Peterson's ~60 signature terms | ↑ better |
| **Type-Token Ratio (TTR)** | Vocabulary richness (unique / total words) | ↑ better |
| **Response length** | Average words per response | informational |

### GPU Memory Strategy

All four Qwen3-14B models (~13–16 GB VRAM each) are evaluated **sequentially**:

```
Load → Infer → Save pkl → Unload → repeat for next model
```

Once all four pkl files exist in `comparison_cache_qwen3_versions/`, no model loading
is required — metrics and charts load from cache in seconds. Delete any pkl to force
re-evaluation of that model.

---
## Model Registry

| Key | Path | Training Task | Epochs | LoRA | Training Stats |
|-----|------|---------------|--------|------|----------------|
| `base` | `unsloth/Qwen3-14B-unsloth-bnb-4bit` | None (base) | — | — | — |
| `v1` | `./outputs/qwen3_14b_jordan_peterson_lora` | Passage completion | 1 | r=16 | 321 steps, 23.3 min, loss 2.44 |
| `v2` | `./outputs/qwen3_14b_peterson_v2_lora` | Synthetic Q&A (5,029 pairs, w/ front matter) | 3 | r=32 | 1,887 steps, 144.0 min, loss 1.5058 |
| `v3` | `./outputs/qwen3_14b_peterson_v3_lora` | Synthetic Q&A (4,867 pairs, clean) | 3 | r=32 | 1,827 steps, 136.9 min, loss 1.5258 |

### Notes

- **V1 training task**: V1 was trained on passage-completion (user = passage fragment,
  assistant = continuation). At inference we apply the Peterson persona system prompt
  for fair comparison, even though V1 never saw that prompt during training. Expect
  lower Q&A quality — V1 learned to continue passages, not answer questions.
- **V2 vs V3 comparability**: Identical architecture (r=32, 3 epochs, same LR). The
  only difference is the training data: V2 includes front-matter pages; V3 uses the
  DataPrep notebook's front-matter heuristic to remove them.
- **V3 data reduction**: Net −162 pairs (5,029 → 4,867). Largest reduction was in
  *We Who Wrestle with God* (−103 pairs), not *Maps of Meaning* as predicted. The
  V3 training loss is +0.02 vs V2 — negligible, within noise.
- **V3 index contamination**: Evaluation showed Q2 produced raw index text
  ("143–74 and Genesis story, 160–68…"), confirming that back-matter content
  is still present. Front-matter removal alone is insufficient.
- **System prompts**: `base` receives `"You are a helpful assistant."`; `v1`, `v2`,
  `v3` all receive the Peterson persona prompt from `peterson_config.json`.

In [None]:
import os, re, gc, math, pickle
from pathlib import Path
from collections import Counter

import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt',     quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords

# ── System prompts ─────────────────────────────────────────────────────────
with open("peterson_config.json") as _f:
    _config = _f.read()
import json as _json
_cfg = _json.loads(_config)
TUNED_SYSTEM_PROMPT = _cfg["system_prompt"]
BASE_SYSTEM_PROMPT  = "You are a helpful assistant."

# ── Model registry ─────────────────────────────────────────────────────────
MODEL_KEYS   = ["base", "v1", "v2", "v3"]
MODEL_PATHS  = {
    "base": "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "v1":   "./outputs/qwen3_14b_jordan_peterson_lora",
    "v2":   "./outputs/qwen3_14b_peterson_v2_lora",
    "v3":   "./outputs/qwen3_14b_peterson_v3_lora",
}
MODEL_DISPLAY = {
    "base": "Qwen3-14B  |  Base",
    "v1":   "Qwen3-14B  |  V1 (1-ep completion)",
    "v2":   "Qwen3-14B  |  V2 (3-ep Q&A)",
    "v3":   "Qwen3-14B  |  V3 (3-ep clean)",
}
MODEL_SHORT = {
    "base": "Base",
    "v1":   "V1\n(1-ep completion)",
    "v2":   "V2\n(3-ep Q&A)",
    "v3":   "V3\n(3-ep clean)",
}
MODEL_COLORS = {
    "base": "#4C72B0",   # blue
    "v1":   "#DD8452",   # orange
    "v2":   "#55A868",   # green
    "v3":   "#C44E52",   # red
}
MODEL_SYSTEM = {
    "base": BASE_SYSTEM_PROMPT,
    "v1":   TUNED_SYSTEM_PROMPT,
    "v2":   TUNED_SYSTEM_PROMPT,
    "v3":   TUNED_SYSTEM_PROMPT,
}

# ── Directories ────────────────────────────────────────────────────────────
CACHE_DIR   = Path("./comparison_cache_qwen3_versions")
FIGURES_DIR = Path("./comparison_figures_qwen3_versions")
CACHE_DIR.mkdir(exist_ok=True)
FIGURES_DIR.mkdir(exist_ok=True)

# ── Plot style ─────────────────────────────────────────────────────────────
plt.rcParams.update({
    'figure.dpi': 120,
    'figure.facecolor': 'white',
    'axes.facecolor': '#f8f8f8',
    'axes.grid': True,
    'grid.alpha': 0.4,
    'font.size': 11,
})

# ── Reference passages ─────────────────────────────────────────────────────
# Eight verbatim excerpts from Peterson's books, held out from training.
# Used to compute perplexity and TF-IDF similarity.
PETERSON_PASSAGES = [
    # From Maps of Meaning — the world as a forum for action
    "The world can be validly construed as a forum for action, or as a place of things. "
    "The former manner of interpretation — more primordial, and less clearly understood — "
    "finds its expression in the arts or humanities, in ritual, drama, literature, and myth. "
    "The world as forum for action is a place of value, a place where all things have meaning.",

    # From 12 Rules for Life — embodied courage
    "To stand up straight with your shoulders back is to accept the terrible responsibility "
    "of life, with eyes wide open. It means deciding to voluntarily transform the chaos of "
    "potential into the realities of habitable order. It means adopting the burden of "
    "self-conscious vulnerability, and accepting the end of the unconscious paradise of "
    "childhood, where finitude and mortality are only dimly comprehended.",

    # From Beyond Order — chaos as unexplored territory
    "Order is the place where the things you are currently doing are working out well "
    "for you. Chaos is the domain of ignorance itself. It's unexplored territory. Chaos "
    "is what extends, endlessly and without limit, beyond the boundaries of all states, "
    "all ideas, and all disciplines. It's the foreigner, the stranger, the member of "
    "another gang, the rustle in the bushes in the night-time.",

    # From We Who Wrestle with God — logos and truth
    "The divine spark in man is the logos — the word, the reason, the creative principle "
    "that gives order to the chaos of experience. To act in accordance with the logos is "
    "to speak the truth, to pursue what is meaningful rather than what is expedient, and "
    "to take on the burden of Being itself with courage and humility.",

    # From 12 Rules — compare yourself to who you were
    "Compare yourself to who you were yesterday, not to who someone else is today. "
    "You have a nature. You can play the game of life and improve. You can set a "
    "standard, even a minimal standard, and try to live up to it. You can improve "
    "incrementally, moving forward step by step. You can judge your life against "
    "what you know to be good, against what you should be.",

    # From Maps of Meaning — myth and the hero
    "The great myths and rituals of the past have been formulated in the language of "
    "the imagination. They say: act out the role of the hero; do not be the villain; "
    "do not be the tyrant. They say: update your maps of meaning when new information "
    "warrants it; admit your errors and change. They say: encounter the stranger and "
    "extract from that encounter what is valuable. Treat the stranger with respect.",

    # From Beyond Order — meaning as balance
    "Meaning is the ultimate balance between, on the one hand, the chaos of transformation "
    "and possibility and, on the other, the discipline of pristine order, whose purpose is "
    "to produce out of the attendant chaos a new order that will be even more productive "
    "and worthwhile than the old. Pursue what is meaningful, not what is expedient.",

    # From We Who Wrestle with God — suffering and transcendence
    "Suffering is not a mistake or an accident. It is the very ground of Being itself. "
    "To wrestle with God, as Jacob did, is to confront that suffering honestly, to take "
    "responsibility for it, and to find within it the possibility of transcendence. The "
    "hero does not flee from the dragon; he faces it and transforms the encounter.",
]

# ── Evaluation prompts ─────────────────────────────────────────────────────
EVAL_PROMPTS = [
    "What is the relationship between order and chaos in human experience?",
    "Why is personal responsibility the foundation of a meaningful life?",
    "How do ancient myths and stories reveal truths about human nature?",
    "What does it mean to pursue what is meaningful rather than what is expedient?",
    "How should a person confront suffering rather than flee from it?",
    "What is the significance of the hero archetype in understanding the human psyche?",
    "Why is telling the truth essential to a properly functioning life?",
    "What is the role of the divine or the sacred in organizing human society?",
    "How does the Jungian concept of the shadow relate to individual development?",
    "What does it mean to stand up straight with your shoulders back?",
]

# ── Peterson keyword dictionary ────────────────────────────────────────────
PETERSON_KEYWORDS = list(set([
    "chaos", "order", "logos", "being", "meaning", "meaningful", "meaningless",
    "transcendence", "transcendent", "archetype", "archetypal",
    "shadow", "anima", "animus", "unconscious", "consciousness", "psyche",
    "individuation", "projection",
    "responsibility", "suffering", "redemption", "courage", "virtue",
    "nihilism", "nihilistic", "expedient", "expedience", "tyranny", "tyrannical",
    "sovereignty", "heroic", "malevolent",
    "myth", "mythological", "hero", "dragon", "narrative", "story",
    "ritual", "sacrifice", "resurrection", "transformation",
    "divine", "sacred", "god", "biblical", "genesis", "logos", "spirit",
    "wrestle", "jacob", "adam", "eve", "serpent",
    "confront", "hierarchy", "dominance", "voluntarily", "catastrophe",
    "pathological", "resentment", "ideological", "totalitarian",
]))

# ── Results container ─────────────────────────────────────────────────────
# Populated by the four inference cells; consumed by the metrics cell.
results = {}

print(f"PyTorch  : {torch.__version__}")
print(f"CUDA     : {torch.cuda.is_available()}  |  GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM     : {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB total")
print(f"Cache    : {CACHE_DIR.resolve()}")
print(f"Figures  : {FIGURES_DIR.resolve()}")
print()
print("Cache status:")
for _k in MODEL_KEYS:
    _pkl = CACHE_DIR / f"{_k}_results.pkl"
    _status = "EXISTS (will skip inference)" if _pkl.exists() else "missing (will run inference)"
    print(f"  {MODEL_DISPLAY[_k]:<42}  {_status}")
print()
print(f"Reference passages  : {len(PETERSON_PASSAGES)}")
print(f"Evaluation prompts  : {len(EVAL_PROMPTS)}")
print(f"Keyword dictionary  : {len(PETERSON_KEYWORDS)} unique terms")

---
## Part 1: Helper Functions

All four models share the same evaluation pipeline — only the model weights differ.
Five helper functions are defined below:

| Function | Purpose |
|----------|---------|
| `generate_response()` | Qwen3 ChatML inference wrapper (strips `<think>` blocks) |
| `compute_perplexity()` | Token-level cross-entropy on held-out Peterson passages |
| `compute_text_stats()` | Word count, TTR, keyword density per response |
| `compute_tfidf_similarity()` | TF-IDF cosine vs reference passages |
| `_avg()` | Simple list average helper |

### Why the same evaluation for V1?

V1 was trained on a **passage-completion** task (assistant = continuation of a
passage fragment) rather than Q&A. At inference we apply the same Peterson persona
prompt as V2/V3 — even though V1 never saw that prompt during training — to keep
the evaluation conditions identical. The Q&A prompts are foreign to V1's training
distribution, so expect lower-quality, more passage-like responses.

### Critical pattern — `compute_text_stats()`

This function **must** append exactly one value per text in the input list, even
for empty strings. If any text is skipped, the output lists will be shorter than
`len(texts)`, causing index-misalignment crashes in the per-prompt bar charts.

In [None]:
def generate_response(model, tokenizer, prompt: str,
                      system_prompt: str,
                      max_new_tokens: int = 300) -> str:
    '''
    Generate a single response from a Qwen3 (ChatML-format) model.

    Qwen3's apply_chat_template() requires a two-step process:
      1. Call with tokenize=False and enable_thinking=False to get a plain string.
      2. Pass that string to the tokenizer separately to get input_ids.

    Even with thinking disabled, the template sometimes adds empty
    <think>\n\n</think> artifacts, so we strip any such blocks with re.sub.
    '''
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": prompt},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,
    )
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=1.0,
            repetition_penalty=1.1,
        )
    new_tokens = out[0][inputs["input_ids"].shape[1]:]
    response = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
    response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
    return response


def compute_perplexity(model, tokenizer, texts: list,
                       max_length: int = 256) -> list:
    '''
    Compute token-level perplexity for each text in the list.

    Perplexity = exp( average negative log-likelihood per token ).
    Lower value = model assigns higher probability to the text = more "in-domain".

    NOTE: We intentionally do NOT pass labels= to the model. Unsloth patches
    the forward pass to use a fused CE loss that allocates a ~1.5 GB gradient
    buffer for lm_head (vocab_size x hidden_dim) even under torch.no_grad(),
    causing OOM on 24 GB cards when the model is already using ~11 GB.
    Instead we obtain the logits and compute CE loss manually on CPU.
    '''
    import torch.nn.functional as F
    model.eval()
    perplexities = []
    with torch.no_grad():
        for text in texts:
            enc = tokenizer(
                text,
                return_tensors="pt",
                max_length=max_length,
                truncation=True,
            ).to("cuda")
            out    = model(**enc)                           # no labels → no fused CE
            logits = out.logits.cpu().float()               # move to CPU to save VRAM
            ids    = enc["input_ids"].cpu()
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = ids[..., 1:].contiguous()
            loss = F.cross_entropy(
                shift_logits.view(-1, shift_logits.size(-1)),
                shift_labels.view(-1),
            )
            perplexities.append(math.exp(loss.item()))
            del out, logits, shift_logits, shift_labels, ids, enc
    return perplexities


def compute_text_stats(texts: list) -> dict:
    '''
    Compute per-text statistics over a list of model responses.

    IMPORTANT: every text in the list must contribute exactly one entry to
    each output list — even empty strings — or per-prompt bar charts will
    crash with shape mismatches.
    '''
    stop_words = set(stopwords.words('english'))
    kw_set     = set(k.lower() for k in PETERSON_KEYWORDS)
    word_counts, sentence_counts, ttr_values = [], [], []
    keyword_density = []
    keyword_counts  = Counter()
    all_words       = []

    for text in texts:
        if text.strip():
            words = word_tokenize(text.lower())
            sents = sent_tokenize(text)
        else:
            words, sents = [], []   # empty response → all zeros; do NOT skip

        words_alpha = [w for w in words if w.isalpha()]
        word_counts.append(len(words_alpha))
        sentence_counts.append(len(sents))

        ttr = len(set(words_alpha)) / max(len(words_alpha), 1)
        ttr_values.append(ttr)

        kw_hits = [w for w in words_alpha if w in kw_set]
        keyword_density.append(len(kw_hits) / max(len(words_alpha), 1))
        keyword_counts.update(kw_hits)

        all_words.extend(w for w in words_alpha if w not in stop_words and len(w) > 2)

    return {
        "word_counts":     word_counts,
        "sentence_counts": sentence_counts,
        "ttr_values":      ttr_values,
        "keyword_density": keyword_density,
        "keyword_counts":  keyword_counts,
        "all_words":       all_words,
    }


def compute_tfidf_similarity(responses: list, references: list) -> list:
    '''
    Measure how similar each model response is to Peterson's actual writing
    using TF-IDF cosine similarity. Returns the maximum similarity across
    all reference passages for each response.
    '''
    if not responses or not any(r.strip() for r in responses):
        return [0.0] * len(responses)
    all_texts  = references + responses
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
    tfidf      = vectorizer.fit_transform(all_texts)
    ref_vecs   = tfidf[:len(references)]
    resp_vecs  = tfidf[len(references):]
    similarities = []
    for i in range(resp_vecs.shape[0]):
        sims = cosine_similarity(resp_vecs[i], ref_vecs)[0]
        similarities.append(float(sims.max()))
    return similarities


def _avg(lst): return sum(lst) / len(lst) if lst else 0.0

print("Helper functions defined.")

---
## Phase 1: Qwen3-14B — Base Model

The unmodified Qwen3-14B base model with no Peterson fine-tuning. It receives the
generic `"You are a helpful assistant."` system prompt — using the Peterson persona
prompt for an unfine-tuned model would encourage superficial mimicry rather than
showing the model's natural out-of-the-box performance on these topics.

This phase establishes the **baseline**: any improvement in V1/V2/V3 can be attributed
directly to the respective training interventions. Perplexity on held-out Peterson
passages measures how "surprised" the model is by his characteristic prose.

In [None]:
from unsloth import FastLanguageModel

_key = "base"
_pkl = CACHE_DIR / f"{_key}_results.pkl"

if _pkl.exists():
    print(f"Cache found — skipping {MODEL_DISPLAY[_key]} inference.")
    print(f"  {_pkl}")
else:
    print(f"Loading {MODEL_DISPLAY[_key]} model ...")
    _model, _tok = FastLanguageModel.from_pretrained(
        model_name      = MODEL_PATHS[_key],
        dtype           = None,
        max_seq_length  = 2048,
        load_in_4bit    = True,
        full_finetuning = False,
    )
    FastLanguageModel.for_inference(_model)
    print(f"  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

    print("\nComputing perplexity on reference passages ...")
    _ppls = compute_perplexity(_model, _tok, PETERSON_PASSAGES)
    print(f"  Per-passage: {[round(p,1) for p in _ppls]}")
    print(f"  Average    : {_avg(_ppls):.2f}")

    print("\nGenerating responses ...")
    _resps = []
    for _i, _prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{_i+1:02d}/{len(EVAL_PROMPTS)}] {_prompt[:70]}")
        _r = generate_response(_model, _tok, _prompt, MODEL_SYSTEM[_key])
        _resps.append(_r)
        print(f"         -> {_r[:100]}\n")

    with open(_pkl, "wb") as _f:
        pickle.dump({"responses": _resps, "perplexities": _ppls}, _f)
    print(f"\nSaved -> {_pkl}")

    del _model, _tok, _resps, _ppls
    gc.collect()
    torch.cuda.empty_cache()
    print(f"Model unloaded.  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

with open(_pkl, "rb") as _f:
    results[_key] = pickle.load(_f)
print(f"\n{MODEL_DISPLAY[_key]} ready.")
print(f"  Responses    : {len(results[_key]['responses'])}")
print(f"  Avg PPL      : {_avg(results[_key]['perplexities']):.2f}")

---
## Phase 2: Qwen3-14B — V1 Fine-Tuned (Passage Completion)

**Training task**: The V1 adapter was trained with user = passage fragment, assistant =
passage continuation. This is a fundamentally different task from Q&A: the model learned
to predict what comes next in Peterson's prose, not to answer a question about his ideas.

**System prompt at inference**: We apply the Peterson persona prompt (same as V2/V3) to
keep the evaluation fair. V1 never saw this prompt during training, so it operates somewhat
outside its training distribution at inference time.

**Expected behaviour**: V1 responses may look more like passage continuations than structured
answers — longer sentences, possibly starting mid-thought, potentially regurgitating verbatim
passages. The Q&A prompts are foreign to its training objective.

**LoRA**: r=16, alpha=16 (half the capacity of V2/V3's r=32). Training was 1 epoch
(321 steps, 23.3 min, loss 2.44 — a much lower loss than V2/V3 *before* considering
that V2/V3 have 3× more gradient updates on a harder task).

In [None]:
_key = "v1"
_pkl = CACHE_DIR / f"{_key}_results.pkl"

if _pkl.exists():
    print(f"Cache found — skipping {MODEL_DISPLAY[_key]} inference.")
    print(f"  {_pkl}")
else:
    print(f"Loading {MODEL_DISPLAY[_key]} model ...")
    _model, _tok = FastLanguageModel.from_pretrained(
        model_name      = MODEL_PATHS[_key],
        dtype           = None,
        max_seq_length  = 2048,
        load_in_4bit    = True,
        full_finetuning = False,
    )
    FastLanguageModel.for_inference(_model)
    print(f"  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

    print("\nComputing perplexity on reference passages ...")
    _ppls = compute_perplexity(_model, _tok, PETERSON_PASSAGES)
    print(f"  Per-passage: {[round(p,1) for p in _ppls]}")
    print(f"  Average    : {_avg(_ppls):.2f}")

    print("\nGenerating responses ...")
    _resps = []
    for _i, _prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{_i+1:02d}/{len(EVAL_PROMPTS)}] {_prompt[:70]}")
        _r = generate_response(_model, _tok, _prompt, MODEL_SYSTEM[_key])
        _resps.append(_r)
        print(f"         -> {_r[:100]}\n")

    with open(_pkl, "wb") as _f:
        pickle.dump({"responses": _resps, "perplexities": _ppls}, _f)
    print(f"\nSaved -> {_pkl}")

    del _model, _tok, _resps, _ppls
    gc.collect()
    torch.cuda.empty_cache()
    print(f"Model unloaded.  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

with open(_pkl, "rb") as _f:
    results[_key] = pickle.load(_f)
print(f"\n{MODEL_DISPLAY[_key]} ready.")
print(f"  Responses    : {len(results[_key]['responses'])}")
print(f"  Avg PPL      : {_avg(results[_key]['perplexities']):.2f}")

---
## Phase 3: Qwen3-14B — V2 Fine-Tuned (Synthetic Q&A, 5,029 pairs)

**Training task**: V2 uses synthetic Q&A pairs generated by Claude Haiku — the training
task now *matches* the inference task. Each example is (system: Peterson persona, user:
question, assistant: verbatim passage that answers it). This is the root fix for the
passage-regurgitation problem in V1.

**Dataset**: 5,029 Q&A pairs generated from ~2,500 passage chunks across all 4 books.
Front-matter pages were **included** — some training examples may contain publisher/
copyright text, ISBN numbers, or other non-content material.

**LoRA**: r=32, alpha=32 — double the capacity of V1. 3 epochs (1,887 steps, 144.0 min,
loss 1.5058, peak VRAM 15.5 GB).

**Expected improvement over V1**: Better TF-IDF similarity and keyword density because
the model now produces structured answers instead of passage continuations. However,
front-matter contamination may produce occasional noise responses.

In [None]:
_key = "v2"
_pkl = CACHE_DIR / f"{_key}_results.pkl"

if _pkl.exists():
    print(f"Cache found — skipping {MODEL_DISPLAY[_key]} inference.")
    print(f"  {_pkl}")
else:
    print(f"Loading {MODEL_DISPLAY[_key]} model ...")
    _model, _tok = FastLanguageModel.from_pretrained(
        model_name      = MODEL_PATHS[_key],
        dtype           = None,
        max_seq_length  = 2048,
        load_in_4bit    = True,
        full_finetuning = False,
    )
    FastLanguageModel.for_inference(_model)
    print(f"  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

    print("\nComputing perplexity on reference passages ...")
    _ppls = compute_perplexity(_model, _tok, PETERSON_PASSAGES)
    print(f"  Per-passage: {[round(p,1) for p in _ppls]}")
    print(f"  Average    : {_avg(_ppls):.2f}")

    print("\nGenerating responses ...")
    _resps = []
    for _i, _prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{_i+1:02d}/{len(EVAL_PROMPTS)}] {_prompt[:70]}")
        _r = generate_response(_model, _tok, _prompt, MODEL_SYSTEM[_key])
        _resps.append(_r)
        print(f"         -> {_r[:100]}\n")

    with open(_pkl, "wb") as _f:
        pickle.dump({"responses": _resps, "perplexities": _ppls}, _f)
    print(f"\nSaved -> {_pkl}")

    del _model, _tok, _resps, _ppls
    gc.collect()
    torch.cuda.empty_cache()
    print(f"Model unloaded.  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

with open(_pkl, "rb") as _f:
    results[_key] = pickle.load(_f)
print(f"\n{MODEL_DISPLAY[_key]} ready.")
print(f"  Responses    : {len(results[_key]['responses'])}")
print(f"  Avg PPL      : {_avg(results[_key]['perplexities']):.2f}")

---
## Phase 4: Qwen3-14B — V3 Fine-Tuned (Synthetic Q&A, 4,867 pairs, clean)

**Training task**: Identical to V2 — synthetic Q&A pairs, same architecture, same
hyperparameters (r=32, alpha=32, 3 epochs, 2e-4 LR). The *only* difference is the
training data.

**Dataset**: 4,867 Q&A pairs — 162 fewer than V2 — generated from passages that passed
the DataPrep front-matter heuristic. The biggest reduction was in *We Who Wrestle with
God* (−103 pairs), indicating that book had the most front-matter contamination.

**Training stats**: 1,827 steps, 136.9 min, loss 1.5258, peak VRAM 15.3 GB.
Loss is +0.02 vs V2 — negligible, within noise. Fewer training pairs means slightly
less gradient signal but the difference is immaterial.

**V3 vs V2**: This is the cleanest A/B test in the series — identical everything except
data cleanliness. If front-matter removal helps, we expect V3 to show slightly better
metrics. If the difference is negligible, it confirms that data purity at this scale
has minimal impact compared to the task formulation change (V1→V2).

**Known remaining issue**: V3 still contains back-matter content (index pages, figure
lists). The front-matter fix alone is insufficient; a back-matter removal step is the
next highest-leverage data improvement.

In [None]:
_key = "v3"
_pkl = CACHE_DIR / f"{_key}_results.pkl"

if _pkl.exists():
    print(f"Cache found — skipping {MODEL_DISPLAY[_key]} inference.")
    print(f"  {_pkl}")
else:
    print(f"Loading {MODEL_DISPLAY[_key]} model ...")
    _model, _tok = FastLanguageModel.from_pretrained(
        model_name      = MODEL_PATHS[_key],
        dtype           = None,
        max_seq_length  = 2048,
        load_in_4bit    = True,
        full_finetuning = False,
    )
    FastLanguageModel.for_inference(_model)
    print(f"  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

    print("\nComputing perplexity on reference passages ...")
    _ppls = compute_perplexity(_model, _tok, PETERSON_PASSAGES)
    print(f"  Per-passage: {[round(p,1) for p in _ppls]}")
    print(f"  Average    : {_avg(_ppls):.2f}")

    print("\nGenerating responses ...")
    _resps = []
    for _i, _prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{_i+1:02d}/{len(EVAL_PROMPTS)}] {_prompt[:70]}")
        _r = generate_response(_model, _tok, _prompt, MODEL_SYSTEM[_key])
        _resps.append(_r)
        print(f"         -> {_r[:100]}\n")

    with open(_pkl, "wb") as _f:
        pickle.dump({"responses": _resps, "perplexities": _ppls}, _f)
    print(f"\nSaved -> {_pkl}")

    del _model, _tok, _resps, _ppls
    gc.collect()
    torch.cuda.empty_cache()
    print(f"Model unloaded.  VRAM reserved: {torch.cuda.memory_reserved()/1e9:.1f} GB")

with open(_pkl, "rb") as _f:
    results[_key] = pickle.load(_f)
print(f"\n{MODEL_DISPLAY[_key]} ready.")
print(f"  Responses    : {len(results[_key]['responses'])}")
print(f"  Avg PPL      : {_avg(results[_key]['perplexities']):.2f}")

---
## Part 3: Metrics Aggregation

All four pkl files are now loaded. We compute TF-IDF similarity and text statistics
(CPU-only — no GPU required) and print a formatted summary table comparing all
metrics across all four versions.

**Delta columns** use `▲ ✓` / `▼ ✗` to flag whether each change moved in the
expected direction. For perplexity (lower is better), a decrease is `✓`; for all
other metrics (higher is better), an increase is `✓`.

In [None]:
# ── Extract responses and perplexities from results dict ──────────────────
all_responses    = {k: results[k]["responses"]    for k in MODEL_KEYS}
all_perplexities = {k: results[k]["perplexities"] for k in MODEL_KEYS}

# ── TF-IDF cosine similarity ───────────────────────────────────────────────
print("Computing TF-IDF similarities ...")
all_similarities = {
    k: compute_tfidf_similarity(all_responses[k], PETERSON_PASSAGES)
    for k in MODEL_KEYS
}

# ── Text statistics ────────────────────────────────────────────────────────
print("Computing text statistics ...")
all_stats = {k: compute_text_stats(all_responses[k]) for k in MODEL_KEYS}

# ── Averages ───────────────────────────────────────────────────────────────
avg_ppl = {k: _avg(all_perplexities[k])              for k in MODEL_KEYS}
avg_sim = {k: _avg(all_similarities[k])               for k in MODEL_KEYS}
avg_kd  = {k: _avg(all_stats[k]["keyword_density"])   for k in MODEL_KEYS}
avg_ttr = {k: _avg(all_stats[k]["ttr_values"])        for k in MODEL_KEYS}
avg_len = {k: _avg(all_stats[k]["word_counts"])       for k in MODEL_KEYS}


def pct_change(bv, tv, higher_better=True):
    if abs(bv) < 1e-9:
        return "N/A"
    pct   = 100 * (tv - bv) / bv
    is_up = pct > 0
    ok    = (is_up == higher_better)
    return f"{pct:+.1f}% {'▲' if is_up else '▼'} {'✓' if ok else '✗'}"


# ── Summary table ──────────────────────────────────────────────────────────
print()
print(f"{'Metric':<30} {'Base':>12} {'V1':>14} {'V2':>14} {'V3':>14}")
print("─" * 88)
print(f"{'Avg Perplexity (↓)':<30} {avg_ppl['base']:>12.2f} {avg_ppl['v1']:>14.2f} {avg_ppl['v2']:>14.2f} {avg_ppl['v3']:>14.2f}")
print(f"{'Avg TF-IDF Sim (↑)':<30} {avg_sim['base']:>12.4f} {avg_sim['v1']:>14.4f} {avg_sim['v2']:>14.4f} {avg_sim['v3']:>14.4f}")
print(f"{'Avg Keyword Density (↑)':<30} {100*avg_kd['base']:>11.2f}% {100*avg_kd['v1']:>13.2f}% {100*avg_kd['v2']:>13.2f}% {100*avg_kd['v3']:>13.2f}%")
print(f"{'Avg TTR (↑)':<30} {avg_ttr['base']:>12.4f} {avg_ttr['v1']:>14.4f} {avg_ttr['v2']:>14.4f} {avg_ttr['v3']:>14.4f}")
print(f"{'Avg Response Length (words)':<30} {avg_len['base']:>11.1f}  {avg_len['v1']:>13.1f}  {avg_len['v2']:>13.1f}  {avg_len['v3']:>13.1f}")
print("─" * 88)
print()
# Delta vs base
print(f"{'Delta vs Base:':<30} {'':>12} {'V1 Δ':>14} {'V2 Δ':>14} {'V3 Δ':>14}")
print("─" * 88)
print(f"{'Perplexity (↓ is ✓)':<30} {'—':>12} {pct_change(avg_ppl['base'], avg_ppl['v1'], False):>14} {pct_change(avg_ppl['base'], avg_ppl['v2'], False):>14} {pct_change(avg_ppl['base'], avg_ppl['v3'], False):>14}")
print(f"{'TF-IDF Sim (↑ is ✓)':<30} {'—':>12} {pct_change(avg_sim['base'], avg_sim['v1']):>14} {pct_change(avg_sim['base'], avg_sim['v2']):>14} {pct_change(avg_sim['base'], avg_sim['v3']):>14}")
print(f"{'Keyword Density (↑ is ✓)':<30} {'—':>12} {pct_change(avg_kd['base'], avg_kd['v1']):>14} {pct_change(avg_kd['base'], avg_kd['v2']):>14} {pct_change(avg_kd['base'], avg_kd['v3']):>14}")
print(f"{'TTR (↑ is ✓)':<30} {'—':>12} {pct_change(avg_ttr['base'], avg_ttr['v1']):>14} {pct_change(avg_ttr['base'], avg_ttr['v2']):>14} {pct_change(avg_ttr['base'], avg_ttr['v3']):>14}")
print("─" * 88)

---
## Figure 1: Perplexity on Peterson Reference Passages

**Left panel** — Per-passage perplexity for all four model versions. Each cluster of
four bars corresponds to one held-out reference passage. Fine-tuned models should sit
lower in their respective clusters, indicating they assign higher probability to Peterson's
prose — i.e., they have "learned" his writing.

**Right panel** — Average perplexity across all eight passages. This is the most direct
measure of domain adaptation. The percentage annotation shows change vs the base model.

Note: perplexity is architecture-agnostic — all four models share the same Qwen3-14B
base and the same tokenizer, so the numbers are directly comparable.

In [None]:
x     = np.arange(len(PETERSON_PASSAGES))
width = 0.18
n     = len(MODEL_KEYS)
colors = [MODEL_COLORS[k] for k in MODEL_KEYS]
labels = [MODEL_SHORT[k]  for k in MODEL_KEYS]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6),
                                gridspec_kw={'width_ratios': [3, 1]})
fig.suptitle("Perplexity on Held-Out Peterson Passages", fontsize=15, fontweight='bold')

for i, (key, color, label) in enumerate(zip(MODEL_KEYS, colors, labels)):
    offset = (i - (n - 1) / 2) * width
    ax1.bar(x + offset, all_perplexities[key], width,
            label=label, color=color, alpha=0.85, edgecolor='white')
ax1.set_xlabel("Reference Passage")
ax1.set_ylabel("Perplexity  (lower = more domain-adapted)")
ax1.set_title("Per-Passage Perplexity")
ax1.set_xticks(x)
ax1.set_xticklabels([f"P{i+1}" for i in range(len(PETERSON_PASSAGES))], fontsize=10)
ax1.legend(fontsize=9, loc='upper right')
ax1.grid(axis='y', alpha=0.4)

avgs = [avg_ppl[k] for k in MODEL_KEYS]
bars = ax2.bar(range(n), avgs, color=colors, alpha=0.85, edgecolor='white', width=0.6)
base_ppl = avg_ppl["base"]
for bar, v, k in zip(bars, avgs, MODEL_KEYS):
    ann = f"{v:.1f}"
    if k != "base":
        ann += f"\n({100*(v-base_ppl)/base_ppl:+.1f}%)"
    ax2.text(bar.get_x() + bar.get_width() / 2, v + 0.2,
             ann, ha='center', va='bottom', fontsize=9, fontweight='bold')
ax2.set_xticks(range(n))
ax2.set_xticklabels(labels, fontsize=9)
ax2.set_ylabel("Average Perplexity")
ax2.set_title("Average Perplexity\n(% change vs Base)")
ax2.grid(axis='y', alpha=0.4)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "01_perplexity.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 01_perplexity.png")

---
## Figure 2: TF-IDF Cosine Similarity to Peterson's Writing

For each model response to each evaluation prompt, we compute the cosine similarity
between its TF-IDF vector and each of the eight reference passages, then take the
**maximum** — so a response only needs to resemble one passage to score well.

**Higher = the model's word choices are more similar to how Peterson actually writes.**

We expect V2/V3 to improve substantially over V1 on this metric because Q&A training
produces structured responses with Peterson vocabulary, whereas V1's passage continuations
may have different vocabulary distributions.

In [None]:
x     = np.arange(len(EVAL_PROMPTS))
width = 0.18

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6),
                                gridspec_kw={'width_ratios': [3, 1]})
fig.suptitle("TF-IDF Cosine Similarity to Peterson's Writing",
             fontsize=15, fontweight='bold')

for i, (key, color, label) in enumerate(zip(MODEL_KEYS, colors, labels)):
    offset = (i - (n - 1) / 2) * width
    ax1.bar(x + offset, all_similarities[key], width,
            label=label, color=color, alpha=0.85, edgecolor='white')
ax1.set_xlabel("Evaluation Prompt")
ax1.set_ylabel("TF-IDF Cosine Similarity  (higher = more Peterson-like)")
ax1.set_title("Per-Prompt TF-IDF Similarity")
ax1.set_xticks(x)
ax1.set_xticklabels([f"Q{i+1}" for i in range(len(EVAL_PROMPTS))], fontsize=10)
ax1.legend(fontsize=9, loc='upper right')
ax1.grid(axis='y', alpha=0.4)

avgs = [avg_sim[k] for k in MODEL_KEYS]
bars = ax2.bar(range(n), avgs, color=colors, alpha=0.85, edgecolor='white', width=0.6)
for bar, v in zip(bars, avgs):
    ax2.text(bar.get_x() + bar.get_width() / 2, v + 0.001,
             f"{v:.4f}", ha='center', va='bottom', fontsize=9, fontweight='bold')
ax2.set_xticks(range(n))
ax2.set_xticklabels(labels, fontsize=9)
ax2.set_ylabel("Average TF-IDF Similarity")
ax2.set_title("Average TF-IDF Similarity")
ax2.grid(axis='y', alpha=0.4)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "02_tfidf_similarity.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 02_tfidf_similarity.png")

---
## Figure 3: Peterson Keyword Density

Keyword density = (Peterson-signature words in response) ÷ (total words).

This is a more *targeted* measure than TF-IDF: we are specifically asking "does the
model spontaneously use terms like chaos, logos, archetype, hierarchy, sovereignty,
etc.?" High density suggests the fine-tuning has wired those concepts into the model's
first-choice vocabulary.

For Q&A-trained models (V2/V3), we expect higher density because the model has learned
to answer questions *using* Peterson's conceptual vocabulary. V1 (passage completion)
may score lower if it primarily outputs narrative prose that relies on different words.

In [None]:
x     = np.arange(len(EVAL_PROMPTS))
width = 0.18

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6),
                                gridspec_kw={'width_ratios': [3, 1]})
fig.suptitle("Peterson Keyword Density per Prompt", fontsize=15, fontweight='bold')

for i, (key, color, label) in enumerate(zip(MODEL_KEYS, colors, labels)):
    offset = (i - (n - 1) / 2) * width
    ax1.bar(x + offset,
            [100 * v for v in all_stats[key]["keyword_density"]],
            width, label=label, color=color, alpha=0.85, edgecolor='white')
ax1.set_xlabel("Evaluation Prompt")
ax1.set_ylabel("Keyword Density  (%)")
ax1.set_title("Per-Prompt Keyword Density")
ax1.set_xticks(x)
ax1.set_xticklabels([f"Q{i+1}" for i in range(len(EVAL_PROMPTS))], fontsize=10)
ax1.legend(fontsize=9, loc='upper right')
ax1.grid(axis='y', alpha=0.4)

avgs = [100 * avg_kd[k] for k in MODEL_KEYS]
bars = ax2.bar(range(n), avgs, color=colors, alpha=0.85, edgecolor='white', width=0.6)
for bar, v in zip(bars, avgs):
    ax2.text(bar.get_x() + bar.get_width() / 2, v + 0.05,
             f"{v:.2f}%", ha='center', va='bottom', fontsize=9, fontweight='bold')
ax2.set_xticks(range(n))
ax2.set_xticklabels(labels, fontsize=9)
ax2.set_ylabel("Average Keyword Density  (%)")
ax2.set_title("Average Keyword Density")
ax2.grid(axis='y', alpha=0.4)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "03_keyword_density.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 03_keyword_density.png")

---
## Figure 4: Response Characteristics

Three-panel chart showing per-model averages for:

- **Word count** — how verbose is each version?
- **Sentence count** — structural complexity of responses
- **Type-Token Ratio (TTR)** — vocabulary richness (unique/total words).
  Higher TTR = more varied word choice. Peterson is known for elaborate, varied
  prose, so higher TTR suggests stylistic alignment.

Note that V1's passage-completion training may produce longer responses (full passage
continuations) while V2/V3 may produce more concise, structured Q&A answers. This
difference in response style is a direct consequence of the training task change.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Response Characteristics — All 4 Qwen3-14B Versions",
             fontsize=14, fontweight='bold')

stat_keys    = ["word_counts",     "sentence_counts",  "ttr_values"]
stat_titles  = ["Avg Word Count",  "Avg Sentence Count","Avg Type-Token Ratio"]
stat_ylabels = ["Words",           "Sentences",         "TTR  (unique/total words)"]

for ax, stat_key, title, ylabel in zip(axes, stat_keys, stat_titles, stat_ylabels):
    avgs = [_avg(all_stats[k][stat_key]) for k in MODEL_KEYS]
    bars = ax.bar(range(n), avgs, color=colors, alpha=0.85,
                  edgecolor='white', width=0.6)
    for bar, v in zip(bars, avgs):
        fmt = f"{v:.3f}" if "ttr" in stat_key else f"{v:.1f}"
        ax.text(bar.get_x() + bar.get_width() / 2, v * 1.02,
                fmt, ha='center', va='bottom', fontsize=9, fontweight='bold')
    ax.set_xticks(range(n))
    ax.set_xticklabels(labels, fontsize=8)
    ax.set_title(title, fontsize=11, fontweight='bold')
    ax.set_ylabel(ylabel)
    ax.grid(axis='y', alpha=0.4)

legend_patches = [
    mpatches.Patch(color=MODEL_COLORS[k], label=MODEL_DISPLAY[k])
    for k in MODEL_KEYS
]
fig.legend(handles=legend_patches, loc='lower center', ncol=4,
           fontsize=9, bbox_to_anchor=(0.5, -0.08))

plt.tight_layout()
plt.savefig(FIGURES_DIR / "04_response_characteristics.png",
            bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 04_response_characteristics.png")

---
## Figure 5: Word Clouds

Word clouds show the most frequent *content* words (stop words removed) in each
model's aggregate responses. Peterson's vocabulary centres on a distinctive cluster
— order, chaos, meaning, responsibility, hero, archetype — so we expect to see those
words dominate the fine-tuned clouds.

A visual comparison of all four clouds reveals:
- Whether the base model spontaneously uses Peterson-specific vocabulary
- Whether V1's passage-completion training produced different word distributions than
  V2/V3's Q&A training
- Whether V2 and V3 clouds look nearly identical (expected given similar training loss)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle("Word Clouds — Content Word Frequency per Model Version",
             fontsize=15, fontweight='bold')

for ax, key in zip(axes.flat, MODEL_KEYS):
    words = all_stats[key]["all_words"]
    if words:
        wc = WordCloud(
            width=600, height=400,
            background_color='white',
            colormap='viridis',
            max_words=80,
            collocations=False,
            stopwords=STOPWORDS,
        ).generate(' '.join(words))
        ax.imshow(wc, interpolation='bilinear')
    else:
        ax.text(0.5, 0.5, "(no content words)", ha='center', va='center',
                transform=ax.transAxes, fontsize=14, color='grey')
    ax.set_title(MODEL_DISPLAY[key], fontsize=12, fontweight='bold',
                 color=MODEL_COLORS[key])
    ax.axis('off')

plt.tight_layout()
plt.savefig(FIGURES_DIR / "05_wordclouds.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 05_wordclouds.png")

---
## Figure 6: Peterson Keyword Heatmap (per prompt × per keyword)

Each cell shows the raw count of a specific Peterson keyword in the response to a
specific prompt. The top-20 most-used keywords (pooled across all four models) are
shown on the x-axis; the 10 prompts on the y-axis.

Comparing the four heatmaps side by side reveals:
- Which keywords the fine-tuned models spontaneously use more
- Whether V1's passage-completion training concentrates on different keywords than V2/V3
- Whether V2 and V3 have similar keyword distributions (expected)
- Any anomalous cells where a model produces the wrong type of content (e.g., V3's
  Q2 index contamination would show as a cell with zero Peterson keywords)

In [None]:
kw_set = set(k.lower() for k in PETERSON_KEYWORDS)


def per_prompt_kw_matrix(responses, keywords):
    '''Return (n_prompts x n_keywords) integer matrix of keyword hit counts.'''
    mat = []
    for resp in responses:
        words = word_tokenize(resp.lower()) if resp.strip() else []
        words_alpha = [w for w in words if w.isalpha()]
        mat.append([words_alpha.count(kw) for kw in keywords])
    return np.array(mat)


all_kw_counter = Counter()
for key in MODEL_KEYS:
    for resp in all_responses[key]:
        for w in word_tokenize(resp.lower()):
            if w in kw_set:
                all_kw_counter[w] += 1
top20 = [kw for kw, _ in all_kw_counter.most_common(20)]

if not top20:
    print("No Peterson keywords found — skipping heatmap.")
else:
    matrices = {k: per_prompt_kw_matrix(all_responses[k], top20) for k in MODEL_KEYS}
    vmax = max(m.max() for m in matrices.values() if m.size > 0) or 1

    p_labels = [f"Q{i+1}: {EVAL_PROMPTS[i][:28]}..." for i in range(len(EVAL_PROMPTS))]

    fig, axes = plt.subplots(2, 2, figsize=(18, 14), sharey=True)
    fig.suptitle("Peterson Keyword Usage Heatmap — All 4 Model Versions",
                 fontsize=15, fontweight='bold')

    for ax, key in zip(axes.flat, MODEL_KEYS):
        mat = matrices[key]
        im = ax.imshow(mat, aspect='auto', cmap='YlOrRd', vmin=0, vmax=vmax)
        ax.set_xticks(range(len(top20)))
        ax.set_xticklabels(top20, rotation=45, ha='right', fontsize=8)
        ax.set_yticks(range(len(p_labels)))
        ax.set_yticklabels(p_labels, fontsize=8)
        ax.set_title(MODEL_DISPLAY[key], fontsize=11, fontweight='bold',
                     color=MODEL_COLORS[key])
        ax.set_xlabel("Peterson Keyword")
        for i in range(mat.shape[0]):
            for j in range(mat.shape[1]):
                if mat[i, j] > 0:
                    ax.text(j, i, str(mat[i, j]), ha='center', va='center',
                            fontsize=7,
                            color='white' if mat[i, j] > vmax * 0.6 else 'black')
        plt.colorbar(im, ax=ax, label="Keyword count")

    plt.tight_layout()
    plt.savefig(FIGURES_DIR / "06_keyword_heatmap.png", bbox_inches='tight', dpi=150)
    plt.show()
    print("Saved: 06_keyword_heatmap.png")

---
## Figure 7: Radar / Spider Chart — Overall "Peterson-Likeness"

The radar chart summarises all five metrics simultaneously for all four model versions
on a single normalised scale [0.1, 1.0]. All metrics are oriented so that
**larger = more Peterson-like**:

| Spoke | Raw metric | Orientation |
|-------|-----------|-------------|
| Perplexity (inverted) | Lower PPL → higher score | inverted |
| TF-IDF Similarity | Higher cos-sim → higher score | normal |
| Keyword Density | Higher % → higher score | normal |
| Vocabulary Richness (TTR) | Higher TTR → higher score | normal |
| Response Length | Longer responses → higher score | normal |

Normalisation uses min–max across all four models simultaneously, so the chart shows
*relative* differences rather than absolute magnitudes. A larger polygon area means
the model scores more consistently across all five axes.

In [None]:
def norm4(values, higher_better=True):
    '''Normalise four values to [0.1, 1.0] with optional inversion.'''
    lo, hi = min(values), max(values)
    if abs(hi - lo) < 1e-9:
        return [0.5] * len(values)
    normed = [(v - lo) / (hi - lo) for v in values]
    if not higher_better:
        normed = [1.0 - v for v in normed]
    return [0.1 + 0.9 * v for v in normed]


radar_metrics = [
    ("Perplexity\n(inverted)",
     norm4([avg_ppl[k] for k in MODEL_KEYS], higher_better=False)),
    ("TF-IDF\nSimilarity",
     norm4([avg_sim[k] for k in MODEL_KEYS], higher_better=True)),
    ("Keyword\nDensity",
     norm4([avg_kd[k]  for k in MODEL_KEYS], higher_better=True)),
    ("Vocabulary\nRichness (TTR)",
     norm4([avg_ttr[k] for k in MODEL_KEYS], higher_better=True)),
    ("Response\nLength",
     norm4([avg_len[k] for k in MODEL_KEYS], higher_better=True)),
]

spoke_labels = [m[0] for m in radar_metrics] + [radar_metrics[0][0]]
angles       = np.linspace(0, 2 * np.pi, len(spoke_labels), endpoint=True)

fig, ax = plt.subplots(figsize=(9, 9), subplot_kw=dict(polar=True))
fig.suptitle(
    "Radar Summary: Peterson-Likeness Across All Metrics\n(Qwen3-14B: Base vs V1 vs V2 vs V3)",
    fontsize=14, fontweight='bold', y=1.02)

for i, key in enumerate(MODEL_KEYS):
    values = [m[1][i] for m in radar_metrics] + [radar_metrics[0][1][i]]
    ax.plot(angles, values, color=MODEL_COLORS[key], lw=2.5,
            label=MODEL_DISPLAY[key])
    ax.fill(angles, values, color=MODEL_COLORS[key], alpha=0.08)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(spoke_labels[:-1], size=11)
ax.set_yticks([0.25, 0.5, 0.75, 1.0])
ax.set_yticklabels(['0.25', '0.5', '0.75', '1.0'], size=7, color='grey')
ax.set_ylim(0, 1)
ax.legend(loc='upper right', bbox_to_anchor=(1.45, 1.15), fontsize=10)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "07_radar_summary.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 07_radar_summary.png")

---
## Part 5: Version Progression Analysis

This section is unique to this notebook. Instead of treating the four models as
independent groups (like the bar charts above), we examine them as a **progression**:
Base → V1 → V2 → V3 — each step representing a deliberate training intervention.

**Line chart**: V1 → V2 → V3 on the x-axis, one line per metric (normalised so all
metrics fit on the same scale). The base model is shown as a horizontal dashed reference
line for each metric.

**Delta table**: metric × version transition (Base→V1, V1→V2, V2→V3) with directional
arrows and expected-direction checks. This makes it easy to answer the three research
questions:
1. **V1→V2** (Base→V1 for TF-IDF/density, V1→V2 for task change): did Q&A training help?
2. **V2→V3**: did cleaner data produce measurably different results?
3. **Base vs all fine-tuned**: how much did fine-tuning help overall?

In [None]:
# ── Normalise all metrics to [0, 1] for the line chart ────────────────────
def _norm01(values, higher_better=True):
    lo, hi = min(values), max(values)
    if abs(hi - lo) < 1e-9:
        return [0.5] * len(values)
    normed = [(v - lo) / (hi - lo) for v in values]
    if not higher_better:
        normed = [1.0 - v for v in normed]
    return normed


tuned_keys  = ["v1", "v2", "v3"]
x_pos       = [1, 2, 3]
x_labels    = ["V1\n(1-ep completion)", "V2\n(3-ep Q&A)", "V3\n(3-ep clean)"]
all_keys    = ["base"] + tuned_keys

metric_specs = [
    # (label, avg_dict, higher_better, subplot_col)
    ("Perplexity\n(↓ better)", avg_ppl, False),
    ("TF-IDF Similarity\n(↑ better)", avg_sim, True),
    ("Keyword Density\n(↑ better)", avg_kd, True),
]

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Version Progression: Base Reference + V1 → V2 → V3",
             fontsize=14, fontweight='bold')

for ax, (metric_label, metric_dict, higher_better) in zip(axes, metric_specs):
    all_vals  = [metric_dict[k] for k in all_keys]
    norm_vals = _norm01(all_vals, higher_better)

    base_norm   = norm_vals[0]
    tuned_norms = norm_vals[1:]  # v1, v2, v3

    # Base model as horizontal dashed reference line
    ax.axhline(y=base_norm, color=MODEL_COLORS["base"], linestyle='--', lw=2,
               label=f"Base (ref)")

    # Line through V1 → V2 → V3
    ax.plot(x_pos, tuned_norms, 'o-', color='#7B2D8B', lw=2.5, ms=8,
            label="Fine-tuned")

    raw_vals = [metric_dict[k] for k in tuned_keys]
    for x, y_norm, y_raw, k in zip(x_pos, tuned_norms, raw_vals, tuned_keys):
        ax.annotate(f"{y_raw:.3f}", (x, y_norm),
                    textcoords="offset points", xytext=(6, 4),
                    fontsize=8, color=MODEL_COLORS[k])
    # Also annotate base
    ax.annotate(f"{metric_dict['base']:.3f}", (0.5, base_norm),
                textcoords="offset points", xytext=(4, 4),
                fontsize=8, color=MODEL_COLORS["base"])

    ax.set_xticks(x_pos)
    ax.set_xticklabels(x_labels, fontsize=9)
    ax.set_xlim(0.5, 3.5)
    ax.set_ylim(-0.1, 1.2)
    ax.set_title(metric_label, fontsize=11, fontweight='bold')
    ax.set_ylabel("Normalised score (0=worst, 1=best)")
    ax.legend(fontsize=9)
    ax.grid(axis='y', alpha=0.4)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "08_version_progression.png", bbox_inches='tight', dpi=150)
plt.show()
print("Saved: 08_version_progression.png")

# ── Delta table ────────────────────────────────────────────────────────────
def fmt_delta(old_v, new_v, higher_better=True):
    if abs(old_v) < 1e-9:
        return "N/A"
    pct   = 100 * (new_v - old_v) / old_v
    is_up = pct > 0
    ok    = (is_up == higher_better)
    return f"{pct:+.1f}% {'▲' if is_up else '▼'} {'✓' if ok else '✗'}"


transitions = [
    ("Base → V1", "base", "v1"),
    ("V1  → V2",  "v1",   "v2"),
    ("V2  → V3",  "v2",   "v3"),
    ("Base → V3", "base", "v3"),
]

print("\nVersion Transition Delta Table")
print("=" * 90)
print(f"{'Transition':<14} {'Perplexity (↓✓)':>18} {'TF-IDF (↑✓)':>16} {'Kwd Density (↑✓)':>18} {'TTR (↑✓)':>14}")
print("-" * 90)
for label, from_k, to_k in transitions:
    ppl_d = fmt_delta(avg_ppl[from_k], avg_ppl[to_k], higher_better=False)
    sim_d = fmt_delta(avg_sim[from_k], avg_sim[to_k], higher_better=True)
    kd_d  = fmt_delta(avg_kd[from_k],  avg_kd[to_k],  higher_better=True)
    ttr_d = fmt_delta(avg_ttr[from_k], avg_ttr[to_k], higher_better=True)
    print(f"{label:<14} {ppl_d:>18} {sim_d:>16} {kd_d:>18} {ttr_d:>14}")
print("=" * 90)
print("✓ = change in expected direction  |  ✗ = unexpected  |  ▲/▼ = direction")

---
## Part 6: Side-by-Side Response Comparison

Quantitative metrics tell only part of the story. Below we print all four model
responses to three selected prompts to allow direct qualitative comparison.

**Prompts chosen to maximise observable differences:**

1. *"What is the relationship between order and chaos in human experience?"* — Core
   Peterson theme; all models should have strong signal on this one.
2. *"Why is personal responsibility the foundation of a meaningful life?"* — The prompt
   where V3 was observed to produce raw index text in prior evaluation; reveals
   back-matter contamination vs substantive response.
3. *"How should a person confront suffering rather than flee from it?"* — Suffering and
   transcendence is a key theme from *We Who Wrestle with God* — the book with the
   most front-matter contamination difference between V2 and V3.

**What to look for:**
- Does V1 produce passage-like continuations vs structured Q&A answers?
- Do V2 and V3 produce qualitatively similar responses (suggesting data cleanliness
  has minimal impact on output style)?
- Does any model produce index/figure-list text (indicating back-matter contamination)?

In [None]:
COMPARISON_PROMPT_INDICES = [0, 1, 4]  # Q1, Q2, Q5

for idx in COMPARISON_PROMPT_INDICES:
    print("=" * 100)
    print(f"PROMPT Q{idx+1}: {EVAL_PROMPTS[idx]}")
    print("=" * 100)
    for key in MODEL_KEYS:
        resp = all_responses[key][idx]
        print(f"\n[{MODEL_DISPLAY[key]}]")
        print(resp if resp.strip() else "(empty response)")
    print()

---
## Part 7: Conclusions

### Research Question 1 — Did Q&A training (V1→V2) improve answering ability?

The V1→V2 transition changed both the training task (completion → Q&A) *and*
the LoRA rank (r=16 → r=32) and epoch count (1 → 3). Any improvement in TF-IDF
similarity and keyword density in V2 vs V1 should be attributed to all three changes
together. The most likely dominant factor is the task change: V1 learned to continue
passages; V2 learned to answer questions using Peterson's vocabulary.

If TF-IDF and keyword density improved V1→V2, the answer to Q1 is **yes**. If V1's
passage-completion model happened to score similarly on these metrics, it would suggest
that domain vocabulary is being learned regardless of task framing — an interesting null
result in its own right.

### Research Question 2 — Did clean data (V2→V3) produce measurably different results?

V2 and V3 are the cleanest A/B comparison: identical architecture, identical
hyperparameters, training loss within 0.02 of each other (negligible). Any metric
difference is attributable to the 162-pair data reduction and front-matter cleanup.

If V2 and V3 scores are nearly identical, the conclusion is that **front-matter removal
at this dataset scale has minimal impact on inference quality** — and the next
highest-leverage data improvement is back-matter removal (index pages, figure lists).

### Research Question 3 — How do fine-tuned versions compare to the base model?

All fine-tuned models share the same Qwen3-14B base architecture and tokenizer,
making perplexity directly comparable. Any version showing lower perplexity on
held-out Peterson passages has learned something about his writing style.

### V3 Index Contamination

The known V3 Q2 anomaly (raw index text output) is visible in the response comparison
above. This confirms that the training data still contains index/back-matter pages.
The front-matter heuristic in `JordanPeterson_DataPrep.ipynb` removes content *before*
Chapter 1 but does not handle content *after* the final chapter.

**Next step**: Add a back-matter removal heuristic (stop extracting at the last chapter
header or first "Index" section header) and regenerate the Q&A cache as V4 training data.

### Adding any of these models to AllModels comparison

To include any Qwen3 version in `AllModels_JordanPeterson_Comparison.ipynb`:
1. Add the key to `MODEL_KEYS` in the AllModels config cell
2. Add its path to `MODEL_PATHS`
3. Add display name, colour, and system prompt
4. The pkl format is identical — copy from `comparison_cache_qwen3_versions/` to
   `comparison_cache_all_models/` to skip re-inference
5. Adjust figure layout if needed (currently 2×2 → may need 2×3 or 3×2 for 6 models)