# Comparing Base vs. Fine-Tuned Qwen3-14B: Jordan Peterson Analysis

## Overview

This notebook provides a quantitative comparison between:
- **Base model**: `unsloth/Qwen3-14B-unsloth-bnb-4bit` ‚Äî Qwen3-14B with no domain specialization
- **Fine-tuned model**: The same base with LoRA adapters trained on ~768,000 words from four Jordan Peterson books

It is the companion to `GPT_OSS_20B_JordanPeterson_Comparison.ipynb` and uses the identical evaluation framework, allowing meaningful cross-model comparisons.

### How This Notebook Differs from the GPT-OSS Comparison

| Aspect | GPT-OSS 20B | Qwen3-14B |
|--------|-------------|-----------|
| Chat template | Harmony (`reasoning_effort`) | **ChatML** (`enable_thinking`) |
| Inference API | `apply_chat_template(return_dict=True)` | **`apply_chat_template` ‚Üí text ‚Üí tokenize** |
| Thinking mode | Via `reasoning_effort="low"` | Via **`enable_thinking=False`** |
| Expected base perplexity | Higher (larger model, different training) | Different baseline |
| Fine-tune loss | 3.01 | **2.44** (better ‚Äî smaller model, larger batch) |

### Metrics (Same as GPT-OSS Comparison)

| Metric | What It Measures | Expected Result |
|--------|-----------------|----------------|
| **Perplexity** | How surprised the model is by Peterson's actual writing | Fine-tuned = lower |
| **TF-IDF Similarity** | Vocabulary overlap with real Peterson passages | Fine-tuned = higher |
| **Keyword Density** | Frequency of Peterson's characteristic vocabulary | Fine-tuned = higher |
| **Type-Token Ratio** | Vocabulary richness | Fine-tuned ‚âà higher |
| **Response Length** | Words per response (Peterson is verbose) | Fine-tuned = longer |
| **Word Distribution** | Word clouds of dominant vocabulary | Fine-tuned = Peterson-specific terms |

### Memory Management

Qwen3-14B in 4-bit uses ~10-11 GB of VRAM. We still can't fit two models simultaneously (would require ~22 GB active at once, plus overhead), so we evaluate them sequentially, caching results to disk between phases.

---
## Step 1: Setup and Configuration

In [None]:
import os, re, gc, math, pickle
from pathlib import Path
from collections import Counter

import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt',     quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords

# ‚îÄ‚îÄ Paths ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
BASE_MODEL_NAME = "unsloth/Qwen3-14B-unsloth-bnb-4bit"
LORA_MODEL_PATH = "./outputs/qwen3_14b_jordan_peterson_lora"
CACHE_DIR       = Path("./comparison_cache_qwen3")
FIGURES_DIR     = Path("./comparison_figures_qwen3")
CACHE_DIR.mkdir(exist_ok=True)
FIGURES_DIR.mkdir(exist_ok=True)

# ‚îÄ‚îÄ Plot style ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
plt.rcParams.update({
    'figure.dpi': 120,
    'figure.facecolor': 'white',
    'axes.facecolor': '#f8f8f8',
    'axes.grid': True,
    'grid.alpha': 0.4,
    'font.size': 11,
})
BASE_COLOR  = '#4C72B0'   # blue  ‚Äî base model
TUNED_COLOR = '#DD8452'   # orange ‚Äî fine-tuned model

print(f"PyTorch : {torch.__version__}")
print(f"CUDA    : {torch.cuda.is_available()}  |  GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM    : {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"Cache   : {CACHE_DIR.resolve()}")
print(f"Figures : {FIGURES_DIR.resolve()}")

---
## Step 2: Reference Data

The same three reference datasets used in the GPT-OSS comparison ‚Äî held-out Peterson passages, evaluation prompts, and keyword dictionary ‚Äî ensuring the results are directly comparable across model families.

In [None]:
PETERSON_PASSAGES = [
    "The world can be validly construed as a forum for action, or as a place of things. "
    "The former manner of interpretation ‚Äî more primordial, and less clearly understood ‚Äî "
    "finds its expression in the arts or humanities, in ritual, drama, literature, and myth. "
    "The world as forum for action is a place of value, a place where all things have meaning.",

    "To stand up straight with your shoulders back is to accept the terrible responsibility "
    "of life, with eyes wide open. It means deciding to voluntarily transform the chaos of "
    "potential into the realities of habitable order. It means adopting the burden of "
    "self-conscious vulnerability, and accepting the end of the unconscious paradise of "
    "childhood, where finitude and mortality are only dimly comprehended.",

    "Order is the place where the things you are currently doing are working out well "
    "for you. Chaos is the domain of ignorance itself. It's unexplored territory. Chaos "
    "is what extends, endlessly and without limit, beyond the boundaries of all states, "
    "all ideas, and all disciplines. It's the foreigner, the stranger, the member of "
    "another gang, the rustle in the bushes in the night-time.",

    "The divine spark in man is the logos ‚Äî the word, the reason, the creative principle "
    "that gives order to the chaos of experience. To act in accordance with the logos is "
    "to speak the truth, to pursue what is meaningful rather than what is expedient, and "
    "to take on the burden of Being itself with courage and humility.",

    "Compare yourself to who you were yesterday, not to who someone else is today. "
    "You have a nature. You can play the game of life and improve. You can set a "
    "standard, even a minimal standard, and try to live it up to. You can improve "
    "incrementally, moving forward step by step. You can judge your life against "
    "what you know to be good, against what you should be.",

    "The great myths and rituals of the past have been formulated in the language of "
    "the imagination. They say: act out the role of the hero; do not be the villain; "
    "do not be the tyrant. They say: update your maps of meaning when new information "
    "warrants it; admit your errors and change. They say: encounter the stranger and "
    "extract from that encounter what is valuable. Treat the stranger with respect.",

    "Meaning is the ultimate balance between, on the one hand, the chaos of transformation "
    "and possibility and, on the other, the discipline of pristine order, whose purpose is "
    "to produce out of the attendant chaos a new order that will be even more productive "
    "and worthwhile than the old. Pursue what is meaningful, not what is expedient.",

    "Suffering is not a mistake or an accident. It is the very ground of Being itself. "
    "To wrestle with God, as Jacob did, is to confront that suffering honestly, to take "
    "responsibility for it, and to find within it the possibility of transcendence. The "
    "hero does not flee from the dragon; he faces it and transforms the encounter.",
]

EVAL_PROMPTS = [
    "What is the relationship between order and chaos in human experience?",
    "Why is personal responsibility the foundation of a meaningful life?",
    "How do ancient myths and stories reveal truths about human nature?",
    "What does it mean to pursue what is meaningful rather than what is expedient?",
    "How should a person confront suffering rather than flee from it?",
    "What is the significance of the hero archetype in understanding the human psyche?",
    "Why is telling the truth essential to a properly functioning life?",
    "What is the role of the divine or the sacred in organizing human society?",
    "How does the Jungian concept of the shadow relate to individual development?",
    "What does it mean to stand up straight with your shoulders back?",
]

PETERSON_KEYWORDS = list(set([
    "chaos", "order", "logos", "being", "meaning", "meaningful", "meaningless",
    "transcendence", "transcendent", "archetype", "archetypal",
    "shadow", "anima", "animus", "unconscious", "consciousness", "psyche",
    "individuation", "projection",
    "responsibility", "suffering", "redemption", "courage", "virtue",
    "nihilism", "nihilistic", "expedient", "expedience", "tyranny", "tyrannical",
    "sovereignty", "heroic", "malevolent",
    "myth", "mythological", "hero", "dragon", "narrative", "story",
    "ritual", "sacrifice", "resurrection", "transformation",
    "divine", "sacred", "god", "biblical", "genesis", "logos", "spirit",
    "wrestle", "jacob", "adam", "eve", "serpent",
    "confront", "hierarchy", "dominance", "voluntarily", "catastrophe",
    "pathological", "resentment", "ideological", "totalitarian",
]))

print(f"Reference passages  : {len(PETERSON_PASSAGES)}")
print(f"Evaluation prompts  : {len(EVAL_PROMPTS)}")
print(f"Keyword dictionary  : {len(PETERSON_KEYWORDS)} terms")

---
## Step 3: Helper Functions

The metric functions are identical to the GPT-OSS comparison notebook. The key difference is in `generate_response()`, which must use Qwen3's ChatML API:

```python
# GPT-OSS approach (reasoning_effort parameter, return_dict=True):
inputs = tokenizer.apply_chat_template(messages, return_dict=True,
             reasoning_effort="low", ...).to("cuda")

# Qwen3 approach (enable_thinking parameter, two-step):
text   = tokenizer.apply_chat_template(messages, enable_thinking=False, tokenize=False, ...)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
```

Qwen3's `apply_chat_template` returns a plain string (not a dict), so we tokenize in a second step. We also strip any `<think>...</think>` blocks from the output to ensure the measured response text contains only the visible answer.

In [None]:
def compute_perplexity(model, tokenizer, texts: list, max_length: int = 512) -> list:
    """
    Compute perplexity of a language model on each text.

    Perplexity = exp(average negative log-likelihood per token).
    Lower perplexity = model is less surprised by the text = more domain-adapted.
    """
    model.eval()
    perplexities = []
    with torch.no_grad():
        for text in texts:
            enc = tokenizer(text, return_tensors="pt",
                            max_length=max_length, truncation=True).to("cuda")
            out = model(**enc, labels=enc["input_ids"])
            perplexities.append(math.exp(out.loss.item()))
    return perplexities


def generate_response(model, tokenizer, prompt: str,
                      system_prompt: str, max_new_tokens: int = 300) -> str:
    """
    Generate a response from a Qwen3 model for a given prompt.

    Key difference from GPT-OSS: Qwen3's apply_chat_template returns a plain string,
    not a tensor dict. We tokenize in a second step. We also pass enable_thinking=False
    to suppress chain-of-thought and get direct Peterson-style responses.

    Any residual <think>...</think> blocks are stripped from the output so that only
    the visible response is measured.

    Uses greedy decoding (do_sample=False) for deterministic, fair comparison.
    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": prompt},
    ]
    # Step 1: Apply chat template ‚Üí plain text string
    text = tokenizer.apply_chat_template(
        messages,
        tokenize             = False,
        add_generation_prompt= True,
        enable_thinking      = False,   # Non-thinking mode: direct response
    )
    # Step 2: Tokenize the formatted string
    inputs = tokenizer(text, return_tensors="pt").to("cuda")

    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens    = max_new_tokens,
            do_sample         = False,   # Greedy ‚Äî fully deterministic
            temperature       = 1.0,
            repetition_penalty= 1.1,
        )

    new_tokens = out[0][inputs["input_ids"].shape[1]:]
    response   = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

    # Strip any <think>...</think> blocks (may appear even in non-thinking mode)
    response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
    return response


def compute_text_stats(texts: list) -> dict:
    """
    Compute text statistics over a list of responses.

    Every text always contributes exactly one entry to each output list
    (empty texts get 0 counts) so that per-prompt plots never have length mismatches.
    """
    stop_words = set(stopwords.words('english'))
    kw_set     = set(k.lower() for k in PETERSON_KEYWORDS)

    word_counts, sentence_counts, ttr_values, keyword_density = [], [], [], []
    keyword_counts = Counter()
    all_words = []

    for text in texts:
        if text.strip():
            words = word_tokenize(text.lower())
            sents = sent_tokenize(text)
        else:
            words, sents = [], []

        words_alpha = [w for w in words if w.isalpha()]
        word_counts.append(len(words_alpha))
        sentence_counts.append(len(sents))
        ttr_values.append(len(set(words_alpha)) / max(len(words_alpha), 1))

        kw_hits = [w for w in words_alpha if w in kw_set]
        keyword_density.append(len(kw_hits) / max(len(words_alpha), 1))
        keyword_counts.update(kw_hits)

        content_words = [w for w in words_alpha if w not in stop_words and len(w) > 2]
        all_words.extend(content_words)

    return {
        "word_counts":     word_counts,
        "sentence_counts": sentence_counts,
        "ttr_values":      ttr_values,
        "keyword_density": keyword_density,
        "keyword_counts":  keyword_counts,
        "all_words":       all_words,
    }


def compute_tfidf_similarity(responses: list, references: list) -> list:
    """
    TF-IDF cosine similarity: how similar each response is to Peterson's actual writing.
    Returns one similarity score per response (0.0 for empty responses).
    """
    if not responses or not any(r.strip() for r in responses):
        return [0.0] * len(responses)
    all_texts  = references + responses
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
    tfidf      = vectorizer.fit_transform(all_texts)
    ref_vecs   = tfidf[:len(references)]
    resp_vecs  = tfidf[len(references):]
    return [float(cosine_similarity(resp_vecs[i], ref_vecs)[0].max())
            for i in range(resp_vecs.shape[0])]


print("Helper functions defined.")

---
## Step 4: Evaluate the Base Model

We load `unsloth/Qwen3-14B-unsloth-bnb-4bit`, compute perplexity on Peterson's passages, generate responses to all 10 evaluation prompts, save results to `comparison_cache_qwen3/`, then fully unload the model before loading the fine-tuned version.

**System prompt for base model**: A generic helpful-assistant prompt ‚Äî the same framing a user would naturally provide to an untrained model.

**System prompt for fine-tuned model**: The Peterson-expert persona the model was trained on.

In [None]:
from unsloth import FastLanguageModel

BASE_SYSTEM_PROMPT = "You are a helpful assistant."

TUNED_SYSTEM_PROMPT = (
    "You are an AI assistant that has been trained on the complete works of Jordan B. Peterson, "
    "a Canadian clinical psychologist, professor, and author. You speak with deep knowledge of "
    "psychology, philosophy, mythology, religion, and personal responsibility. Your responses "
    "reflect Peterson's writing style, intellectual depth, and interdisciplinary approach to "
    "understanding human nature and meaning."
)

# ‚îÄ‚îÄ Cache check ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
_base_cache_exists = (CACHE_DIR / "base_results.pkl").exists()

if _base_cache_exists:
    print("Base model cache found ‚Äî skipping inference.")
    print(f"  (Delete {CACHE_DIR / 'base_results.pkl'} to force re-run)")
else:
    print("Loading BASE model (Qwen3-14B 4-bit)‚Ä¶")
    base_model, base_tokenizer = FastLanguageModel.from_pretrained(
        model_name    = BASE_MODEL_NAME,
        dtype         = None,
        max_seq_length= 2048,
        load_in_4bit  = True,
        full_finetuning = False,
    )
    FastLanguageModel.for_inference(base_model)
    print(f"Base model loaded. VRAM: {torch.cuda.memory_reserved()/1e9:.1f} GB")

In [None]:
if not _base_cache_exists:
    print("Computing base model perplexity on Peterson passages‚Ä¶")
    base_perplexities = compute_perplexity(base_model, base_tokenizer, PETERSON_PASSAGES)
    for i, (txt, ppl) in enumerate(zip(PETERSON_PASSAGES, base_perplexities)):
        print(f"  Passage {i+1}: PPL = {ppl:.2f}  |  '{txt[:55]}‚Ä¶'")

In [None]:
if not _base_cache_exists:
    print("Generating base model responses‚Ä¶\n")
    base_responses = []
    for i, prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{i+1}/{len(EVAL_PROMPTS)}] {prompt[:70]}")
        resp = generate_response(base_model, base_tokenizer, prompt, BASE_SYSTEM_PROMPT)
        base_responses.append(resp)
        print(f"         ‚Üí {resp[:100]}‚Ä¶\n")
    print(f"Done. {len(base_responses)} responses collected.")

In [None]:
if not _base_cache_exists:
    base_results = {"perplexities": base_perplexities, "responses": base_responses}
    with open(CACHE_DIR / "base_results.pkl", "wb") as f:
        pickle.dump(base_results, f)
    print(f"Saved. Avg PPL: {sum(base_perplexities)/len(base_perplexities):.2f}  "
          f"| Responses: {len(base_responses)}")

In [None]:
if not _base_cache_exists:
    del base_model, base_tokenizer
    gc.collect()
    torch.cuda.empty_cache()
    free = (torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_reserved()) / 1e9
    print(f"Base model unloaded. VRAM free: {free:.1f} GB")
else:
    print("Base model not loaded (used cache).")

---
## Step 5: Evaluate the Fine-Tuned Model

We load the LoRA-adapted Qwen3-14B from the `outputs/` directory. Unsloth loads the base weights and merges the LoRA adapters transparently ‚Äî the resulting model behaves like a single fine-tuned model from the caller's perspective.

In [None]:
_tuned_cache_exists = (CACHE_DIR / "tuned_results.pkl").exists()

if _tuned_cache_exists:
    print("Fine-tuned model cache found ‚Äî skipping inference.")
    print(f"  (Delete {CACHE_DIR / 'tuned_results.pkl'} to force re-run)")
else:
    print("Loading FINE-TUNED model (base + LoRA adapters)‚Ä¶")
    tuned_model, tuned_tokenizer = FastLanguageModel.from_pretrained(
        model_name    = LORA_MODEL_PATH,
        dtype         = None,
        max_seq_length= 2048,
        load_in_4bit  = True,
        full_finetuning = False,
    )
    FastLanguageModel.for_inference(tuned_model)
    print(f"Fine-tuned model loaded. VRAM: {torch.cuda.memory_reserved()/1e9:.1f} GB")

In [None]:
if not _tuned_cache_exists:
    print("Computing fine-tuned model perplexity‚Ä¶")
    tuned_perplexities = compute_perplexity(tuned_model, tuned_tokenizer, PETERSON_PASSAGES)
    for i, (txt, ppl) in enumerate(zip(PETERSON_PASSAGES, tuned_perplexities)):
        print(f"  Passage {i+1}: PPL = {ppl:.2f}  |  '{txt[:55]}‚Ä¶'")

In [None]:
if not _tuned_cache_exists:
    print("Generating fine-tuned model responses‚Ä¶\n")
    tuned_responses = []
    for i, prompt in enumerate(EVAL_PROMPTS):
        print(f"  [{i+1}/{len(EVAL_PROMPTS)}] {prompt[:70]}")
        resp = generate_response(tuned_model, tuned_tokenizer, prompt, TUNED_SYSTEM_PROMPT)
        tuned_responses.append(resp)
        print(f"         ‚Üí {resp[:100]}‚Ä¶\n")
    print(f"Done. {len(tuned_responses)} responses collected.")

In [None]:
if not _tuned_cache_exists:
    tuned_results = {"perplexities": tuned_perplexities, "responses": tuned_responses}
    with open(CACHE_DIR / "tuned_results.pkl", "wb") as f:
        pickle.dump(tuned_results, f)
    print(f"Saved. Avg PPL: {sum(tuned_perplexities)/len(tuned_perplexities):.2f}  "
          f"| Responses: {len(tuned_responses)}")

In [None]:
if not _tuned_cache_exists:
    del tuned_model, tuned_tokenizer
    gc.collect()
    torch.cuda.empty_cache()
    print("Fine-tuned model unloaded. Beginning analysis‚Ä¶")
else:
    print("Fine-tuned model not loaded (used cache). Beginning analysis‚Ä¶")

---
## Step 6: Compute All Derived Metrics

We load both result sets from cache and compute all derived metrics in one place. Re-running this and subsequent cells is fast (no GPU required) ‚Äî only Steps 4 and 5 use the GPU.

In [None]:
with open(CACHE_DIR / "base_results.pkl",  "rb") as f: base_results  = pickle.load(f)
with open(CACHE_DIR / "tuned_results.pkl", "rb") as f: tuned_results = pickle.load(f)

base_perplexities  = base_results["perplexities"]
tuned_perplexities = tuned_results["perplexities"]
base_responses     = base_results["responses"]
tuned_responses    = tuned_results["responses"]

print(f"Base  responses : {len(base_responses)}  |  non-empty: {sum(1 for r in base_responses  if r.strip())}")
print(f"Tuned responses : {len(tuned_responses)} |  non-empty: {sum(1 for r in tuned_responses if r.strip())}")
print()

print("Computing text statistics‚Ä¶")
base_stats  = compute_text_stats(base_responses)
tuned_stats = compute_text_stats(tuned_responses)

print("Computing TF-IDF similarity‚Ä¶")
base_similarities  = compute_tfidf_similarity(base_responses,  PETERSON_PASSAGES)
tuned_similarities = compute_tfidf_similarity(tuned_responses, PETERSON_PASSAGES)

# ‚îÄ‚îÄ Summary table ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def _avg(lst): return sum(lst) / max(len(lst), 1)

rows = {
    "Metric": [
        "Avg Perplexity (‚Üì better)",
        "Avg TF-IDF Similarity (‚Üë better)",
        "Avg Keyword Density % (‚Üë better)",
        "Avg Type-Token Ratio (‚Üë better)",
        "Avg Response Length (words)",
    ],
    "Base Model": [
        f"{_avg(base_perplexities):.2f}",
        f"{_avg(base_similarities):.4f}",
        f"{100*_avg(base_stats['keyword_density']):.2f}%",
        f"{_avg(base_stats['ttr_values']):.4f}",
        f"{_avg(base_stats['word_counts']):.1f}",
    ],
    "Fine-Tuned Model": [
        f"{_avg(tuned_perplexities):.2f}",
        f"{_avg(tuned_similarities):.4f}",
        f"{100*_avg(tuned_stats['keyword_density']):.2f}%",
        f"{_avg(tuned_stats['ttr_values']):.4f}",
        f"{_avg(tuned_stats['word_counts']):.1f}",
    ],
}
print("\n" + pd.DataFrame(rows).to_string(index=False))

---
## Step 7: Visualization ‚Äî Perplexity

**Perplexity** measures how "surprised" the model is by Peterson's actual sentences. A fine-tuned model that has learned his vocabulary and sentence patterns assigns higher probability to his words, yielding lower perplexity. This is the most direct measure of domain adaptation.

$$\text{PPL}(\text{text}) = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(w_i | w_1,\ldots,w_{i-1})\right)$$

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle("Perplexity on Jordan Peterson Reference Passages (Qwen3-14B)",
             fontsize=14, fontweight='bold', y=1.01)

x = np.arange(len(PETERSON_PASSAGES))
w = 0.35
labels = [f"P{i+1}" for i in range(len(PETERSON_PASSAGES))]

ax = axes[0]
bb = ax.bar(x - w/2, base_perplexities,  w, label="Base",        color=BASE_COLOR,  alpha=0.85)
bt = ax.bar(x + w/2, tuned_perplexities, w, label="Fine-Tuned",  color=TUNED_COLOR, alpha=0.85)
for bar in bb: ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+.3,
                       f"{bar.get_height():.1f}", ha='center', va='bottom', fontsize=7, color=BASE_COLOR)
for bar in bt: ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+.3,
                       f"{bar.get_height():.1f}", ha='center', va='bottom', fontsize=7, color=TUNED_COLOR)
ax.set_xticks(x); ax.set_xticklabels(labels)
ax.set_xlabel("Peterson Passage"); ax.set_ylabel("Perplexity  (lower = better)")
ax.set_title("Per-Passage Perplexity"); ax.legend()

ax2 = axes[1]
avg_b = _avg(base_perplexities); avg_t = _avg(tuned_perplexities)
bars  = ax2.bar(["Base Model", "Fine-Tuned"], [avg_b, avg_t],
                color=[BASE_COLOR, TUNED_COLOR], alpha=0.85, width=0.45)
for bar, val in zip(bars, [avg_b, avg_t]):
    ax2.text(bar.get_x()+bar.get_width()/2, bar.get_height()+.1,
             f"{val:.2f}", ha='center', va='bottom', fontsize=12, fontweight='bold')
pct = 100 * (avg_b - avg_t) / avg_b
ax2.annotate(f"{pct:+.1f}%\nimprovement",
             xy=(1, avg_t), xytext=(0.5, (avg_b+avg_t)/2), fontsize=10, ha='center',
             color='green' if pct > 0 else 'red', fontweight='bold',
             arrowprops=dict(arrowstyle='->', color='green' if pct > 0 else 'red', lw=1.5))
ax2.set_ylabel("Average Perplexity  (lower = better)")
ax2.set_title("Average Perplexity Across All Passages")

plt.tight_layout()
plt.savefig(FIGURES_DIR / "01_perplexity.png", bbox_inches='tight', dpi=150)
plt.show()
print(f"Base: {avg_b:.2f}  |  Fine-tuned: {avg_t:.2f}  |  Change: {pct:+.1f}%")

---
## Step 8: Visualization ‚Äî TF-IDF Semantic Similarity

**TF-IDF cosine similarity** measures how similar each model's response vocabulary is to Peterson's actual writing. Rare, distinctive words (chaos, logos, archetype, sovereignty) get higher TF-IDF weight than common words, so this metric rewards using Peterson's characteristic vocabulary rather than generic language.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle("TF-IDF Semantic Similarity to Peterson's Actual Writing (Qwen3-14B)",
             fontsize=14, fontweight='bold', y=1.01)

prompt_labels = [f"Q{i+1}" for i in range(len(EVAL_PROMPTS))]

ax = axes[0]
ax.plot(prompt_labels, base_similarities,  'o-', color=BASE_COLOR,  label="Base",       lw=2, ms=7)
ax.plot(prompt_labels, tuned_similarities, 's-', color=TUNED_COLOR, label="Fine-Tuned", lw=2, ms=7)
ax.fill_between(range(len(prompt_labels)), base_similarities,  alpha=0.15, color=BASE_COLOR)
ax.fill_between(range(len(prompt_labels)), tuned_similarities, alpha=0.15, color=TUNED_COLOR)
ax.set_xticks(range(len(prompt_labels))); ax.set_xticklabels(prompt_labels)
ax.set_xlabel("Evaluation Prompt"); ax.set_ylabel("Cosine Similarity  (higher = better)")
ax.set_title("Per-Prompt Similarity"); ax.legend(); ax.set_ylim(0, 1)
for i, (b, t) in enumerate(zip(base_similarities, tuned_similarities)):
    ax.annotate(f"{b:.2f}", (i, b), textcoords="offset points", xytext=(0, 8),
                fontsize=7, ha='center', color=BASE_COLOR)
    ax.annotate(f"{t:.2f}", (i, t), textcoords="offset points", xytext=(0, -14),
                fontsize=7, ha='center', color=TUNED_COLOR)

ax2 = axes[1]
bp = ax2.boxplot([base_similarities, tuned_similarities], labels=["Base", "Fine-Tuned"],
                 patch_artist=True, medianprops=dict(color='black', lw=2))
bp['boxes'][0].set_facecolor(BASE_COLOR);  bp['boxes'][0].set_alpha(0.7)
bp['boxes'][1].set_facecolor(TUNED_COLOR); bp['boxes'][1].set_alpha(0.7)
ax2.set_ylabel("Cosine Similarity"); ax2.set_title("Similarity Distribution"); ax2.set_ylim(0, 1)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "02_tfidf_similarity.png", bbox_inches='tight', dpi=150)
plt.show()
avg_b2 = _avg(base_similarities); avg_t2 = _avg(tuned_similarities)
print(f"Base: {avg_b2:.4f}  |  Fine-tuned: {avg_t2:.4f}  |  Change: {100*(avg_t2-avg_b2)/max(avg_b2,1e-6):+.1f}%")

---
## Step 9: Visualization ‚Äî Peterson Keyword Density

**Keyword density** measures what fraction of words in each response belong to Peterson's characteristic vocabulary (~60 terms: chaos, order, meaning, hero, archetype, responsibility, logos, suffering‚Ä¶). This directly tests whether the model has adopted his conceptual vocabulary.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle("Peterson Keyword Density (Qwen3-14B)", fontsize=14, fontweight='bold', y=1.01)

base_kd  = [v * 100 for v in base_stats['keyword_density']]
tuned_kd = [v * 100 for v in tuned_stats['keyword_density']]
ql = [f"Q{i+1}" for i in range(len(EVAL_PROMPTS))]
x  = np.arange(len(ql)); w = 0.35

ax = axes[0]
ax.bar(x - w/2, base_kd,  w, label="Base",       color=BASE_COLOR,  alpha=0.85)
ax.bar(x + w/2, tuned_kd, w, label="Fine-Tuned", color=TUNED_COLOR, alpha=0.85)
ax.set_xticks(x); ax.set_xticklabels(ql)
ax.set_xlabel("Evaluation Prompt"); ax.set_ylabel("Keyword Density (%)")
ax.set_title("Fraction of Response Words in Peterson's Vocabulary"); ax.legend()

ax2 = axes[1]
all_kw  = set(base_stats['keyword_counts']) | set(tuned_stats['keyword_counts'])
top_kw  = sorted(all_kw,
                 key=lambda k: base_stats['keyword_counts'].get(k, 0) +
                               tuned_stats['keyword_counts'].get(k, 0),
                 reverse=True)[:15]
bkc = [base_stats['keyword_counts'].get(k, 0)  for k in top_kw]
tkc = [tuned_stats['keyword_counts'].get(k, 0) for k in top_kw]
y   = np.arange(len(top_kw))
ax2.barh(y + 0.2, bkc, 0.4, label="Base",       color=BASE_COLOR,  alpha=0.85)
ax2.barh(y - 0.2, tkc, 0.4, label="Fine-Tuned", color=TUNED_COLOR, alpha=0.85)
ax2.set_yticks(y); ax2.set_yticklabels(top_kw)
ax2.set_xlabel("Uses Across All Responses"); ax2.set_title("Top Peterson Keywords Used")
ax2.legend(); ax2.invert_yaxis()

plt.tight_layout()
plt.savefig(FIGURES_DIR / "03_keyword_density.png", bbox_inches='tight', dpi=150)
plt.show()
print(f"Avg keyword density ‚Äî Base: {_avg(base_kd):.2f}%  |  Fine-tuned: {_avg(tuned_kd):.2f}%  "
      f"|  Change: {_avg(tuned_kd)-_avg(base_kd):+.2f}pp")

---
## Step 10: Visualization ‚Äî Response Characteristics

**Type-Token Ratio (TTR)** measures vocabulary richness: what fraction of the words in a response are unique. Peterson's writing is known for its elaborate, varied vocabulary. **Response length** reflects verbosity ‚Äî Peterson's explanations tend to be extensive and multi-layered.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("Response Characteristics: Vocabulary Richness & Length (Qwen3-14B)",
             fontsize=14, fontweight='bold')

ql2 = [f"Q{i+1}" for i in range(len(EVAL_PROMPTS))]

ax = axes[0, 0]
ax.plot(ql2, base_stats['ttr_values'],  'o-', color=BASE_COLOR,  label="Base",       lw=2, ms=7)
ax.plot(ql2, tuned_stats['ttr_values'], 's-', color=TUNED_COLOR, label="Fine-Tuned", lw=2, ms=7)
ax.set_ylabel("Type-Token Ratio"); ax.set_title("Vocabulary Richness (TTR) per Prompt")
ax.set_ylim(0, 1); ax.legend(); ax.set_xlabel("Prompt")

ax2 = axes[0, 1]
bp = ax2.boxplot([base_stats['ttr_values'], tuned_stats['ttr_values']],
                 labels=["Base", "Fine-Tuned"], patch_artist=True,
                 medianprops=dict(color='black', lw=2))
bp['boxes'][0].set_facecolor(BASE_COLOR);  bp['boxes'][0].set_alpha(0.7)
bp['boxes'][1].set_facecolor(TUNED_COLOR); bp['boxes'][1].set_alpha(0.7)
ax2.set_ylabel("Type-Token Ratio"); ax2.set_title("TTR Distribution"); ax2.set_ylim(0, 1)

x2 = np.arange(len(ql2)); w2 = 0.35
ax3 = axes[1, 0]
ax3.bar(x2 - w2/2, base_stats['word_counts'],  w2, label="Base",       color=BASE_COLOR,  alpha=0.85)
ax3.bar(x2 + w2/2, tuned_stats['word_counts'], w2, label="Fine-Tuned", color=TUNED_COLOR, alpha=0.85)
ax3.set_xticks(x2); ax3.set_xticklabels(ql2)
ax3.set_xlabel("Prompt"); ax3.set_ylabel("Words in Response")
ax3.set_title("Response Length (Word Count) per Prompt"); ax3.legend()

ax4 = axes[1, 1]
bp2 = ax4.boxplot([base_stats['word_counts'], tuned_stats['word_counts']],
                  labels=["Base", "Fine-Tuned"], patch_artist=True,
                  medianprops=dict(color='black', lw=2))
bp2['boxes'][0].set_facecolor(BASE_COLOR);  bp2['boxes'][0].set_alpha(0.7)
bp2['boxes'][1].set_facecolor(TUNED_COLOR); bp2['boxes'][1].set_alpha(0.7)
ax4.set_ylabel("Words in Response"); ax4.set_title("Response Length Distribution")

plt.tight_layout()
plt.savefig(FIGURES_DIR / "04_response_characteristics.png", bbox_inches='tight', dpi=150)
plt.show()

print(f"Avg TTR   ‚Äî Base: {_avg(base_stats['ttr_values']):.4f}  "
      f"|  Fine-tuned: {_avg(tuned_stats['ttr_values']):.4f}")
print(f"Avg words ‚Äî Base: {_avg(base_stats['word_counts']):.1f}  "
      f"|  Fine-tuned: {_avg(tuned_stats['word_counts']):.1f}")

---
## Step 11: Visualization ‚Äî Word Clouds

Word clouds give an immediate visual of each model's dominant vocabulary (stop words removed). The base model should show generic language; the fine-tuned model should prominently feature Peterson's vocabulary: *meaning, chaos, order, responsibility, suffering, hero, myth, logos*.

In [None]:
stop_wc = set(STOPWORDS) | {'also', 'one', 'may', 'much', 'even', 'way', 'well',
                             'get', 'make', 'like', 'us', 'would', 'could', 'time',
                             'thing', 'things', 'many', 'something', 'often'}

def make_wc(words, title, ax):
    freq = Counter(w for w in words if w.lower() not in stop_wc and len(w) > 2)
    if not freq:
        ax.text(0.5, 0.5, "(no content words)", ha='center', va='center', transform=ax.transAxes)
        ax.set_title(title); return
    wc = WordCloud(width=800, height=450, background_color='white',
                   colormap='Blues' if 'Base' in title else 'Oranges',
                   max_words=80, prefer_horizontal=0.8, stopwords=stop_wc,
                   min_font_size=8).generate_from_frequencies(freq)
    ax.imshow(wc, interpolation='bilinear'); ax.axis('off')
    ax.set_title(title, fontsize=13, fontweight='bold', pad=10)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle("Word Clouds: Most Frequent Content Words in Responses (Qwen3-14B)",
             fontsize=14, fontweight='bold')
make_wc(base_stats['all_words'],  "Base Model",        axes[0])
make_wc(tuned_stats['all_words'], "Fine-Tuned Model",  axes[1])
plt.tight_layout()
plt.savefig(FIGURES_DIR / "05_wordclouds.png", bbox_inches='tight', dpi=150)
plt.show()

---
## Step 12: Visualization ‚Äî Keyword Heatmap

This heatmap shows, for each evaluation prompt (row), how many Peterson keywords appeared in that model's response (column). Warmer colors indicate more keyword usage ‚Äî the fine-tuned model's heatmap should be noticeably warmer if domain adaptation has taken hold.

In [None]:
kw_set = set(k.lower() for k in PETERSON_KEYWORDS)

def per_prompt_kw_matrix(responses, keywords):
    mat = []
    for resp in responses:
        words = word_tokenize(resp.lower()) if resp.strip() else []
        words_alpha = [w for w in words if w.isalpha()]
        mat.append([words_alpha.count(kw) for kw in keywords])
    return np.array(mat)

all_kw_c = Counter()
for r in base_responses + tuned_responses:
    for w in word_tokenize(r.lower()):
        if w in kw_set:
            all_kw_c[w] += 1

top20 = [kw for kw, _ in all_kw_c.most_common(20)]

if top20:
    bm = per_prompt_kw_matrix(base_responses,  top20)
    tm = per_prompt_kw_matrix(tuned_responses, top20)

    fig, axes = plt.subplots(1, 2, figsize=(16, 6), sharey=True)
    fig.suptitle("Peterson Keyword Usage per Prompt ‚Äî Heatmap (Qwen3-14B)",
                 fontsize=14, fontweight='bold')
    p_labels = [f"Q{i+1}: {EVAL_PROMPTS[i][:28]}‚Ä¶" for i in range(len(EVAL_PROMPTS))]
    vmax = max(bm.max(), tm.max(), 1)

    for ax, mat, title in [(axes[0], bm, "Base Model"), (axes[1], tm, "Fine-Tuned Model")]:
        im = ax.imshow(mat, aspect='auto', cmap='YlOrRd', vmin=0, vmax=vmax)
        ax.set_xticks(range(len(top20))); ax.set_xticklabels(top20, rotation=45, ha='right', fontsize=9)
        ax.set_yticks(range(len(p_labels))); ax.set_yticklabels(p_labels, fontsize=8)
        ax.set_title(title, fontsize=12, fontweight='bold'); ax.set_xlabel("Peterson Keyword")
        for i in range(mat.shape[0]):
            for j in range(mat.shape[1]):
                if mat[i, j] > 0:
                    ax.text(j, i, str(mat[i, j]), ha='center', va='center', fontsize=7,
                            color='white' if mat[i, j] > vmax * 0.6 else 'black')
        plt.colorbar(im, ax=ax, label="Count")

    plt.tight_layout()
    plt.savefig(FIGURES_DIR / "06_keyword_heatmap.png", bbox_inches='tight', dpi=150)
    plt.show()
else:
    print("No Peterson keywords found ‚Äî skipping heatmap.")

---
## Step 13: Side-by-Side Response Comparison

Quantitative metrics capture aggregate behavior, but it's equally important to read the actual responses. Here we display the first three prompts answered by both models.

In [None]:
for i in range(min(3, len(EVAL_PROMPTS))):
    print("‚îÅ" * 90)
    print(f"PROMPT {i+1}: {EVAL_PROMPTS[i]}")
    print("‚îÅ" * 90)
    print(f"\nüîµ BASE MODEL:")
    print(base_responses[i]  if base_responses[i].strip()  else "(empty response)")
    print(f"\nüü† FINE-TUNED MODEL:")
    print(tuned_responses[i] if tuned_responses[i].strip() else "(empty response)")
    print()

---
## Step 14: Radar Chart Summary

A radar chart lets us compare both models across all five metrics simultaneously. All metrics are normalized to [0.1, 1.0] so that the area of each polygon represents overall "Peterson-likeness". A larger orange area = more domain-adapted model.

All metrics are oriented so that **larger = more Peterson-like**:
- Perplexity is inverted (lower perplexity ‚Üí higher radar score)
- All other metrics are already in "higher = better" direction

In [None]:
def norm(bv, tv, higher_better=True):
    lo, hi = min(bv, tv), max(bv, tv)
    if abs(hi - lo) < 1e-9: return 0.5, 0.5
    bn = (bv - lo) / (hi - lo)
    tn = (tv - lo) / (hi - lo)
    if not higher_better: bn, tn = 1 - bn, 1 - tn
    return 0.1 + 0.9 * bn, 0.1 + 0.9 * tn

avg_b_ppl = _avg(base_perplexities);  avg_t_ppl = _avg(tuned_perplexities)
avg_b_sim = _avg(base_similarities);  avg_t_sim = _avg(tuned_similarities)
avg_b_kd  = _avg(base_stats['keyword_density']); avg_t_kd = _avg(tuned_stats['keyword_density'])
avg_b_ttr = _avg(base_stats['ttr_values']);       avg_t_ttr= _avg(tuned_stats['ttr_values'])
avg_b_len = _avg(base_stats['word_counts']);      avg_t_len= _avg(tuned_stats['word_counts'])

metrics = [
    ("Perplexity\n(inverted)",     *norm(avg_b_ppl, avg_t_ppl, higher_better=False)),
    ("TF-IDF\nSimilarity",         *norm(avg_b_sim, avg_t_sim, higher_better=True)),
    ("Keyword\nDensity",           *norm(avg_b_kd,  avg_t_kd,  higher_better=True)),
    ("Vocabulary\nRichness (TTR)", *norm(avg_b_ttr, avg_t_ttr, higher_better=True)),
    ("Response\nLength",           *norm(avg_b_len, avg_t_len, higher_better=True)),
]
labels  = [m[0] for m in metrics] + [metrics[0][0]]
base_v  = [m[1] for m in metrics] + [metrics[0][1]]
tuned_v = [m[2] for m in metrics] + [metrics[0][2]]
angles  = np.linspace(0, 2 * np.pi, len(labels), endpoint=True)

fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
fig.suptitle("Radar Summary: How Peterson-Like is Each Qwen3-14B Model?",
             fontsize=15, fontweight='bold', y=1.01)

ax.plot(angles, base_v,  color=BASE_COLOR,  lw=2.5, label="Base Model")
ax.fill(angles, base_v,  color=BASE_COLOR,  alpha=0.15)
ax.plot(angles, tuned_v, color=TUNED_COLOR, lw=2.5, label="Fine-Tuned Model")
ax.fill(angles, tuned_v, color=TUNED_COLOR, alpha=0.15)

ax.set_xticks(angles[:-1]); ax.set_xticklabels(labels[:-1], size=11)
ax.set_yticklabels([]); ax.set_ylim(0, 1)
ax.set_yticks([0.25, 0.5, 0.75, 1.0])
ax.set_yticklabels(['0.25', '0.5', '0.75', '1.0'], size=7, color='grey')
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.15), fontsize=11)

plt.tight_layout()
plt.savefig(FIGURES_DIR / "07_radar_summary.png", bbox_inches='tight', dpi=150)
plt.show()

---
## Step 15: Final Summary Table

In [None]:
def pct_change(bv, tv, higher_better=True):
    if abs(bv) < 1e-9: return "N/A"
    pct = 100 * (tv - bv) / bv
    symbol = "‚ñ≤" if (pct > 0) == higher_better else "‚ñº"
    ok     = "‚úì" if (pct > 0) == higher_better else "‚úó"
    return f"{pct:+.1f}% {symbol} {ok}"

rows = [
    {"Metric": "Avg Perplexity on Peterson text (‚Üì better)",
     "Direction": "‚Üì", "Base": f"{avg_b_ppl:.2f}", "Fine-Tuned": f"{avg_t_ppl:.2f}",
     "Change": pct_change(avg_b_ppl, avg_t_ppl, higher_better=False)},
    {"Metric": "Avg TF-IDF Similarity to Peterson (‚Üë better)",
     "Direction": "‚Üë", "Base": f"{avg_b_sim:.4f}", "Fine-Tuned": f"{avg_t_sim:.4f}",
     "Change": pct_change(avg_b_sim, avg_t_sim)},
    {"Metric": "Avg Keyword Density (‚Üë better)",
     "Direction": "‚Üë", "Base": f"{100*avg_b_kd:.2f}%", "Fine-Tuned": f"{100*avg_t_kd:.2f}%",
     "Change": pct_change(avg_b_kd, avg_t_kd)},
    {"Metric": "Avg Type-Token Ratio (‚Üë better)",
     "Direction": "‚Üë", "Base": f"{avg_b_ttr:.4f}", "Fine-Tuned": f"{avg_t_ttr:.4f}",
     "Change": pct_change(avg_b_ttr, avg_t_ttr)},
    {"Metric": "Avg Response Length (informational)",
     "Direction": "‚Äî", "Base": f"{avg_b_len:.1f} words", "Fine-Tuned": f"{avg_t_len:.1f} words",
     "Change": f"{avg_t_len - avg_b_len:+.1f} words"},
]
df = pd.DataFrame(rows)
print("=" * 95)
print("FINAL COMPARISON SUMMARY  ‚Äî  Qwen3-14B  (Jordan Peterson Fine-Tuning)")
print("=" * 95)
print(df.to_string(index=False))
print("=" * 95)
print("\n‚úì = improvement in expected direction  |  ‚úó = no improvement  |  ‚ñ≤/‚ñº = direction of change")

---
## Step 16: Saved Figures

In [None]:
print("Saved figures:")
for f in sorted(FIGURES_DIR.iterdir()):
    print(f"  {f.name}  ({f.stat().st_size / 1024:.0f} KB)")

---
## Conclusions

### Interpreting the Metrics for Qwen3-14B

**Perplexity** is the most direct indicator of domain adaptation. After fine-tuning on ~768,000 words of Peterson's writing, the model should assign higher probability to his actual sentences ‚Äî producing lower perplexity. The magnitude of improvement reflects how thoroughly the 14B model has internalized his patterns in a single epoch.

**TF-IDF Similarity** tests whether the model's responses use Peterson's *distinctive* vocabulary ‚Äî not just common words, but the specific terms (chaos, logos, archetype, sovereignty) that characterize his writing across all four books.

**Keyword Density** provides the most direct check: does the model use Peterson's signature words when answering his central questions?

**Type-Token Ratio** captures the richness of the model's vocabulary ‚Äî a property Peterson is known for in his elaborate, multi-layered prose.

### Qwen3-14B vs. GPT-OSS 20B: Training Comparison

| Metric | GPT-OSS 20B | Qwen3-14B |
|--------|-------------|-----------|
| Training loss | 3.01 | **2.44** |
| Training time | 73.3 min | **23.3 min** |
| Batch size | 1 | **2** |
| Optimizer steps | 641 | **321** |

The Qwen3-14B model achieved a **lower training loss** despite being a smaller model and training faster. This does not necessarily mean better inference quality ‚Äî perplexity and response metrics will show the true story.

### Limitations

- **One epoch**: More epochs would deepen style adoption. Three epochs typically produces noticeably more Peterson-like outputs.
- **Greedy decoding**: Deterministic generation for fair comparison. Sampling (`temperature=0.7`, `top_p=0.8`) as recommended by Qwen3 for chat mode would likely produce richer responses.
- **Thinking mode not tested**: We use `enable_thinking=False` throughout for a clean apples-to-apples comparison. Enabling thinking at inference might produce substantively different (and potentially better) outputs for the fine-tuned model.