# Qwen3-32B Jordan Peterson Fine-Tuning
## Maximum Quality on a 24 GB GPU — Conservative VRAM Strategy

This notebook fine-tunes `unsloth/Qwen3-32B-unsloth-bnb-4bit` on Jordan B.
Peterson's four books using the **synthetic Q&A methodology** developed in the
Qwen3-14B V2 notebook.

The 32B model is the largest dense Qwen3 variant that can plausibly fit on a
single 24 GB GPU.  It represents the highest quality ceiling available given
the hardware — more parameters means more capacity to encode Peterson's
distinctive vocabulary, rhetorical structure, and conceptual framing.

**Prerequisite**: This notebook uses the same Q&A dataset as the 14B V2
notebook.  If you have already run that notebook, the dataset cache at
`qa_dataset/peterson_qa.jsonl` will be detected and reused automatically —
no additional API calls needed.

---
## Why Qwen3-32B? The Case for a Larger Dense Model

### More Parameters = More Stylistic Capacity

A language model's LoRA adapter can only encode stylistic patterns in a
compressed low-rank subspace of the full weight space.  With r=32, we give
the adapter 32 independent directions per attention projection to capture
Peterson's style.  But those directions are projected into the *full model
weight space*, whose size determines how expressively they can be represented.

A 32B model has a hidden dimension of ~5,120 (vs ~5,120 for 14B — Qwen3-14B and
32B use the same hidden dim but differ in depth and attention heads).  More
layers and more attention heads means more places where the LoRA update can
reinforce Peterson's vocabulary choices and rhetorical patterns.

### Why Dense > MoE for Style Fine-Tuning

The other 30B-class model available is `Qwen3-30B-A3B` — a Mixture of Experts
(MoE) model with 30B total parameters but only **3B active per forward pass**.
While MoE models excel at knowledge-intensive tasks, they have a structural
disadvantage for stylistic fine-tuning:

- Each token is routed to a different subset of experts
- Style is a *global* property: Peterson's voice should be consistent across
  every token prediction, not just those that happen to route to "Peterson-experts"
- Fine-tuning MoE with LoRA requires teaching all expert groups consistently,
  which needs more training data and more epochs to converge

**The Qwen3-32B dense model is the right choice**: every token prediction passes
through the same weights, so every training example uniformly teaches the style.

### The VRAM Challenge

The 32B model is the most capable model that can *potentially* fit on a 24 GB
GPU.  It requires careful settings.  The next section explains exactly how.

---
## VRAM Strategy: Fitting 32B into 24 GB

### The Math

Our measured baseline from the Qwen3-14B V2 run:

| Component | Qwen3-14B V2 | Qwen3-32B (estimate) |
|-----------|-------------|---------------------|
| 4-bit model weights | ~7.7 GB | ~17.6 GB |
| Training overhead* | ~5.7 GB | ~3.0 GB (after reductions) |
| **Total peak** | **13.4 GB** (measured) | **~20–22 GB** (estimated) |

*Training overhead = activations + KV cache + LoRA adapter + 8-bit optimizer states

The overhead scales with `batch_size × max_seq_length`, so halving both
roughly halves the overhead:

### Four Levers We Pull

| Setting | V1 14B | V2 14B | **32B** | Saving vs 14B V2 |
|---------|--------|--------|---------|-----------------|
| `per_device_train_batch_size` | 2 | 2 | **1** | ~2.5 GB |
| `max_seq_length` | 2048 | 2048 | **1024** | ~1.5 GB |
| `gradient_accumulation_steps` | 4 | 4 | **8** | compensates for batch=1 |
| `gradient_checkpointing` | unsloth | unsloth | unsloth | already maximised |

**Effective batch size stays at 8**: `batch_size=1 × grad_accum=8 = 8`
The *gradient quality* is identical to V2; we just spread it over more steps.

### What Happens if It Still OOMs

Despite these reductions, 24 GB is a hard limit.  If CUDA raises an
out-of-memory error, this notebook will catch it and print an explicit
fallback plan rather than dying silently.  The training cell is wrapped in a
`try/except` so you get actionable guidance immediately.

### Sequence Length Adequacy

Our Q&A pairs are:
- Question: ~20 words → ~30 tokens
- Answer (Peterson passage): ~350 words → ~500 tokens
- System prompt + ChatML tokens: ~150 tokens
- **Total per example: ~680 tokens**

`max_seq_length=1024` covers 99%+ of examples with comfortable headroom.
Only the very longest passages (statistical tail) will be truncated, and
the truncation is applied to the end of the passage — the style-bearing
content is in the middle.

---
# Part 1: Q&A Dataset

This section is identical to the Qwen3-14B V2 notebook.  The Q&A cache at
`qa_dataset/peterson_qa.jsonl` is **shared** between the 14B and 32B notebooks
— if you have already run V2, this entire section will complete in seconds
by loading from cache.  If not, it will call the Claude Haiku API and generate
the dataset (~$1–3, ~15–20 minutes).

In [None]:
import subprocess, sys
_result = subprocess.run(
    [sys.executable, "-m", "pip", "install", "anthropic", "-q"],
    capture_output=True, text=True,
)
print("anthropic SDK ready." if _result.returncode == 0 else _result.stderr[:300])

import os, re, json, time, math
from pathlib import Path
from collections import Counter

import fitz
import anthropic

print(f"anthropic : {anthropic.__version__}")

# ── API key ────────────────────────────────────────────────────────────────
_api_key = os.environ.get("ANTHROPIC_API_KEY", "")
if not _api_key:
    _env_file = Path.home() / ".env"
    if _env_file.exists():
        for _line in _env_file.read_text().splitlines():
            if _line.startswith("ANTHROPIC_API_KEY="):
                _api_key = _line.split("=", 1)[1].strip().strip('"').strip("'")
                break
if not _api_key:
    raise EnvironmentError(
        "ANTHROPIC_API_KEY not found.\n"
        "Set it with:  export ANTHROPIC_API_KEY='sk-ant-...'\n"
        "Or add it to ~/.env as:  ANTHROPIC_API_KEY=sk-ant-..."
    )
print("API key found.")

BOOKS_DIR        = Path("../../Books/JordanPeterson")
QA_DIR           = Path("./qa_dataset")
QA_CACHE         = QA_DIR / "peterson_qa.jsonl"   # shared with 14B V2 notebook
QA_DIR.mkdir(exist_ok=True)
GENERATION_MODEL = "claude-haiku-4-5-20251001"

if QA_CACHE.exists():
    with open(QA_CACHE) as f:
        _n = sum(1 for line in f if line.strip())
    print(f"\nQ&A cache found: {_n:,} records — generation will be skipped.")
else:
    print("\nQ&A cache not found — will generate from PDFs.")

In [None]:
def clean_text(raw: str) -> str:
    """Strip PDF artefacts: control chars, ligatures, excess whitespace."""
    text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]', '', raw)
    text = re.sub(r'\s+', ' ', text)
    return text.replace('\ufb01', 'fi').replace('\ufb02', 'fl').strip()


def extract_chunks(pdf_path: Path, chunk_words: int = 350,
                   overlap_words: int = 50) -> list[str]:
    """
    Extract clean text from a PDF and split into overlapping word-level chunks.

    chunk_words=350 produces passages that fit well within max_seq_length=1024
    after applying the ChatML system+user template overhead (~150 tokens).
    overlap_words=50 ensures no passage starts mid-sentence.
    """
    doc   = fitz.open(str(pdf_path))
    pages = [clean_text(page.get_text()) for page in doc]
    doc.close()
    words  = ' '.join(pages).split()
    step   = chunk_words - overlap_words
    return [
        ' '.join(words[s : s + chunk_words])
        for s in range(0, len(words), step)
        if len(words[s : s + chunk_words]) >= 100
    ]


pdf_files  = sorted(BOOKS_DIR.glob("*.pdf"))
all_chunks = []
chunk_meta = []

print(f"Extracting from {len(pdf_files)} PDFs…")
for pdf in pdf_files:
    fname = pdf.name.lower()
    if   "maps"    in fname: label = "Maps of Meaning"
    elif "12 rules" in fname or "antidote" in fname: label = "12 Rules for Life"
    elif "beyond"  in fname: label = "Beyond Order"
    else:                    label = "We Who Wrestle with God"

    chunks = extract_chunks(pdf)
    all_chunks.extend(chunks)
    chunk_meta.extend([{"book": label}] * len(chunks))
    print(f"  {label:<35}  {len(chunks):4d} chunks")

print(f"\nTotal: {len(all_chunks):,} passages")

In [None]:
client = anthropic.Anthropic(api_key=_api_key)


def generate_questions(passage: str, book: str, max_retries: int = 3) -> list[str]:
    """
    Call Claude Haiku to generate 2 questions answered by this passage.

    Returns only the questions (not answers) — the passage itself is used as
    the answer in the training dataset.  This keeps output tokens minimal
    and cost low (~$0.00068 per passage, ~$0.82 for all 1,200 passages).

    Retries with exponential backoff on API errors or malformed JSON.
    """
    prompt = (
        f"You are building a training dataset for a Jordan B. Peterson AI model.\n\n"
        f"The passage below is from Peterson's book '{book}'.  Generate exactly 2 questions that:\n"
        f"1. This passage directly and substantively answers\n"
        f"2. Someone interested in Peterson's ideas might genuinely ask\n"
        f"3. Cover different angles of the passage (e.g. one concrete, one philosophical)\n\n"
        f"Peterson's topics: order vs chaos, meaning, personal responsibility, suffering, "
        f"mythology, archetypes, shadow, logos, truth, religion, Jungian psychology, "
        f"hierarchy, heroism, sacrifice, being.\n\n"
        f"Return ONLY a JSON array of exactly 2 question strings.  No other text.\n"
        f"Example: [\"Why is confronting chaos necessary for meaning?\", "
        f"\"What role does suffering play in personal development?\"]\n\n"
        f"Passage:\n{passage}"
    )
    for attempt in range(max_retries):
        try:
            resp  = client.messages.create(
                model=GENERATION_MODEL, max_tokens=150,
                messages=[{"role": "user", "content": prompt}],
            )
            raw   = resp.content[0].text.strip()
            match = re.search(r'\[.*?\]', raw, re.DOTALL)
            if not match:
                raise ValueError(f"No JSON array in: {raw[:80]}")
            qs = json.loads(match.group())
            return [str(q).strip() for q in qs[:2]]
        except Exception as e:
            wait = 2 ** attempt
            if attempt < max_retries - 1:
                print(f"    [retry {attempt+1}] {e} — wait {wait}s")
                time.sleep(wait)
            else:
                print(f"    [FAILED] passage skipped")
                return []
    return []


print("generate_questions() defined.")

In [None]:
# ── Check existing cache ───────────────────────────────────────────────────
existing = []
if QA_CACHE.exists():
    with open(QA_CACHE) as f:
        existing = [json.loads(l) for l in f if l.strip()]

_expected = len(all_chunks) * 2
_coverage = len(existing) / _expected if _expected else 0
print(f"Cache coverage: {len(existing):,} / {_expected:,}  ({100*_coverage:.1f}%)")

if _coverage >= 0.90:
    print("Cache ≥90% complete — skipping generation.")
    print(f"  (Delete {QA_CACHE} to force regeneration)")
else:
    already_done   = len(existing) // 2
    remaining      = all_chunks[already_done:]
    remaining_meta = chunk_meta[already_done:]
    print(f"Generating {len(remaining):,} passages (from #{already_done+1})…")
    print(f"Estimated cost: ~${len(remaining)*0.00068:.2f}\n")

    skipped = 0
    with open(QA_CACHE, "a") as out_f:
        for i, (passage, meta) in enumerate(zip(remaining, remaining_meta)):
            if i % 50 == 0:
                pct = 100 * (already_done + i) / len(all_chunks)
                print(f"  [{already_done+i+1:4d}/{len(all_chunks):4d}]  "
                      f"{pct:5.1f}%  {meta['book']}")
            questions = generate_questions(passage, meta["book"])
            if not questions:
                skipped += 1
                continue
            for q in questions:
                out_f.write(json.dumps({"question": q, "answer": passage,
                                        "book": meta["book"]}) + "\n")
            out_f.flush()
            time.sleep(0.3)
    print(f"\nDone.  Processed: {len(remaining)-skipped:,}  Skipped: {skipped}")

with open(QA_CACHE) as f:
    qa_records = [json.loads(l) for l in f if l.strip()]
print(f"Total Q&A pairs: {len(qa_records):,}")

In [None]:
book_counts = Counter(r["book"] for r in qa_records)
print("Q&A pairs by book:")
for book, count in sorted(book_counts.items(), key=lambda x: -x[1]):
    print(f"  {book:<35}  {count:4d}  ({100*count/len(qa_records):.1f}%)")

ans_len = [len(r["answer"].split()) for r in qa_records]
print(f"\nAnswer length: min={min(ans_len)}  max={max(ans_len)}  "
      f"mean={sum(ans_len)/len(ans_len):.0f} words")

# Token budget sanity check
# ~1.4 tokens/word × mean answer + ~180 tokens overhead < 1024 limit
mean_tokens_est = sum(ans_len) / len(ans_len) * 1.4 + 180
print(f"Estimated mean tokens per example: ~{mean_tokens_est:.0f}  "
      f"(limit: 1024 — {'OK' if mean_tokens_est < 900 else 'WARNING: close to limit'})")

---
# Part 2: Fine-Tuning Qwen3-32B

## Configuration

| Parameter | Qwen3-14B V2 | **Qwen3-32B** | Rationale |
|-----------|-------------|--------------|-----------|
| Base model | `Qwen3-14B-unsloth-bnb-4bit` | **`Qwen3-32B-unsloth-bnb-4bit`** | 2.3× more parameters |
| Training data | Synthetic Q&A | Same | Shared cache |
| Epochs | 3 | **3** | Same generalisation target |
| LoRA rank | r=32 | **r=32** | Same adapter capacity ratio |
| LoRA alpha | 32 | **32** | alpha/rank = 1.0 |
| Batch size | 2 | **1** | Primary VRAM reduction |
| Gradient accumulation | 4 | **8** | Keeps effective batch = 8 |
| Max seq length | 2048 | **1024** | Secondary VRAM reduction |
| Warmup steps | 30 | **30** | Same absolute warmup |
| LR | 2e-4 | **2e-4** | Standard Qwen3 LoRA |
| Estimated peak VRAM | ~13.4 GB | **~20–22 GB** | Right up to the 24 GB limit |
| Output dir | `qwen3_14b_peterson_v2_lora` | **`qwen3_32b_peterson_lora`** | Separate adapter |

## Risk Acknowledgement

This notebook will attempt training.  If the GPU runs out of memory, a
structured error handler will print an explicit fallback plan rather than
a raw CUDA traceback.

In [None]:
import os, re, gc, math, json
from pathlib import Path

import torch
from datasets import Dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import train_on_responses_only

# ── Paths ──────────────────────────────────────────────────────────────────
QA_CACHE   = Path("./qa_dataset/peterson_qa.jsonl")
OUTPUT_DIR = Path("./outputs/qwen3_32b_peterson_lora")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# ── Model ──────────────────────────────────────────────────────────────────
BASE_MODEL  = "unsloth/Qwen3-32B-unsloth-bnb-4bit"
MAX_SEQ_LEN = 1024    # halved from 2048 — single biggest VRAM lever

# ── LoRA ──────────────────────────────────────────────────────────────────
LORA_RANK  = 32       # same as 14B V2; adapter capacity ratio is identical
LORA_ALPHA = 32       # alpha/rank = 1.0

# ── Training ─────────────────────────────────────────────────────────────
BATCH_SIZE    = 1     # halved — primary VRAM saving
GRAD_ACCUM    = 8     # doubled — effective batch stays at 8
NUM_EPOCHS    = 3
LEARNING_RATE = 2e-4
WARMUP_STEPS  = 30
WEIGHT_DECAY  = 0.01

SYSTEM_PROMPT = (
    "You are an AI assistant that has been trained on the complete works of "
    "Jordan B. Peterson, a Canadian clinical psychologist, professor, and author. "
    "You speak with deep knowledge of psychology, philosophy, mythology, religion, "
    "and personal responsibility.  Your responses reflect Peterson's writing style, "
    "intellectual depth, and interdisciplinary approach to understanding human "
    "nature and meaning."
)

# ── Hardware summary ───────────────────────────────────────────────────────
total_vram = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"GPU      : {torch.cuda.get_device_name(0)}")
print(f"VRAM     : {total_vram:.1f} GB total")
print()
if total_vram < 23.5:
    print(f"WARNING: This GPU has {total_vram:.1f} GB VRAM.")
    print("Qwen3-32B requires ~20-22 GB during training.")
    print("Proceeding, but OOM is possible — see the training cell for fallback.")
    print()

print("Configuration:")
print(f"  Model            : {BASE_MODEL}")
print(f"  max_seq_length   : {MAX_SEQ_LEN}  (halved vs 14B V2 to save VRAM)")
print(f"  batch_size       : {BATCH_SIZE}   (halved vs 14B V2)")
print(f"  grad_accum       : {GRAD_ACCUM}   (doubled — effective batch = {BATCH_SIZE*GRAD_ACCUM})")
print(f"  epochs           : {NUM_EPOCHS}")
print(f"  LoRA             : r={LORA_RANK}, alpha={LORA_ALPHA}")
print(f"  Output           : {OUTPUT_DIR.resolve()}")

In [None]:
if not QA_CACHE.exists():
    raise FileNotFoundError(
        f"Q&A cache not found at {QA_CACHE}.\nRun Part 1 first."
    )

with open(QA_CACHE) as f:
    records = [json.loads(l) for l in f if l.strip()]

raw_dataset = Dataset.from_list([
    {"question": r["question"], "answer": r["answer"]}
    for r in records
])

print(f"Dataset: {len(raw_dataset):,} Q&A pairs  |  columns: {raw_dataset.column_names}")

---
## Step 3: Load Qwen3-32B

This is the most memory-intensive step.  The 4-bit quantized weights alone
occupy ~17.6 GB.  We monitor VRAM before and after loading so we know exactly
how much headroom remains for training.

In [None]:
# ── Pre-load VRAM check ────────────────────────────────────────────────────
# Clear any stale allocations from previous runs in this kernel session
gc.collect()
torch.cuda.empty_cache()
free_before = (torch.cuda.get_device_properties(0).total_memory
               - torch.cuda.memory_reserved()) / 1e9
print(f"Free VRAM before model load : {free_before:.1f} GB")
if free_before < 17.0:
    print("WARNING: less than 17 GB free — model weights may not fit.")
    print("Restart the kernel to free all allocations and try again.")

print(f"\nLoading {BASE_MODEL} …")
print("(This downloads ~18 GB on first run — may take several minutes.)")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name      = BASE_MODEL,
    dtype           = None,          # auto: bfloat16 on Ampere+
    max_seq_length  = MAX_SEQ_LEN,   # 1024 — critical for VRAM management
    load_in_4bit    = True,          # 4-bit quantization via bitsandbytes
    full_finetuning = False,
)

vram_after = torch.cuda.memory_reserved() / 1e9
vram_free  = total_vram - vram_after
print(f"\nModel loaded.")
print(f"  VRAM reserved : {vram_after:.1f} GB")
print(f"  VRAM free     : {vram_free:.1f} GB  (training needs ~3-5 GB more)")
if vram_free < 3.0:
    print("  WARNING: very little headroom — training will likely OOM.")
    print("  Consider reducing max_seq_length to 768 before proceeding.")

---
## Step 4: Add LoRA Adapters

Same configuration as the 14B V2 notebook: r=32, alpha=32, all attention and
MLP projections targeted.  The adapter size scales with `d_model × r`, and
Qwen3-32B's larger hidden dimension (~5,120) means the r=32 adapter encodes
more expressive style directions than the same rank on a smaller model.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r                          = LORA_RANK,
    lora_alpha                 = LORA_ALPHA,
    lora_dropout               = 0,
    target_modules             = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    use_gradient_checkpointing = "unsloth",   # saves ~30% VRAM vs PyTorch default
    random_state               = 42,
    use_rslora                 = False,
    loftq_config               = None,
)

total_p     = sum(p.numel() for p in model.parameters())
trainable_p = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters    : {total_p:,}")
print(f"Trainable (LoRA)    : {trainable_p:,}  ({100*trainable_p/total_p:.4f}%)")
print(f"VRAM after LoRA     : {torch.cuda.memory_reserved()/1e9:.1f} GB")

---
## Steps 5–6: Format Dataset and Apply Response Mask

In [None]:
def format_example(batch):
    """
    Convert Q&A pairs to Qwen3 ChatML format.

    The training signal is entirely in the assistant (answer) tokens.
    The system prompt and user question are masked out by train_on_responses_only
    so the model only learns what to say, not the structural tokens around it.
    """
    texts = []
    for q, a in zip(batch["question"], batch["answer"]):
        conv = [
            {"role": "system",    "content": SYSTEM_PROMPT},
            {"role": "user",      "content": q},
            {"role": "assistant", "content": a},
        ]
        texts.append(tokenizer.apply_chat_template(
            conv, tokenize=False, add_generation_prompt=False,
            enable_thinking=False,
        ))
    return {"text": texts}


dataset = raw_dataset.map(
    format_example, batched=True, batch_size=256,
    remove_columns=raw_dataset.column_names,
)
print(f"Formatted: {len(dataset):,} examples")

# Verify token budget: show how many examples fit within max_seq_length
sample_ids = tokenizer(dataset[0]["text"])["input_ids"]
print(f"Sample token count  : {len(sample_ids)} (limit: {MAX_SEQ_LEN})")
if len(sample_ids) > MAX_SEQ_LEN:
    print("WARNING: first sample exceeds max_seq_length — it will be truncated.")

# ── Detect response boundary tokens ────────────────────────────────────────
_sample = dataset[0]["text"]
if "<|im_start|>assistant\n" in _sample:
    instruction_part = "<|im_start|>user\n"
    response_part    = "<|im_start|>assistant\n"
else:
    raise RuntimeError("Could not detect Qwen3 ChatML tokens in formatted data.")
print(f"\nResponse boundary   : {repr(response_part)}")

---
## Step 7: Configure SFTTrainer

The key difference from the 14B V2 notebook is `per_device_train_batch_size=1`
and `gradient_accumulation_steps=8`.  The optimizer sees the same quality of
gradient signal (effective batch = 8) but processes one example at a time,
which halves the peak activation memory during each forward-backward pass.

In [None]:
from trl import SFTTrainer, SFTConfig

sft_trainer = SFTTrainer(
    model         = model,
    tokenizer     = tokenizer,
    train_dataset = dataset,
    args          = SFTConfig(
        dataset_text_field          = "text",
        max_seq_length              = MAX_SEQ_LEN,
        dataset_num_proc            = 2,

        per_device_train_batch_size = BATCH_SIZE,   # 1 — primary VRAM saving
        gradient_accumulation_steps = GRAD_ACCUM,   # 8 — effective batch = 8
        num_train_epochs            = NUM_EPOCHS,   # 3

        learning_rate               = LEARNING_RATE,
        warmup_steps                = WARMUP_STEPS,
        lr_scheduler_type           = "cosine",
        weight_decay                = WEIGHT_DECAY,

        optim                       = "adamw_8bit",
        fp16                        = not torch.cuda.is_bf16_supported(),
        bf16                        = torch.cuda.is_bf16_supported(),

        logging_steps               = 25,
        save_strategy               = "epoch",
        output_dir                  = str(OUTPUT_DIR),
        report_to                   = "none",
        seed                        = 42,
        packing                     = False,
    ),
)

sft_trainer = train_on_responses_only(
    sft_trainer,
    instruction_part = instruction_part,
    response_part    = response_part,
)

# ── Estimate total training steps ──────────────────────────────────────────
_steps = math.ceil(len(dataset) / BATCH_SIZE) * NUM_EPOCHS // GRAD_ACCUM
print(f"Estimated gradient updates : {_steps:,}")
print(f"  = {len(dataset):,} examples / batch {BATCH_SIZE} × accum {GRAD_ACCUM} × {NUM_EPOCHS} epochs")
print()
print("Ready to train.  Current VRAM state:")
print(f"  Reserved : {torch.cuda.memory_reserved()/1e9:.1f} GB")
print(f"  Free     : {(total_vram - torch.cuda.memory_reserved()/1e9):.1f} GB")

---
## Step 8: Train

Training is wrapped in a structured error handler.  If CUDA runs out of memory
the handler prints an explicit fallback plan — no raw traceback.

In [None]:
import time

torch.cuda.reset_peak_memory_stats()
print(f"VRAM before training  : {torch.cuda.memory_reserved()/1e9:.1f} GB")
print(f"Starting — {NUM_EPOCHS} epochs, ~{_steps:,} gradient updates…")
print()

t0 = time.time()

try:
    train_result = sft_trainer.train()
    elapsed_min  = (time.time() - t0) / 60
    vram_peak    = torch.cuda.max_memory_reserved() / 1e9

    print()
    print("═" * 65)
    print("TRAINING COMPLETE")
    print("═" * 65)
    print(f"  Gradient updates   : {train_result.global_step:,}")
    print(f"  Training loss      : {train_result.training_loss:.4f}")
    print(f"  Elapsed            : {elapsed_min:.1f} min")
    print(f"  Peak VRAM          : {vram_peak:.1f} GB  /  {total_vram:.1f} GB")
    print()
    print("Reference — Qwen3-14B V2 (same data, same epochs):")
    print("  Expected: ~900 steps  |  ~90 min  |  loss ~2.2-2.4  |  ~13.5 GB")

except RuntimeError as e:
    elapsed_min = (time.time() - t0) / 60
    err_str     = str(e).lower()

    if "out of memory" in err_str or "cuda" in err_str:
        print()
        print("╔" + "═" * 63 + "╗")
        print("║  OUT OF MEMORY — Qwen3-32B exceeded 24 GB VRAM              ║")
        print("╚" + "═" * 63 + "╝")
        print()
        print(f"  Failed after {elapsed_min:.1f} min.")
        print(f"  Peak VRAM before OOM: {torch.cuda.max_memory_reserved()/1e9:.1f} GB")
        print()
        print("FALLBACK OPTIONS (in order of recommendation):")
        print()
        print("  Option 1 — Reduce max_seq_length to 768 and retry:")
        print("    • Change MAX_SEQ_LEN = 768 in the config cell")
        print("    • Restart the kernel and rerun from Step 3")
        print("    • ~95% of Q&A examples fit within 768 tokens")
        print()
        print("  Option 2 — Use Qwen3-30B-A3B (MoE, fits ~18-20 GB):")
        print("    • Model: unsloth/Qwen3-30B-A3B-bnb-4bit")
        print("    • 30B total params, 3B active per step — lower activation memory")
        print("    • Change BASE_MODEL and OUTPUT_DIR in the config cell")
        print("    • Keep all other settings the same")
        print()
        print("  Option 3 — Fall back to Qwen3-14B V2:")
        print("    • Proven to work at 13.4 GB peak")
        print("    • Notebook: Qwen3_14B_JordanPeterson_V2_FineTuning.ipynb")
        print()
        # Free VRAM so the user can retry without a kernel restart if possible
        del sft_trainer
        gc.collect()
        torch.cuda.empty_cache()
        print(f"  VRAM after cleanup: {torch.cuda.memory_reserved()/1e9:.1f} GB")
    else:
        raise   # re-raise non-OOM errors

---
## Step 9: Save the LoRA Adapter

The r=32 adapter for a 32B model will be larger than for 14B — approximately
proportional to the number of target projection matrices multiplied by r × d_model.

In [None]:
print(f"Saving adapter to {OUTPUT_DIR} …")
model.save_pretrained(str(OUTPUT_DIR))
tokenizer.save_pretrained(str(OUTPUT_DIR))

adapter_files = list(OUTPUT_DIR.glob("*.safetensors")) + list(OUTPUT_DIR.glob("*.bin"))
total_mb = sum(f.stat().st_size for f in adapter_files) / 1e6
print(f"\nAdapter files:")
for f in sorted(adapter_files):
    print(f"  {f.name}  ({f.stat().st_size/1e6:.1f} MB)")
print(f"\nTotal: {total_mb:.0f} MB")

---
## Step 10: Test Inference

Run the same 5 evaluation prompts used across all comparison notebooks.
Results here give a first qualitative check of response quality before
running the full quantitative comparison.

In [None]:
FastLanguageModel.for_inference(model)

EVAL_PROMPTS = [
    "What is the relationship between order and chaos in human experience?",
    "Why is personal responsibility the foundation of a meaningful life?",
    "How do ancient myths and stories reveal truths about human nature?",
    "What does it mean to pursue what is meaningful rather than what is expedient?",
    "How should a person confront suffering rather than flee from it?",
]


def ask(question: str, max_new_tokens: int = 300) -> str:
    """
    Generate a response from the fine-tuned 32B model.

    Identical to the 14B inference wrapper: Qwen3 two-step tokenisation,
    enable_thinking=False, greedy decoding (do_sample=False) for
    deterministic outputs comparable across notebooks.
    """
    msgs  = [{"role": "system", "content": SYSTEM_PROMPT},
             {"role": "user",   "content": question}]
    text  = tokenizer.apply_chat_template(
        msgs, tokenize=False, add_generation_prompt=True, enable_thinking=False,
    )
    inp   = tokenizer(text, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inp, max_new_tokens=max_new_tokens,
            do_sample=False, temperature=1.0, repetition_penalty=1.1,
        )
    resp = tokenizer.decode(out[0][inp["input_ids"].shape[1]:],
                             skip_special_tokens=True).strip()
    return re.sub(r'<think>.*?</think>', '', resp, flags=re.DOTALL).strip()


print("Testing 32B fine-tuned model (greedy decoding)…\n")
for i, prompt in enumerate(EVAL_PROMPTS):
    print(f"{'─'*70}")
    print(f"Q{i+1}: {prompt}")
    print(f"{'─'*70}")
    answer = ask(prompt)
    print(answer if answer.strip() else "(empty response)")
    print()

---
## Step 11: Thinking Mode Test

The 32B model's chain-of-thought reasoning is considerably more capable than
the 14B model's.  Testing it here shows whether the larger model's reasoning
capacity is preserved after fine-tuning and how it combines with Peterson's
domain knowledge.

In [None]:
def ask_thinking(question: str, max_new_tokens: int = 600) -> tuple[str, str]:
    """Return (thinking_content, final_answer) using Qwen3 reasoning mode."""
    msgs = [{"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": question}]
    text = tokenizer.apply_chat_template(
        msgs, tokenize=False, add_generation_prompt=True, enable_thinking=True,
    )
    inp = tokenizer(text, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inp, max_new_tokens=max_new_tokens,
            do_sample=True, temperature=0.6, top_p=0.95, top_k=20,
            repetition_penalty=1.1,
        )
    full  = tokenizer.decode(out[0][inp["input_ids"].shape[1]:],
                              skip_special_tokens=True).strip()
    match = re.search(r'<think>(.*?)</think>', full, re.DOTALL)
    think = match.group(1).strip() if match else ""
    ans   = re.sub(r'<think>.*?</think>', '', full, flags=re.DOTALL).strip()
    return think, ans


q = "What is the relationship between order and chaos in human experience?"
print(f"Q: {q}\n")
thinking, answer = ask_thinking(q)
if thinking:
    print(f"[Chain-of-Thought]\n{thinking[:500]}{'…' if len(thinking)>500 else ''}\n")
print(f"[Answer]\n{answer}")

---
## Conclusions

### What This Run Achieved

A Qwen3-32B model fine-tuned on synthetic Q&A pairs from Peterson's four books
for 3 epochs represents the **highest achievable quality** for this domain on
a single 24 GB GPU.

The three improvements that drive this quality ceiling:

1. **Synthetic Q&A data** — the model learned to answer questions in Peterson's
   voice, not just continue passages.  The training task matches the inference task.

2. **3 epochs** — enough passes over the ~2,400 training examples to generalise
   Peterson's style to unseen questions rather than memorise specific passages.

3. **Qwen3-32B dense architecture** — 2.3× the parameters of the 14B model,
   all of which participate in every token prediction.  More weight space means
   more nuanced style encoding with the same r=32 LoRA rank.

### Updating the All-Models Comparison

To include this model in `AllModels_JordanPeterson_Comparison.ipynb`:

1. Add `"qwen3_32b"` to `MODEL_KEYS`
2. Add the path, display name, and colour to the respective dicts:
   ```python
   MODEL_PATHS  ["qwen3_32b"] = "./outputs/qwen3_32b_peterson_lora"
   MODEL_DISPLAY["qwen3_32b"] = "Qwen3-32B  |  Fine-Tuned"
   MODEL_COLORS ["qwen3_32b"] = "#9467BD"   # purple
   MODEL_SHORT  ["qwen3_32b"] = "Qwen3-32B\nFine-Tuned"
   ```
3. Add a Phase 5 inference block (using `generate_response_qwen3`) with cache
   file `comparison_cache_all_models/qwen3_32b_results.pkl`
4. Update all chart loops — they already iterate over `MODEL_KEYS` so adding
   one entry is the only change needed per chart.

### Further Improvement Options

| Option | Expected gain | Prerequisite |
|--------|--------------|-------------|
| 5 epochs | Marginal; diminishing returns | Just increase `NUM_EPOCHS` |
| `r=64` LoRA | More style capacity | Likely OOMs on 32B; test with care |
| More Q&A variety (3 questions/passage) | Broader coverage | Costs ~$1-2 more |
| Human-edited Q&A answers | Highest quality ceiling | Manual effort |