# Fine-Tuning Qwen3-14B on Jordan Peterson's Books Using Unsloth

## Overview

This notebook fine-tunes the **`unsloth/Qwen3-14B-unsloth-bnb-4bit`** model on text from four Jordan Peterson books using **Unsloth** and **LoRA**. It is the companion to the GPT-OSS 20B fine-tuning notebook, and demonstrates how working with a different model family requires adapting to different chat templates, inference patterns, and model capabilities.

### What is Qwen3?

**Qwen3** is the third generation of Alibaba's open-source Qwen language model family. The 14B variant has 14 billion parameters and is one of the strongest open models in its size class. Qwen3 has a unique capability not present in GPT-OSS: a built-in **"thinking" mode** where the model produces an internal chain-of-thought inside `<think>` tags before giving its final answer, similar to OpenAI's o1 or DeepSeek-R1 models.

### Qwen3 vs GPT-OSS: Key Differences

| Feature | Qwen3-14B | GPT-OSS 20B |
|---------|-----------|-------------|
| Parameters | 14 billion | 20 billion |
| Chat format | **ChatML** (`<|im_start|>` / `<|im_end|>`) | Harmony (`<|start|>` / `<|message|>`) |
| Thinking mode | ✓ `enable_thinking=True/False` | ✓ `reasoning_effort="low/medium/high"` |
| Inference temp (chat) | 0.7, top_p=0.8, top_k=20 | 0.7, top_p=0.9 |
| Inference temp (thinking) | 0.6, top_p=0.95, top_k=20 | N/A (uses reasoning_effort) |
| VRAM (4-bit) | ~8-9 GB | ~12-13 GB |

### Qwen3's Thinking Mode

This is Qwen3's most distinctive feature. When `enable_thinking=True` is passed to the chat template, the model wraps its reasoning process in `<think>...</think>` tags before the visible response. This can significantly improve answer quality on complex questions — the model "thinks out loud" before committing to an answer.

For our **fine-tuning**, we use `enable_thinking=False`. This keeps the training data in standard chat format (no thinking tokens), which is appropriate for learning Jordan Peterson's writing style — we want the model to respond directly in his voice, not reason through each answer like a problem-solver.

For **inference**, we demonstrate both modes so you can see the difference.

### What is LoRA and 4-bit Quantization?

These are the same techniques used in the GPT-OSS notebook. See that notebook for a full explanation. In brief:
- **4-bit quantization**: Compresses 14B parameters to fit in ~8-9 GB of VRAM
- **LoRA**: Trains only small adapter matrices (~0.04% of parameters) instead of updating the full model

### Hardware

Designed for an **NVIDIA RTX 4090** (24 GB VRAM). The Qwen3-14B model uses significantly less VRAM than GPT-OSS 20B, so we can use a larger batch size.


---
## Step 1: Verify Environment and Imports

We verify all required packages are available in the `.finetuning` virtual environment before doing any GPU-heavy work. Catching a missing package early saves time.

In [None]:
import importlib

required_packages = {
    'unsloth':        'Unsloth — fast fine-tuning library (2x speedup)',
    'torch':          'PyTorch — deep learning framework',
    'transformers':   'HuggingFace Transformers — model loading & tokenization',
    'peft':           'PEFT — LoRA adapter management',
    'trl':            'TRL — SFTTrainer for supervised fine-tuning',
    'datasets':       'HuggingFace Datasets — dataset utilities',
    'fitz':           'PyMuPDF — PDF text extraction',
}

print("Checking required packages:\n")
for pkg, description in required_packages.items():
    try:
        m = importlib.import_module(pkg)
        version = getattr(m, '__version__', 'installed')
        print(f"  ✓  {pkg:15s} {version:12s} — {description}")
    except ImportError:
        print(f"  ✗  {pkg:15s} NOT FOUND  — {description}")
        print(f"       Install with: uv pip install {pkg}")

In [None]:
import torch

print(f"PyTorch version : {torch.__version__}")
print(f"CUDA available  : {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version    : {torch.version.cuda}")
    for i in range(torch.cuda.device_count()):
        name = torch.cuda.get_device_name(i)
        mem  = torch.cuda.get_device_properties(i).total_memory / 1024**3
        print(f"GPU {i}           : {name}  ({mem:.1f} GB VRAM)")
else:
    print("WARNING: No CUDA GPU detected — fine-tuning requires a GPU.")

---
## Step 2: Extract Text from Jordan Peterson's Books

We use the exact same PDF extraction and chunking pipeline as the GPT-OSS notebook. The only difference is that Qwen3's chat template has slightly different overhead, so token counts will differ slightly from the GPT-OSS run.

### Why These Chunk Sizes?

The Qwen3 ChatML template adds overhead per conversation (system + user prompt + template tokens). With `chunk_size=350` words (~470 tokens) and `max_seq_length=2048`, we leave ample room for:
- System prompt: ~70 tokens
- User prompt: ~20 tokens
- ChatML template tokens: ~20 tokens
- Book passage: ~470 tokens
- Total: ~580 tokens — well within 2048

### The Four Books

1. **Maps of Meaning: The Architecture of Belief** — Peterson's academic masterwork on myth and psychological meaning
2. **12 Rules for Life: An Antidote to Chaos** — His bestselling self-help book
3. **Beyond Order: 12 More Rules for Life** — The sequel
4. **We Who Wrestle with God** — His most recent work on biblical narrative

In [None]:
import fitz  # PyMuPDF
import re
from pathlib import Path

BOOKS_DIR = Path("../../Books/JordanPeterson")


def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract all text from a PDF file using PyMuPDF."""
    doc = fitz.open(pdf_path)
    parts = []
    for page in doc:
        text = page.get_text()
        if text.strip():
            parts.append(text)
    doc.close()
    return "\n".join(parts)


def clean_text(text: str) -> str:
    """Remove PDF extraction artifacts and normalize whitespace."""
    text = re.sub(r'[^\x20-\x7E\n\t]', ' ', text)   # non-printable chars
    text = text.replace('\t', ' ')
    text = re.sub(r' +', ' ', text)                      # collapse spaces
    text = re.sub(r'\n{3,}', '\n\n', text)             # collapse excess newlines
    text = re.sub(r'^\s*\d{1,4}\s*$', '', text, flags=re.MULTILINE)  # page numbers
    lines = [line.strip() for line in text.split('\n')]
    text = '\n'.join(lines)
    text = re.sub(r'\n{3,}', '\n\n', text)
    return text.strip()


def chunk_text(text: str, chunk_size: int = 350, overlap: int = 50) -> list:
    """
    Split text into overlapping word-count chunks.

    350 words ≈ 470 tokens, comfortably within our 2048-token limit when combined
    with the ChatML template overhead (~110 tokens for system + user + markers).
    The 50-word overlap ensures continuity across chunk boundaries.
    """
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        end   = start + chunk_size
        chunk = ' '.join(words[start:end])
        if len(chunk.split()) >= 50:
            chunks.append(chunk)
        start = end - overlap
    return chunks


# ── Extract and clean all PDFs ─────────────────────────────────────────────
print(f"Looking for PDFs in: {BOOKS_DIR.resolve()}\n")
books = {}
pdf_files = sorted(BOOKS_DIR.glob("*.pdf"))

if not pdf_files:
    raise FileNotFoundError(f"No PDFs found in {BOOKS_DIR.resolve()}")

for pdf_path in pdf_files:
    print(f"Processing: {pdf_path.name}")
    raw   = extract_text_from_pdf(str(pdf_path))
    clean = clean_text(raw)
    books[pdf_path.stem] = clean
    words = len(clean.split())
    chars = len(clean)
    print(f"  → {words:,} words  {chars:,} chars\n")

total_words = sum(len(t.split()) for t in books.values())
print(f"Total across all books : {total_words:,} words")
print(f"Books processed        : {len(books)}")

---
## Step 3: Create Training Dataset

We format each book passage as a three-turn conversation:

```
system  : "You are an AI trained on Jordan Peterson's works…"
user    : "Please share your thoughts on the following topic…"
assistant : <350-word passage from one of the books>
```

This is identical in structure to the GPT-OSS training data. The difference is that **the chat template applied in Step 7 is Qwen3-specific** — it uses ChatML tokens (`<|im_start|>`, `<|im_end|>`) instead of Harmony tokens (`<|start|>`, `<|message|>`).

By rotating through eight different user prompts we add variety that prevents the model from overfitting to any single prompt phrasing.

### Why `enable_thinking=False` for Training?

When we apply the Qwen3 chat template in Step 7, we pass `enable_thinking=False`. This produces clean training examples without thinking-mode markers. Here is why this matters:

- We are fine-tuning for **style** (Jordan Peterson's writing voice), not for **reasoning** (solving math problems or logical puzzles).
- Thinking-mode training would add `<think>` / `</think>` tokens around empty or nonsensical "thoughts" since our book passages have no reasoning steps to show.
- Using non-thinking mode keeps the training signal clean and directly tied to Peterson's prose.

You can still use `enable_thinking=True` at inference time (Step 12) — the thinking capability is baked into the base model weights and is not affected by our fine-tuning.

In [None]:
USER_PROMPTS = [
    "Please share your thoughts on the following topic from your writings.",
    "Can you elaborate on this idea from your work?",
    "Explain this concept in detail.",
    "What are your views on this subject?",
    "Continue discussing this topic.",
    "Tell me more about this idea.",
    "Share your perspective on this.",
    "Discuss the following in depth.",
]

SYSTEM_PROMPT = (
    "You are an AI assistant that has been trained on the complete works of Jordan B. Peterson, "
    "a Canadian clinical psychologist, professor, and author. You speak with deep knowledge of "
    "psychology, philosophy, mythology, religion, and personal responsibility. Your responses "
    "reflect Peterson's writing style, intellectual depth, and interdisciplinary approach to "
    "understanding human nature and meaning."
)


def create_training_examples(books: dict) -> list:
    """
    Convert each book's text into conversational training examples.

    Each example is a dict with a 'messages' key containing a list of role/content dicts.
    This is the standard format consumed by Unsloth's standardize_sharegpt() and the
    tokenizer's apply_chat_template().
    """
    examples, prompt_idx = [], 0
    for book_name, text in books.items():
        chunks = chunk_text(text)
        print(f"  {book_name}: {len(chunks)} chunks")
        for chunk in chunks:
            examples.append({
                "messages": [
                    {"role": "system",    "content": SYSTEM_PROMPT},
                    {"role": "user",      "content": USER_PROMPTS[prompt_idx % len(USER_PROMPTS)]},
                    {"role": "assistant", "content": chunk},
                ]
            })
            prompt_idx += 1
    return examples


print("Creating training examples…\n")
training_data = create_training_examples(books)
print(f"\nTotal training examples: {len(training_data)}")

In [None]:
# Preview one training conversation
print("=" * 80)
print("SAMPLE TRAINING CONVERSATION (first example):")
print("=" * 80)
for msg in training_data[0]["messages"]:
    role    = msg["role"].upper()
    content = msg["content"]
    if len(content) > 300:
        content = content[:300] + "…[truncated]"
    print(f"\n[{role}]:")
    print(content)

---
## Step 4: Convert to HuggingFace Dataset

We convert our list of conversation dictionaries into a HuggingFace `Dataset`. This gives us efficient batching, shuffling, and the `.map()` interface needed to apply the chat template across all examples.

We shuffle the dataset so that examples from all four books are interleaved throughout training, rather than all of one book appearing consecutively. This mitigates catastrophic forgetting — the tendency for the model to overwrite earlier learning when trained on later data.

In [None]:
from datasets import Dataset

dataset = Dataset.from_list(training_data)
dataset = dataset.shuffle(seed=3407)

print(f"Dataset: {len(dataset)} examples")
print(f"Features: {dataset.features}")

---
## Step 5: Load Qwen3-14B with Unsloth

We load `unsloth/Qwen3-14B-unsloth-bnb-4bit` — the 4-bit quantized version of Qwen3-14B. This model has already been quantized by Unsloth and uploaded to HuggingFace, so it downloads ready to use without any quantization step on our end.

### Parameter Notes

- **`model_name`**: The Unsloth-quantized 4-bit version of Qwen3-14B. Compared to the non-quantized `unsloth/Qwen3-14B`, this uses ~50% less VRAM at a small accuracy cost.
- **`max_seq_length`**: 2048 tokens. Qwen3 natively supports up to 32K tokens, but we use 2048 for training efficiency. Our ~350-word chunks fit comfortably within this limit.
- **`load_in_4bit = True`**: Activates bitsandbytes 4-bit quantization. The 14B model occupies approximately **8-9 GB** of VRAM in this mode (vs ~40 GB for the full-precision model).
- **`full_finetuning = False`**: We use LoRA (parameter-efficient), not full fine-tuning.

### How Qwen3-14B Compares to GPT-OSS 20B

| | Qwen3-14B | GPT-OSS 20B |
|--|-----------|-------------|
| VRAM (4-bit) | ~8-9 GB | ~12-13 GB |
| Available VRAM headroom | ~15 GB | ~10 GB |
| Batch size possible | 2 | 1 |
| Training speed | Faster | Slower |

The extra headroom in Qwen3-14B lets us use `per_device_train_batch_size=2` in Step 8, effectively doubling our throughput compared to the GPT-OSS run.

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048   # Training context length (Qwen3 supports up to 32K)
dtype         = None    # Auto-detect best dtype (bfloat16 on RTX 4090)

print("Loading Qwen3-14B (4-bit quantized)…")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name      = "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    dtype           = dtype,
    max_seq_length  = max_seq_length,
    load_in_4bit    = True,         # 4-bit quantization (~8-9 GB VRAM)
    load_in_8bit    = False,        # 8-bit: more accurate but uses 2x VRAM
    full_finetuning = False,        # LoRA, not full fine-tuning
)

vram_used = torch.cuda.memory_reserved() / 1024**3
print(f"\nModel loaded. VRAM reserved: {vram_used:.1f} GB")
print(f"Model type: {type(model).__name__}")
print(f"Tokenizer: {type(tokenizer).__name__}")

---
## Step 6: Add LoRA Adapters

We inject **LoRA (Low-Rank Adaptation)** adapter matrices into the model's attention and feed-forward layers. Only these adapters are trained — the 14B base weights remain frozen.

### How LoRA Works

For each target weight matrix `W` (e.g., a 4096×4096 query projection), LoRA adds two small matrices `A` (4096×r) and `B` (r×4096) and trains only these. During the forward pass, the effective weight becomes `W + B·A`. The total trainable parameters for r=16 are:

```
7 modules × 2 matrices × (4096 × 16) ≈ 3.7 million parameters
vs the base model's 14,000 million parameters → ~0.026% of total
```

### Parameter Choices

- **`r = 16`**: LoRA rank. The reference Qwen3 notebook uses r=32; we use r=16 for consistency with the GPT-OSS run. r=32 would give more adaptation capacity at the cost of more VRAM.
- **`lora_alpha = 32`**: Scaling factor. The effective learning rate multiplier is `alpha / r = 32 / 16 = 2.0`. This is a standard configuration for style learning.
- **`target_modules`**: All seven linear projections in each Transformer block. Targeting all of them (vs. just q/v as in some approaches) gives the adapters access to the full information flow through the model.
- **`use_gradient_checkpointing = "unsloth"`**: Unsloth's custom implementation saves ~30% additional VRAM vs. standard gradient checkpointing by recomputing activations more cleverly.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r                        = 16,    # LoRA rank (try 32 for more capacity)
    target_modules           = [
        "q_proj",    # Query projection (attention)
        "k_proj",    # Key projection (attention)
        "v_proj",    # Value projection (attention)
        "o_proj",    # Output projection (attention)
        "gate_proj", # Gate projection (FFN)
        "up_proj",   # Up projection (FFN)
        "down_proj", # Down projection (FFN)
    ],
    lora_alpha               = 32,    # Scaling: effective lr multiplier = alpha/r = 2x
    lora_dropout             = 0,     # 0 is optimized by Unsloth (no dropout)
    bias                     = "none",
    use_gradient_checkpointing = "unsloth",  # ~30% less VRAM
    random_state             = 3407,
    use_rslora               = False,
    loftq_config             = None,
)

# Count trainable parameters
total_params     = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters     : {total_params:,}")
print(f"Trainable parameters : {trainable_params:,}")
print(f"Trainable %          : {100 * trainable_params / total_params:.4f}%")

---
## Step 7: Format Dataset with Qwen3 Chat Template

This step converts our `messages` format into the exact text the model expects, using Qwen3's **ChatML** template. This is one of the most important differences from the GPT-OSS notebook.

### Qwen3 ChatML Format

After applying the template, each training example looks like:

```
<|im_start|>system
You are an AI assistant…<|im_end|>
<|im_start|>user
Please share your thoughts on the following topic…<|im_end|>
<|im_start|>assistant
Book passage text here…<|im_end|>
```

The tokens `<|im_start|>` ("image/instruction message start") and `<|im_end|>` are the markers that delineate conversation turns in the ChatML format used by Qwen and other models.

### The `enable_thinking` Parameter

Qwen3 introduces a special `enable_thinking` parameter to `apply_chat_template()`:

- **`enable_thinking=False`** (what we use for training): Produces the standard ChatML output shown above. The model is expected to respond directly without any reasoning preamble.
- **`enable_thinking=True`** (used for complex reasoning at inference): Adds a thinking directive that causes the model to produce `<think>…</think>` before the final answer.

We always use `enable_thinking=False` for training data to keep the signal clean and consistent with what we want the model to produce (direct Peterson-style prose).

### Why `standardize_sharegpt()`?

Unsloth's `standardize_sharegpt()` normalizes different possible field names (`from`/`value` vs `role`/`content`) in the dataset to the standard HuggingFace format, ensuring compatibility with `apply_chat_template()`.

In [None]:
from unsloth.chat_templates import standardize_sharegpt

def formatting_prompts_func(examples):
    """
    Apply the Qwen3 ChatML template to each conversation.

    Key difference from GPT-OSS: we pass enable_thinking=False.
    This suppresses the thinking-mode prefix (<think> tags) and produces
    standard ChatML output suitable for style-learning fine-tuning.
    """
    convos = examples["messages"]
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize             = False,   # Return text, not token IDs
            add_generation_prompt = False,  # Full conversation (no open-ended prompt)
            enable_thinking      = False,   # Suppress thinking mode for training
        )
        for convo in convos
    ]
    return {"text": texts}


# Standardize field names, then apply the template
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched=True)

print(f"Dataset formatted. Columns: {dataset.column_names}")
print(f"Examples: {len(dataset)}")

In [None]:
# Inspect the first formatted example — this is the exact text fed to the model.
# Notice the <|im_start|> and <|im_end|> tokens that mark turn boundaries.
print("=" * 80)
print("FORMATTED TRAINING EXAMPLE (first 1200 characters):")
print("=" * 80)
print(dataset[0]['text'][:1200])
print("…")

In [None]:
# Diagnostic: verify that examples fit within max_seq_length.
# If too many exceed the limit, SFTTrainer will filter them out,
# potentially resulting in an empty (or very small) training set.
import statistics

token_counts = []
for i in range(min(50, len(dataset))):
    tokens = tokenizer.encode(dataset[i]['text'])
    token_counts.append(len(tokens))

print(f"Token count stats (sample of {len(token_counts)} examples):")
print(f"  Min    : {min(token_counts)}")
print(f"  Max    : {max(token_counts)}")
print(f"  Mean   : {statistics.mean(token_counts):.0f}")
print(f"  Median : {statistics.median(token_counts):.0f}")
print(f"  Limit  : {max_seq_length}")
over = sum(1 for t in token_counts if t > max_seq_length)
print(f"  Over   : {over}/{len(token_counts)}")
if over:
    print(f"  WARNING: {over} examples exceed max_seq_length and will be truncated/dropped!")
else:
    print("  All sampled examples fit within max_seq_length.")

---
## Step 8: Configure SFTTrainer

We use `SFTTrainer` from the `trl` library, the same trainer used for the GPT-OSS run. The configuration is similar but with a few Qwen3-specific adjustments.

### Differences from the GPT-OSS Configuration

| Parameter | GPT-OSS 20B | Qwen3-14B | Reason |
|-----------|-------------|-----------|--------|
| `per_device_train_batch_size` | 1 | **2** | Qwen3-14B uses less VRAM, giving us room for larger batches |
| `gradient_accumulation_steps` | 4 | **4** | Same effective batch size of 8 (2 × 4) |
| `weight_decay` | 0.01 | **0.001** | Reference notebook value; lighter regularization |
| `dataset_text_field` | (implicit) | **"text"** | Explicitly tells SFTTrainer which column to use |

### Why `per_device_train_batch_size=2`?

The Qwen3-14B model in 4-bit uses ~8-9 GB of VRAM for the model weights, leaving ~15 GB free for training. This is enough to fit 2 examples per step, which improves GPU utilization and produces more stable gradient estimates. With GPT-OSS 20B, the model alone took ~12-13 GB, leaving only ~10 GB — not enough for batch_size=2 when accounting for optimizer states.

### Effective Batch Size

```
effective_batch_size = per_device_train_batch_size × gradient_accumulation_steps
                     = 2 × 4 = 8
```

A larger effective batch size means more stable training but requires more examples per weight update. With ~2563 training examples and effective_batch_size=8, we get ~320 optimizer steps per epoch (vs ~641 with the GPT-OSS configuration).

In [None]:
from trl import SFTConfig, SFTTrainer

OUTPUT_DIR = "./outputs/qwen3_14b_jordan_peterson"

trainer = SFTTrainer(
    model         = model,
    tokenizer     = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        dataset_text_field          = "text",  # Column name in our dataset
        per_device_train_batch_size = 2,   # 2 examples per step (more VRAM headroom vs GPT-OSS)
        gradient_accumulation_steps = 4,   # Effective batch size = 2 × 4 = 8
        warmup_steps                = 10,  # Ramp up LR over first 10 steps
        num_train_epochs            = 1,   # One full pass through the dataset
        # max_steps = 30,                  # Uncomment to limit steps for quick testing
        learning_rate               = 2e-4, # Standard LoRA learning rate
        logging_steps               = 1,    # Log every step
        optim                       = "adamw_8bit",  # 8-bit AdamW saves optimizer memory
        weight_decay                = 0.001,          # Light regularization (reference value)
        lr_scheduler_type           = "cosine",       # Cosine decay (better than linear for SFT)
        seed                        = 3407,
        output_dir                  = OUTPUT_DIR,
        report_to                   = "none",         # No WandB / TensorBoard
        fp16 = not torch.cuda.is_bf16_supported(),   # Use fp16 if bf16 unavailable
        bf16 = torch.cuda.is_bf16_supported(),       # Use bf16 on RTX 4090
        save_strategy               = "steps",
        save_steps                  = 100,
        save_total_limit            = 3,
    ),
)

---
## Step 9: Apply Response-Only Training

As with the GPT-OSS notebook, we apply `train_on_responses_only()` so the model only computes loss on the **assistant's book passages**, not on the system prompt or user prompts. This focuses all learning capacity on what matters: generating text that sounds like Peterson.

### Qwen3 ChatML Response Tokens

For Qwen3's ChatML format (with `enable_thinking=False`), the response boundary tokens are:

```
instruction_part = "<|im_start|>user\n"
response_part    = "<|im_start|>assistant\n"
```

These are different from the GPT-OSS tokens:
- GPT-OSS: `<|start|>user<|message|>` / `<|start|>assistant<|message|>`
- Qwen3:   `<|im_start|>user\n`      / `<|im_start|>assistant\n`

We auto-detect the response token from the formatted data to ensure we use the exact string that appears in the actual training text, avoiding the silent failure mode where a wrong response_part causes all labels to be masked (resulting in an empty training set).

In [None]:
from unsloth.chat_templates import train_on_responses_only

# Auto-detect the correct token strings from the actual formatted data.
# Qwen3 ChatML uses <|im_start|>assistant\n as the response boundary.
sample_text = dataset[0]['text']
print("Detecting response boundary tokens from formatted data…")
print(f"Sample text (first 400 chars):\n{sample_text[:400]}\n")

if "<|im_start|>assistant\n" in sample_text:
    instruction_part = "<|im_start|>user\n"
    response_part    = "<|im_start|>assistant\n"
    print(f"✓ Detected Qwen3 ChatML tokens")
else:
    # Fallback: inspect what's actually in the data
    raise ValueError(
        f"Could not find expected Qwen3 assistant token in formatted text.\n"
        f"First 500 chars of formatted text:\n{sample_text[:500]}"
    )

print(f"  instruction_part = {repr(instruction_part)}")
print(f"  response_part    = {repr(response_part)}")

trainer = train_on_responses_only(
    trainer,
    instruction_part = instruction_part,
    response_part    = response_part,
)

print("\nResponse-only training configured.")
print("Loss will only be computed on assistant (book passage) tokens.")

In [None]:
# Verify that the masking is working correctly.
# Lines replaced by spaces below are masked (system + user prompts).
# The visible text is what the model actually trains on (book passages).

dataset_size = len(trainer.train_dataset)
print(f"Training examples after tokenization: {dataset_size}")

if dataset_size == 0:
    print("\nERROR: Training dataset is empty after tokenization!")
    print("Possible causes:")
    print("  1. All examples exceeded max_seq_length and were dropped.")
    print("  2. The response_part token string doesn't match the formatted data.")
    print("     (When the token isn't found, all labels are -100, all examples filtered.)")
else:
    print("\n" + "=" * 80)
    print("FULL INPUT (what the model sees):")
    print("=" * 80)
    full = tokenizer.decode(trainer.train_dataset[0]["input_ids"])
    print(full[:600] + "…")

    print("\n" + "=" * 80)
    print("MASKED LABELS (spaces = masked out, visible text = trained on):")
    print("=" * 80)
    labels      = trainer.train_dataset[0]["labels"]
    masked_text = tokenizer.decode(
        [tokenizer.pad_token_id if x == -100 else x for x in labels]
    ).replace(tokenizer.pad_token, " ")
    print(masked_text[:600] + "…")

---
## Step 10: Check Memory Before Training

A snapshot of GPU memory usage before training starts. This helps us understand whether we have enough headroom and whether the batch size is sustainable.

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
max_memory       = round(gpu_stats.total_memory / 1024**3, 3)

print(f"GPU           : {gpu_stats.name}")
print(f"Total VRAM    : {max_memory} GB")
print(f"Reserved now  : {start_gpu_memory} GB")
print(f"Available     : {max_memory - start_gpu_memory:.3f} GB")

---
## Step 11: Train the Model

The trainer iterates through the dataset, computes loss only on the assistant responses (book passages), accumulates gradients across 4 micro-steps, and updates the LoRA adapter weights every 8 examples (2 per device × 4 accumulation steps).

### What to Watch

- **Loss** should decrease steadily. Starting values around 3-5 are normal; the model is initially uncertain about Peterson-style text.
- **Loss shouldn't reach near 0** — that would indicate memorization, not learning.
- A healthy final loss for 1 epoch of book-style fine-tuning is typically **1.5–3.0**.

### Training Time Estimate

With ~2563 examples, batch_size=2, gradient_accumulation=4:
- ~321 optimizer steps
- RTX 4090: roughly **1.5–2 seconds per step**
- Estimated training time: **~8–11 minutes**

This is significantly faster than the GPT-OSS 20B run (~73 minutes, 641 steps) for two reasons:
1. The model is smaller (14B vs 20B parameters)
2. We use batch_size=2 instead of 1, halving the number of optimizer steps per epoch

In [None]:
# Start training!
# Progress bar will show step count, loss, and learning rate.
trainer_stats = trainer.train()

In [None]:
used_memory          = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
used_for_training    = round(used_memory - start_gpu_memory, 3)
used_pct             = round(used_memory / max_memory * 100, 3)
training_pct         = round(used_for_training / max_memory * 100, 3)

runtime_sec  = trainer_stats.metrics['train_runtime']
runtime_min  = round(runtime_sec / 60, 2)
train_loss   = trainer_stats.metrics.get('train_loss', 'N/A')

print("=" * 60)
print("TRAINING COMPLETE")
print("=" * 60)
print(f"  Time          : {runtime_sec:.1f} s ({runtime_min} min)")
print(f"  Final loss    : {train_loss}")
print(f"  Peak VRAM     : {used_memory} GB  ({used_pct}% of {max_memory} GB)")
print(f"  Training VRAM : {used_for_training} GB  ({training_pct}%)")

---
## Step 12: Test the Fine-Tuned Model (Inference)

This is where Qwen3 really shines compared to GPT-OSS: we can test the model in **two distinct modes** and compare the output quality.

### Mode 1: Standard Chat (Non-Thinking)

Pass `enable_thinking=False` to the chat template. The model responds directly, using the same format it was trained on. Recommended settings: `temperature=0.7, top_p=0.8, top_k=20`.

### Mode 2: Thinking Mode

Pass `enable_thinking=True`. The model first produces a chain-of-thought in `<think>` tags before its visible response. Even though we did **not train on thinking data**, this capability is present in the base model weights and is accessible after fine-tuning. Recommended settings: `temperature=0.6, top_p=0.95, top_k=20`.

**Important**: In thinking mode, the model generates thinking tokens that don't appear in the final response. The `TextStreamer` will print them as they're generated (including the `<think>` tags). To suppress them and show only the final answer, you would post-process the output to remove everything between `<think>` and `</think>`.

### Expected Outputs

After only 1 epoch of training, the model may produce short or incomplete responses — this is normal. The fine-tuned model should show increased use of Peterson's vocabulary (chaos, order, meaning, responsibility, myth, hero) compared to the untuned base model. Multiple epochs of training would improve output coherence significantly.

In [None]:
from transformers import TextStreamer

# Put model in inference mode: disables dropout, enables fast generation path
FastLanguageModel.for_inference(model)


def ask_model(question: str,
              enable_thinking: bool = False,
              temperature: float = None,
              top_p: float = None,
              top_k: int = 20,
              max_tokens: int = 512):
    """
    Ask the fine-tuned Qwen3 model a question and stream the response.

    Args:
        question       : The question to ask
        enable_thinking: If True, model produces <think> reasoning before answering
                         Recommended temp/top_p differ per mode (see below)
        temperature    : Sampling temperature. Defaults to 0.7 (chat) or 0.6 (thinking)
        top_p          : Nucleus sampling. Defaults to 0.8 (chat) or 0.95 (thinking)
        top_k          : Top-k sampling. Qwen3 team recommends 20 for both modes.
        max_tokens     : Maximum tokens to generate
    """
    # Qwen3-recommended defaults per mode
    if temperature is None:
        temperature = 0.6 if enable_thinking else 0.7
    if top_p is None:
        top_p = 0.95 if enable_thinking else 0.8

    mode_label = "THINKING MODE" if enable_thinking else "CHAT MODE"
    print(f"\n{'='*70}")
    print(f"[{mode_label}]  temp={temperature}  top_p={top_p}  top_k={top_k}")
    print(f"QUESTION: {question}")
    print(f"{'='*70}\n")

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": question},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize              = False,
        add_generation_prompt = True,   # Open-ended: don't close with <|im_end|>
        enable_thinking       = enable_thinking,
    )
    inputs = tokenizer(text, return_tensors="pt").to("cuda")

    streamer = TextStreamer(tokenizer, skip_prompt=True)
    _ = model.generate(
        **inputs,
        max_new_tokens     = max_tokens,
        streamer           = streamer,
        temperature        = temperature,
        top_p              = top_p,
        top_k              = top_k,
        repetition_penalty = 1.1,
    )

In [None]:
# Test 1: Core Peterson theme — "12 Rules for Life"
ask_model(
    "What is the relationship between order and chaos in human experience?",
    enable_thinking = False,
)

In [None]:
# Test 2: A theme from "Maps of Meaning"
ask_model(
    "How do myths and archetypal stories illuminate the structure of the human psyche?",
    enable_thinking = False,
)

In [None]:
# Test 3: Same question as Test 1, but with thinking mode ENABLED.
# Watch for the <think> section before the actual response.
# The model reasons through its answer before committing — a capability
# that was present in the base weights and survives fine-tuning.
ask_model(
    "What is the relationship between order and chaos in human experience?",
    enable_thinking = True,
    max_tokens      = 1024,  # Allow more tokens for the thinking chain
)

In [None]:
# Test 4: Responsibility — Peterson's central practical theme
ask_model(
    "Why is taking responsibility for your own life the necessary precondition for meaning?",
    enable_thinking = False,
)

In [None]:
# Test 5: Peterson's take on suffering — from "We Who Wrestle with God"
# Using thinking mode to see how the model reasons about this deep topic.
ask_model(
    "What is the significance of suffering in the biblical narratives, and what does it mean for how we should face hardship?",
    enable_thinking = True,
    max_tokens      = 1024,
)

---
## Step 13: Save the Fine-Tuned Model

We save the LoRA adapter weights only — this is a small file (~20-50 MB) that can be loaded on top of the base Qwen3-14B model at any time. The base model weights do not need to be stored separately (they are downloaded from HuggingFace when needed).

### Loading Later

To use the fine-tuned model in a future session:
```python
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "./outputs/qwen3_14b_jordan_peterson_lora",
    max_seq_length = 2048,
    load_in_4bit   = True,
)
FastLanguageModel.for_inference(model)
```

### Optional: Save Full Merged Model or GGUF

The commented-out cells below show how to save the full merged model (for vLLM deployment) or GGUF format (for use with llama.cpp or Ollama).

In [None]:
import os

LORA_OUTPUT_DIR = "./outputs/qwen3_14b_jordan_peterson_lora"

model.save_pretrained(LORA_OUTPUT_DIR)
tokenizer.save_pretrained(LORA_OUTPUT_DIR)

print(f"LoRA adapters saved to: {LORA_OUTPUT_DIR}\n")
total_size = 0
for fname in sorted(os.listdir(LORA_OUTPUT_DIR)):
    fpath = os.path.join(LORA_OUTPUT_DIR, fname)
    if os.path.isfile(fpath):
        size = os.path.getsize(fpath)
        total_size += size
        print(f"  {fname}: {size / 1024**2:.2f} MB")
print(f"\nTotal: {total_size / 1024**2:.2f} MB")

In [None]:
# ── Optional: Save as merged 16-bit model (large, ~28 GB) ─────────────────
# Use this for vLLM or other deployment frameworks that require a single model file.
# WARNING: Requires substantial disk space.
if False:
    model.save_pretrained_merged(
        "./outputs/qwen3_14b_jordan_peterson_merged_16bit",
        tokenizer,
        save_method = "merged_16bit",
    )

# ── Optional: Save as GGUF for llama.cpp / Ollama ──────────────────────────
# q4_k_m is a good balance of quality and size.
if False:
    model.save_pretrained_gguf(
        "./outputs/qwen3_14b_jordan_peterson_gguf",
        tokenizer,
        quantization_method = "q4_k_m",
    )

---
## Step 14: How to Load the Fine-Tuned Model Later

The code cell below shows how to load the fine-tuned model in a new Python session without re-training. Set `LOAD_SAVED_MODEL = True` to activate it.

In [None]:
LOAD_SAVED_MODEL = False  # Set to True to load from the saved LoRA adapters

if LOAD_SAVED_MODEL:
    from unsloth import FastLanguageModel

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name      = "./outputs/qwen3_14b_jordan_peterson_lora",
        max_seq_length  = 2048,
        dtype           = None,
        load_in_4bit    = True,
    )
    FastLanguageModel.for_inference(model)
    print("Fine-tuned Qwen3-14B loaded successfully!")
else:
    print("Using the model from the current training session.")
    print("Set LOAD_SAVED_MODEL = True to reload from saved adapters.")

---
## Summary

### What We Did

In this notebook we:

1. **Extracted text** from four Jordan Peterson books using PyMuPDF
2. **Chunked** the text into ~350-word passages with 50-word overlap
3. **Formatted** the passages as three-turn conversations (system / user / assistant)
4. **Loaded Qwen3-14B** in 4-bit quantization using Unsloth (~8-9 GB VRAM)
5. **Added LoRA adapters** (rank 16) targeting all attention and FFN projections
6. **Applied the Qwen3 ChatML template** with `enable_thinking=False` for clean training data
7. **Configured response-only training** using the ChatML boundary tokens (`<|im_start|>assistant\n`)
8. **Trained** for 1 epoch (~320 steps, ~8-11 minutes on RTX 4090)
9. **Tested inference** in both chat mode and thinking mode
10. **Saved** the LoRA adapters (~20-50 MB)

### Qwen3-Specific Takeaways

- **ChatML format** (`<|im_start|>` / `<|im_end|>`) is Qwen3's native chat template — it's simpler and more widely adopted than GPT-OSS's Harmony format.
- **`enable_thinking=False`** in `apply_chat_template()` is essential for training — it prevents the model from expecting reasoning tokens that our book passages don't contain.
- **Thinking mode is preserved after fine-tuning** — even though we trained in non-thinking mode, the base model's thinking capability remains accessible via `enable_thinking=True` at inference.
- **Faster training than GPT-OSS 20B**: The smaller model (14B vs 20B) and larger batch size (2 vs 1) together produce roughly a 10-15x speedup in total training time.

### Differences From the GPT-OSS Notebook

| Aspect | GPT-OSS 20B | Qwen3-14B |
|--------|-------------|-----------|
| Chat template | Harmony | **ChatML** |
| Inference control | `reasoning_effort="low/medium/high"` | **`enable_thinking=True/False`** |
| Inference temperatures | 0.7 / top_p=0.9 | **0.7/0.8/top_k=20** (chat) or **0.6/0.95/top_k=20** (thinking) |
| Batch size | 1 | **2** |
| Steps / epoch | ~641 | **~320** |
| Training time | ~73 min | **~8–11 min** |
| LoRA adapter size | ~30 MB | **~20 MB** |

### Next Steps

- **More epochs**: Try `num_train_epochs=3` to deepen the style adaptation
- **Higher LoRA rank**: Try `r=32` (as in the reference notebook) for more capacity
- **Thinking mode fine-tuning**: Add reasoning-style examples to teach the model to think *like* Peterson before responding
- **GRPO reinforcement learning**: Use GRPO with a Peterson-vocabulary reward function to align generation toward his distinctive style
- **Comparison notebook**: Run the companion comparison notebook to quantitatively measure how much more Peterson-like this model is compared to the base