# Fine-Tuning GPT-OSS 20B on Jordan Peterson's Books Using Unsloth

## Overview

This notebook fine-tunes the **`unsloth/gpt-oss-20b-unsloth-bnb-4bit`** model using text extracted from Jordan Peterson's books. We use **Unsloth** to make training 2x faster and use significantly less VRAM than standard approaches.

### What is Fine-Tuning?

Fine-tuning is the process of taking a pre-trained language model (one that already understands language) and training it further on a specific dataset so it learns the style, knowledge, and patterns from that data. Think of it like this: the base model went to "general school" and learned language broadly. Fine-tuning is like sending it to a "specialized course" on Jordan Peterson's writings.

### What is Unsloth?

**[Unsloth](https://github.com/unslothai/unsloth)** is an open-source library that makes fine-tuning large language models (LLMs) significantly faster and more memory-efficient. It achieves 2x faster training with up to 80% less VRAM usage compared to standard methods. This is critical because LLMs are enormous and typically require expensive hardware to train.

### What is LoRA (Low-Rank Adaptation)?

Instead of updating ALL 20 billion parameters in the model (which would require enormous amounts of memory and compute), LoRA adds small "adapter" layers that contain only a fraction of the parameters. We train only these adapters while keeping the original model frozen. This means we train roughly 0.02% of the total parameters while still getting meaningful improvements in the model's behavior.

### What is 4-bit Quantization?

Normally, each model parameter is stored as a 16-bit or 32-bit floating-point number. 4-bit quantization compresses each parameter down to just 4 bits, reducing the model's memory footprint by 4-8x. The `bnb-4bit` in the model name means it uses **bitsandbytes** library for this quantization. This allows a 20B parameter model to fit in ~12GB of VRAM instead of the ~40GB it would normally require.

### Our Data Source

We will extract text from four Jordan Peterson books (PDFs):
1. **Maps of Meaning: The Architecture of Belief** (1999)
2. **12 Rules for Life: An Antidote to Chaos** (2018)
3. **Beyond Order: 12 More Rules For Life**
4. **We Who Wrestle with God: Perceptions of the Divine**

The extracted text will be formatted into a conversational dataset suitable for supervised fine-tuning (SFT).

### Hardware

This notebook is designed to run on an **NVIDIA RTX 4090** (24GB VRAM).

---

## Step 1: Verify Environment and Imports

Before we start, let's verify that our Python environment has everything we need. We're using the `.finetuning` virtual environment which already has Unsloth and its dependencies installed.

We also need **PyMuPDF** (`fitz`) to extract text from PDF files. PyMuPDF is one of the fastest and most reliable PDF text extraction libraries for Python.

In [None]:
# Verify key packages are available
import importlib

required_packages = {
    'unsloth': 'Unsloth - fast fine-tuning library',
    'torch': 'PyTorch - deep learning framework',
    'transformers': 'HuggingFace Transformers - model loading and tokenization',
    'peft': 'PEFT - parameter-efficient fine-tuning (LoRA)',
    'trl': 'TRL - transformer reinforcement learning / SFT trainer',
    'datasets': 'HuggingFace Datasets - dataset handling',
    'fitz': 'PyMuPDF - PDF text extraction',
}

print("Checking required packages:\n")
for pkg, description in required_packages.items():
    try:
        m = importlib.import_module(pkg)
        version = getattr(m, '__version__', 'installed')
        print(f"  {pkg}: {version} -- {description}")
    except ImportError:
        print(f"  {pkg}: NOT FOUND -- {description}")
        print(f"    -> Install with: pip install {pkg}")

In [None]:
# Check GPU availability and specs
# This is critical - fine-tuning requires a CUDA-capable GPU
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"GPU Memory: {gpu_mem:.1f} GB")
else:
    print("WARNING: No GPU detected! Fine-tuning will not work without a GPU.")

---

## Step 2: Extract Text from Jordan Peterson's Books

We need to convert the PDF books into plain text that the model can learn from. Here's what happens in this step:

1. **Read each PDF** using PyMuPDF (`fitz`), which parses the PDF format and extracts raw text from each page.
2. **Clean the text** by removing excessive whitespace, headers/footers, and other artifacts that commonly appear in PDF extraction.
3. **Chunk the text** into passages of manageable length. LLMs have a maximum sequence length (we'll use 2048 tokens), so we need to break the books into pieces that fit within this limit.

### Why Chunk the Text?

The model can only process a limited number of tokens at a time (our `max_seq_length` of 2048 tokens). A whole book would be hundreds of thousands of tokens. By splitting into chunks, each training example fits within the model's context window. We use overlapping chunks so that ideas that span chunk boundaries aren't lost.

**Important note on token budget:** The GPT-OSS chat template adds a substantial system preamble (~150 tokens) on top of our system prompt and user message. We need to account for this overhead when sizing our chunks, so we keep chunks to ~350 words (~470 tokens) to stay safely within the 2048 limit.

In [None]:
import fitz  # PyMuPDF
import os
import re
from pathlib import Path

# Path to the Jordan Peterson books
BOOKS_DIR = Path("../../Books/JordanPeterson")

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extract all text from a PDF file using PyMuPDF.
    
    PyMuPDF reads each page of the PDF and extracts the text content.
    This works well for text-based PDFs (not scanned images).
    
    Args:
        pdf_path: Path to the PDF file
        
    Returns:
        The full text content of the PDF as a single string
    """
    doc = fitz.open(pdf_path)
    text_parts = []
    
    for page_num in range(len(doc)):
        page = doc[page_num]
        text = page.get_text()
        if text.strip():  # Only include pages that have text
            text_parts.append(text)
    
    doc.close()
    return "\n".join(text_parts)


def clean_text(text: str) -> str:
    """
    Clean extracted PDF text by removing common artifacts.
    
    PDF extraction often introduces artifacts like:
    - Excessive newlines from page layouts
    - Page numbers and headers/footers
    - Multiple consecutive spaces
    - Non-printable characters
    
    This function normalizes the text while preserving paragraph structure.
    
    Args:
        text: Raw text extracted from PDF
        
    Returns:
        Cleaned text with normalized whitespace
    """
    # Remove non-printable characters (except newlines and tabs)
    text = re.sub(r'[^\x20-\x7E\n\t]', ' ', text)
    
    # Replace tabs with spaces
    text = text.replace('\t', ' ')
    
    # Collapse multiple spaces into one
    text = re.sub(r' +', ' ', text)
    
    # Collapse 3+ newlines into 2 (preserving paragraph breaks)
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Remove lines that are just page numbers (common PDF artifact)
    text = re.sub(r'^\s*\d{1,4}\s*$', '', text, flags=re.MULTILINE)
    
    # Strip leading/trailing whitespace from each line
    lines = [line.strip() for line in text.split('\n')]
    text = '\n'.join(lines)
    
    # Collapse multiple newlines again after line stripping
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    return text.strip()


# Extract text from all PDFs
print("Extracting text from Jordan Peterson's books...\n")
print(f"Looking in: {BOOKS_DIR.resolve()}\n")

books = {}
pdf_files = sorted(BOOKS_DIR.glob("*.pdf"))

if not pdf_files:
    raise FileNotFoundError(f"No PDF files found in {BOOKS_DIR.resolve()}")

for pdf_path in pdf_files:
    print(f"Processing: {pdf_path.name}")
    raw_text = extract_text_from_pdf(str(pdf_path))
    cleaned = clean_text(raw_text)
    books[pdf_path.stem] = cleaned
    
    # Show stats for this book
    word_count = len(cleaned.split())
    char_count = len(cleaned)
    print(f"  -> {word_count:,} words, {char_count:,} characters\n")

# Total stats
total_words = sum(len(text.split()) for text in books.values())
total_chars = sum(len(text) for text in books.values())
print(f"\nTotal across all books: {total_words:,} words, {total_chars:,} characters")
print(f"Number of books processed: {len(books)}")

---

## Step 3: Create Training Dataset from Book Text

Now we need to convert the raw book text into a format the model can learn from. For **Supervised Fine-Tuning (SFT)**, we need the data in a conversational format — specifically, a list of messages with roles like `"system"`, `"user"`, and `"assistant"`.

### Our Approach: Passage-Based Q&A Format

We will structure each training example as:
- **System message**: Sets the context — tells the model it is an expert on Jordan Peterson's ideas
- **User message**: Asks the model to discuss, explain, or continue a passage from the book
- **Assistant message**: Contains the actual book text passage

This format teaches the model to produce text that sounds like Jordan Peterson when asked about topics from his books.

### Why This Format?

The GPT-OSS model uses OpenAI's [Harmony](https://github.com/openai/harmony) chat format. To fine-tune it effectively, our training data must match this conversational structure. The `tokenizer.apply_chat_template()` function handles all the special tokens and formatting automatically.

### Chunking Strategy

We split the text into chunks of approximately 350 words (~470 tokens) with 50-word overlaps between consecutive chunks. This ensures:
- Each chunk fits within our 2048-token sequence length (with room to spare for the chat template overhead)
- Ideas that span chunk boundaries are partially represented in adjacent chunks
- We leave room for the system/user prompt tokens and the GPT-OSS system preamble (~150 tokens)

In [None]:
def chunk_text(text: str, chunk_size: int = 350, overlap: int = 50) -> list[str]:
    """
    Split text into overlapping chunks by word count.
    
    We chunk by words rather than characters because the model processes
    tokens (which roughly correspond to words). A chunk of ~350 words
    translates to roughly 470 tokens, leaving plenty of room for the 
    system/user prompt and chat template overhead within our 2048 token limit.
    
    The overlap ensures continuity between chunks — a sentence that gets
    split at a chunk boundary will appear (at least partially) in both
    the current chunk and the next one.
    
    Args:
        text: The full text to chunk
        chunk_size: Number of words per chunk
        overlap: Number of overlapping words between consecutive chunks
        
    Returns:
        List of text chunks
    """
    words = text.split()
    chunks = []
    start = 0
    
    while start < len(words):
        end = start + chunk_size
        chunk = ' '.join(words[start:end])
        
        # Only include chunks that have meaningful content (at least 50 words)
        if len(chunk.split()) >= 50:
            chunks.append(chunk)
        
        # Move the window forward, but overlap with the previous chunk
        start = end - overlap
    
    return chunks


# Define the prompts that will be paired with each chunk.
# We rotate through these to add variety to the training data.
# Each prompt frames the book passage differently, teaching the model
# to respond to various types of requests about Peterson's ideas.
USER_PROMPTS = [
    "Please share your thoughts on the following topic from your writings.",
    "Can you elaborate on this idea from your work?",
    "Explain this concept in detail.",
    "What are your views on this subject?",
    "Continue discussing this topic.",
    "Tell me more about this idea.",
    "Share your perspective on this.",
    "Discuss the following in depth.",
]

# The system prompt that will be used for all training examples.
# This establishes the persona the model should adopt after fine-tuning.
SYSTEM_PROMPT = (
    "You are an AI assistant that has been trained on the complete works of Jordan B. Peterson, "
    "a Canadian clinical psychologist, professor, and author. You speak with deep knowledge of "
    "psychology, philosophy, mythology, religion, and personal responsibility. Your responses "
    "reflect Peterson's writing style, intellectual depth, and interdisciplinary approach to "
    "understanding human nature and meaning."
)


def create_training_examples(books: dict[str, str]) -> list[dict]:
    """
    Convert book text into conversational training examples.
    
    Each example is a conversation with:
    - A system message setting the Peterson-expert persona
    - A user message with a rotating prompt
    - An assistant message containing the book passage
    
    Args:
        books: Dictionary mapping book names to their full text
        
    Returns:
        List of conversation dictionaries in the format expected by
        the tokenizer's chat template
    """
    examples = []
    prompt_idx = 0
    
    for book_name, text in books.items():
        chunks = chunk_text(text)
        print(f"  {book_name}: {len(chunks)} chunks")
        
        for chunk in chunks:
            # Create a conversational training example
            messages = [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": USER_PROMPTS[prompt_idx % len(USER_PROMPTS)]},
                {"role": "assistant", "content": chunk},
            ]
            examples.append({"messages": messages})
            prompt_idx += 1
    
    return examples


print("Creating training examples from book chunks...\n")
training_data = create_training_examples(books)
print(f"\nTotal training examples: {len(training_data)}")

In [None]:
# Let's look at one example to understand the format
print("=" * 80)
print("EXAMPLE TRAINING CONVERSATION (first example):")
print("=" * 80)

example = training_data[0]
for msg in example["messages"]:
    role = msg["role"].upper()
    content = msg["content"]
    # Truncate long content for display
    if len(content) > 300:
        content = content[:300] + "...[truncated]"
    print(f"\n[{role}]:\n{content}")

---

## Step 4: Convert to HuggingFace Dataset

The `SFTTrainer` from the `trl` library expects data in the HuggingFace `Dataset` format. We convert our list of conversation dictionaries into this format.

We also apply the **chat template** using the tokenizer. The GPT-OSS model uses a specific format with special tokens like `<|start|>`, `<|message|>`, `<|end|>`, and `<|channel|>`. The `tokenizer.apply_chat_template()` function handles all of this formatting automatically — we just need to provide conversations in the standard `messages` format.

In [None]:
from datasets import Dataset

# Convert our list of dictionaries to a HuggingFace Dataset
# The Dataset class provides efficient data handling, shuffling,
# and integration with the SFTTrainer.
dataset = Dataset.from_list(training_data)

# Shuffle the dataset so examples from different books are interleaved.
# This prevents the model from "forgetting" earlier books as it trains
# on later ones (a phenomenon called "catastrophic forgetting").
dataset = dataset.shuffle(seed=3407)

print(f"Dataset created with {len(dataset)} examples")
print(f"\nDataset features: {dataset.features}")
print(f"\nFirst example keys: {list(dataset[0].keys())}")

---

## Step 5: Load the Model with Unsloth

Now we load the GPT-OSS 20B model using Unsloth's `FastLanguageModel`. Here's what each parameter does:

- **`model_name`**: The HuggingFace model ID. We use `unsloth/gpt-oss-20b-unsloth-bnb-4bit` which is pre-quantized to 4-bit, meaning it's already compressed for efficient loading.
- **`max_seq_length`**: The maximum number of tokens the model processes at once. We use 2048, which gives ample room for our ~350-word chunks plus the system/user prompts and the chat template overhead that GPT-OSS adds.
- **`dtype`**: Set to `None` for automatic detection. Unsloth will pick the best precision for your GPU (bfloat16 if supported, otherwise float32).
- **`load_in_4bit`**: Enables 4-bit quantization via bitsandbytes. This compresses the model to fit in GPU memory.
- **`full_finetuning`**: Set to `False` because we're using LoRA (parameter-efficient), not full fine-tuning.

In [None]:
from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048  # Maximum token length per training example
dtype = None           # Auto-detect best dtype for the GPU

# Load the pre-quantized 4-bit model
# This downloads the model from HuggingFace (cached after first download)
# and loads it into GPU memory in 4-bit quantized format.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b-unsloth-bnb-4bit",
    dtype = dtype,
    max_seq_length = max_seq_length,
    load_in_4bit = True,         # Use 4-bit quantization to reduce memory
    full_finetuning = False,     # We use LoRA, not full fine-tuning
    # token = "YOUR_HF_TOKEN",  # Uncomment if model requires authentication
)

---

## Step 6: Add LoRA Adapters

Now we add **LoRA (Low-Rank Adaptation)** adapters to the model. This is the key to parameter-efficient fine-tuning.

### How LoRA Works

In a standard neural network layer, the weight matrix might be 4096x4096 = ~16 million parameters. LoRA decomposes the update to this matrix into two smaller matrices: one that is 4096x8 and another that is 8x4096. This means instead of updating 16M parameters, we only update 65K parameters (8 is the "rank" — our `r` parameter). The original weights stay frozen.

### Parameter Explanation

- **`r = 16`**: The rank of the LoRA adapters. Higher values = more parameters to train = more capacity to learn, but also more memory. 16 is a good balance for learning writing style and knowledge. We use 16 instead of the reference notebook's 8 because we want the model to capture more nuance from Peterson's writing style.
- **`target_modules`**: Which layers in the model to add LoRA adapters to. We target all the attention layers (q/k/v/o projections) and feed-forward layers (gate/up/down projections). This gives the adapters access to both "what the model pays attention to" and "how it transforms information."
- **`lora_alpha = 32`**: A scaling factor for LoRA. The effective learning rate for LoRA is proportional to `lora_alpha / r`. With alpha=32 and r=16, we get a scaling of 2x, which provides a good learning signal.
- **`lora_dropout = 0`**: No dropout on LoRA layers. Unsloth optimizes for dropout=0.
- **`use_gradient_checkpointing = "unsloth"`**: Unsloth's custom gradient checkpointing uses 30% less VRAM than standard gradient checkpointing, allowing larger batch sizes.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,  # LoRA rank: controls adapter capacity (8, 16, 32, 64, 128)
    target_modules = [
        "q_proj",    # Query projection in attention
        "k_proj",    # Key projection in attention
        "v_proj",    # Value projection in attention
        "o_proj",    # Output projection in attention
        "gate_proj", # Gate projection in feed-forward
        "up_proj",   # Up projection in feed-forward
        "down_proj", # Down projection in feed-forward
    ],
    lora_alpha = 32,    # Scaling factor (effective lr ~ alpha/r)
    lora_dropout = 0,   # No dropout (optimized by Unsloth)
    bias = "none",      # Don't train bias terms (optimized)
    use_gradient_checkpointing = "unsloth",  # 30% less VRAM!
    random_state = 3407,
    use_rslora = False,  # Rank-stabilized LoRA (not needed here)
    loftq_config = None, # LoftQ initialization (not needed here)
)

---

## Step 7: Format Dataset with Chat Template

The model needs the training data formatted with its specific chat template. The GPT-OSS format uses special tokens to delineate different parts of the conversation. The `tokenizer.apply_chat_template()` function converts our simple `messages` format into the model's native format.

For example, a message like `{"role": "user", "content": "Hello"}` gets converted to something like:
```
<|start|>user<|message|>Hello<|end|>
```

We also use `standardize_sharegpt()` from Unsloth to ensure our data conforms to the expected format before applying the template.

In [None]:
from unsloth.chat_templates import standardize_sharegpt

def formatting_prompts_func(examples):
    """
    Apply the model's chat template to each conversation.
    
    This function is called by the dataset.map() method. It takes a batch
    of examples (each with a 'messages' field) and converts them into
    the model's native text format using the tokenizer's chat template.
    
    The resulting 'text' field is what the SFTTrainer will use for training.
    
    Args:
        examples: A batch of examples from the dataset
        
    Returns:
        Dictionary with 'text' field containing formatted conversations
    """
    convos = examples["messages"]
    texts = [
        tokenizer.apply_chat_template(
            convo, 
            tokenize=False,            # Return text, not token IDs
            add_generation_prompt=False  # Don't add prompt at the end (we have the full conversation)
        ) 
        for convo in convos
    ]
    return {"text": texts}


# Standardize the dataset format to match what Unsloth expects
dataset = standardize_sharegpt(dataset)

# Apply the chat template to create the 'text' column
dataset = dataset.map(formatting_prompts_func, batched=True)

print(f"Dataset formatted. Columns: {dataset.column_names}")
print(f"Number of examples: {len(dataset)}")

In [None]:
# Let's examine what a formatted example looks like.
# This shows the exact text the model will be trained on,
# including all the special tokens.
print("=" * 80)
print("FORMATTED TRAINING EXAMPLE (first 1000 chars):")
print("=" * 80)
print(dataset[0]['text'][:1000])
print("...")

In [None]:
# Diagnostic: Check token counts to ensure examples fit within max_seq_length.
# If too many examples exceed the limit, they'll be dropped during training,
# resulting in an empty dataset. This cell helps catch that problem early.

token_counts = []
for i in range(min(50, len(dataset))):  # Sample first 50 examples
    tokens = tokenizer.encode(dataset[i]['text'])
    token_counts.append(len(tokens))

import statistics
print(f"Token count statistics (sampled from {len(token_counts)} examples):")
print(f"  Min:    {min(token_counts)}")
print(f"  Max:    {max(token_counts)}")
print(f"  Mean:   {statistics.mean(token_counts):.0f}")
print(f"  Median: {statistics.median(token_counts):.0f}")
print(f"  Max seq length: {max_seq_length}")
print(f"")
over_limit = sum(1 for tc in token_counts if tc > max_seq_length)
print(f"  Examples over limit: {over_limit}/{len(token_counts)}")
if over_limit > 0:
    print(f"  WARNING: {over_limit} examples exceed max_seq_length and may be truncated or dropped!")

---

## Step 8: Configure the Trainer

We use `SFTTrainer` (Supervised Fine-Tuning Trainer) from the `trl` library. This trainer is specifically designed for fine-tuning language models on conversational data.

### Training Configuration Explained

- **`per_device_train_batch_size = 1`**: Process 1 example at a time per GPU. With a 20B model, even in 4-bit, we can only fit 1 example in memory at a time.
- **`gradient_accumulation_steps = 4`**: Accumulate gradients over 4 batches before updating weights. This simulates a batch size of 4 without needing 4x the memory. Larger effective batch sizes lead to more stable training.
- **`warmup_steps = 10`**: Gradually increase the learning rate from 0 to the target over the first 10 steps. This prevents the model from making too-large updates at the start when it hasn't "seen" much data yet.
- **`num_train_epochs = 1`**: Go through the entire dataset once. For book-based fine-tuning, 1-3 epochs is typical. More epochs risk "overfitting" (memorizing the text rather than learning the style).
- **`learning_rate = 2e-4`**: How much to adjust weights per update. 2e-4 is a standard rate for LoRA fine-tuning — small enough to avoid destroying the model's existing knowledge, large enough to learn new patterns.
- **`optim = "adamw_8bit"`**: The optimizer algorithm, using 8-bit quantization to save memory. AdamW is the standard optimizer for transformer training.
- **`weight_decay = 0.01`**: A regularization technique that slightly penalizes large weights, helping prevent overfitting.
- **`lr_scheduler_type = "cosine"`**: The learning rate follows a cosine curve, starting high and gradually decreasing. This is generally better than linear decay for fine-tuning.
- **`output_dir`**: Where to save training checkpoints.

In [None]:
from trl import SFTConfig, SFTTrainer

# Define the output directory for checkpoints and the final model
OUTPUT_DIR = "./outputs/gpt_oss_20b_jordan_peterson"

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,   # 1 example per GPU at a time
        gradient_accumulation_steps = 4,    # Effective batch size = 1 * 4 = 4
        warmup_steps = 10,                  # Gradual learning rate warmup
        num_train_epochs = 1,               # 1 full pass through the data
        # max_steps = 30,                   # Uncomment to limit steps for testing
        learning_rate = 2e-4,               # Standard LoRA learning rate
        logging_steps = 1,                  # Log metrics every step
        optim = "adamw_8bit",               # Memory-efficient optimizer
        weight_decay = 0.01,                # Regularization
        lr_scheduler_type = "cosine",       # Cosine learning rate decay
        seed = 3407,                        # Reproducibility
        output_dir = OUTPUT_DIR,            # Checkpoint directory
        report_to = "none",                 # Disable WandB/TensorBoard logging
        fp16 = not torch.cuda.is_bf16_supported(),  # Use fp16 if bf16 not available
        bf16 = torch.cuda.is_bf16_supported(),      # Use bf16 if available (RTX 4090 supports it)
        save_strategy = "steps",            # Save checkpoints by step count
        save_steps = 100,                   # Save every 100 steps
        save_total_limit = 3,               # Keep only the 3 most recent checkpoints
    ),
)

---

## Step 9: Apply Response-Only Training

This is an important optimization: we configure the trainer to **only compute loss on the assistant's responses**, not on the system/user prompts.

### Why Train Only on Responses?

During training, the model processes the entire conversation and tries to predict each next token. Without this optimization, the model would try to learn to predict the system prompt and user messages too — which is wasteful because those are always provided as input during inference. We only want the model to learn how to generate good responses.

By masking the loss on the instruction parts (system + user), we:
1. **Improve training efficiency**: The model focuses its learning capacity on what matters
2. **Reduce loss**: The loss metric more accurately reflects how well the model generates responses
3. **Improve output quality**: All learning signal goes toward improving responses

The `instruction_part` and `response_part` parameters tell the trainer which special tokens mark the boundary between "input" (don't train on) and "output" (train on).

**Important:** The GPT-OSS chat template formats simple assistant messages with `<|start|>assistant<|message|>` (without the `<|channel|>final` part that appears in multi-channel conversations like the reference notebook's dataset). We must match the exact tokens our formatted data actually contains.

In [None]:
from unsloth.chat_templates import train_on_responses_only

# Auto-detect the correct response part token from our formatted data.
# The GPT-OSS template uses different formats depending on whether the
# conversation has channel info (analysis/final) or not.
# Our simple assistant messages use: <|start|>assistant<|message|>
# The reference notebook's multi-channel data uses: <|start|>assistant<|channel|>final<|message|>
sample_text = dataset[0]['text']
if "<|start|>assistant<|channel|>final<|message|>" in sample_text:
    response_part = "<|start|>assistant<|channel|>final<|message|>"
elif "<|start|>assistant<|message|>" in sample_text:
    response_part = "<|start|>assistant<|message|>"
else:
    raise ValueError(f"Could not find assistant response marker in formatted text!")

print(f"Detected response_part: {response_part}")

gpt_oss_kwargs = dict(
    instruction_part = "<|start|>user<|message|>",
    response_part = response_part,
)

trainer = train_on_responses_only(
    trainer,
    **gpt_oss_kwargs,
)

print("Response-only training configured.")
print("The model will only learn from assistant responses, not from prompts.")

In [None]:
# Verify that masking is working correctly.
# The first print shows the full tokenized input.
# The second print shows only the parts the model will train on
# (everything else is replaced with padding tokens, shown as spaces).

dataset_size = len(trainer.train_dataset)
print(f"Training dataset size after tokenization: {dataset_size}")

if dataset_size == 0:
    print("\nERROR: Training dataset is empty after tokenization!")
    print("This usually means all examples exceeded max_seq_length and were dropped.")
    print("Try reducing chunk_size or increasing max_seq_length.")
else:
    sample_idx = 0
    print(f"\n{'=' * 80}")
    print("FULL TOKENIZED INPUT (what the model sees):")
    print("=" * 80)
    full_text = tokenizer.decode(trainer.train_dataset[sample_idx]["input_ids"])
    print(full_text[:500] + "...\n")

    print("=" * 80)
    print("MASKED LABELS (what the model trains on - spaces are masked out):")
    print("=" * 80)
    labels = trainer.train_dataset[sample_idx]["labels"]
    masked_text = tokenizer.decode(
        [tokenizer.pad_token_id if x == -100 else x for x in labels]
    ).replace(tokenizer.pad_token, " ")
    print(masked_text[:500] + "...")

---

## Step 10: Check Memory Before Training

Let's see how much GPU memory the model is using before training starts. This helps us understand how much headroom we have and whether we might encounter out-of-memory errors.

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU: {gpu_stats.name}")
print(f"Total GPU Memory: {max_memory} GB")
print(f"Memory reserved before training: {start_gpu_memory} GB")
print(f"Available for training: {max_memory - start_gpu_memory:.3f} GB")

---

## Step 11: Train the Model!

This is the main training step. The trainer will:

1. Iterate through the dataset in batches
2. For each batch, feed the text through the model
3. Compute the loss (how wrong the model's predictions were) on the assistant responses only
4. Compute gradients (which direction to adjust weights)
5. Accumulate gradients over 4 steps
6. Update the LoRA adapter weights
7. Log the loss and learning rate

**What to watch for during training:**
- **Loss should decrease** over time, indicating the model is learning
- **Loss shouldn't drop to near 0**, which would indicate overfitting (memorization)
- A healthy final loss is typically between 0.5 and 2.0 for this type of fine-tuning

Training time depends on the dataset size and number of epochs. With the RTX 4090, expect roughly 1-2 seconds per training step.

In [None]:
# Start training!
# This will take a while depending on your dataset size.
# You'll see a progress bar with loss values.
trainer_stats = trainer.train()

In [None]:
# Show training statistics and memory usage
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
training_percentage = round(used_memory_for_training / max_memory * 100, 3)

runtime_seconds = trainer_stats.metrics['train_runtime']
runtime_minutes = round(runtime_seconds / 60, 2)
train_loss = trainer_stats.metrics.get('train_loss', 'N/A')

print("=" * 60)
print("TRAINING COMPLETE")
print("=" * 60)
print(f"\nTime: {runtime_seconds:.1f} seconds ({runtime_minutes} minutes)")
print(f"Final training loss: {train_loss}")
print(f"\nMemory Usage:")
print(f"  Peak reserved memory: {used_memory} GB")
print(f"  Memory used for training: {used_memory_for_training} GB")
print(f"  Peak memory % of total: {used_percentage}%")
print(f"  Training memory % of total: {training_percentage}%")

---

## Step 12: Test the Fine-Tuned Model (Inference)

Let's test our fine-tuned model! We'll ask it questions related to the topics in Jordan Peterson's books to see if it has learned his writing style and ideas.

The GPT-OSS model supports a `reasoning_effort` parameter that controls how much the model "thinks" before responding:
- **low**: Fast, less detailed responses
- **medium**: Balanced
- **high**: Most detailed, more reasoning tokens used

In [None]:
from transformers import TextStreamer

# Put model in inference mode (disables dropout, optimizes for generation)
FastLanguageModel.for_inference(model)

def ask_model(question: str, reasoning_effort: str = "medium", max_tokens: int = 512):
    """
    Ask the fine-tuned model a question and stream the response.
    
    Args:
        question: The question to ask
        reasoning_effort: 'low', 'medium', or 'high'
        max_tokens: Maximum number of tokens to generate
    """
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        reasoning_effort=reasoning_effort,
    ).to("cuda")
    
    print(f"\n{'='*60}")
    print(f"Question: {question}")
    print(f"Reasoning Effort: {reasoning_effort}")
    print(f"{'='*60}\n")
    
    # TextStreamer prints tokens as they're generated (like ChatGPT's typing effect)
    streamer = TextStreamer(tokenizer, skip_prompt=True)
    _ = model.generate(
        **inputs, 
        max_new_tokens=max_tokens, 
        streamer=streamer,
        temperature=0.7,       # Controls randomness (0=deterministic, 1=creative)
        top_p=0.9,             # Nucleus sampling (consider top 90% probability tokens)
        repetition_penalty=1.1, # Slightly penalize repetition
    )

In [None]:
# Test 1: A topic central to "12 Rules for Life"
ask_model("What is the importance of taking responsibility in one's life?")

In [None]:
# Test 2: A topic from "Maps of Meaning"
ask_model("How do myths and stories help us understand the nature of reality?")

In [None]:
# Test 3: A topic from "Beyond Order"
ask_model("What does it mean to pursue what is meaningful rather than what is expedient?")

In [None]:
# Test 4: A topic from "We Who Wrestle with God"
ask_model(
    "What is the relationship between suffering and meaning in human existence?",
    reasoning_effort="high",
    max_tokens=768,
)

---

## Step 13: Save the Fine-Tuned Model

We can save our fine-tuned model in different ways:

### LoRA Adapters Only (Recommended for Storage)
Saves just the trained LoRA adapter weights (~20-50MB). To use the model later, you load the base model and then apply these adapters on top. This is the most storage-efficient option.

### Merged Model (For Deployment)
Merges the LoRA adapters back into the base model weights and saves the full model. This is larger but simpler to deploy since you only need one set of files.

In [None]:
# Save as LoRA adapters (small, ~20-50MB)
LORA_OUTPUT_DIR = "./outputs/gpt_oss_20b_jordan_peterson_lora"

model.save_pretrained(LORA_OUTPUT_DIR)
tokenizer.save_pretrained(LORA_OUTPUT_DIR)

print(f"LoRA adapters saved to: {LORA_OUTPUT_DIR}")

# Show the saved files and their sizes
import os
total_size = 0
for f in sorted(os.listdir(LORA_OUTPUT_DIR)):
    fpath = os.path.join(LORA_OUTPUT_DIR, f)
    if os.path.isfile(fpath):
        size = os.path.getsize(fpath)
        total_size += size
        print(f"  {f}: {size / 1024 / 1024:.2f} MB")
print(f"\nTotal size: {total_size / 1024 / 1024:.2f} MB")

In [None]:
# Optional: Save as merged 16-bit model (larger, ~40GB, but simpler to deploy)
# Uncomment the lines below if you want to save the full merged model.
# WARNING: This requires significant disk space!

# MERGED_OUTPUT_DIR = "./outputs/gpt_oss_20b_jordan_peterson_merged_16bit"
# model.save_pretrained_merged(MERGED_OUTPUT_DIR, tokenizer, save_method="merged_16bit")
# print(f"Merged 16-bit model saved to: {MERGED_OUTPUT_DIR}")

---

## Step 14: How to Load the Fine-Tuned Model Later

Once saved, you can load the fine-tuned model in a new session without re-training. Here's how:

In [None]:
# To load the fine-tuned model in a new session, run this cell.
# Set the condition to True when you want to load a previously saved model.

LOAD_SAVED_MODEL = False  # Change to True to load from saved LoRA adapters

if LOAD_SAVED_MODEL:
    from unsloth import FastLanguageModel
    import torch
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "./outputs/gpt_oss_20b_jordan_peterson_lora",
        max_seq_length = 2048,
        dtype = None,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model)
    print("Fine-tuned model loaded successfully!")
else:
    print("Using the model from the current training session.")
    print("Set LOAD_SAVED_MODEL = True to load from saved adapters instead.")

---

## Summary

In this notebook, we:

1. **Extracted text** from 4 Jordan Peterson books using PyMuPDF
2. **Created a training dataset** by chunking the text into ~350-word passages and formatting them as conversations
3. **Loaded the GPT-OSS 20B model** in 4-bit quantization using Unsloth for 2x faster training
4. **Added LoRA adapters** to train only ~0.02% of the model's parameters
5. **Applied response-only training** so the model only learns from the assistant responses
6. **Trained the model** using SFTTrainer with optimized settings
7. **Tested the model** with questions related to Peterson's ideas
8. **Saved the fine-tuned model** as LoRA adapters for future use

### Key Takeaways

- **LoRA** makes it possible to fine-tune a 20B parameter model on a single consumer GPU
- **4-bit quantization** compresses the model to fit in ~12GB of VRAM
- **Unsloth** provides 2x speedup and significant VRAM savings over standard training
- **Response-only training** improves efficiency by focusing learning on what matters
- The fine-tuned model learns Jordan Peterson's writing style and can discuss his ideas

### Next Steps

- **Increase epochs**: Try `num_train_epochs = 2` or `3` for deeper learning (watch for overfitting)
- **Adjust LoRA rank**: Try `r = 32` or `r = 64` for more capacity
- **Increase chunk size**: Try larger chunks if token counts have headroom within `max_seq_length`
- **Try GRPO**: Use reinforcement learning to further align the model's outputs
- **Push to HuggingFace Hub**: Share your fine-tuned model with the community