# Fine-Tuning a Transformer

**From pre-trained model to aligned assistant through post-training**

## What is Post-Training?

**Post-training** (also called fine-tuning or alignment) is the process of taking a pre-trained language model and teaching it to be helpful, harmless, and honest. While pre-training teaches a model to predict the next token, post-training teaches it to:

- **Follow instructions** — Respond appropriately to user requests
- **Align with human preferences** — Generate responses humans actually prefer
- **Refuse harmful requests** — Decline to help with dangerous or unethical tasks
- **Be truthful** — Acknowledge uncertainty and avoid making things up

This is what transforms a base model (which just completes text) into an assistant (which helps users).

## The Post-Training Pipeline

Modern AI assistants like GPT-4, Claude, and Llama go through a multi-stage training process:

```
┌─────────────────────────────────────────────────────────────────────┐
│                        PRE-TRAINING                                  │
│  Train on massive text corpus to learn language patterns            │
│  Result: Base model that can complete text                          │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│              SUPERVISED FINE-TUNING (SFT)                           │
│  Train on (instruction, response) pairs                             │
│  Result: Model that follows instructions                            │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│              PREFERENCE ALIGNMENT                                    │
│  RLHF: Train reward model, then optimize with PPO                   │
│  DPO: Directly optimize on preference pairs                         │
│  Result: Model aligned with human preferences                       │
└─────────────────────────────────────────────────────────────────────┘
```

## Learning Path

In this section, we'll implement the complete post-training pipeline:

1. **Why Post-Training Matters** — Understand the gap between pre-trained and aligned models
2. **Supervised Fine-Tuning (SFT)** — Train models to follow instructions
3. **Reward Modeling** — Learn to predict human preferences
4. **RLHF with PPO** — Optimize models using reinforcement learning
5. **Direct Preference Optimization (DPO)** — A simpler alternative to RLHF
6. **Advanced Topics** — Memory optimization, hyperparameters, evaluation

In [1]:
# Check our environment
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print("MPS (Apple Silicon) available")

PyTorch version: 2.10.0.dev20251124+rocm7.1
CUDA available: True
CUDA device: Radeon RX 7900 XTX


## The Three Pillars of Post-Training

| Stage | Input Data | Objective | Output |
|-------|-----------|-----------|--------|
| **SFT** | (instruction, response) pairs | Maximize P(response \| instruction) | Instruction-following model |
| **Reward Model** | (prompt, chosen, rejected) triples | Rank chosen > rejected | Preference predictor |
| **RLHF/DPO** | Prompts + reward signal | Maximize expected reward | Aligned model |

## Key Concepts

**Supervised Fine-Tuning (SFT):**
Train the model to generate good responses by showing it examples. Simple and effective, but limited by the quality and diversity of demonstrations.

**Reinforcement Learning from Human Feedback (RLHF):**
First train a reward model to predict which responses humans prefer, then use reinforcement learning (PPO) to optimize the language model to maximize this reward.

**Direct Preference Optimization (DPO):**
Skip the reward model entirely! DPO reformulates RLHF as a simple classification loss on preference pairs, making training much simpler and more stable.

In [2]:
# Let's see the difference between a base model and an instruction-tuned model
prompt = "What is the capital of France?"

# Base model completion (simulated)
base_completion = """What is the capital of France? The capital of Germany? 
The capital of Italy? These are questions that many students..."""

# Instruction-tuned model response (simulated)
instruct_response = """The capital of France is Paris. It's located in the 
north-central part of the country along the Seine River."""

print("Base Model (just completes text):")
print(f"  Input: {prompt}")
print(f"  Output: {base_completion[:80]}...")
print()
print("Instruction-Tuned Model (answers questions):")
print(f"  Input: {prompt}")
print(f"  Output: {instruct_response}")

Base Model (just completes text):
  Input: What is the capital of France?
  Output: What is the capital of France? The capital of Germany? 
The capital of Italy? Th...

Instruction-Tuned Model (answers questions):
  Input: What is the capital of France?
  Output: The capital of France is Paris. It's located in the 
north-central part of the country along the Seine River.


## Let's Begin!

In the following notebooks, we'll implement each component of the post-training pipeline with executable code you can run and modify.

Ready? Let's start by understanding why post-training matters.