# Fine-tuning GPT-2 with Aggressive Weight Decay Scheduling

This notebook demonstrates how to integrate ScheduleAnything with HuggingFace Trainer
to schedule arbitrary optimizer hyperparameters during training. While most training
frameworks only support learning rate scheduling, ScheduleAnything enables scheduling
of any optimizer parameter—in this case, weight decay.

## What We're Demonstrating

We'll train GPT-2 on CNN/DailyMail articles while applying a custom weight decay schedule:
- **Warmup phase**: Weight decay increases from 0 to base value
- **Training phase**: Weight decay decreases from base value to 10% of base value
- **Simultaneously**: Learning rate follows its own cosine annealing schedule

## Why This Matters

The ability to schedule arbitrary optimizer hyperparameters opens new possibilities for
training dynamics. More importantly, this integration works with existing frameworks—
HuggingFace Trainer treats our custom schedulers like any other scheduler, requiring
no modifications to the training loop itself.

This is a pedagogical example. The specific schedule chosen here is less important than
understanding how to bind custom schedules to any optimizer hyperparameter and integrate
them into production training pipelines.

## Setup and Imports

Library versions are pinned for long-term compatibility of this example. As deep learning libraries evolve rapidly, pinning ensures this code will continue to work as written even years from now. The specific versions chosen represent a stable, well-tested combination at the time of writing.

In [1]:
# Setup
%pip install -q transformers "datasets<4.0.0" torch torch-schedule-anything

# Imports
import torch
import os
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset
from torch.optim import AdamW
import torch_schedule_anything as tsa

# Type hints
from transformers import PreTrainedTokenizer, PreTrainedModel
from torch.optim import Optimizer
from torch_schedule_anything import SynchronousSchedule
from datasets import Dataset

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## Configuration

Configuration tends to be fairly boilerplate, but a few cases are worth highlighting

Notice that schedule targets (LR_WARMUP_TARGET, LR_FINAL_TARGET, etc.) are expressed as multipliers rather than absolute values. This is fundamental to how PyTorch schedulers work: they apply a function that returns a value, then multiply it by a base value. For example, the current learning rate is computed as: `BASE_LR * schedule(timestep)`.

This design makes schedules reusable and composable. The same cosine annealing function can be applied to learning rate, weight decay, or any other hyperparameter by simply changing the base value and the schedule_target parameter.

In our configuration:
- Warmup to 1.0 means "reach 100% of base value"
- Anneal to 0.1 means "decrease to 10% of base value"

This allows us to independently configure schedules for different hyperparameters while keeping the same conceptual framework.

In [2]:
# Configuration
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
LOGGING_STEPS = 50

# Model config
MODEL_NAME = "gpt2"
MAX_LENGTH = 512

# Dataset config
DATASET_NAME = "cnn_dailymail"
DATASET_VERSION = "3.0.0"
DATASET_TEXT_FIELD = "article"

# Training config
BATCH_SIZE = 8
GRADIENT_ACCUMULATION_STEPS = 4
MAX_STEPS = 1000
MAX_EPOCHS = 3
WARMUP_STEPS = 100
EVAL_STEPS = 200
MAX_EVAL_SAMPLES = 1000

# Optimizer config
BASE_LR = 5e-5
BASE_WD = 0.01

# Learning rate schedule
LR_WARMUP_TARGET = 1.0
LR_FINAL_TARGET = 0.1

# Weight decay schedule
WD_WARMUP_TARGET = 1.0
WD_FINAL_TARGET = 0.1

# Precision
BF16 = True if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else False

# Output
OUTPUT_DIR = "./gpt2-finetuned-openwebtext"

## Boilerplate Factories

The following factory functions handle standard HuggingFace setup for tokenizer, model,
dataset, and data collator. These don't require ScheduleAnything integration and follow
typical patterns you'd see in any HuggingFace training script.

In [3]:
# Factories
def make_tokenizer() -> PreTrainedTokenizer:
    """Load tokenizer and set pad token."""
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    return tokenizer

def make_model() -> PreTrainedModel:
    """Load pretrained GPT-2 model."""
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
    return model.to(DEVICE)

def make_dataset(tokenizer: PreTrainedTokenizer) -> tuple[Dataset, Dataset]:
    """Load cnn_dailymail dataset and tokenize."""
    dataset = load_dataset(DATASET_NAME, DATASET_VERSION)
    num_proc = os.cpu_count()

    def process_articles(examples):
        return tokenizer(
            examples[DATASET_TEXT_FIELD],
            truncation=True,
            max_length=MAX_LENGTH
        )

    train_dataset = dataset["train"].shuffle(seed=42)
    train_dataset = train_dataset.map(
        process_articles,
        batched=True,
        num_proc=num_proc,
        remove_columns=dataset["train"].column_names
    )

    eval_dataset = dataset["validation"].shuffle(seed=42)
    eval_dataset = eval_dataset.select(range(MAX_EVAL_SAMPLES))
    eval_dataset = eval_dataset.map(
        process_articles,
        batched=True,
        num_proc=num_proc,
        remove_columns=dataset["validation"].column_names
    )

    return train_dataset, eval_dataset

def make_data_collator(tokenizer: PreTrainedTokenizer) -> DataCollatorForLanguageModeling:
    """Create data collator for causal language modeling."""
    return DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )

## Optimizer and Schedule Factories

### Parameter Groups and Weight Decay

We separate model parameters into two groups: those that should experience weight
decay and those that shouldn't. This is a common production pattern based on the
observation that applying weight decay to all parameters can harm training.

The standard practice:
- Apply weight decay to weight matrices (2D+ parameters)
- Skip weight decay for embeddings, biases, and normalization parameters (1D parameters)

This separation serves two purposes in this example:
1. It matches real-world training configurations for better pedagogical value
2. It demonstrates ScheduleAnything's flexibility—the library can schedule weight_decay
   even when different parameter groups have different base weight decay values

When the weight decay scheduler runs, it will scale the weight_decay value for the
'default' group while leaving the 'no_wd' group at 0.0.

### ScheduleAnything Integration: The Core Concept

This is where ScheduleAnything binds to the optimizer to enable scheduling of
arbitrary hyperparameters. There are several key design elements at work:

**The schedule_target parameter**: This specifies which optimizer hyperparameter
to schedule. It can be 'lr' for learning rate, 'weight_decay' for weight decay,
or any other hyperparameter the optimizer supports. This is what enables scheduling beyond just learning rate.

**Binding to the optimizer**: Each scheduler must be created with a reference to
the optimizer it will modify. This binding happens at creation time and allows the scheduler to directly manipulate the optimizer's parameter groups during training.

**SynchronousSchedule wrapper**: When you need to schedule multiple hyperparameters, SynchronousSchedule coordinates their updates. It ensures both schedulers step together, maintaining synchronized training dynamics.

**Transparent to downstream code**: The returned SynchronousSchedule object
implements the standard PyTorch scheduler interface. HuggingFace Trainer (or any
other training framework) treats it like any other scheduler, calling .step() at
appropriate times without knowing ScheduleAnything is involved.

This design philosophy—bind early, appear standard—is what makes ScheduleAnything practical for real-world use despite adding significant new capabilities. You only pay for the complexity you need, not plumbing costs.

In [4]:
# Optimizer and Scheduler
def make_optimizer(model: PreTrainedModel) -> Optimizer:
    """Create AdamW optimizer with parameter groups."""
    decay_params = []
    no_decay_params = []

    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue
        if 'embedding' in name or param.dim() == 1:
            no_decay_params.append(param)
        else:
            decay_params.append(param)

    return AdamW([
        {'params': decay_params, 'weight_decay': BASE_WD, 'name': 'default'},
        {'params': no_decay_params, 'weight_decay': 0.0, 'name': 'no_wd'}
    ], lr=BASE_LR)

def make_scheduler(optimizer: Optimizer) -> SynchronousSchedule:
    """Create synchronized schedules for learning rate and weight decay."""
    lr_scheduler = tsa.cosine_annealing_with_warmup(
        optimizer,
        warmup_to_value=LR_WARMUP_TARGET,
        anneal_to_value=LR_FINAL_TARGET,
        num_warmup_steps=WARMUP_STEPS,
        num_training_steps=MAX_STEPS,
        schedule_target='lr'
    )

    wd_scheduler = tsa.cosine_annealing_with_warmup(
        optimizer,
        warmup_to_value=WD_WARMUP_TARGET,
        anneal_to_value=WD_FINAL_TARGET,
        num_warmup_steps=WARMUP_STEPS,
        num_training_steps=MAX_STEPS,
        schedule_target='weight_decay'
    )

    return tsa.SynchronousSchedule([lr_scheduler, wd_scheduler])


## Trainer Creation

### The Binding Requirement

Notice that this function creates the optimizer and scheduler internally, rather than letting the trainer make them. This is necessary due to the way ScheduleAnything works.

The schedulers must bind to the optimizer at creation time. This binding establishes the connection that allows schedulers to modify optimizer hyperparameters during training. Once bound, the scheduler and optimizer must be passed together to the Trainer as a tuple via the 'optimizers' parameter.

An alternative approach would be to subclass Trainer and override its optimizer and scheduler creation methods. Either approach works; this factory pattern is simpler for a pedagogical example.

After this setup, everything is standard: Trainer treats our custom schedulers like any other schedulers, calling .step() at the appropriate times during training. The training loop requires no modifications.

In [5]:
# Trainer
def make_trainer(
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    train_dataset: Dataset,
    eval_dataset: Dataset,
    data_collator: DataCollatorForLanguageModeling
) -> Trainer:
    """Create Trainer with custom optimizer and scheduler."""
    optimizer = make_optimizer(model)
    scheduler = make_scheduler(optimizer)

    args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        num_train_epochs=MAX_EPOCHS,
        max_steps=MAX_STEPS,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        eval_strategy="steps",
        eval_steps=EVAL_STEPS,
        logging_steps=LOGGING_STEPS,
        save_strategy="no",
        bf16=BF16,
        report_to="none"
    )

    return Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        optimizers=(optimizer, scheduler)
    )

## Main Execution

With all the setup complete, training consists of creating the necessary components, passing them to the trainer, and invoking train(). The ScheduleAnything integration happens transparently—the training loop doesn't need to know that weight decay is being scheduled alongside learning rate. From the perspective of the Trainer, these are just schedulers that get stepped at the appropriate times.

Naturally, this means huggingface trainer is not suitable for extensions including the training loops without significantly more work. A better person than I make take up that torch if they so choose, but generally those cases end up using something like PyTorch Lightning anyhow.

In [None]:
# Main
def main():
    print("Creating tokenizer and model...")
    tokenizer = make_tokenizer()
    model = make_model()

    print("Loading dataset...")
    train_dataset, eval_dataset = make_dataset(tokenizer)
    data_collator = make_data_collator(tokenizer)

    print("Setting up trainer...")
    trainer = make_trainer(
        model,
        tokenizer,
        train_dataset,
        eval_dataset,
        data_collator
    )

    print("Evaluating baseline performance...")
    import math
    baseline_results = trainer.evaluate()
    baseline_perplexity = math.exp(baseline_results['eval_loss'])
    print(f"Baseline Perplexity: {baseline_perplexity:.2f}\n")

    print(f"Training for {MAX_STEPS} steps")
    print(f"Device: {DEVICE}")
    print(f"BF16: {BF16}\n")

    trainer.train()

    print("\nTraining complete!")

    print("\nEvaluating final performance...")
    eval_results = trainer.evaluate()
    final_perplexity = math.exp(eval_results['eval_loss'])
    print(f"Final Perplexity: {final_perplexity:.2f}")
    print(f"Improvement: {baseline_perplexity - final_perplexity:.2f} ({((baseline_perplexity - final_perplexity) / baseline_perplexity * 100):.1f}% reduction)")

if __name__ == '__main__':
    main()

Creating tokenizer and model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Loading dataset...


README.md: 0.00B [00:00, ?B/s]

3.0.0/train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

3.0.0/validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

3.0.0/test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/287113 [00:00<?, ? examples/s]