# Scheduling Logical Batch Size in PyTorch

This example demonstrates extending existing optimizers using schedule anything, by implementing a flavor of logical batch size using scheduling.

Pay attention to the decoupling of concerns this can bring about.

## Environment Setup and Imports

We use magic commands to ensure the environment is setup. Then we run all the needed imports. Note the usage of the cannonical ScheduleAnything import pattern:  torch-schedule-anything -> tsa

```
import torch_schedule_anything as tsa
```

In [1]:
# Setup
%pip install -q transformers datasets torch-schedule-anything torch

# Imports
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from torch.optim import AdamW
from torch.utils.data import DataLoader
import torch_schedule_anything as tsa

# Type hints
from torch_schedule_anything import SynchronousSchedule
from transformers import PreTrainedTokenizer, PreTrainedModel
from torch.optim import Optimizer
from torch.optim.lr_scheduler import LRScheduler

## Configuration

For easy experimentation, we place the majority of the hyperparameters right here, though we do hardwire the dataset. For the most part, we stick to some fairly boring configurations that should be familiar boilerplate to anyone in NLP.

Training duration and details are specified in terms of number of batches, learning rate has been set to something that is known to train, and the schedules are functional.

### Schedule Overview

Scheduling using builtins in this library generally works by specifying a number of warmup steps (in this case batches) a number of training steps, and some parameters relating to warmup targets and values.

It should always be kept in mind that torch schedules are applied in terms of
value(t) = base_hyperparameter*lambda(t), meaning you will get the base value times a multiplier as your final rate.

The warmup target tells you what lambda will be when warmup finishes, while the final target tells you what it will be at end of training. Largely, the various builtin curves say how we get there. In this case, we use a cosine annealing, and a quadratic curve for learning rate and weight decay respectively.

### Schedule Config

We are going to schedule logical batch size. This is largely inspired by smith's work, but does not use his exact algorithm, as this is simply a demonstation.

### Tuning and Purpose

This exists primarily to demonstrate the technology, not demonstrate a well-tuned example. This example has not been properly tuned besides verifying convergence, and as such do not treat this as having been deployed to be optimal.

In [2]:
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' # Device
LOGGING_RATE = 5 # How frequently to print to console

# Model/Pipeline config
MODEL_NAME = "distilbert-base-uncased" # Model
MAX_LENGTH = 256 # Maximum number of tokens in sample
BATCH_SIZE = 8 # Batches in samples
TOTAL_BATCHES = 10000 # All batches used over training
WARMUP_BATCHES = 500 # Number of batches used for warmup

# The learning rate and schedule.
#
# We will warm up and just become a constant.
BASE_LR = 0.00001
BASE_WD = 0.01

# The batch size schedule
# is a quadratic schedule from low to high
STARTING_BATCH_SIZE = 16
ENDING_BATCH_SIZE = 128

## Standard Boilerplate

Largely standard boilerplate here.
We make a model, we make an AdamW optimizer,
we make a pipeline that loads imdb and tokenizes it

In [3]:
def make_model()->PreTrainedModel:
    """Load pretrained model with classification head."""
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=2
    )
    return model.to(DEVICE)

def make_dataloader()->DataLoader:
    """Load and tokenize IMDB dataset, return DataLoader."""
    dataset = load_dataset("imdb", split="train")  # Subset for faster demo
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

    def tokenize(examples):
        result = tokenizer(
            examples["text"],
            truncation=True,
            max_length=MAX_LENGTH,
            padding="max_length"
        )
        result["labels"] = examples["label"]
        return result

    dataset = dataset.map(tokenize, batched=True)
    dataset = dataset.shuffle(seed=42)
    dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

    return DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

def make_optimizer(model: PreTrainedModel)->Optimizer:
    """Create optimizer with base hyperparameter values that schedules multiply against."""
    return AdamW(
        model.parameters(),
        lr=BASE_LR,
        weight_decay=BASE_WD
    )

## Schedule Factory, and the Novelty

This is where ScheduleAnything comes in. We're going to bind a new field to the optimizer, bind two schedules, and define a helper that can tell when it is time to step.

**The pattern:**
1. Add optimizer fields as needed
2. Create a schedule for each hyperparameter you want to control
3. Use `schedule_target` to specify which hyperparameter each schedule controls
4. Wrap them in `SynchronousSchedule` to keep them coordinated
5. Define utilities in the same place using tsa that respond to your extra hyperparameters to invoke in the training loop.

**Crucially**, this means downstream can access through well abstracted utilities, maintaining separation of concern.

In [4]:
def make_schedule(optimizer: Optimizer)->SynchronousSchedule:
    """
    Create coordinated schedules for learning rate and weight decay.

    Returns a SynchronousSchedule that steps both schedules together.
    """
    # Learning rate: cosine annealing
    lr_scheduler = tsa.constant_with_warmup(
        optimizer,
        warmup_to_value=1.0, # Base learning rate already encoded
        num_warmup_steps=WARMUP_BATCHES,
        schedule_target='lr'
    )

    # Create the logical batch feature in the first place,
    # then bind the schedule

    tsa.extend_optimizer(optimizer,
                         name="logical_batch_size_target",
                         default_value=1.0)

    batch_size_scheduler = tsa.quadratic_schedule_with_warmup(
        optimizer,
        warmup_to_value=STARTING_BATCH_SIZE,
        anneal_to_value=ENDING_BATCH_SIZE,
        num_warmup_steps=WARMUP_BATCHES,
        num_training_steps=TOTAL_BATCHES,
        schedule_target='logical_batch_size_target'
    )

    # Coordinate them to step together
    return tsa.SynchronousSchedule([lr_scheduler, batch_size_scheduler])



In [5]:
def get_accum_threshold(optimizer: Optimizer)->float:
    """
    Gets the relevant threshold out of the optimizer using
    utils
    """
    thresholds = []
    for value, _, _ in tsa.get_param_groups_regrouped_by_key(optimizer, 'logical_batch_size_target'):
        thresholds.append(value)
    return max(thresholds)/BATCH_SIZE

## Train Loop
Standard PyTorch training loop as used in NLP, with schedules per batch. We abstract away the changes to logging, however.

In [6]:
def report_progress(schedule: SynchronousSchedule,
                    batch_idx: int,
                    loss: float,
                    last_step_size: int):
    last_lr = schedule.get_last_lr()[0]
    last_batch_target = schedule.get_last_schedule('logical_batch_size_target')[0]
    msg = (f"Batch {batch_idx+1:4d}/{TOTAL_BATCHES}"
          f" | Loss: {loss.item():.4f}"
          f" | LR: {last_lr:.4e}"
          f" | Target Batch Size: {last_batch_target:.4f}"
          f" | Last Batch Size: {last_step_size}"
          )
    print(msg)


In [7]:
def train(model: PreTrainedModel,
          dataloader: DataLoader,
          optimizer: Optimizer,
          schedule: LRScheduler,
          ):
    """Train for TOTAL_BATCHES batches."""
    model.train()
    data_iter = iter(dataloader)
    accum_threshold = 1
    accum_steps = 0
    last_batch_size = 0

    for batch_idx in range(TOTAL_BATCHES):
        # Get next batch
        try:
            batch = next(data_iter)
        except StopIteration:
            data_iter = iter(dataloader)
            batch = next(data_iter)

        # Move to device
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        # Forward pass and backwards pass
        # Increase batch size
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss = loss/max(int(accum_threshold), 1)
        loss.backward()
        accum_steps += 1

        # Optimizer steps when I hit or exceed my target
        if accum_steps >= accum_threshold:
            optimizer.step()
            optimizer.zero_grad()
            accum_threshold = get_accum_threshold(optimizer)
            last_batch_size = accum_steps * BATCH_SIZE
            accum_steps = 0

        # Step schedules
        schedule.step()

        # Log progress
        if (batch_idx + 1) % LOGGING_RATE == 0:
            assert len(schedule.get_last_lr()) == 1, "update logging system when adding param groups"
            report_progress(schedule, batch_idx, loss, last_batch_size)


## Putting It All Together

Create the components and train.


In [None]:

def main():
    print("Setting up model and data...")
    model = make_model()
    dataloader = make_dataloader()

    print("Creating optimizer and schedules...")
    optimizer = make_optimizer(model)
    schedule = make_schedule(optimizer)

    #print(f"Scheduling: {schedule.schedule_names}")
    print(f"Training for {TOTAL_BATCHES} batches with {WARMUP_BATCHES} warmup")
    print(f"Device: {DEVICE}\n")

    train(model, dataloader, optimizer, schedule)

    print(f"\nTraining complete!")

if __name__ == '__main__':
    main()

Setting up model and data...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Creating optimizer and schedules...
Training for 10000 batches with 500 warmup
Device: cuda

Batch    5/10000 | Loss: 0.7512 | LR: 1.0000e-07 | Target Batch Size: 0.1600 | Last Batch Size: 8
Batch   10/10000 | Loss: 0.7227 | LR: 2.0000e-07 | Target Batch Size: 0.3200 | Last Batch Size: 8
Batch   15/10000 | Loss: 0.6625 | LR: 3.0000e-07 | Target Batch Size: 0.4800 | Last Batch Size: 8
Batch   20/10000 | Loss: 0.6895 | LR: 4.0000e-07 | Target Batch Size: 0.6400 | Last Batch Size: 8
Batch   25/10000 | Loss: 0.7312 | LR: 5.0000e-07 | Target Batch Size: 0.8000 | Last Batch Size: 8
Batch   30/10000 | Loss: 0.6735 | LR: 6.0000e-07 | Target Batch Size: 0.9600 | Last Batch Size: 8
Batch   35/10000 | Loss: 0.7113 | LR: 7.0000e-07 | Target Batch Size: 1.1200 | Last Batch Size: 8
Batch   40/10000 | Loss: 0.7223 | LR: 8.0000e-07 | Target Batch Size: 1.2800 | Last Batch Size: 8
Batch   45/10000 | Loss: 0.6992 | LR: 9.0000e-07 | Target Batch Size: 1.4400 | Last Batch Size: 8
Batch   50/10000 | Loss: 