# Train a GPT model from scratch

Here we are going to train a GPT model from scratch (babyGPT) using characters as tokens. This code can also be used to fine-tune GPT-2 models. 

The code in this week is based on [Andrej Karpathy's nanoGPT](https://github.com/karpathy/nanoGPT), a really simple and lightweight implementation of a large language model. This code was chosen as the most accessible codebase where you can inspect and understand the code behind transformers, as well as easily train and customise your own simple GPT models. 

The code here has been modified a bit to make it simpler, more Pythonic, and more easy for you to adapt for your own project. But the core of the technical implementation of the LLM in `nanoGPT/model.py` is largely unchanged. When this is training **spend some time looking at the model implementation code** and try to see what bits of code correspond to the components of a transformer model as discussed in the lecture.

Before running this you will need to run `00-prepare-dataset.ipynb` (that you can easily customise to your own text dataset). After running this you can use your model to do some generation in `02-generate-with-nanoGPT.ipynb`. 

Unlike previous weeks to model and training hyperparemeters are specified in a separate YAML config file, which is a hand format for writing human readable code configs that get read as python dictionaries (see `configs/` and `nanoGPT/config.py` for more details). This is to make it easier to switch back and forth between models and training params. 

In [None]:
import os
import math
import time
import yaml
import torch
import pickle
import numpy as np

from yaml.loader import SafeLoader
from contextlib import nullcontext

from nanoGPT.model import GPT
from nanoGPT.config import ModelConfig, TrainingConfig

### Setup parameters 

In [None]:
# -----------------------------------------------------------------------------
# I/O
eval_only = False # if True, script exits right after the first eval
always_save_checkpoint = True # if True, always save a checkpoint after each eval

# DDP settings
backend = 'nccl' # 'nccl', 'gloo', etc.
device = 'cpu' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = 'float16'
compile = False # use PyTorch 2.0 to compile the model to be faster
gradient_accumulation_steps = 1 # used to simulate larger batch sizes

# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such
# if fine-tuning a gpt model you want to make these bigger
eval_interval = 50 # keep frequent because we'll overfit
eval_iters = 50
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

torch.manual_seed(1337)
ctx = nullcontext()

### Set training config and dataset

In the folder `configs` there are a number of yaml files with different configs for the model and training of our GPT models. You can look in there if you want to see how these are set up. By default this notebook will use `baby_gpt_config.yaml` which will use characters as tokens so we can train a simple and small GPT model for educational purposes. For good quality results though, you will want to fine-tune a pretrained GPT2 model using the second config set here. 

This code has been designed for you to train these on whatever custom text dataset you want, using the code in `00-prepare-dataset.ipynb`. Just use the *chars* or *tokens* dataset depending on whether you want to train a babyGPT or GPT-2 model. 

##### Train a babyGPT character model from scratch:

This configuration will train a babyGPT model from scratch at the character level (similar to CharRNN from last term). This is designed to be small enough to be trained from scratch on a laptop cpu. 

In [None]:
config_path = 'configs/baby_gpt_config.yaml'
dataset_dir = '../data/class-datasets/text-datasets/shakespeare_chars'
ckpt_dir = 'ckpt/shakespeare-char'

##### Fine-tune a GPT2 model on a custom dataset (GPU recommended):

This configuration will automatically load a pretrained GPT-2 Model from huggingface transformers which you can fine-tune on your custom tokenised dataset. 

In [None]:
# config_path = 'configs/gpt2_config.yaml'
# dataset_dir = '../data/class-datasets/text-datasets/shakespeare_tokens'
# ckpt_dir = 'ckpt/shakespeare-gpt'

### Load model and training hyperparameters

Here we will load our Model and Training hyperparameters into our ModelConfig and TrainingConfig classes, which we will call `m` and `t`. This is different to the original nanoGPT code and was built in order to make something that was more robust and Pythonic. 

In [None]:
os.makedirs(ckpt_dir, exist_ok=True)

with open(config_path, 'r') as f:
    m_dict, t_dict = list(yaml.load_all(f, Loader=SafeLoader))

m = ModelConfig.from_dict(m_dict['model_config'])
t = TrainingConfig.from_dict(t_dict['training_config'])
print(m.__dict__)
print(t.__dict__)
tokens_per_iter = gradient_accumulation_steps* t.batch_size * m.block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

### Data loader

Here we load in our data from our pre-calculated binaries `.bin` of our list of token indicies for our text sequence.

In [None]:
train_data = np.memmap(os.path.join(dataset_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(dataset_dir, 'val.bin'), dtype=np.uint16, mode='r')
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - m.block_size, (t.batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+m.block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+m.block_size]).astype(np.int64)) for i in ix])
    if device == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y


### Initialise model

Here we initialise our model. It gets a bit complicated as we want to override the params in our ModelConfig `m` if there is a conflict with the params listed in a saved checkpoint that we load in.

In [None]:
# init these up here, can override if t.init_from='resume' (i.e. from a checkpoint)
iter_num = 0
best_val_loss = 1e9

# attempt to derive vocab_size from the dataset
meta_path = os.path.join(dataset_dir, 'meta.pkl')
meta_vocab_size = None
if os.path.exists(meta_path):
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    meta_vocab_size = meta['vocab_size']
    print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")
    
# model init
if t.init_from == 'scratch':
    # init a new model from scratch
    print("Initializing a new model from scratch")
    # determine the vocab size we'll use for from-scratch training
    if meta_vocab_size is None:
        print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
    m.vocab_size = meta_vocab_size if meta_vocab_size is not None else 50304
    model = GPT(m)
elif t.init_from == 'resume':
    print(f"Resuming training from {ckpt_dir}")
    # resume training from a checkpoint.
    ckpt_path = os.path.join(ckpt_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    checkpoint_model_args = checkpoint['model_args']
    # force these config attributes to be equal otherwise we can't even resume training
    # the rest of the attributes (e.g. dropout) can stay as desired from command line
    m = ModelConfig.from_dict(checkpoint_model_args)
    # create the model
    model = GPT(m)
    state_dict = checkpoint['model']
    # fix the keys of the state dictionary :(
    # honestly no idea how checkpoints sometimes get this prefix, have to debug more
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)
    iter_num = checkpoint['iter_num']
    best_val_loss = checkpoint['best_val_loss']
elif t.init_from.startswith('gpt2'):
    print(f"Initializing from OpenAI GPT-2 weights: {t.init_from}")
    # initialize from OpenAI GPT-2 weights
    override_args = dict(dropout=m.dropout)
    model = GPT.from_pretrained(t.init_from, override_args)
    # read off the created config params, so we can store them into checkpoint correctly
    m = ModelConfig.from_dict(model.config.__dict__)
# crop down the model block size if desired, using model surgery
if m.block_size < model.config.block_size:
    model.crop_block_size(m.block_size)
model.to(device)


### Setup coure objects

Here we setup our core objects like our optimiser.

In [None]:
# initialize a GradScaler. If enabled=False scaler is a no-op
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# optimizer
optimizer = model.configure_optimizers(t.weight_decay, t.learning_rate, (t.beta1, t.beta2), device)
if t.init_from == 'resume':
    optimizer.load_state_dict(checkpoint['optimizer'])
checkpoint = None # free up memory

# compile the model
if compile:
    print("compiling the model... (takes a ~minute)")
    unoptimized_model = model
    model = torch.compile(model) # requires PyTorch 2.0

### Util functions

Some utility functions to help us estimate the loss and get the learning rate (if we set it to gradually step down over training)L

In [None]:
# helps estimate an arbitrarily accurate loss over either split using many batches
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# learning rate decay scheduler (cosine with warmup)
def get_lr(it, t):
    # 1) linear warmup for warmup_iters steps
    if it < t.warmup_iters:
        return t.learning_rate * it / t.warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > t.lr_decay_iters:
        return t.min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - t.warmup_iters) / (t.lr_decay_iters - t.warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return t.min_lr + coeff * (t.learning_rate - t.min_lr)

### Training loop

Here is our training loop. The code is a bit more complex here than we have sesn before. This is largely because there is quite a bit of boilerplate code which can be used to make training more efficient on NVIDIA GPUs.

In [None]:
# training loop
X, Y = get_batch('train') # fetch the very first batch
t0 = time.time()
local_iter_num = 0 # number of iterations in the lifetime of this process
running_mfu = -1.0
while True:

    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num, t) if t.decay_lr else t.learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': m.__dict__,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': model.config.__dict__,
                    'dataset': dataset_dir
                }
                print(f"saving checkpoint to {ckpt_dir}")
                torch.save(checkpoint, os.path.join(ckpt_dir, 'ckpt.pt'))
    if iter_num == 0 and eval_only:
        break

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train')
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()
    # clip the gradient
    if t.grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), t.grad_clip)
    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0:
        # get loss as float. note: this is a CPU-GPU sync point
        # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
        lossf = loss.item() * gradient_accumulation_steps
        if local_iter_num >= 5: # let the training loop settle a bit
            mfu = model.estimate_mfu(t.batch_size * gradient_accumulation_steps, dt)
            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    iter_num += 1
    local_iter_num += 1

    # termination conditions
    if iter_num > t.max_iters:
        break

## Tasks

**Task 1:** Run the code here and to train a babyGPT model using characters as tokens on the shakespeare dataset or on a custom dataset of your choice. While training is happening, do task 2:

**Task 2:** Go to `nanoGPT/model.py` and look through the code, which lines of code are responsible for:
 - Token embedding
 - Positional embedding
 - Self attention
 - Mask for self attention
 - Layer normalisation
 - The fully connected layers
 - The complete transformer block
 - Softmax probabilities of outputs

**Task 3:** Once your model has trained use it to generate some outputs in `02-generate-with-nanoGPT.ipynb`. How does your model perform? You can resume training from a saved checkpoint if you want to restart and resume training of your model. 

### Bonus tasks

**Task A:** Change the [config](#set-training-config-and-dataset) for this code to fine-tune a pretrained GPT2 model on the tokenised shakespeare dataset (or other dataset of your choice) and run it. **Disclaimer** you will probably need an NVIDIA GPU and a lot of time to do this well.

**Task B:** Look at the template for a config in `configs`, can you make your own custom GPT architecture by setting new parameters in the config? Maybe you could train a babyGPT model from scratch using the GPT2 tokeniser instead of on characters.