## Week 5 workshop

In this week, we'll train a GPT-style model on text by Shakespeare and get the model to speak like Shakespeare.

The model we're using is a from-scratch implementation of a model similar to Llama 2, written by [Clément Labrugere.](https://github.com/clabrugere) His original code is on GitHub here: https://github.com/clabrugere/scratch-llm. A copy of this repository as of September 2025 is included here in the `scratch-llm` folder.

First we import the required dependencies:

In [None]:
import sys
from pathlib import Path
import torch
from torch.utils.data import DataLoader
import polars as pl
from plotnine import ggplot, aes, geom_line, labs, theme_minimal

sys.path.append("scratch-llm")
from model.llm import LLM
from model.tokenizer import Tokenizer, train_tokenizer

from helpers.dataset import NextTokenPredictionDataset
from helpers.trainer import train
from helpers.config import LLMConfig, TrainingConfig, get_device

Next the setup for the model and for model training.

In [None]:
llm_config = LLMConfig(
    vocab_size = 4096,
    seq_len = 128,
    dim_emb = 256,
    num_layers = 4,
    num_heads = 8,
    emb_dropout = 0.0,
    ffn_dim_hidden = 4 * 256,
    ffn_bias = False
)
train_config = TrainingConfig(
    retrain_tokenizer = False,
    device = get_device(),
    batch_size = 64,
    learning_rate = 3e-4,
    weight_decay = 1e-5,
    max_epochs = 3,
    log_frequency = 10
)

Before we can train the model, we need to train the tokenizer.

In [None]:
input_file = "data/tinyshakespeare.txt"
output_file = Path(input_file).with_suffix(".model")

if not output_file.exists() or train_config.retrain_tokenizer:
    train_tokenizer(input_file, llm_config.vocab_size)

tokenizer = Tokenizer(str(output_file))

A demonstration of the tokenizer in action.

In [None]:
sentence = "Before we proceed any further, hear me speak."
print(tokenizer.sp.EncodeAsPieces(sentence))

## Model setup and training

Generate the model instance using the configuration options previously chosen.

In [None]:
model = LLM(
    vocab_size = tokenizer.vocab_size,
    seq_len = llm_config.seq_len,
    dim_emb = llm_config.dim_emb,
    num_layers = llm_config.num_layers,
    attn_num_heads = llm_config.num_heads,
    emb_dropout = llm_config.emb_dropout,
    ffn_hidden_dim = llm_config.ffn_dim_hidden,
    ffn_bias = llm_config.ffn_bias
)

params_size = sum(p.nelement() * p.element_size() for p in model.parameters())
buffer_size = sum(p.nelement() * p.element_size() for p in model.buffers())
size = (params_size + buffer_size) / 1024**2

print(f"Total model parameters: {sum(p.numel() for p in model.parameters()):,d}")
print(f"Model size: {size:.3f}MB\n")
#print(model)

Set up the training data and the corresponding data loader. When you look carefully at the training labels, you will notice they are exactly shifted one relative to the training inputs. That's because we're training next-token prediction.

In [None]:
# training data
ds_train = NextTokenPredictionDataset(input_file, llm_config.seq_len, tokenizer)

# data loader
dl_train = DataLoader(ds_train, batch_size = train_config.batch_size, shuffle = True)

# make pytorch print more numbers in the array instead of abbreviating with ...
torch.set_printoptions(edgeitems = 5)

# print inputs and labels for the first training iteration
for inputs, labels in dl_train:
    print(f"Tensor shapes\n  input: {inputs.shape}\n output: {labels.shape}\n")
    print(f"Inputs:\n{inputs}\n")
    print(f"Labels:\n{labels}\n")
    
    break

Next we train the model (if you set `train_llm = True`). This may take a couple of hours, depending on your computer.

In [None]:
train_llm = False

if train_llm:
    loss_history = train(
        model,
        dl_train,
        train_config.device,
        lr = train_config.learning_rate,
        max_epochs = train_config.max_epochs,
        weight_decay = train_config.weight_decay,
        log_every = train_config.log_frequency
    )
    # save the model
    torch.save(model.state_dict(), "data/tinyshakespeare_llm.pt")

    # save the history from the training run
    df = pl.DataFrame({
        'index': range(len(loss_history['train_loss'])),
        'train_loss': loss_history['train_loss']
    })
    df.write_csv('data/loss_history.csv')

Plot the history of the training loss.

In [None]:
# Load the DataFrame back from the CSV file
loss_history = pl.read_csv('data/loss_history.csv')

plot = (
    ggplot(loss_history, aes(x = 'index', y = 'train_loss')) +
        geom_line() +
        labs(
            x = 'Step',
            y = 'Training loss'
        ) +
        theme_minimal()
    )

plot

## Using the model

To use the model, we don't have to retrain. We can just load the model that we saved previously and start exploring.

In [None]:
# move the model to the appropriate GPU device
model.to(train_config.device)

# load the saved model weights
model.load_state_dict(torch.load(
    "data/tinyshakespeare_llm.pt",
    weights_only = True
))

# put the model into evaluation mode
model.eval()

Let's start with an empty prompt to generate random text.

In [None]:
prompt = torch.full((1, llm_config.seq_len), tokenizer.eos_id, dtype = torch.int32)
print(f"The prompt input:\n{prompt}\n")
prompt = prompt.to(train_config.device)
out = model.generate(prompt, max_seq_len = 64, top_p = 1)
print(f"The output in token form:\n{out}")
print(f"The output decoded:\n{tokenizer.decode(out.tolist())}")

We can also generate from a starting prompt.

In [None]:
prompt = tokenizer.encode(
    "KING HENRY VI:",
    beg_of_string = True,
    pad_seq = True,
    seq_len = llm_config.seq_len
)

# convert prompt into tensor the model can work with
inputs = torch.tensor(prompt, dtype=torch.int32).unsqueeze(0).to(train_config.device)

# print the input tensor
print(f"The prompt input:\n{inputs.to(torch.device("cpu"))}\n")

# generate output
out = model.generate(inputs, max_seq_len=64, top_p=1)
print(f"The output in token form:\n{out}")
print(f"The output decoded:\n{tokenizer.decode(out.tolist())}")

## Exploring model parameters

Structure of the model parameters:

- **0** :: weight matrix for **token embeddings**

<!-- -->

- **1** :: **RMSNorm** parameter vector
- **2** :: Q, K, V matrices (concatenated) for **MultiHeadAttention**
- **3** :: weight matrix for projout part of **MultiHeadAttention**
- **4** :: **RMSNorm** parameter vector
- **5** :: initial weight matrix for **FeedForward (SwiGLU)** part
- **6** :: **SwiGLU** weight matrices (concatened)
- **7** :: **SwiGLU** bias vector
- **8** :: final weight matrix for **FeedForward (SwiGLU)** part

<!-- -->

-  **9-16** :: as 1-8 but for second TransformerBlock
- **17-24** ::        "       third         "
- **25-32** ::        "       fourth        "

<!-- -->

- **33** :: **RMSNorm** parameter vector
- **34** :: final **projection_head** bias vector

**NOTE**: there is no weight matrix for the final projection head b/c it is "weight-tied" to the token embeddings weight matrix (0 above)

In [None]:
parList = list(model.parameters())
len(parList) ## 35
parShapes = [list(el.shape) for el in parList]
parShapes

Extracting token embeddings.

In [None]:
# extract list of all tokens
tokens = [tokenizer.sp.id_to_piece(i) for i in range(llm_config.vocab_size)]

print(tokenizer.sp.piece_to_id("▁perforce"))
print(tokenizer.sp.piece_to_id("▁basilisk"))

In [None]:
# Convert tensor to numpy array and create DataFrame
embedding_data = parList[0].cpu().detach().numpy()
print(embedding_data.shape)

# _basilisk is token 4077
print(embedding_data[4077, :])

# Create DataFrame with token column plus embedding dimensions
embeddings = pl.DataFrame({
    'token': tokens,
    **{f'dim_{i}': embedding_data[:, i] for i in range(embedding_data.shape[1])}
})

target_tokens = ["▁basilisk", "▁perforce", "▁castle"]
embeddings.filter(pl.col('token').is_in(target_tokens))