<a href="https://colab.research.google.com/github/vidh2000/Build-my-ChatGPT/blob/main/Karpathy_GPT_lecture_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is created for the ML4Physics @ Ljubljana school 2025. It follows Andrej Karpathy's GPT lecture [Let's build GPT: from scratch, in code, spelled out](https://www.youtube.com/watch?v=kCc8FmEb1nY), and includes snippets borrowed from Ryan Killian's [notebooks](https://github.com/ryankillian/karpathy-lectures-notebooks) following Karpathy's lecture.

We will be using PyTorch. If you have never used PyTorch before, their [tutorials](https://docs.pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html#define-the-class) can be helpful. If you feel completely lost, use the hints in the notebook or just ask us to help!

# Dataset

In [1]:
# Download the Tiny Shakespeare text dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-06-30 09:03:18--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-06-30 09:03:19 (102 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
# Check what's in the dataset
with open("input.txt", "r", encoding="utf-8") as f:
  text = f.read()
chars = sorted(list(set(text)))  # Create a list of characters
vocab_size = len(chars)
print("Length of dataset in characters: ", len(text))
print("Vocab size: ", vocab_size)
print("Characters in the dataset: ", "".join(chars))
print("")
print(text[:250])

Length of dataset in characters:  1115394
Vocab size:  65
Characters in the dataset:  
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



# Simple tokenizaton, encoding and decoding

In [3]:
# Tokenization - simple lookup table
char_to_int = {ch:i for i,ch in enumerate(chars)}  # Map charachter to integer
int_to_char = {i:ch for i,ch in enumerate(chars)}  # Map integer to character

In [4]:
# Encoding and decoding
encode = lambda s: [ char_to_int[c] for c in s]
decode = lambda l: "".join([ int_to_char[i] for i in l])

print(f"Encoded string: {encode('hi there')}")
print(f"Decoded encoding: {decode(encode('hi there'))}")

Encoded string: [46, 47, 1, 58, 46, 43, 56, 43]
Decoded encoding: hi there


In [5]:
# Encode the entire text and store as a torch tensor
import torch
data = torch.tensor(encode(text), dtype = torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [6]:
# How does the story end? Decode the last 401 characters of the text

last_401_tokens = data[-401:]  # PyTorch tensor
decoded_text = decode(last_401_tokens.tolist())
print(decoded_text)
# For hint: scroll --->                                                                                                                                                                   use .tolist() to turn the tensor into a format that the decoder can read

SEBASTIAN:
What, art thou waking?

ANTONIO:
Do you not hear me speak?

SEBASTIAN:
I do; and surely
It is a sleepy language and thou speak'st
Out of thy sleep. What is it thou didst say?
This is a strange repose, to be asleep
With eyes wide open; standing, speaking, moving,
And yet so fast asleep.

ANTONIO:
Noble Sebastian,
Thou let'st thy fortune sleep--die, rather; wink'st
Whiles thou art waking.



# Simple dataloader

In [7]:
# Divide into train/val
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]
print((len(train_data), len(val_data)))

(1003854, 111540)


In [8]:
# Define a maximum context length
context_length = 128

In [9]:
# The target is always the next element in the sequence
print(train_data[:10])
input = train_data[:context_length]
target = train_data[1: context_length+1]
for t in range(context_length):
    x = input[:t+1]
    y = target[t]
    print(f"When input is {x}, the target is {y} ")

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])
When input is tensor([18]), the target is 47 
When input is tensor([18, 47]), the target is 56 
When input is tensor([18, 47, 56]), the target is 57 
When input is tensor([18, 47, 56, 57]), the target is 58 
When input is tensor([18, 47, 56, 57, 58]), the target is 1 
When input is tensor([18, 47, 56, 57, 58,  1]), the target is 15 
When input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is 47 
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is 58 
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58]), the target is 47 
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47]), the target is 64 
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64]), the target is 43 
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43]), the target is 52 
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52]), the target is 10 
When input is tensor([18, 47, 56, 57, 58, 

In [10]:
# Define a batch

torch.manual_seed(0) # or pick your favorite number
batch_size = 4

def get_batch(split, device='cpu'):
    data = train_data if split == "train" else val_data
    batch_size = 4

    # Random start indices: each must allow a full context window
    max_start = len(data) - context_length
    ix = torch.randint(0, max_start, (batch_size,))
    # Pick batch size (4) random indices between 0 and the length of the data minus the maximum context length
    ix = torch.randint(len(data) - context_length, (batch_size,))
    inputs = torch.stack([data[i: i+context_length] for i in ix])
    targets = torch.stack([data[i+1: i+1+context_length] for i in ix])
    # For hint: scroll --->                                                                                                                                                                Target should be the same as inputs, with an offset of 1

    return inputs.to(device), targets.to(device)

In [11]:
# Check that your targets make sense in relation to the inputs
x, y = get_batch('train')
print('inputs:')
print(x)
print('targets:')
print(y)

inputs:
tensor([[53, 56, 42, 57,  6,  1, 51, 39, 63,  1, 52, 39, 51, 43,  1, 58, 46, 43,
          1, 58, 47, 51, 43, 11,  0, 13, 52, 42,  1, 47, 52,  1, 58, 46, 43,  1,
         42, 59, 49, 43,  5, 57,  1, 40, 43, 46, 39, 50, 44,  1, 21,  5, 50, 50,
          1, 45, 47, 60, 43,  1, 51, 63,  1, 60, 53, 47, 41, 43,  6,  0, 35, 46,
         47, 41, 46,  6,  1, 21,  1, 54, 56, 43, 57, 59, 51, 43,  6,  1, 46, 43,
          5, 50, 50,  1, 58, 39, 49, 43,  1, 47, 52,  1, 45, 43, 52, 58, 50, 43,
          1, 54, 39, 56, 58,  8,  0,  0, 14, 21, 31, 20, 27, 28,  1, 27, 18,  1,
         17, 24],
        [43, 11,  1, 44, 53, 56,  1, 21,  1, 42, 47, 42,  1, 49, 47, 50, 50,  1,
         23, 47, 52, 45,  1, 20, 43, 52, 56, 63,  6,  0, 14, 59, 58,  1,  5, 58,
         61, 39, 57,  1, 58, 46, 63,  1, 40, 43, 39, 59, 58, 63,  1, 58, 46, 39,
         58,  1, 54, 56, 53, 60, 53, 49, 43, 42,  1, 51, 43,  8,  0, 26, 39, 63,
          6,  1, 52, 53, 61,  1, 42, 47, 57, 54, 39, 58, 41, 46, 11,  1,  5, 58,
  

# Model

In [12]:
import torch.nn as nn
from torch.nn import functional as F

This is the simplest possible model. It contains no connections between tokens, they are not aware of one another. We are only getting back which token typically follows token x.

In [13]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, inputs, targets=None):
        # Shape: BTC (batch size, max context length, vocab size)
        #print("Before\n",inputs)
        logits = self.token_embedding_table(inputs)
        #print("After\n",logits)

        if targets is None: # so we can re-use the function for generation (see below)
            loss = None
        else:
            # Reshape to fit Pytorch's cross entropy implementation
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)

            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, inputs, max_new_tokens):
        # inputs is a (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            logits, _ = self(inputs) # call forward() to get predictions (B, T, C)
            logits = logits[:,-1,:] # focus on last element (since we are predicting what comes after the last element), shape becomes (B, C)

            # Turn the logits into probabilitites
            probs = torch.softmax(logits, dim=-1)
            # For hint: scroll --->                                                                                                                                                                                       To get probabilities, we use softmax on the last dimension

            # Sample the probabilities to get the predicted index
            idx_next = torch.multinomial(probs, num_samples=1)

            # For hint: scroll --->                                                                                                                                                                                       Use torch.multinomial to sample the probability distribution

            inputs = torch.cat((inputs, idx_next), dim=1) # concatenate the sampled element, shape is (B, T+1)

        return inputs



In [14]:
model = BigramLanguageModel(vocab_size)
inputs, targets = get_batch('train')
logits, loss = model(inputs, targets)

print(logits.shape) # Shape B*T, C
print(loss.item())

torch.Size([512, 65])
4.563185691833496


In [15]:
# Generate and decode 500 tokens, starting from a start index
start_index = torch.zeros((1,1), dtype=torch.long)
generated = model.generate(start_index, max_new_tokens=500)
generated_text = decode(generated[0].tolist())  # remove batch dim and decode
print(generated_text)
# your code here...
# For hint: scroll --->                                                                                                                                                                   Examine the output of model.generate. Remember how you decoded the output above.



MRL'Dj-Gae
pVEBYbUbEVPlPoJiz,
iHgtSTdF'UULLbe.vAHHofsxyYbJ&3iSZ,fLxmqtOBbCl3HgzxNhpkyBiS3tR-Ow-uShxe,G,&
K.!:hQOxZLdbir$KiZ!TygzFocdQ-EuZwxXgzOH3Hs:xDqTMsbHFGULx&kmY GvI., T&Gd&&nZk-GdcE$z,CZrroDxSRmAK?
muMrBTHg
WnETu;w$Pte,NwIkHg? rqNrxyvsnRCAzmyiDh:!&HgVeqGDDnfbsTXxlfrv :Oo!TIqvBxPK:tdTAe !Pv!ahiRU!KxIXSg?R!IZcYtWSZ!'aMHJJdpjAaoONA;w$-PoFM;&-OvyNnpRH!a 
;UtE'fSSndVULJEh3QOdf!Wqdi$Ko&DWxcQo$wWHYPNm$N
Qq3M?DNCsekEFqiO3zFof$wQaTNBTEmZag'bQlx;H'ArrqIeI?;FTYy
JJXJdjOG,k:xed
IZ:O!-
BxVe&?PANOOoBkm$N


# Train the model

The output above is from a randomly initialized model. Let's train it, and see if it learns anything!

In [16]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [17]:
# Let's train!
batch_size=64

for steps in range(10000):
    inputs, targets = get_batch('train')  # get batch of shape (B, T)

    logits, loss = model(inputs, targets)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    if steps % 1000 == 0:
        print(f"Step {steps}, Loss: {loss.item()}")
print(f"Final loss = {loss}")

Step 0, Loss: 4.5482096672058105
Step 1000, Loss: 3.6333296298980713
Step 2000, Loss: 3.0479347705841064
Step 3000, Loss: 2.65885066986084
Step 4000, Loss: 2.5712459087371826
Step 5000, Loss: 2.5849523544311523
Step 6000, Loss: 2.4878904819488525
Step 7000, Loss: 2.4828920364379883
Step 8000, Loss: 2.4297704696655273
Step 9000, Loss: 2.4429640769958496
Final loss = 2.555318832397461


In [18]:
# Generate
print(decode(model.generate(inputs=start_index, max_new_tokens=500)[0].tolist()))


th?
HA co gonind wind to my KIIUThlalopor, it l warechowhisthe, I wicare on fr ssovey, K:
festhemo his k-
Taisqur; her-

WIOPoincout feso wis ul ck
San:
RDid athat ond ws sst fuino, ghe? llangey, in
nset sthenyo cer.
ANTur sare Noreraldsay he, heso, totifour, she ther ghinde,NESe t?
To th s 's tl ig ING weth s, tard ik, indigthain t deewo ARDoulde:
Codochetr s I bl!
RYe die omor trr kias arerorr bof, bara be mint thererthevedelal ay mengoomur we,
CEO CEThe ts. mit lind geshy he.

BR: Ho.
TESithi


Did the output look more reasonable?

# Let's get a bit more serious with the training

For the rest of this notebook, we would like access to a GPU. If your runtime in Google Colab is on a CPU, you can request a GPU by clicking the little arrow in the top right corner next to the RAM and disk usage graphs, and select "Change runtime type". Note that this will restart your runtime, so you need to reload the data etc again. It is possible that Google will say no to your request for a GPU, as they have limited resources. It is possible to run the following also on a CPU, but get a GPU if you can.

In [19]:
# Do we have access to a GPU?
if torch.cuda.is_available():
    print("We have access to the following GPU(s):")
    for i in range(torch.cuda.device_count()):
        print(torch.cuda.get_device_properties(i).name)
else:
    print("No access to GPU")

We have access to the following GPU(s):
Tesla T4


In [20]:
### You only need to run this if you had to restart your runtime at this point
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(0) # or pick your favorite number
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
char_to_int = {ch:i for i,ch in enumerate(chars)}
int_to_char = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [ char_to_int[c] for c in s]
decode = lambda l: "".join([ int_to_char[i] for i in l])
data = torch.tensor(encode(text), dtype = torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]
def get_batch(split, device='cpu'):
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - context_length, (batch_size,))
    inputs = torch.stack([data[i: i+context_length] for i in ix])
    targets = torch.stack([data[i+1: i+context_length+1] for i in ix])
    return inputs.to(device), targets.to(device)

--2025-06-30 09:03:42--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2025-06-30 09:03:42 (189 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [21]:
# hyperparameters
batch_size = 32
context_length = 128
max_iters = 10000
eval_interval = 300
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200

# A function to estimate the loss, which we can call for evaluation
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for data_split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(data_split, device=device)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[data_split] = losses.mean()
    model.train()
    return out

model = BigramLanguageModel(vocab_size)
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train', device=device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


step 0: train loss 4.5310, val loss 4.5589
step 300: train loss 4.1313, val loss 4.1630
step 600: train loss 3.7911, val loss 3.8262
step 900: train loss 3.5119, val loss 3.5478
step 1200: train loss 3.2797, val loss 3.3141
step 1500: train loss 3.0947, val loss 3.1262
step 1800: train loss 2.9503, val loss 2.9822
step 2100: train loss 2.8327, val loss 2.8660
step 2400: train loss 2.7458, val loss 2.7756
step 2700: train loss 2.6790, val loss 2.7093
step 3000: train loss 2.6271, val loss 2.6574
step 3300: train loss 2.5911, val loss 2.6167
step 3600: train loss 2.5621, val loss 2.5891
step 3900: train loss 2.5410, val loss 2.5648
step 4200: train loss 2.5256, val loss 2.5498
step 4500: train loss 2.5111, val loss 2.5362
step 4800: train loss 2.5018, val loss 2.5296
step 5100: train loss 2.4956, val loss 2.5186
step 5400: train loss 2.4881, val loss 2.5123
step 5700: train loss 2.4832, val loss 2.5067
step 6000: train loss 2.4807, val loss 2.5063
step 6300: train loss 2.4741, val loss 2

# Self-attention, first tests

So far, our model has only had a simple embedding table to work with. Let's introduce some attention.

In [22]:
torch.manual_seed(0)

# The input dimension is (B, T, C), which stands for (batch, time, channels). In
# this context, this will be equivalent to (batch size, sequence length,
# embedding dimension).

# Generate some random input
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

The formula for attention is:

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V$

Here, $Q$, $K$ and $V$ will be the output of linear layers (whose input is $x$), and the scaling factor $d_k$ is the dimension of the keys.

In [23]:
class SimpleAttentionModel(nn.Module):

    def __init__(self):
        super().__init__()
        # First, create an embedding table from vocab_size to n_embd
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        # For hint: scroll --->                                                                                                                                                                                            self.token_embedding_table = nn.Embedding(vocab_size, n_embd)


        # For language, the position matters. Create an embedding table from context_length to n_embd
        self.position_embedding_table = nn.Embedding(context_length, n_embd)
        # For hint: scroll --->                                                                                                                                                                                            self.position_embedding_table = nn.Embedding(context_length, n_embd)


        # Set up the Head
        self.sa_head = Head(n_embd)

        # Set up a projection layer from the embedding dimension back to the vocab_size
        self.proj = nn.Linear(n_embd, vocab_size)
        # For hint: scroll --->                                                                                                                                                                                            This is just a simple linear layer: self.proj = nn.Linear(n_embd, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B, T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, C)
        x = tok_emb + pos_emb # (B, T, C)
        x = self.sa_head(x)
        logits = self.proj(x) # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -context_length:]

            logits, _ = self(idx_cond)  # (B, T, C)
            logits = logits[:, -1, :] # (B, C)
            probs = F.softmax(logits, dim=-1) # (B, C)
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


In [24]:
# Let's see a single Head perform self-attention
head_size = 32
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, hs)
q = query(x) # (B, T, hs)
v = value(x) # (B, T, hs)


# Compute attention scores ("affinities") with scaled dot-product
wei = q @ k.transpose(-2, -1) / head_size**0.5  # (B, T, T)

# Optionally: apply causal mask if this is for autoregressive transformer
# wei = wei.masked_fill(tril == 0, float('-inf'))

# Softmax over the keys to get attention weights
wei = F.softmax(wei, dim=-1)  # (B, T, T)

# Weighted sum of the values
attention = wei @ v  # (B, T, head_size)

print(attention.shape)  # Should be (B, T, hs)

torch.Size([4, 8, 32])


# Train a model with attention

We now add an explicit embedding dimension, `n_embed`. We also introduce a triangular matrix, the causal mask, which will stop the model from cheating by looking ahead at what comes next.

In [25]:
# hyperparameters
batch_size = 32
context_length = 8
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
head_size = 32
n_embd = 32

In [26]:
class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # For hint: scroll --->                                                                                                                                                                                       Just as we did before, same setup for Q, K and V: nn.Linear(n_embd, head_size, bias=False)


        # New! We introduce a causal mask, that will stop the model from cheating
        # by looking ahead at what comes next.
        self.register_buffer('tril', torch.tril(torch.ones(context_length, context_length)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)   # (B, T, hs)
        q = self.query(x) # (B, T, hs)
        v = self.value(x) # (B, T, hs)

        weights = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)

        # Apply the causal mask
        weights = weights.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)

        weights = F.softmax(weights, dim=-1) # (B, T, T)
        out = weights @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out


In [27]:
model = SimpleAttentionModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [28]:
# Train

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train', device=device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.2400, val loss 4.2331
step 500: train loss 2.6863, val loss 2.6920
step 1000: train loss 2.5159, val loss 2.5317
step 1500: train loss 2.4719, val loss 2.4637
step 2000: train loss 2.4503, val loss 2.4413
step 2500: train loss 2.4270, val loss 2.4254
step 3000: train loss 2.4040, val loss 2.4155
step 3500: train loss 2.3967, val loss 2.4162
step 4000: train loss 2.3935, val loss 2.4216
step 4500: train loss 2.3871, val loss 2.3882


In [29]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


Pur youbn eald Cinoimeabs ut yom chithisur alinen. ncomef br'd ht
D yor reak' ht ster oul the You lotoo wim thet hean bur st thor ther an oile, ans fenver ses ave Nikn tath llld am
rid ld
BE
VI:
Ase wh vet I'leck yofr.

WAdvende do omef beld to
Towan, st pradeed lle, fal boer oure mm m'schis bif ous tansol daccanet lour tals ven.

BRO.

ARIVIINm I whey yomainans nd pamel,
I st Ker pounse? pleces ot omy ot bithe hy eghesch le g; tr oron he beer al ady alf onge grein ml gre vergo od thang befre Ef


# Multihead attention

The purpose of multihead attention is to allow different heads to attend differently to the input. Here we are using sequential heads for illustrative purposes. To see an implmentation that utilizes parallel processing (and thus making the most of the GPU), see the [GPT model](https://github.com/uhh-pd-ml/omnijet_alpha/blob/d47046b2bb7f47cea5ee69f616b842fa402758e0/gabbro/models/gpt_model.py#L13) of OmniJet-alpha.

Note that `num_heads` times `head_size` must equal `n_embd`.

In [30]:
# hyperparameters
batch_size = 32
context_length = 8
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
num_heads = 4
n_embd = 32

In [31]:
# Initialize the model and send to the correct device
# your code here...
# For hint: scroll --->                                                                                                                                                                                           Same as before, but use the new model


# create a PyTorch optimizer
# your code here...
# For hint: scroll --->                                                                                                                                                                                           Same as before
model = SimpleMHA_Model()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)


NameError: name 'SimpleMHA_Model' is not defined

In [None]:
# Train
# your code here...
# For hint: scroll --->                                                                                                                                                                                           Same as before

# Train

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train', device=device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


In [None]:
# generate from the model
# your code here...
# For hint: scroll --->                                                                                                                                                                                           Same as before
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

# Adding FF network, residual connections, layer norm and dropout

To make our model more powerful, let's add some more stuff to make it more similar to what's in the [Attention is all you need](https://arxiv.org/abs/1706.03762) paper. You can experiment with the dropout value, and where you want to use it.

In [35]:
# Let's try multihead attention

class Head(nn.Module):
    # your code here...
    # For hint: scroll --->                                                                                                                                                                                           Same as before

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        self.register_buffer('tril', torch.tril(torch.ones(context_length, context_length)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)   # (B, T, hs)
        q = self.query(x) # (B, T, hs)
        v = self.value(x) # (B, T, hs)

        weights = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)

        # Apply the causal mask
        weights = weights.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)

        weights = F.softmax(weights, dim=-1) # (B, T, T)
        out = weights @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        # Heads in sequence
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        # The output of the heads are concatenated
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out

class SimpleMHA_Model(nn.Module):

    def __init__(self):
        super().__init__()
        # Implement token embedding table and position embedding table as earlier
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_length, n_embd)


        # Note that num_heads x head_size = n_embd
        self.sa_heads = MultiHeadAttention(4, n_embd//4) # 4 heads of 8 dimensional self-attention

        # Implement the projection layer as earlier
        self.proj = nn.Linear(n_embd, vocab_size)


    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B, T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, C)
        x = tok_emb + pos_emb # (B, T, C)
        x = self.sa_heads (x)
        logits = self.proj(x) # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -context_length:]

            logits, _ = self(idx_cond)  # (B, T, C)
            logits = logits[:, -1, :] # (B, C)
            probs = F.softmax(logits, dim=-1) # (B, C)
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx



In [36]:
# hyperparameters
batch_size = 32
context_length = 8
max_iters = 5000
eval_interval = 500
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_head = 4
n_embd = 32
dropout = 0.0

                                                                                                                                                                                           # In forward, after softmax: weights = self.dropout(weights)
class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.dropout = nn.Dropout(dropout)


        self.register_buffer('tril', torch.tril(torch.ones(context_length, context_length)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)   # (B, T, hs)
        q = self.query(x) # (B, T, hs)
        v = self.value(x) # (B, T, hs)

        weights = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)

        # Apply the causal mask
        weights = weights.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)

        weights = F.softmax(weights, dim=-1) # (B, T, T)
        weights = self.dropout(weights)
        out = weights @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out


class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        # Heads in sequence
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

        # INItialised dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # The output of the heads are concatenated
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        ### ADded dropout
        out = self.dropout(out)
        return out


class FeedFoward(nn.Module):

    def __init__(self, n_embd):
        super().__init__()
        self.FF = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),  # experiment with using/not using dropout here
        )

    def forward(self, x):
        return self.FF(x)

class Block(nn.Module):

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class Full_Model(nn.Module):

    def __init__(self):
        super().__init__()

        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_length, n_embd)

        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )

        self.proj = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, C)
        x = tok_emb + pos_emb # (B, T, C)
        x = self.blocks(x)
        logits = self.proj(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -context_length:]

            logits, _ = self(idx_cond)  # (B, T, C)
            logits = logits[:, -1, :] # (B, C)
            probs = F.softmax(logits, dim=-1) # (B, C)
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx



In [37]:
# Initialize the model and send to the correct device
# your code here...


# create a PyTorch optimizer
# your code here...

# Initialize the model and send to the correct device
# your code here...
# For hint: scroll --->                                                                                                                                                                                           Same as before, but use the new model


# create a PyTorch optimizer
# your code here...
# For hint: scroll --->                                                                                                                                                                                           Same as before
model = Full_Model()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)



In [38]:
# Train
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train', device=device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: train loss 4.5580, val loss 4.5610
step 500: train loss 2.4104, val loss 2.4101
step 1000: train loss 2.2708, val loss 2.2930
step 1500: train loss 2.1876, val loss 2.2168
step 2000: train loss 2.1430, val loss 2.1763
step 2500: train loss 2.1211, val loss 2.1566
step 3000: train loss 2.0791, val loss 2.1389
step 3500: train loss 2.0550, val loss 2.1229
step 4000: train loss 2.0235, val loss 2.1101
step 4500: train loss 2.0257, val loss 2.0992


In [39]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


Themwal your lial him! I buet wellse it
And truade be to no forted:
Tome; fartell, I:
And sorrtooadle of me, selemele your frew that will younds!
sel, cand the corst dod but shope wee,
And thim: what I seecontitame ap of of earsimves oaken? Cumbern,
Nor sutee 'on ste: in trees.
No, Hausparlatt peam,
II shall appwion the to Ko just muss:
The demess!
Sted Rithts lovend. 'Et did nat traced his nawn
Wan
That hean to lat'ly end oblo, wheasife, seve lanta he is we a my heed.

USTIELILAM:
WeC
I marre s


# Scaling up

Foundation models become powerful because of scale. This is not yet a foundation model, just a small transformer, but we still want to check what happens if we scale it up! Note that this will likely take way too long to finish during the school (during my test run on a T4 GPU it took 1.5h to finish 1500 steps). If you have time at home, and perhaps access to more powerful GPUs via your university/institute, play around with the different options and see what happens.

In [40]:
# hyperparameters
batch_size = 64  # Larger batch size
context_length = 256  # Longer context length
max_iters = 5000
eval_interval = 10  # <-- Since this model takes a long time to train, I put a short interval here so you won't have to wait forever to see the output
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384  # Larger embedding dimension
n_head = 6  # More heads
n_blocks = 3  # More transformer blocks
dropout = 0.2

# A function to estimate the loss, which we can call for evaluation
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for data_split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(data_split, device=device)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[data_split] = losses.mean()
    model.train()
    return out

In [41]:
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('tril', torch.tril(torch.ones(context_length, context_length)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)   # (B, T, hs)
        q = self.query(x) # (B, T, hs)
        v = self.value(x) # (B, T, hs)

        weights = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, T)
        weights = weights.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        weights = F.softmax(weights, dim=-1)
        weights = self.dropout(weights)  # <- Dropout after softmax
        out = weights @ v  # (B, T, hs)
        return out


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embd)  # ← add this
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))  # ← and use it here
        return out


class FeedFoward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.FF = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout)  # <- Dropout at the end of FFN
        )

    def forward(self, x):
        return self.FF(x)

class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))  # residual connection + attention
        x = x + self.ffwd(self.ln2(x))  # residual connection + FFN
        return x


class Full_Model(nn.Module):

    def __init__(self):
        super().__init__()

        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(context_length, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_blocks)]) # many blocks!
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.proj = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, C)
        x = tok_emb + pos_emb # (B, T, C)
        x = self.ln_f(self.blocks(x))
        logits = self.proj(x) # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -context_length:]

            logits, _ = self(idx_cond)  # (B, T, C)
            logits = logits[:, -1, :] # (B, C)
            probs = F.softmax(logits, dim=-1) # (B, C)
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx



In [42]:
# Initialize the model and send to the correct device
model = Full_Model()
m = model.to(device)
print(f"Model structure:\n{model}\n")
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Model structure:
Full_Model(
  (token_embedding_table): Embedding(65, 384)
  (position_embedding_table): Embedding(256, 384)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-5): 6 x Head(
            (key): Linear(in_features=384, out_features=64, bias=False)
            (query): Linear(in_features=384, out_features=64, bias=False)
            (value): Linear(in_features=384, out_features=64, bias=False)
            (dropout): Dropout(p=0.2, inplace=False)
          )
        )
        (proj): Linear(in_features=384, out_features=384, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
      )
      (ffwd): FeedFoward(
        (FF): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Linear(in_features=1536, out_features=384, bias=True)
          (3): Dropout(p=0.2, inplace=False)
        )
      )
      (ln1): LayerNorm((384,), eps=1e-05, elementwi

In [None]:
# Train
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train', device=device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: train loss 4.3187, val loss 4.3076
step 10: train loss 3.0200, val loss 3.0515
step 20: train loss 2.8026, val loss 2.8194
step 30: train loss 2.6769, val loss 2.6926
step 40: train loss 2.5999, val loss 2.6043
step 50: train loss 2.5553, val loss 2.5581
step 60: train loss 2.5319, val loss 2.5370
step 70: train loss 2.5165, val loss 2.5239
step 80: train loss 2.5016, val loss 2.5104
step 90: train loss 2.4916, val loss 2.4996
step 100: train loss 2.4848, val loss 2.4946
step 110: train loss 2.4783, val loss 2.4902
step 120: train loss 2.4722, val loss 2.4865
step 130: train loss 2.4649, val loss 2.4848
step 140: train loss 2.4581, val loss 2.4782
step 150: train loss 2.4523, val loss 2.4739
step 160: train loss 2.4467, val loss 2.4653
step 170: train loss 2.4406, val loss 2.4650
step 180: train loss 2.4321, val loss 2.4585
step 190: train loss 2.4301, val loss 2.4491
step 200: train loss 2.4193, val loss 2.4416
step 210: train loss 2.4153, val loss 2.4393
step 220: train loss 

In [None]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

I'm done, now what?

Congrats! Now you hopefully have a better understanding of what goes on behind the scenes in a GPT model. If you want to continue working with language models, check out Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT/tree/master). If you want to try a particle physics implementation, check out [OmniJet-alpha](https://github.com/uhh-pd-ml/omnijet_alpha). Have fun!