# Introduction

This workbook is based on Andrej Karpathy's YouTube tutorial titled ["Let's build GPT: from scratch, in code, spelled out"](https://www.youtube.com/watch?v=kCc8FmEb1nY).


## Purpose of this Workbook
This notebook serves as a personal educational tool to reinforce learning and practice building a transformer from scratch.

## Suggested use
Delete everything but the list of instructions below, and good luck!!




# List of instructions

In [None]:
# 1. Imports & config
# 2. Download dataset
# 3. Vocabulary
# 4. Tokenizer (encode and decode)
# 5. Train and Test Splits
# 6. Dataloader *
# 7. Model: Embedding Layer and Output Linear Transformation
# 8. Generate Function
# 9. Evaluation Loop
# 10. Training Loop
# 11. Model: Positional Embeddings
# 12. Model: Single Attention Head
# 13. Model: Multi-head Attention
# 14. Model: Multihead Attention Projection Layer
# 15. Model: MLP
# 16: Model: Transformer block
# 17: Model: Skip connections
# 18: Model: Layer Normalization
# 19: Model: Dropout

# Extras
# 1. Self Attention from first principles

# Code

In [1]:
# 1. Imports & config

import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [2]:
# 2. Download dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-04-18 15:50:42--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.2’


2024-04-18 15:50:42 (32.2 MB/s) - ‘input.txt.2’ saved [1115394/1115394]



In [3]:
# 3. Vocabulary

with open('./input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size = len(chars)
print(vocab_size)
print(chars)

65
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [4]:
# 4. Tokenizer (encode and decode)

itos = {i:ch for i, ch in enumerate(chars)}
stoi = {ch:i for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])

print(encode('hello world!'))
print(decode(encode('hello world!')))

# def decode: lambda

[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42, 2]
hello world!


In [5]:
# 5. Train and Test Splits

data = torch.tensor(encode(text), dtype=torch.long)

n = int(len(data)*0.9)
train_data = data[:n]
val_data = data[n:]

print(train_data.shape)
print(val_data.shape)

torch.Size([1003854])
torch.Size([111540])


In [6]:
# 6. Dataloader

batch_size = 4
block_size = 8

def get_batch(split):

    data = train_data if split=='train' else val_data

    ix = torch.randint(len(data)-block_size, (batch_size,))

    x = torch.stack([data[i: i+ block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1: i+ block_size+1] for i in ix], dim=0)

    x = x.to(device)
    y = y.to(device)

    return x, y

print(get_batch('train'))




(tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]]), tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]]))


In [13]:
# 7. Model: Embedding Layer and Output Linear Transformation
# 8. Generate Function

n_embed = 32

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        logits = self.lm_head(tok_emb) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

model = GPT()
model = model.to(device)

print(model)




GPT(
  (embed_tokens): Embedding(65, 32)
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)


In [8]:
# 9. Evaluation Loop
# 10. Training Loop

eval_iters = 20
eval_interval = 500

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out

learning_rate=1e-3
training_iters=1000


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()





 training loss: 4.396236419677734, eval loss 4.392512321472168
 training loss: 3.7682690620422363, eval loss 3.8132052421569824


In [9]:
context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))


nGFrSwIrsSp! BeMn!yf,Mtb l!spYfNiebNw  tsxBonlrp!,


In [15]:
# 11. Model: Positional Embeddings

n_embed = 32

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3

#----------------------------------------------


class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = tok_emb

        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)
 training loss: 4.284838676452637, eval loss 4.372692108154297
 training loss: 2.9556362628936768, eval loss 2.9893593788146973
 training loss: 2.6617894172668457, eval loss 2.75193190574646
 training loss: 2.535144090652466, eval loss 2.739504337310791
 training loss: 2.560425043106079, eval loss 2.6136131286621094
 training loss: 2.4922423362731934, eval loss 2.6709134578704834

Mje p st; mabind d pr, bafotauth ny l baru,
Ae hes


In [18]:
# 12. Model: Single Attention Head

n_embed = 32
head_size = 32

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3

#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        out = wei @ v

        return out




class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.attn = Head(head_size)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.attn(x)

        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (attn): Head(
    (query): Linear(in_features=32, out_features=32, bias=False)
    (key): Linear(in_features=32, out_features=32, bias=False)
    (value): Linear(in_features=32, out_features=32, bias=False)
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

---------------
 training loss: 4.191686153411865, eval loss 4.202771186828613
 training loss: 3.0024924278259277, eval loss 2.986266851425171
 training loss: 2.6651840209960938, eval loss 2.7015931606292725
 training loss: 2.7183470726013184, eval loss 2.6373283863067627
 training loss: 2.623220443725586, eval loss 2.5721213817596436
 training loss: 2.5711042881011963, eval loss 2.5833258628845215

---------------

Hacounesre d;
Bar taud tpilptit omig ty II tiveang

---------------

Total Parameters: 7553
Trainable Parameters: 7553


In [25]:
# 13. Model: Multi-head Attention
# 14. Model: Multihead Attention Projection Layer

n_embed = 32
head_size = 8
n_head = 4

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3

#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for h in range(n_head)])
        self.o_proj = nn.Linear(n_embed, n_embed, bias=False)

    def forward(self, x):
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        x = self.o_proj(x)
        return x




class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.attn = MultiHeadAttention(n_head, n_embed//n_head)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.attn(x)

        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (attn): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
    (o_proj): Linear(in_features=32, out_features=32, bias=False)
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

---------------
 training loss: 4.203196048736572, eval loss 4.1950364112854
 training loss: 2.936199426651001, eval loss 2.9405248165130615
 training loss: 2.6581954956054688, eval loss 2.734909772872925
 training loss: 2.5741524696350098, eval loss 2.6650893688201904
 training loss: 2.6502718925476074, eval loss 2.5795183181762695
 training loss: 2.439868211746216, eval loss 2.5682873725891113

---------------

rfler the icot hathinlt non trit yoce chirese thos

---------------



In [27]:
# 15. Model: MLP

n_embed = 32
head_size = 8
n_head = 4

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3

#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for h in range(n_head)])
        self.o_proj = nn.Linear(n_embed, n_embed, bias=False)

    def forward(self, x):
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        x = self.o_proj(x)
        return x

class MLP(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.out_proj = nn.Linear(n_embed, 4 * n_embed)
        self.in_proj = nn.Linear(4 * n_embed, n_embed)
        self.act = nn.ReLU()

    def forward(self, x):
        x = self.out_proj(x)
        x = self.act(x)
        x = self.in_proj(x)
        return x

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.attn = MultiHeadAttention(n_head, n_embed//n_head)
        self.mlp = MLP(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.attn(x)
        x = self.mlp(x)

        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (attn): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
    (o_proj): Linear(in_features=32, out_features=32, bias=False)
  )
  (mlp): MLP(
    (out_proj): Linear(in_features=32, out_features=128, bias=True)
    (in_proj): Linear(in_features=128, out_features=32, bias=True)
    (act): ReLU()
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

---------------
 training loss: 4.203790187835693, eval loss 4.198258399963379
 training loss: 2.8171744346618652, eval loss 2.7808096408843994
 training loss: 2.673691987991333, eval loss 2.642836093902588
 training loss: 2.6495823860168457, eval loss 2.461919069290161
 training loss: 2.421330213546753, eval loss

In [32]:
# 16: Model: Transformer block

n_embed = 32
head_size = 8
n_head = 4
n_layers = 2

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3


#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_head)])
        self.o_proj = nn.Linear(n_head * head_size, n_embed, bias=False)

    def forward(self, x):
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        x = self.o_proj(x)
        return x

class MLP(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.up_proj = nn.Linear(n_embed, 4 * n_embed)
        self.down_proj = nn.Linear(4 * n_embed, n_embed)
        self.act = nn.ReLU()

    def forward(self, x):
        x = self.up_proj(x)
        x = self.act(x)
        x = self.down_proj(x)
        return x

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.attn = MultiHeadAttention(n_head, head_size)
        self.mlp = MLP(n_embed)

    def forward(self, x):
        x = self.attn(x)
        x = self.mlp(x)
        return x

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.layers = nn.Sequential(*[Block(n_embed, n_head) for _ in range(n_layers) ])
        # self.layers = nn.Sequential(
        #     Block(n_embed, n_head=4),
        #     Block(n_embed, n_head=4)
        # )
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.layers(x)
        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (layers): Sequential(
    (0): Block(
      (attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
        (o_proj): Linear(in_features=32, out_features=32, bias=False)
      )
      (mlp): MLP(
        (up_proj): Linear(in_features=32, out_features=128, bias=True)
        (down_proj): Linear(in_features=128, out_features=32, bias=True)
        (act): ReLU()
      )
    )
    (1): Block(
      (attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear

In [35]:
# 17: Model: Skip connections
# 18: Model: Layer Normalization
# 19: Model: Dropout

n_embed = 32
head_size = 8
n_head = 4
n_layers = 2

dropout=0.1

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3


#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)


    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_head)])
        self.o_proj = nn.Linear(n_head * head_size, n_embed, bias=False)

    def forward(self, x):
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        x = self.o_proj(x)
        return x

class MLP(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.up_proj = nn.Linear(n_embed, 4 * n_embed)
        self.down_proj = nn.Linear(4 * n_embed, n_embed)
        self.act = nn.ReLU()

    def forward(self, x):
        x = self.up_proj(x)
        x = self.act(x)
        x = self.down_proj(x)
        return x

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.attn = MultiHeadAttention(n_head, head_size)
        self.mlp = MLP(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = x + self.dropout(self.attn(self.ln1(x)))
        x = x + self.dropout(self.mlp(self.ln2(x)))
        return x

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.layers = nn.Sequential(*[Block(n_embed, n_head) for _ in range(n_layers) ])
        self.lm_head = nn.Linear(n_embed, vocab_size)
        self.ln_f = nn.LayerNorm(n_embed)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.layers(x)
        x = self.ln_f(x)
        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (layers): Sequential(
    (0): Block(
      (attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (o_proj): Linear(in_features=32, out_features=32, bias=False)
      )
      (mlp): MLP(
        (up_proj): Linear(in_features=32, out_features=128, bias=True)
        (down_proj): Linear(in_features=128, out_features=32, bias=True)
        (act): ReLU()
      )
      (ln1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (1): Block(
      (attn): MultiHeadAtte

# Notes

## improvements for get_batch()

Here's a revised version of your code with improvements in variable naming, inline comments for clarity, and some additional refinements to make it more understandable and efficient:

```python
import torch

batch_size = 4
sequence_length = 8  # Renamed for clarity

def get_batch(data_split):
    """
    Fetches a batch of data sequences for training or validation.

    Args:
    data_split (str): Either 'train' or 'val' to specify which dataset to use.

    Returns:
    tuple: Two tensors (input sequences and target sequences) moved to the appropriate device.
    """
    # Select the appropriate dataset based on the data split
    dataset = train_data if data_split == 'train' else val_data

    # Randomly choose starting indices for the sequences
    start_indices = torch.randint(len(dataset) - sequence_length, (batch_size,))

    # Extract input sequences using the starting indices
    input_sequences = torch.stack([dataset[start: start + sequence_length] for start in start_indices], dim=0)

    # Extract target sequences which are the next characters following the input sequences
    target_sequences = torch.stack([dataset[start + 1: start + sequence_length + 1] for start in start_indices], dim=0)

    # Move the input and target sequences to the designated computing device
    input_sequences = input_sequences.to(device)
    target_sequences = target_sequences.to(device)

    return input_sequences, target_sequences
```

### Improvements made:

1. **Variable Names**: Improved variable names like `batch_size` and `sequence_length` to enhance code readability.
2. **Function Parameters**: Changed `split` to `data_split` to clearly indicate that it specifies which part of the dataset to use.
3. **Documentation**: Added a docstring to the function explaining what it does, its parameters, and what it returns.
4. **Usage of Start Indices**: Renamed `ix` to `start_indices` for clarity, indicating that these are starting points for the sequences.
5. **Comments**: Added inline comments to explain important steps in the function.

Make sure to define `train_data`, `val_data`, and `device` outside this function to avoid errors when it's called. These should be set according to your dataset and PyTorch device configuration.

## The `.backward()` function in PyTorch

The `.backward()` function in PyTorch is a fundamental method for performing backpropagation, which is critical for training neural networks. It automates the calculation of gradients for all tensors in the network that have `requires_grad` set to `True`. This enables the model to update its parameters according to the gradient descent algorithm during training.

### Basic Usage

In PyTorch, each tensor has a `.grad` attribute that holds the gradients computed during backpropagation. When you call `.backward()` on a tensor, PyTorch computes the gradients of that tensor with respect to all tensors that have `requires_grad=True`.

For a typical loss tensor computed by comparing the model’s predictions to the true values, calling `loss.backward()` will calculate the gradients of the loss with respect to all model parameters. This is because the loss is the endpoint of the computation graph, and backpropagation needs to trace gradients from this endpoint back to the inputs.

### How It Works

- **Computational Graph**: PyTorch builds a dynamic computational graph as your code executes. This graph contains all the operations performed on tensors, and it is used to compute gradients during backpropagation.
- **Gradient Accumulation**: PyTorch accumulates gradients every time `.backward()` is called. This means that if `.backward()` is called multiple times without resetting the gradients, the gradients from each call will add up in the `.grad` attributes. This is particularly useful for scenarios like accumulating gradients over multiple batches of data.
- **Zeroing Gradients**: Because gradients accumulate, you usually need to manually set the gradients to zero before each optimization step using `optimizer.zero_grad()` or `tensor.grad.zero_()`.

### Parameters

- **gradient**: This is an optional parameter that can be passed to `.backward()`. It allows for weighting the gradient during the computation. For example, if the tensor on which `.backward()` is called is not a scalar (i.e., it has more than one element), you must pass a `gradient` argument specifying the tensor of weights.
- **retain_graph**: By default, PyTorch frees the computational graph after backpropagation is done (to save memory). However, if you need to do several backward passes on the same graph, you must set `retain_graph=True`.

### Example Code

Here’s a simple example illustrating the use of `.backward()`:

```python
import torch

# Create tensors.
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = torch.tensor([4.0, 5.0, 6.0], requires_grad=True)

# Perform an operation.
z = x * y

# Sum to get a scalar output.
s = z.sum()

# Backpropagate.
s.backward()

# Print gradients.
print(x.grad)  # Output will be the values of y because ds/dx = y.
print(y.grad)  # Output will be the values of x because ds/dy = x.
```

In this example, `s.backward()` computes the gradients of `s` with respect to `x` and `y`, which are stored in `x.grad` and `y.grad`, respectively.

Understanding `.backward()` and its role in gradient computation and backpropagation is crucial for effectively training models using PyTorch.

In [91]:
import torch

# Create tensors.
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = torch.tensor([4.0, 5.0, 6.0], requires_grad=True)

# Perform an operation.
z = x * y

# Sum to get a scalar output.
s = z.sum()

# Backpropagate.
s.backward()

# Print gradients.
print(x.grad)  # Output will be the values of y because ds/dx = y.
print(y.grad)  # Output will be the values of x because ds/dy = x.


tensor([4., 5., 6.])
tensor([1., 2., 3.])
