# Module 2 Project 1: TRANSFORMERS

Implement a basic version of a Transformer based language model

## STEP 1: IMPORTS AND HYPERPARAMETERS
- Import necessary libraries
- Set up our hyperparameters
- `embed_dim` / 4 needs to be factorizable by 8
- `chunk_size` is our context length
- 4 layers of the network, and 4 attention heads per layer
- `batch_size` is the size of our batch dimension

In [None]:
import requests
import re
import os
import torch
import torch.nn as nn
from torch.nn import functional as F

## Hyperparameters
chunk_size = 128
batch_size = 16
embed_dim = 32
num_layers = 4
num_heads = 4

learning_rate = 3e-4
dropout = 0.2

eval_iters = 200
eval_interval = 200
epochs = 5000

max_new_tokens = 400



## STEP 2: DATASET COLLECTION
- We chose Winnie the Pooh eBook from Project Gutenberg for sample data
- Select our start and end tags of the text we want, and extract it into `result`
- Create a simple vocabulary by getting all possible characters in the text

In [None]:
resp = requests.get("https://www.gutenberg.org/cache/epub/67098/pg67098.txt")

# Tags for text filtering
start = "*** START OF THE PROJECT GUTENBERG EBOOK WINNIE-THE-POOH ***"
end = "*** END OF THE PROJECT GUTENBERG EBOOK WINNIE-THE-POOH ***"

result = resp.text[resp.text.find(start):resp.text.find(end)]

print(result[:1000])

chars = sorted(list(set(result)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

## STEP 3: BASIC TOKENIZATION
- Here we will be doing the most basic form of tokenization - character level
- No BPE algorithm here (yet)
- Just `encode` and `decode` functions for use with our text data like classic tokenizers

In [None]:
stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

## STEP 3: TRAIN AND VALIDATION SPLIT
- Split the data between train and validation sets (90% train, 10% val)
- Print an example 'chunk' in the training data
- We will define a method to get a random batch from our data of size `chunk_size`
- We also get the set of 'next' tokens for a given chunk, i.e. index_of_chunk_start shifted over by 1
- Now we go through a batch, and all chunks within the batch, and print the `context` and `target`
- Here, `context` is the current word and `target` is the next word to be predicted

In [None]:
data = torch.tensor(encode(result), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

x = train_data[:chunk_size]
y = train_data[1:chunk_size+1]

for t in range(chunk_size):
    context = x[:t+1]
    target = y[t]
    print(str(context) + " " + str(target))

#torch.manual_seed(1337)

def get_batch(split):
    data = train_data if split == "train" else val_data
    index = torch.randint(len(data) - chunk_size, (batch_size,))
    x = torch.stack([data[i:i+chunk_size] for i in index])
    y = torch.stack([data[i+1:i+chunk_size+1] for i in index])
    return x, y

x_batch, y_batch = get_batch("train")

print("Inputs")
print(x_batch.shape)
print(x_batch)
print("Targets")
print(y_batch.shape)
print(y_batch)

for b in range(batch_size):
    for t in range(chunk_size):
        context = x_batch[b, :t+1]
        target = y_batch[b, t]
        print(str(context.tolist()) + " " + str(target.tolist()))

## STEP 4: ATTENTION
- Here, we will create our simple self-attention head, much like we did in the BERT implementation
- We assign linear transformations to our key, query, and value vectors, as well as a 'triangulation' buffer that will zero out all entries in our attention weight matrix that correspond to 'future' probabilities (we only care about the current word and all words that came before it when predicting the next token, not future tokens, as that would give us the answer before calulating it)
- Our weight matrix is formed by multiplying Q and K.T, which gives us a positionally encoded weight matrix that details the relevance of each token in a sentence to all other tokens in the sentence.
- Once this is done, we apply trianglulation and a softmax to normalize the values between [0, 1]
- Finally, we can multiply this normalized weight matrix by our value vector (the tokens in the input sentence) to get a token-wise weighted distribution of 'relevance', or attention

In [None]:
class AttentionHead(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(embed_dim, head_size, bias=False)
        self.query = nn.Linear(embed_dim, head_size, bias=False)
        self.value = nn.Linear(embed_dim, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(chunk_size, chunk_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)

        weights = q @ k.transpose(-2, -1) * C**-0.5
        weights = weights.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        weights = F.softmax(weights, dim=1)
        weights = self.dropout(weights)

        v = self.value(x)
        result = weights @ v

        return result

## STEP 5: MULTI HEADED ATTENTION
- Using what we did above, we can combine our Attention Heads into a `nn.ModuleList` layer, the size of which is `num_heads`
- Afterwards, we add a linear transformation layer and dropout to go from our multi headed attention to a projection of the probabilities per-token in our text

In [None]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([AttentionHead(head_size) for _ in range(num_heads)])
        self.projection = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        result = torch.cat([h(x) for h in self.heads], dim=-1)
        result = self.dropout(self.projection(result))
        return result

## STEP 6: FEED FORWARD NETWORK
- The last piece we really need to get this all running is the Feed Forward network
- This is just a sequence of a linear transformation, ReLU activation, and another linear transformation, with the usual dropout applied at the end
- The `embed_dim` parameter is scaled up by 4x in the intermediate layer as in the original Transformer architecture paper

In [None]:
class FeedForward(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, 4*embed_dim),
            nn.ReLU(),
            nn.Linear(4*embed_dim, embed_dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

## STEP 7: BUILDING BLOCKS
- Now we can put everything together in our layers, called Building Blocks
- We calculate the `head_size` based on hyperparameters
- We will run each layer with the following order:
    - Multi Headed Attention
    - LayerNorm (row-wise normalization)
    - Feed Forward network
    - Another LayerNorm
- We will then return the output (of size `embed_dim`)

In [None]:
class BuildingBlock(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        head_size = embed_dim // num_heads
        self.self_attention = MultiHeadedAttention(num_heads, head_size)
        self.feed_forward = FeedForward(embed_dim)
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.layer_norm2 = nn.LayerNorm(embed_dim)


    def forward(self, x):
        x = x + self.self_attention(self.layer_norm1(x))
        x = x + self.feed_forward(self.layer_norm2(x))
        return x

## STEP 8: SIMPLE TRANSFORMER
- Using our Building Blocks above, we have the components to build our Simple Transformer language model
- We first create a token embedding table and a position embedding table
- The token embedding input is of size `vocab_size`, and the position embedding input is of size `chunk_size`
- We build a sequential layer of our Building Blocks accoding to `num_layers`
- We add a final LayerNorm to normalize the output, before one final linear transformation layer as the LM head (the output size here is `vocab_size` - which in turn predicts the likeliest 'next' character based on our attention weight matrix)
- We first embed the input in our token embedding, and we then embed our positions in the form of the 'time' dimension of our input (basically just index of token in the string)
- After adding them together to get an embedding of both our tokens and their positional relationships, we pass the input to our Building Block layers
- Once the output is returned, we do our layer normalization, and the final linear transformation to size `vocab_size` to get our `logits`, which we can use for predicting the next token
- Our model also has an `estimate_loss` method to average the loss over a number of iterations for better logging during training (not expressly necessary, just nice to have)

In [None]:
class SimpleTransformer(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, embed_dim)
        self.position_embedding_table = nn.Embedding(chunk_size, embed_dim)
        self.blocks = nn.Sequential(*[BuildingBlock(embed_dim, num_heads=num_heads) for _ in range(num_layers)])
        self.layer_norm_f = nn.LayerNorm(embed_dim)
        self.lm_head = nn.Linear(embed_dim, vocab_size)

    def forward(self, index, targets=None):

        B, T = index.shape

        token_embedding = self.token_embedding_table(index)
        position_embedding = self.position_embedding_table(torch.arange(T, device='cpu'))
        x = token_embedding + position_embedding
        x = self.blocks(x)
        x = self.layer_norm_f(x)
        logits = self.lm_head(x)

        # Generation step if we have no target
        if targets is None:
            loss = None

        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, index, max_new_tokens):
        for _ in range(max_new_tokens):
            context = index[:, -chunk_size:]
            logits, loss = self(context)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_index = torch.multinomial(probs, num_samples=1)
            index = torch.cat((index, next_index), dim=1)

        return index

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

## STEP 9: BASIC GENERATION
- Now that our Simple Transformer model is created, we can text generation on an un-trained model
- We first demonstrate an example of doing a 'forward' pass with the model, and observing the loss and logits returned, to ensure everything is functioning correctly
- Generation works by simply calling the model's `generate` function with an input (random in this case), which returns the logits, which are then softmaxed and used with `torch.multinomial` to generate the prediction for the 'next' token, until `max_new_tokens` have been generated
- This prediction is then decoded after all tokens have been generated, this is how we generate text with a language model!

In [None]:
model = SimpleTransformer(embed_dim)
logits, loss = model(x_batch, y_batch)
print(logits.shape)
print(loss)

print(decode(model.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

## STEP 10: MODEL TRAINING
- Now it is time to set up our training loop for our model
- We use AdamW as our optimizer here (as with the original paper)
- Every `eval_interval` steps, we estimate the loss over the interval and display - just nice to have
- We get a random batch from our data, and run it through our Simple Transformer model
- We use `cross_entropy` to calculate the loss, and step through the optimizer and backpropagate
- After we have reached our desired number of epochs, we output the final loss and a sample generation
- We will have successfully trained a Simple Transformer model

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for step in range(epochs):

    if step % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {step}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    x_batch, y_batch = get_batch('train')

    logits, loss = model(x_batch, y_batch)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())
print(decode(model.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=max_new_tokens)[0].tolist()))