# Build Your Own Transformer-Based LLM From Scratch 

This jupyter notebook contains a miniature implementation of the famous ["Attention Is All You Need" paper](https://arxiv.org/pdf/1706.03762)'s transformer architecture with layers/components built from scratch using PyTorch. This architecture is what started the snowball of high performance LLMs in the recent years, including the GPT models!

### Install PyTorch

In [3]:
!pip install torch==2.3.1



### Import PyTorch & Required Tools

In [4]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# set random seed for reproducible results 
torch.manual_seed(1337)

<torch._C.Generator at 0x113f8f170>

### Download and observe the training data
Note the destination directory that the data is being downloaded into! 
#### We'll be using a small Shakespeare dataset.

In [5]:
!curl https://raw.githubusercontent.com/karpathy/ng-video-lecture/refs/heads/master/input.txt -o input_data


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1089k  100 1089k    0     0  4247k      0 --:--:-- --:--:-- --:--:-- 4254k


### Build the Tokenizer 

In [6]:
# get all unique characters

# create tokenizer fpr the input text by characters


### Encode text and prepare test-train split

In [7]:
# encode input dataset and place it in tensor 

# train validation split 

## Lets Start Building the Model


### Feed Forward Block
A feed forward block is a simple neural network layer that consists of two linear transformations, one after the other. This feed forward block contains a ReLU between the two linear layers. The ReLU function, or Rectified Linear Unit, is a mathematical function that returns the input if it is positive, and 0 if it is negative. It is commonly used as an activation function in neural networks.

In [47]:
class FeedForward(nn.Module):
    """ simple linear layer followed by non-linearity """

    def __init__(self, n_embed, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed * 4), # scale computation by 4
            nn.ReLU(), 
            nn.Linear(n_embed * 4, n_embed), # scale residual computation by 4
            nn.Dropout(dropout)
        )

    def forward(self, x):
        output = self.net(x)
        return output

### Attention Head & Multi-Head Attention Block
**Attention Head**: An attention head is a sub-component of a multi-head attention block. It takes a query vector, a set of key-value pairs, and outputs a weighted sum of the values based on the similarity between the query and the keys.
**Multi-Head Attention Block**: A multi-head attention block is a neural network layer that consists of multiple attention heads. It takes a query vector, a set of key-value pairs, and outputs a weighted sum of the values based on the similarity between the query and the keys. The output of the multi-head attention block is then passed through a feed forward block.

In [48]:
class Head(nn.Module):

    """ single head of self-attention """

    def __init__(self, n_embed, head_size, block_size, dropout):
        super().__init__()

    def forward(self, x):

        return weighted_aggregation

class MultiHeadAttention(nn.Module):

    """ multiple heads of self attention in parallel """


### Transformer Block

A transformer block is a neural network layer that consists of a multi-head attention block and a feed forward block. It takes a query vector, a set of key-value pairs, and outputs a weighted sum of the values based on the similarity between the query and the keys. The output of the transformer block is then passed through a feed forward block.

In [49]:
class TransformerBlock(nn.Module):

    """ transformer block: communication followed by computation """

    def __init__(self, n_embed, n_heads, block_size, dropout):
        super().__init__()

    def forward(self, x):
        return x


### Putting it all together!
A BigramLanguageModel is a type of language model that predicts the next word in a sentence based on the previous **two words**, hence the prefix (Bi-). We will build it using the blocks we previously defined. 

In [50]:
# Model Hyperparameters 
block_size = 128 # maximum context length for prediction
n_embed = 142 # number of embedding dimensions 
n_heads = 4
n_layers = 6
dropout = 0.2
vocab_size = 65


class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
       
    def forward(self, idx, targets=None):
    
        return logits, loss
    

    def generate(self, idx, max_new_tokens):
     
        return idx

#### Let's see how the model performs prior to training

In [58]:
m = BigramLanguageModel()
print(decorder(m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


tgRVxeeoJKVchAPRB,$pX!B;wIsUJqKMD.fIYwqlwT.zZh-.!A wyoFMPC&ifW.wXWwAAj3H;ItqA3WusQPmbMsqQpIWkyUT'k.R


The result is a bunch of completely random tokens from our vocab. This is expected as the model is initialized to random parameters. 

## Lets train the model!

In [41]:
# define training hyperparameters
batch_size = 4 # number of independent sequences processed in parallel
max_iterations = 5000
learning_rate = 3e-4
eval_interval = 1000
eval_iters = 200


# initialize the model we created
model = BigramLanguageModel()

# define a function that generates batches of data for training or validation. 
def get_batch(split):
    return x, y

@torch.no_grad() #setting to no back propagation! more efficient memory mode


# define function to calculate the mean loss for the model during training and testing 
def estimate_loss():
   
xb, yb = get_batch('train')


# get pytorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# training loop
for iter in range(max_iterations):

    # evaluate loss on training and validation data every few iterations 
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"Training loss at step {iter}: {losses['train']:.4f} \nValidation loss at step {iter}: {losses['val']:.4f}")

    # sample a batch
    xb, yb = get_batch('train')

    # evaluate loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

Training loss at step 0: 4.3404 
Validation loss at step 0: 4.3501
Training loss at step 1000: 2.4414 
Validation loss at step 1000: 2.4643
Training loss at step 2000: 2.2943 
Validation loss at step 2000: 2.3042
Training loss at step 3000: 2.1522 
Validation loss at step 3000: 2.2031
Training loss at step 4000: 2.0412 
Validation loss at step 4000: 2.1139


### Sample generation!

Let's see how the model performs post training

In [59]:
print(decorder(model.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=500)[0].tolist()))



Whaly prase, agesif! Ines:
You wall, knoway your prave lomy me.

LOENCENTER:
Tholk 'This an delfs!

ANGHAH:
Thes, as INould haple dam the kinking theing
You shiche shall to his intesatool's lordcy gonge!

MERCUans, Yorkize, What othr lous I Hey,'d were lancalt it
If he reguan, larnd a ongenow: not Lord giefuld:
On nir sigisent
En Ravent on my pruc des,
Foor bage'd booth of ther of berpoble.
Dey! fay!
To cousis, and tur hee she spirthe hat dank
Noble of infur the pludmy dease nedt sugee gring;
An



Given our hardware and time restrictions its not great. The larger you make the model's hyperparameters (especially n_embed, which is the number of embeddings we capture), the more likely it is to perform better, but the longer it'll take to train due to the increase in the number of model weights to be computed/optimized. 