# Tiny GPT
Build a small GPT model using PyTorch (in mac)

## Check PyTorch instance

_Notes:_

**MPS (Metal Performance Shaders)**
* Metal Performance Shaders is an Apple framework of highly optimized GPU shaders for image processing, linear algebra, and neural networks on top of Metal.
* PyTorch, diffusers, and other ML libraries expose an mps device to offload tensor operations to the GPU via Metal/MPS on Apple silicon.


In [1]:
import torch

print(f"Torch version: {torch.__version__} and MPS availability is: {torch.backends.mps.is_available()}")

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
torch.set_default_device(device)

print(f"Default device set to: {torch.get_default_device()}")
print(f"Is MPS built: {torch.backends.mps.is_built()}")

Torch version: 2.9.1 and MPS availability is: True
Default device set to: mps:0
Is MPS built: True


## Mathematical Explanation

A LLM model only understands numbers - based on which it calculates the probability. The model selects the next one word that has got the highest probability.
$$
\begin{aligned}
P(w_1,w_2,w_3,......,w_n) = \prod_{t=1}^{n}P(w_t|w_1,w_2,......,w_{t-1}) \\
\text{where, } w_i \text{ represents a token}
\end{aligned}
$$
E.g., suppose,
* we have the vocabulary (training set): [`very`, `tea`, `hot`, `is`, `the`]
* the first 2 words fed to the model is: [`the`, `tea`]
* the model will use the above formula of _Chain Probability_ to predict the next likely word.
    * So, it will execute `P('is'|'the','tea')`. This reads as probability of `is` given the previous words are `the` and `tea`. Similarly it will execute `P('hot'|'the','tea')` and so on. The one with the highest probability wins.
    * For a set of 4 tokens, the overall probability is calculated as
$$
\begin{aligned}
P(w_1,w_2,w_3,w_4) = P(w_1) \times P(w_2|w_1) \times P(w_3|w_1, w_2) \times P(w_4|w_1, w_2, w_3)
\end{aligned}
$$

## Training Data

In [2]:
corpus = [
    "hello friends how are you",
    "the tea is very hot",
    "my name is Sushovan",
    "the roads of Kolkata are busy",
    "it is raining in Mumbai",
    "the train is late again",
    "i love eating samosas and drinking tea",
    "holi is my favorite festival",
    "diwali brings lights and sweets",
    "india won the cricket match"
]

## Tokenize Data for the model
_Note_: Instead of using a tokenizer library (e.g. `BPE` - `Byte Pair Encoding` used in GPT or `SentencePiece` used in LLAMA), we shall be building a custom tokenizer to understand the concepts.

### Put end marker and concatenate the corpus

In [3]:
text = " ".join([s + " <END>" for s in corpus])
print(text)

hello friends how are you <END> the tea is very hot <END> my name is Sushovan <END> the roads of Kolkata are busy <END> it is raining in Mumbai <END> the train is late again <END> i love eating samosas and drinking tea <END> holi is my favorite festival <END> diwali brings lights and sweets <END> india won the cricket match <END>


### Extract the words to build the vocabulary

In [4]:
words = list(set(text.split()))
vocab_size = len(words)
print(f"Vocabulary: {words}")
print(f"Vocabulary Size: {vocab_size}")

Vocabulary: ['samosas', 'i', 'lights', 'friends', '<END>', 'my', 'you', 'Sushovan', 'of', 'train', 'and', 'drinking', 'eating', 'holi', 'love', 'sweets', 'very', 'roads', 'busy', 'it', 'again', 'name', 'raining', 'festival', 'brings', 'hot', 'late', 'hello', 'tea', 'the', 'are', 'cricket', 'Mumbai', 'diwali', 'won', 'match', 'is', 'in', 'favorite', 'india', 'Kolkata', 'how']
Vocabulary Size: 42


### Build the word index

In [5]:
word2idx = {w: idx for idx, w in enumerate(words)}
print(f"Words to Index:  {word2idx}")

idx2word = {idx: w for w, idx in word2idx.items()}
print(f"Index to Words:  {idx2word}")

Words to Index:  {'samosas': 0, 'i': 1, 'lights': 2, 'friends': 3, '<END>': 4, 'my': 5, 'you': 6, 'Sushovan': 7, 'of': 8, 'train': 9, 'and': 10, 'drinking': 11, 'eating': 12, 'holi': 13, 'love': 14, 'sweets': 15, 'very': 16, 'roads': 17, 'busy': 18, 'it': 19, 'again': 20, 'name': 21, 'raining': 22, 'festival': 23, 'brings': 24, 'hot': 25, 'late': 26, 'hello': 27, 'tea': 28, 'the': 29, 'are': 30, 'cricket': 31, 'Mumbai': 32, 'diwali': 33, 'won': 34, 'match': 35, 'is': 36, 'in': 37, 'favorite': 38, 'india': 39, 'Kolkata': 40, 'how': 41}
Index to Words:  {0: 'samosas', 1: 'i', 2: 'lights', 3: 'friends', 4: '<END>', 5: 'my', 6: 'you', 7: 'Sushovan', 8: 'of', 9: 'train', 10: 'and', 11: 'drinking', 12: 'eating', 13: 'holi', 14: 'love', 15: 'sweets', 16: 'very', 17: 'roads', 18: 'busy', 19: 'it', 20: 'again', 21: 'name', 22: 'raining', 23: 'festival', 24: 'brings', 25: 'hot', 26: 'late', 27: 'hello', 28: 'tea', 29: 'the', 30: 'are', 31: 'cricket', 32: 'Mumbai', 33: 'diwali', 34: 'won', 35: 'm

### Convert to tensor
Replace each word in text with the corresponding index and fed that to a tensor

In [6]:
data = torch.tensor([word2idx[idx] for idx in text.split()], dtype=torch.long)
print(f"Tensor Data: {data}")
print(f"Data Shape: {data.shape}")

Tensor Data: tensor([27,  3, 41, 30,  6,  4, 29, 28, 36, 16, 25,  4,  5, 21, 36,  7,  4, 29,
        17,  8, 40, 30, 18,  4, 19, 36, 22, 37, 32,  4, 29,  9, 36, 26, 20,  4,
         1, 14, 12,  0, 10, 11, 28,  4, 13, 36,  5, 38, 23,  4, 33, 24,  2, 10,
        15,  4, 39, 34, 29, 31, 35,  4], device='mps:0')
Data Shape: torch.Size([62])


## Define Parameters

`block_size`: also known as context_length. It means that how many previous words the llm would refer, to predict the next word

In [7]:
block_size = 6

`embedding_dim`: Embedding Dimension

Each word in the tensor will be represented in the llm model as a dimensional vector of a defined size. Initially, random numbers are generate for each value in the dimensional vector. The llm uses this vector to predict the next word.

In [8]:
embedding_dim = 32

In [9]:
n_heads = 2  # number of Multi-head attention layer

In [10]:
n_layers = 2  # number of transformer blocks to use

In [11]:
lr = 1e-3  # learning rate

In [12]:
epochs = 1500  # number of training iterations

## Define batch function

`batch_size`: This indicates the number of sequences (or sentences) for the model to consider

In [13]:
def get_batch(batch_size=16):
    ix = torch.randint(len(data) - block_size, size=(batch_size,))
    # the above function gives 16 examples of random sentences
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
    return ix, x, y


rand_int, input_batch, output_bath = get_batch()
print(f"Data: {data}")
print(f"Batch Index: {rand_int}")
print(f"input_batch shape: {input_batch.data}")
print(f"output_bath shape: {output_bath.data}")

Data: tensor([27,  3, 41, 30,  6,  4, 29, 28, 36, 16, 25,  4,  5, 21, 36,  7,  4, 29,
        17,  8, 40, 30, 18,  4, 19, 36, 22, 37, 32,  4, 29,  9, 36, 26, 20,  4,
         1, 14, 12,  0, 10, 11, 28,  4, 13, 36,  5, 38, 23,  4, 33, 24,  2, 10,
        15,  4, 39, 34, 29, 31, 35,  4], device='mps:0')
Batch Index: tensor([38, 49, 52,  2, 20, 29,  4, 26, 29, 47, 35,  6, 28, 13, 38, 47],
       device='mps:0')
input_batch shape: tensor([[12,  0, 10, 11, 28,  4],
        [ 4, 33, 24,  2, 10, 15],
        [ 2, 10, 15,  4, 39, 34],
        [41, 30,  6,  4, 29, 28],
        [40, 30, 18,  4, 19, 36],
        [ 4, 29,  9, 36, 26, 20],
        [ 6,  4, 29, 28, 36, 16],
        [22, 37, 32,  4, 29,  9],
        [ 4, 29,  9, 36, 26, 20],
        [38, 23,  4, 33, 24,  2],
        [ 4,  1, 14, 12,  0, 10],
        [29, 28, 36, 16, 25,  4],
        [32,  4, 29,  9, 36, 26],
        [21, 36,  7,  4, 29, 17],
        [12,  0, 10, 11, 28,  4],
        [38, 23,  4, 33, 24,  2]], device='mps:0')
output_b

In [14]:
import numpy as np

print(f"Decoded data: \n{[idx2word[i.item()] for i in data]}")
print(f"input stack words: \n{np.array([[idx2word[i.item()] for i in b] for b in input_batch])}")
print(f"output stack words: \n{np.array([[idx2word[i.item()] for i in b] for b in output_bath])}")

Decoded data: 
['hello', 'friends', 'how', 'are', 'you', '<END>', 'the', 'tea', 'is', 'very', 'hot', '<END>', 'my', 'name', 'is', 'Sushovan', '<END>', 'the', 'roads', 'of', 'Kolkata', 'are', 'busy', '<END>', 'it', 'is', 'raining', 'in', 'Mumbai', '<END>', 'the', 'train', 'is', 'late', 'again', '<END>', 'i', 'love', 'eating', 'samosas', 'and', 'drinking', 'tea', '<END>', 'holi', 'is', 'my', 'favorite', 'festival', '<END>', 'diwali', 'brings', 'lights', 'and', 'sweets', '<END>', 'india', 'won', 'the', 'cricket', 'match', '<END>']
input stack words: 
[['eating' 'samosas' 'and' 'drinking' 'tea' '<END>']
 ['<END>' 'diwali' 'brings' 'lights' 'and' 'sweets']
 ['lights' 'and' 'sweets' '<END>' 'india' 'won']
 ['how' 'are' 'you' '<END>' 'the' 'tea']
 ['Kolkata' 'are' 'busy' '<END>' 'it' 'is']
 ['<END>' 'the' 'train' 'is' 'late' 'again']
 ['you' '<END>' 'the' 'tea' 'is' 'very']
 ['raining' 'in' 'Mumbai' '<END>' 'the' 'train']
 ['<END>' 'the' 'train' 'is' 'late' 'again']
 ['favorite' 'festival' '<

## Build the model

In [15]:
import torch.nn as nn
import torch.nn.functional as F
from transformer.transformer_utils import Block


class TinyGPT(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.positional_embeddings = nn.Embedding(block_size, embedding_dim)
        self.blocks = nn.Sequential(*[Block(embedding_dim, block_size, n_heads) for _ in range(n_layers)])

        self.ln_f = nn.LayerNorm(embedding_dim)
        self.head = nn.Linear(embedding_dim, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        token_embeddings = self.token_embeddings(idx)
        position_embeddings = self.positional_embeddings(torch.arange(T, device=idx.device))

        x = token_embeddings + position_embeddings
        x = self.blocks(x)
        x = self.ln_f(x)

        logits = self.head(x)
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            loss = F.cross_entropy(logits.view(B * T, C), targets.view(B * T))

        return logits, loss

    def generate(self, idx, max_tokens=10):
        for _ in range(max_tokens):
            idx_cond = idx[:, -block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, 1)
            idx = torch.cat((idx, next_idx), dim=1)
        return idx


## Train the model

In [16]:
model = TinyGPT()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

for step in range(epochs):
    _, xb, yb = get_batch()
    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 100 == 0:
        print(f"Step: {step}, Loss: {loss.item():.4f}")
    elif step == epochs - 1:
        print(f"Step: {step}, Loss: {loss.item():.4f}")

Step: 0, Loss: 3.8710
Step: 100, Loss: 1.1697
Step: 200, Loss: 0.2769
Step: 300, Loss: 0.1886
Step: 400, Loss: 0.1711
Step: 500, Loss: 0.2098
Step: 600, Loss: 0.1316
Step: 700, Loss: 0.1889
Step: 800, Loss: 0.1220
Step: 900, Loss: 0.0810
Step: 1000, Loss: 0.0703
Step: 1100, Loss: 0.0854
Step: 1200, Loss: 0.1881
Step: 1300, Loss: 0.0728
Step: 1400, Loss: 0.0763
Step: 1499, Loss: 0.0673


## Execute the model

In [17]:
def execute_model(context):
    word_indexed = [word2idx[word] for word in context.split()]
    context = torch.tensor([word_indexed], dtype=torch.long)
    out = model.generate(context)

    print("\nGenerated Text:\n")
    print(" ".join(idx2word[int(i)] for i in out[0]))

In [18]:
execute_model("hello")


Generated Text:

hello friends how are you <END> the tea is very hot


In [19]:
execute_model("my name")


Generated Text:

my name is Sushovan <END> the roads of Kolkata are busy <END>
