## Workbook Overview

This workbook guides you through coding a Transformer-based GPT model from scratch. It is based on Andrej Karpathy's "Zero to Hero" YouTube tutorial titled ["Let's build GPT: from scratch, in code, spelled out"](https://www.youtube.com/watch?v=kCc8FmEb1nY).

### Who Is This For?

- Those familiar with Andrej Karpathy’s "Zero to Hero" series and looking to code GPT from scratch.

### How to Use This Workbook

- The workbook is organized into three main sections: **Coding Instructions**, **Coding Exercises**, and **Code Solutions**.
- **Coding Instructions**: Step-by-step guidance on building the key components of a GPT model.
- **Coding Exercises**: Implement the code as you progress through the instructions.
- **Code Solutions**: Refer to the full code solutions for validation or help if you get stuck.
- Use the Colab Table of Contents for seamless navigation.

### Purpose

- Gain a thorough understanding of how Transformer-based language models like GPT are constructed.





# List of instructions

In [None]:
# Let's Build GPT (Step-by-Step Transformer Language Model)

## Part 1: Setup and Data Preparation

# 1. Imports & Configurations
#    - Import necessary libraries (`torch`, `torch.nn`, `torch.nn.functional`, etc.).
#    - Set device configuration (CPU/GPU).
#    - Set random seed for reproducibility.

# 2. Download Dataset
#    - Download and load the "tinyshakespeare" dataset.
#    - https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
#    - Read the text data from the file.

# 3. Vocabulary Creation
#    - Extract unique characters from the dataset.
#    - Determine the vocabulary size.

# 4. Tokenizer
#    - Create mappings for character-to-index (`stoi`) and index-to-character (`itos`).
#    - Implement `encode()` and `decode()` functions for tokenization.

# 5. Train and Test Splits
#    - Convert the entire dataset into token indices.
#    - Split the data into training and validation sets (90/10 split).

# 6. Dataloader
#    - Define a function `get_batch()` to generate batches of data for training and evaluation.
#    - Ensure that each batch contains sequences of fixed block size.

## Part 2: Building the Initial GPT Model

# 7. Model: Embedding Layer and Output Linear Transformation
#    - Implement the GPT class with an embedding layer and a linear layer for output.
#    - Forward pass should handle both token embeddings and linear transformation to logits.

# 8. Generate Function
#    - Implement the `generate()` function to generate text using the trained model.
#    - The function should iterate over a specified number of new tokens, updating the input sequence each time.

## Part 3: Training and Evaluation

# 9. Evaluation Loop
#    - Implement the `estimate_loss()` function to evaluate the model on training and validation data.
#    - Ensure that the model is in evaluation mode during this process.

# 10. Training Loop
#    - Set up the training loop with AdamW optimizer.
#    - Include periodic evaluation using `estimate_loss()` and print training/validation losses.
#    - Generate sample from the model


## Part 4: Enhancing the GPT Model

# 11. Model: Positional Embeddings
#    - Add positional embeddings to the model to capture the order of tokens.
#    - Update the forward pass to combine token embeddings with positional embeddings.

# 12. Model: Single Attention Head
#    - Implement a single attention head within the model to capture token relationships.
#    - Incorporate this attention mechanism into the forward pass.

# 13. Model: Multi-Head Attention
#    - Expand the model to include multiple attention heads.
#    - Implement a projection layer to combine the outputs of the multiple heads.

## Part 5: Building the Transformer Block

# 14. Model: MLP
#    - Add a multi-layer perceptron (MLP) to the model.
#    - The MLP should consist of a projection up, ReLU activation, and a projection down.

# 15. Model: Transformer Block
#    - Combine multi-head attention and MLP into a single Transformer block.
#    - Use this block in the model to stack multiple layers.

## Part 6: Final Enhancements

# 16. Model: Skip Connections, normalization and dropout
#    - Implement skip connections (residual connections) around the attention and MLP layers.
#    - Add layer normalization before applying the skip connections to stabilize training.
#    - Include dropout layers in both the attention and MLP layers to prevent overfitting.

## Part 7: Final Training and Evaluation

# 17. Final Evaluation and Text Generation
#    - Train the final GPT model with the full architecture.
#    - Evaluate the final model on the validation set.
#    - Use the trained model to generate new text samples.

# Coding Exercises

## Part 1: Setup and Data Preparation

### 1. Imports & Configurations

In [None]:
# 1. Imports & Configurations
#    - Import necessary libraries (`torch`, `torch.nn`, `torch.nn.functional`, etc.).
#    - Set device configuration (CPU/GPU).
#    - Set random seed for reproducibility.

# Follow the instructions and code the solution.

### 2. Download Dataset

In [None]:
# 2. Download Dataset
#    - Download and load the "tinyshakespeare" dataset.
#    - https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
#    - Read the text data from the file.

# Follow the instructions and code the solution.

### 3. Vocabulary Creation

In [None]:
# 3. Vocabulary Creation
#    - Extract unique characters from the dataset.
#    - Determine the vocabulary size.

# Follow the instructions and code the solution.

### 4. Tokenizer

In [None]:
# 4. Tokenizer
#    - Create mappings for character-to-index (`stoi`) and index-to-character (`itos`).
#    - Implement `encode()` and `decode()` functions for tokenization.

# Follow the instructions and code the solution.

### 5. Train and Test Splits

In [None]:
# 5. Train and Test Splits
#    - Convert the entire dataset into token indices.
#    - Split the data into training and validation sets (90/10 split).

# Follow the instructions and code the solution.

### 6. Dataloader

In [None]:
# 6. Dataloader
#    - Define a function `get_batch()` to generate batches of data for training and evaluation.
#    - Ensure that each batch contains sequences of fixed block size.

# Follow the instructions and code the solution.

## Part 2: Building the Initial GPT Model

### 7. Model: Embedding Layer and Output Linear Transformation

In [None]:
# 7. Model: Embedding Layer and Output Linear Transformation
#    - Implement the GPT class with an embedding layer and a linear layer for output.
#    - Forward pass should handle both token embeddings and linear transformation to logits.

# Follow the instructions and code the solution.

### 8. Generate Function

In [None]:
# 8. Generate Function
#    - Implement the `generate()` function to generate text using the trained model.
#    - The function should iterate over a specified number of new tokens, updating the input sequence each time.

# Follow the instructions and code the solution.

## Part 3: Training and Evaluation

### 9. Evaluation Loop

In [None]:
# 9. Evaluation Loop
#    - Implement the `estimate_loss()` function to evaluate the model on training and validation data.
#    - Ensure that the model is in evaluation mode during this process.

# Follow the instructions and code the solution.

### 10. Training Loop

In [None]:
# 10. Training Loop
#    - Set up the training loop with AdamW optimizer.
#    - Include periodic evaluation using `estimate_loss()` and print training/validation losses.
#    - Generate sample from the model

# Follow the instructions and code the solution.

## Part 4: Enhancing the GPT Model

### 11. Model: Positional Embeddings

In [None]:
# 11. Model: Positional Embeddings
#    - Add positional embeddings to the model to capture the order of tokens.
#    - Update the forward pass to combine token embeddings with positional embeddings.


# Follow the instructions and code the solution.

### 12. Model: Single Attention Head

In [None]:
# 12. Model: Single Attention Head
#    - Implement a single attention head within the model to capture token relationships.
#    - Incorporate this attention mechanism into the forward pass.

# Follow the instructions and code the solution.

### 13. Model: Multi-Head Attention

In [None]:
# 13. Model: Multi-Head Attention
#    - Expand the model to include multiple attention heads.
#    - Implement a projection layer to combine the outputs of the multiple heads.

# Follow the instructions and code the solution.

## Part 5: Building the Transformer Block

### 14. Model: MLP

In [None]:
# 14. Model: MLP
#    - Add a multi-layer perceptron (MLP) to the model.
#    - The MLP should consist of a projection up, ReLU activation, and a projection down.

# Follow the instructions and code the solution.

### 15. Model: Transformer Block

In [None]:
# 15. Model: Transformer Block
#    - Combine multi-head attention and MLP into a single Transformer block.
#    - Use this block in the model to stack multiple layers.

# Follow the instructions and code the solution.

## Part 6: Final Enhancements

### 16: Model: Skip Connections, normalization and dropout

In [None]:
# 16. Model: Skip Connections, normalization and dropout
#    - Implement skip connections (residual connections) around the attention and MLP layers.
#    - Add layer normalization before applying the skip connections to stabilize training.
#    - Include dropout layers in both the attention and MLP layers to prevent overfitting.

# Follow the instructions and code the solution.

## Part 7: Final Training and Evaluation

### 17. Final Evaluation and Text Generation

In [None]:
# 17. Final Evaluation and Text Generation
#    - Train the final GPT model with the full architecture.
#    - Evaluate the final model on the validation set.
#    - Use the trained model to generate new text samples.

# Follow the instructions and code the solution.

# Code Solutions

## Part 1: Setup and Data Preparation

### 1. Imports & Configurations

In [1]:
# 1. Imports & Configurations
#    - Import necessary libraries (`torch`, `torch.nn`, `torch.nn.functional`, etc.).
#    - Set device configuration (CPU/GPU).
#    - Set random seed for reproducibility.

In [2]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

### 2. Download Dataset

In [3]:
# 2. Download Dataset
#    - Download and load the "tinyshakespeare" dataset.
#    - https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
#    - Read the text data from the file.

In [4]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-08-16 17:33:57--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-08-16 17:33:58 (50.2 MB/s) - ‘input.txt’ saved [1115394/1115394]



### 3. Vocabulary Creation

In [5]:
# 3. Vocabulary Creation
#    - Extract unique characters from the dataset.
#    - Determine the vocabulary size.

In [6]:
with open('./input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size = len(chars)
print(vocab_size)
print(chars)

65
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


### 4. Tokenizer

In [7]:
# 4. Tokenizer
#    - Create mappings for character-to-index (`stoi`) and index-to-character (`itos`).
#    - Implement `encode()` and `decode()` functions for tokenization.

In [8]:
itos = {i:ch for i, ch in enumerate(chars)}
stoi = {ch:i for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])

print(encode('hello world!'))
print(decode(encode('hello world!')))

[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42, 2]
hello world!


### 5. Train and Test Splits

In [9]:
# 5. Train and Test Splits
#    - Convert the entire dataset into token indices.
#    - Split the data into training and validation sets (90/10 split).

In [10]:
data = torch.tensor(encode(text), dtype=torch.long)

n = int(len(data)*0.9)
train_data = data[:n]
val_data = data[n:]

print(train_data.shape)
print(val_data.shape)

torch.Size([1003854])
torch.Size([111540])


### 6. Dataloader

In [11]:
# 6. Dataloader
#    - Define a function `get_batch()` to generate batches of data for training and evaluation.
#    - Ensure that each batch contains sequences of fixed block size.

In [12]:
batch_size = 4
block_size = 8

def get_batch(split):

    data = train_data if split=='train' else val_data

    ix = torch.randint(len(data)-block_size, (batch_size,))

    x = torch.stack([data[i: i+ block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1: i+ block_size+1] for i in ix], dim=0)

    x = x.to(device)
    y = y.to(device)

    return x, y

print(get_batch('train'))

(tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]]), tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]]))


## Part 2: Building the Initial GPT Model

### 7. Model: Embedding Layer and Output Linear Transformation

In [13]:
# 7. Model: Embedding Layer and Output Linear Transformation
#    - Implement the GPT class with an embedding layer and a linear layer for output.
#    - Forward pass should handle both token embeddings and linear transformation to logits.
# See next cell

### 8. Generate Function

In [14]:
# 8. Generate Function
#    - Implement the `generate()` function to generate text using the trained model.
#    - The function should iterate over a specified number of new tokens, updating the input sequence each time.

n_embed = 32

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        logits = self.lm_head(tok_emb) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

model = GPT()
model = model.to(device)

print(model)

GPT(
  (embed_tokens): Embedding(65, 32)
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)


## Part 3: Training and Evaluation

### 9. Evaluation Loop

In [15]:
# 9. Evaluation Loop
#    - Implement the `estimate_loss()` function to evaluate the model on training and validation data.
#    - Ensure that the model is in evaluation mode during this process.

In [16]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out

### 10. Training Loop

In [17]:
# 10. Training Loop
#    - Set up the training loop with AdamW optimizer.
#    - Include periodic evaluation using `estimate_loss()` and print training/validation losses.
#    - Generate sample from the model

In [18]:
eval_iters = 20
eval_interval = 500

learning_rate=1e-3
training_iters=1000


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

 training loss: 4.3456525802612305, eval loss 4.282791614532471
 training loss: 2.9852001667022705, eval loss 3.005040407180786

ZKbt,
LNurernd,u wit d werce s abld be beatWos rdy


## Part 4: Enhancing the GPT Model

### 11. Model: Positional Embeddings

In [19]:
# 11. Model: Positional Embeddings
#    - Add positional embeddings to the model to capture the order of tokens.
#    - Update the forward pass to combine token embeddings with positional embeddings.

In [20]:
n_embed = 32

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3

#----------------------------------------------


class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = tok_emb

        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)
 training loss: 4.314088821411133, eval loss 4.298123359680176
 training loss: 2.969817638397217, eval loss 2.9305644035339355
 training loss: 2.705784320831299, eval loss 2.6208765506744385
 training loss: 2.6057913303375244, eval loss 2.665055990219116
 training loss: 2.5697872638702393, eval loss 2.577955722808838
 training loss: 2.5728728771209717, eval loss 2.515190362930298

MEathe my myo woucls bed ISIprill-n id co y. teve:


### 12. Model: Single Attention Head

In [21]:
# 12. Model: Single Attention Head
#    - Implement a single attention head within the model to capture token relationships.
#    - Incorporate this attention mechanism into the forward pass.

In [22]:
n_embed = 32
head_size = 32

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3

#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        out = wei @ v

        return out




class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.attn = Head(head_size)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.attn(x)

        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (attn): Head(
    (query): Linear(in_features=32, out_features=32, bias=False)
    (key): Linear(in_features=32, out_features=32, bias=False)
    (value): Linear(in_features=32, out_features=32, bias=False)
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

---------------
 training loss: 4.192816734313965, eval loss 4.181145668029785
 training loss: 2.9647483825683594, eval loss 2.9163105487823486
 training loss: 2.6664605140686035, eval loss 2.651362419128418
 training loss: 2.5779342651367188, eval loss 2.711329698562622
 training loss: 2.624051570892334, eval loss 2.5673742294311523
 training loss: 2.556857109069824, eval loss 2.511453866958618

---------------

FEt, my me tha be lapowoo ove:
Mbeastheiin two the

---------------

Total Parameters: 7553
Trainable Parameters: 7553


### 13. Model: Multi-Head Attention

In [23]:
# 13. Model: Multi-Head Attention
#    - Expand the model to include multiple attention heads.
#    - Implement a projection layer to combine the outputs of the multiple heads.

In [24]:
n_embed = 32
head_size = 8
n_head = 4

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3

#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for h in range(n_head)])
        self.o_proj = nn.Linear(n_embed, n_embed, bias=False)

    def forward(self, x):
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        x = self.o_proj(x)
        return x




class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.attn = MultiHeadAttention(n_head, n_embed//n_head)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.attn(x)

        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (attn): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
    (o_proj): Linear(in_features=32, out_features=32, bias=False)
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

---------------
 training loss: 4.190523624420166, eval loss 4.188727378845215
 training loss: 2.922368288040161, eval loss 2.8527870178222656
 training loss: 2.746164083480835, eval loss 2.686760187149048
 training loss: 2.524226427078247, eval loss 2.669847011566162
 training loss: 2.559569835662842, eval loss 2.566415786743164
 training loss: 2.5005269050598145, eval loss 2.5069375038146973

---------------

H:
LCBET'AS:
Id
R Korchoc?
zCO fin gom mie dot por

---------------

To

## Part 5: Building the Transformer Block

### 14. Model: MLP

In [25]:
# 14. Model: MLP
#    - Add a multi-layer perceptron (MLP) to the model.
#    - The MLP should consist of a projection up, ReLU activation, and a projection down.

In [26]:
n_embed = 32
head_size = 8
n_head = 4

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3

#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for h in range(n_head)])
        self.o_proj = nn.Linear(n_embed, n_embed, bias=False)

    def forward(self, x):
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        x = self.o_proj(x)
        return x

class MLP(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.out_proj = nn.Linear(n_embed, 4 * n_embed)
        self.in_proj = nn.Linear(4 * n_embed, n_embed)
        self.act = nn.ReLU()

    def forward(self, x):
        x = self.out_proj(x)
        x = self.act(x)
        x = self.in_proj(x)
        return x

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.attn = MultiHeadAttention(n_head, n_embed//n_head)
        self.mlp = MLP(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.attn(x)
        x = self.mlp(x)

        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (attn): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
    (o_proj): Linear(in_features=32, out_features=32, bias=False)
  )
  (mlp): MLP(
    (out_proj): Linear(in_features=32, out_features=128, bias=True)
    (in_proj): Linear(in_features=128, out_features=32, bias=True)
    (act): ReLU()
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

---------------
 training loss: 4.18421745300293, eval loss 4.177970886230469
 training loss: 2.92226505279541, eval loss 2.875664472579956
 training loss: 2.658184289932251, eval loss 2.57914400100708
 training loss: 2.5772807598114014, eval loss 2.631568431854248
 training loss: 2.468953847885132, eval loss 2.41

### 15. Model: Transformer Block

In [27]:
# 15. Model: Transformer Block
#    - Combine multi-head attention and MLP into a single Transformer block.
#    - Use this block in the model to stack multiple layers.

In [28]:
n_embed = 32
head_size = 8
n_head = 4
n_layers = 2

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3


#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_head)])
        self.o_proj = nn.Linear(n_head * head_size, n_embed, bias=False)

    def forward(self, x):
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        x = self.o_proj(x)
        return x

class MLP(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.up_proj = nn.Linear(n_embed, 4 * n_embed)
        self.down_proj = nn.Linear(4 * n_embed, n_embed)
        self.act = nn.ReLU()

    def forward(self, x):
        x = self.up_proj(x)
        x = self.act(x)
        x = self.down_proj(x)
        return x

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.attn = MultiHeadAttention(n_head, head_size)
        self.mlp = MLP(n_embed)

    def forward(self, x):
        x = self.attn(x)
        x = self.mlp(x)
        return x

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.layers = nn.Sequential(*[Block(n_embed, n_head) for _ in range(n_layers) ])
        # self.layers = nn.Sequential(
        #     Block(n_embed, n_head=4),
        #     Block(n_embed, n_head=4)
        # )
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.layers(x)
        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (layers): Sequential(
    (0): Block(
      (attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
        (o_proj): Linear(in_features=32, out_features=32, bias=False)
      )
      (mlp): MLP(
        (up_proj): Linear(in_features=32, out_features=128, bias=True)
        (down_proj): Linear(in_features=128, out_features=32, bias=True)
        (act): ReLU()
      )
    )
    (1): Block(
      (attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear

## Part 6: Final Enhancements

### 16. Model: Skip connections, normalization and dropout

In [29]:
# 16. Model: Skip Connections
#    - Implement skip connections (residual connections) around the attention and MLP layers.
#    - Add layer normalization before applying the skip connections to stabilize training.
#    - Include dropout layers in both the attention and MLP layers to prevent overfitting.

In [30]:
n_embed = 32
head_size = 8
n_head = 4
n_layers = 2

dropout=0.1

eval_iters = 20
eval_interval = 500
training_iters=3000
learning_rate=1e-3


#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)


    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_head)])
        self.o_proj = nn.Linear(n_head * head_size, n_embed, bias=False)

    def forward(self, x):
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        x = self.o_proj(x)
        return x

class MLP(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.up_proj = nn.Linear(n_embed, 4 * n_embed)
        self.down_proj = nn.Linear(4 * n_embed, n_embed)
        self.act = nn.ReLU()

    def forward(self, x):
        x = self.up_proj(x)
        x = self.act(x)
        x = self.down_proj(x)
        return x

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.attn = MultiHeadAttention(n_head, head_size)
        self.mlp = MLP(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = x + self.dropout(self.attn(self.ln1(x)))
        x = x + self.dropout(self.mlp(self.ln2(x)))
        return x

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.layers = nn.Sequential(*[Block(n_embed, n_head) for _ in range(n_layers) ])
        self.lm_head = nn.Linear(n_embed, vocab_size)
        self.ln_f = nn.LayerNorm(n_embed)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.layers(x)
        x = self.ln_f(x)
        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

print(model)

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#----------------------------------------------

print('\n---------------')

context = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(context)[0].tolist()))

print('\n---------------')

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

GPT(
  (embed_tokens): Embedding(65, 32)
  (pos_embedding_table): Embedding(8, 32)
  (layers): Sequential(
    (0): Block(
      (attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (o_proj): Linear(in_features=32, out_features=32, bias=False)
      )
      (mlp): MLP(
        (up_proj): Linear(in_features=32, out_features=128, bias=True)
        (down_proj): Linear(in_features=128, out_features=32, bias=True)
        (act): ReLU()
      )
      (ln1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (1): Block(
      (attn): MultiHeadAtte

## Part 7: Final Training and Evaluation

### 17. Final Evaluation and Text Generation

In [31]:
# 17. Final Evaluation and Text Generation
#    - Train the final GPT model with the full architecture.
#    - Evaluate the final model on the validation set.
#    - Use the trained model to generate new text samples.

In [34]:
n_embed = 32
head_size = 8
n_head = 4
n_layers = 4

dropout=0.1

eval_iters = 20
eval_interval = 1000
training_iters=10000
learning_rate=1e-3


#----------------------------------------------

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)


    def forward(self, x):

        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        wei = q @ k.transpose(-2, -1)
        wei = wei * C**-0.5
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        out = wei @ v

        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_head)])
        self.o_proj = nn.Linear(n_head * head_size, n_embed, bias=False)

    def forward(self, x):
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        x = self.o_proj(x)
        return x

class MLP(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.up_proj = nn.Linear(n_embed, 4 * n_embed)
        self.down_proj = nn.Linear(4 * n_embed, n_embed)
        self.act = nn.ReLU()

    def forward(self, x):
        x = self.up_proj(x)
        x = self.act(x)
        x = self.down_proj(x)
        return x

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.attn = MultiHeadAttention(n_head, head_size)
        self.mlp = MLP(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = x + self.dropout(self.attn(self.ln1(x)))
        x = x + self.dropout(self.mlp(self.ln2(x)))
        return x

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, n_embed)
        self.pos_embedding_table = nn.Embedding(block_size, n_embed)
        self.layers = nn.Sequential(*[Block(n_embed, n_head) for _ in range(n_layers) ])
        self.lm_head = nn.Linear(n_embed, vocab_size)
        self.ln_f = nn.LayerNorm(n_embed)

    def forward(self, x, targets=None): # x shape (B, T)
        B, T = x.shape

        tok_emb = self.embed_tokens(x) # shape (B, T, n_embed)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb;
        x = self.layers(x)
        x = self.ln_f(x)
        logits = self.lm_head(x) # shape (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B*T)

            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=50):
        # idx shape (B, T)

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :] # shape(B, T, vocab_size)
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)

        return idx

#----------------------------------------------

model = GPT()
model = model.to(device)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

print('\n---------------')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _ , loss = model(x, y)
            losses[k] = loss
        out[split] = losses.mean()
    model.train()
    return out


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_iters):

    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f" training loss: {losses['train']}, eval loss {losses['val']}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



Total Parameters: 54849
Trainable Parameters: 54849

---------------
 training loss: 4.370202541351318, eval loss 4.357987880706787
 training loss: 2.5559301376342773, eval loss 2.5146987438201904
 training loss: 2.4333717823028564, eval loss 2.3698134422302246
 training loss: 2.2964673042297363, eval loss 2.388180732727051
 training loss: 2.3420510292053223, eval loss 2.3240602016448975
 training loss: 2.209441900253296, eval loss 2.3096933364868164
 training loss: 2.253512144088745, eval loss 2.3528244495391846
 training loss: 2.2819905281066895, eval loss 2.295457124710083
 training loss: 2.216463327407837, eval loss 2.1905064582824707
 training loss: 2.2447941303253174, eval loss 2.06135892868042


In [37]:
eval_iters = 1000
final_losses = estimate_loss()
print(f"\nFinal training loss: {final_losses['train']}")
print(f"Final validation loss: {final_losses['val']}")


Final training loss: 2.1516566276550293
Final validation loss: 2.198732614517212


In [35]:
# Generate some text to see the model in action
context = torch.tensor([[0]], dtype=torch.long, device=device)
generated_text = decode(model.generate(context, max_new_tokens=5000)[0].tolist())

print("\nGenerated Text:")
print(generated_text)


Generated Text:

What; my nos to me for toes silg mesar, lord thou b'd 'ceece,
And and berew'sermars of chiss us in hom thant bles mome of ipare!

Endosir,
Frelt his loideng of whall I wedch tus tohere. Vareing, coeadem, your Of sodsoankes,
Wom it:
Way my is noebe?

AUfor Bastaren of to mine
Fe dratelf ofer slaight of whoin The hingh:
Tot whenk mingr had he et it, if hingaiunst, es somenpe; boren, I he lat o' wits thim now ith fe the I have saygrean, h yere thou at will has furing herds;
Ting; He hart notu
Whirs whill etheeteont.
 and he pupckel artrow to suers, yey
ANd I y OUMARDEN:
Youth and spoange I mere sis on on heeavaod,
That curnestingen ? Wothem I, you marcoobut the a ste hen so whall but; gode;
Roventinlen supee laeepe Leveacul Ask walut Vens upard:
Wik, hion And't hades noun:
y her le, it that in your Yourece, on my would a demenast;
Senou here wilth trage, jestic? hay se me for res: ain, ucoO:
Lor fat
ith have him; town metay gor, dord; beartis liawit,
I comy, fut thee:
Au