## Workbook Overview

This workbook guides you through coding a Transformer-based GPT model from scratch. It is based on Andrej Karpathy's "Zero to Hero" YouTube tutorial titled ["Let's build GPT: from scratch, in code, spelled out"](https://www.youtube.com/watch?v=kCc8FmEb1nY).

### Who Is This For?

- Those familiar with Andrej Karpathy’s "Zero to Hero" series and looking to code GPT from scratch.

### How to Use This Workbook

1. **Create a Copy**: To begin, you'll need your own copy of this Colab workbook.
   - In Colab, go to **File** > **Save a copy in Drive**.
   - Rename the file if you'd like.
   - You are now ready to start coding.

2. **Workbook Structure**:
   - The workbook is divided into three sections:
     - **Coding Instructions**: Detailed steps for building key GPT model components.
     - **Coding Exercises**: Implement what you've learned as you follow along.
     - **Code Solutions**: Check your work or get unstuck by reviewing complete solutions.
   - Use the Table of Contents on the left sidebar for easy navigation.



### Purpose

- This workbook aims to help you understand the construction of Transformer-based language models like GPT, from the ground up.





# List of instructions

In [None]:

# Let's Build GPT (Step-by-Step Transformer Language Model)

## Part 1: Setup and Data Preparation

# 1. Imports & Configurations
#    - Import necessary libraries (`torch`, `torch.nn`, `torch.nn.functional`, etc.).
#    - Set device configuration (CPU/GPU).
#    - Set random seed for reproducibility.

# 2. Download Dataset
#    - Download and load the "tinyshakespeare" dataset.
#    - https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
#    - Read the text data from the file.

# 3. Vocabulary Creation
#    - Extract unique characters from the dataset.
#    - Determine the vocabulary size.

# 4. Tokenizer
#    - Create mappings for character-to-index (`stoi`) and index-to-character (`itos`).
#    - Implement `encode()` and `decode()` functions for tokenization.

# 5. Train and Test Splits
#    - Convert the entire dataset into token indices.
#    - Split the data into training and validation sets (90/10 split).

# 6. Dataloader
#    - Define a function `get_batch()` to generate batches of data for training and evaluation.
#    - Ensure that each batch contains sequences of fixed block size.

## Part 2: Building the Initial GPT Model

# 7. Model: Embedding Layer, Generate Function, Training Loop
#    - Create a dataclass `GPTConfig` to store model hyperparameters.
#    - Implement the GPT class with an embedding layer, producing logits for output.
#    - Implement the `generate()` function to generate text using the trained model.
#    - Instantiate a model.
#    - Print the number of trainable parameters.
#    - Pass one minibatch through the model.
#    - Generate a sample from the model (output should be garbled/random initially).

# 8. Model: Training Loop
#    - Set up the training loop with AdamW optimizer.
#    - block_size:8, Batch size: 32, Training steps: 3,000, learning_rate: 1e-2
#    - Generate sample from the model
#    - Model output should have more structure after training

# 9. Evaluation Loop
#    - Implement the `estimate_loss()` function to evaluate the model on training and validation data.
#    - Ensure that the model is in evaluation mode during this process.
#    - Set up the training loop as before.
#    - Include periodic evaluation using `estimate_loss()` and print 10 training/validation losses.
#    - Generate sample from the model.

## Part 3: Enhancing the GPT Model

# 10. Model: Positional Embeddings
#    - Add positional embeddings to the model.
#    - Train, evaluate and generate sample from the model.

# 11. Model: Single Attention Head
#    - Implement a single attention head.
#    - Train, evaluate and generate sample from the model.

# 12. Model: Multi-Head Attention
#    - Expand the model to include multiple attention heads.
#    - Implement a projection layer to combine the outputs of the multiple heads.
#    - Train, evaluate and generate sample from the model.

## Part 4: Building the Transformer Block

# 13. Model: MLP
#    - Add a multi-layer perceptron (MLP) to the model.
#    - The MLP should consist of a projection up, ReLU activation, and a projection down.

# 14. Model: Transformer Block
#    - Combine multi-head attention and MLP into a single Transformer block.

## Part 5: Final Enhancements

# 15. Model: Skip Connections, normalization and dropout
#    - Implement skip connections (residual connections) around the attention and MLP layers.
#    - Add layer normalization before applying the skip connections to stabilize training.
#    - Include dropout layers in both the attention and MLP layers to prevent overfitting.
#    - Implement two or more Blocks.

# 16. Model: MultiHeadAttention [Alternate Implementation] (Optional)
#    - Implement MultiHeadAttention by utilizing a single class.
#    - Refer to the following resource for guidance:
#      https://github.com/rasbt/LLMs-from-scratch/tree/main/ch03/02_bonus_efficient-multihead-attention

## Part 6: Final Training and Evaluation

# 18. Final Evaluation and Text Generation
#    - Train the final GPT model with the complete architecture on a GPU for improved performance.
#    - Model hyperparameters:
#        - block_size: 256
#        - n_embd: 128
#        - n_head: 6
#        - n_layer: 6
#        - head_size: 16
#    - Training hyperparameters:
#        - batch_size: 128
#        - max_iters: 5000
#        - learning_rate: 1e-3
#        - eval_interval: 500
#        - eval_iters: 100
#    - Evaluate the final model on the validation set at regular intervals.
#    - Use the trained model to generate new text samples and assess its performance.

# Coding Exercises

## Part 1: Setup and Data Preparation

### 1. Imports & Configurations

In [None]:
# 1. Imports & Configurations
#    - Import necessary libraries (`torch`, `torch.nn`, `torch.nn.functional`, etc.).
#    - Set device configuration (CPU/GPU).
#    - Set random seed for reproducibility.

# Follow the instructions and code the solution.

device: cpu


### 2. Download Dataset

In [None]:
# 2. Download Dataset
#    - Download and load the "tinyshakespeare" dataset.
#    - https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
#    - Read the text data from the file.

# Follow the instructions and code the solution.

--2024-08-23 12:18:48--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-08-23 12:18:48 (16.5 MB/s) - ‘input.txt’ saved [1115394/1115394]



### 3. Vocabulary Creation

In [None]:
# 3. Vocabulary Creation
#    - Extract unique characters from the dataset.
#    - Determine the vocabulary size.

# Follow the instructions and code the solution.

### 4. Tokenizer

In [None]:
# 4. Tokenizer
#    - Create mappings for character-to-index (`stoi`) and index-to-character (`itos`).
#    - Implement `encode()` and `decode()` functions for tokenization.

# Follow the instructions and code the solution.

'hello world!'

### 5. Train and Test Splits

In [None]:
# 5. Train and Test Splits
#    - Convert the entire dataset into token indices.
#    - Split the data into training and validation sets (90/10 split).

# Follow the instructions and code the solution.

1003854 111540
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])
tensor([12,  0,  0, 19, 30, 17, 25, 21, 27, 10])


### 6. Dataloader

In [None]:
# 6. Dataloader
#    - Define a function `get_batch()` to generate batches of data for training and evaluation.
#    - Ensure that each batch contains sequences of fixed block size.

# Follow the instructions and code the solution.

(tensor([[24, 43, 58,  5, 57,  1, 46, 43],
         [44, 53, 56,  1, 58, 46, 39, 58],
         [52, 58,  1, 58, 46, 39, 58,  1],
         [25, 17, 27, 10,  0, 21,  1, 54]]),
 tensor([[43, 58,  5, 57,  1, 46, 43, 39],
         [53, 56,  1, 58, 46, 39, 58,  1],
         [58,  1, 58, 46, 39, 58,  1, 46],
         [17, 27, 10,  0, 21,  1, 54, 39]]))

## Part 2: Building the Initial GPT Model

### 7. Model: Embedding Layer, Generate Function, Training Loop

In [None]:
# Code Consolidation
# - Starting from this cell, we'll include all the code from 'Part 1: Setup and Data Preparation'
#   at the beginning of each subsequent code cell. This approach ensures that we have a complete
#   and up-to-date version of the entire codebase as we incrementally build the Transformer model.
# - In addition to the data preparation code, we'll also include the model hyperparameters
#   and training configurations in each cell.
# - Going forward, always copy the complete code from the previous cell into the next one,
#   and then add new features or enhancements. This method keeps the development process clear,
#   and makes it easier to track progress and make modifications.
# - This is the first cell where we begin consolidating code, so ensure that you bring over all
#   necessary components from the previous steps as we continue to build on the Transformer.

# 7. Model: Embedding Layer, Generate Function, Training Loop
#    - Create a dataclass `GPTConfig` to store model hyperparameters.
#    - Implement the GPT class with an embedding layer, producing logits for output.
#    - Implement the `generate()` function to generate text using the trained model.
#    - Instantiate a model.
#    - Print the number of trainable parameters.
#    - Pass one minibatch through the model.
#    - Generate a sample from the model (output should be garbled/random initially).

# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 4225
Trainable Parameters: 4225

P-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3!dcb


### 8. Model: Training Loop

In [None]:
# 8. Model: Training Loop
#    - Set up the training loop with AdamW optimizer.
#    - block_size:8, Batch size: 32, Training steps: 3,000, learning_rate: 1e-2
#    - Generate sample from the model
#    - Model output should have more structure after training

# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 4225
Trainable Parameters: 4225
loss: 2.5201265811920166

thest, thavor 'spe-d th.

O d g oow.
Gosul d lllarclldd Bush auach nd t hethind he s wntatesi: slou 


### 9. Evaluation Loop

In [None]:
# 9. Evaluation Loop
#    - Implement the `estimate_loss()` function to evaluate the model on training and validation data.
#    - Ensure that the model is in evaluation mode during this process.
#    - Set up the training loop as before.
#    - Include periodic evaluation using `estimate_loss()` and print 10 training/validation losses.
#    - Generate sample from the model.

# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 4225
Trainable Parameters: 4225
Training loss: 4.733225345611572, Validation loss: 4.725981712341309
Training loss: 2.7876458168029785, Validation loss: 2.815702438354492
Training loss: 2.5510098934173584, Validation loss: 2.5644922256469727
Training loss: 2.491964340209961, Validation loss: 2.507617950439453
Training loss: 2.4817771911621094, Validation loss: 2.51529860496521
Training loss: 2.471799850463867, Validation loss: 2.4981491565704346
Training loss: 2.487903594970703, Validation loss: 2.495173454284668
Training loss: 2.4741978645324707, Validation loss: 2.504298210144043
Training loss: 2.4576830863952637, Validation loss: 2.494094133377075
Training loss: 2.448154926300049, Validation loss: 2.496605157852173

Antat p OLLal kengapGHAnd mainateal LO dourd Rimou t t seise:
PWhomo.
Torilik f s wo pe kear thinewie; y, ndunk smeaithe osmullly
Titouinch dgectonc eafithoutitay
INomp hyetherour:
ME:

Cors pres'se g


## Part 3: Enhancing the GPT Model

### 11. Model: Positional Embeddings

In [None]:
# 11. Model: Positional Embeddings
#    - Add positional embeddings to the model.
#    - Train, evaluate and generate sample from the model.

# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 4481
Trainable Parameters: 4481
Training loss: 4.474524974822998, Validation loss: 4.480035305023193
Training loss: 2.522061347961426, Validation loss: 2.5282061100006104
Training loss: 2.5352325439453125, Validation loss: 2.5343899726867676
Training loss: 2.4970154762268066, Validation loss: 2.5091142654418945
Training loss: 2.4937195777893066, Validation loss: 2.5417206287384033
Training loss: 2.511420965194702, Validation loss: 2.5166428089141846
Training loss: 2.5183358192443848, Validation loss: 2.521740198135376
Training loss: 2.511129140853882, Validation loss: 2.526968002319336
Training loss: 2.483370304107666, Validation loss: 2.5209450721740723
Training loss: 2.472799062728882, Validation loss: 2.5132687091827393

Wim anshlar, d, kn:
Who howg.
Theabjor ge?
ANonyof thenthard;
zerig he,
fondr byo maty ind:
NG tt t chy me on Whorssur on
L: ce liceyo cen,
Sst rend, seremy he.
' ar hofr our br.
ANEOXFiave
BEnour CH:


### 12. Model: Single Attention Head

In [None]:
# 12. Model: Single Attention Head
#    - Implement a single attention head.
#    - Train, evaluate and generate sample from the model.

# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 7553
Trainable Parameters: 7553
Training loss: 4.196639537811279, Validation loss: 4.200407981872559
Training loss: 2.4888217449188232, Validation loss: 2.533363103866577
Training loss: 2.463995933532715, Validation loss: 2.487450361251831
Training loss: 2.433098077774048, Validation loss: 2.476203203201294
Training loss: 2.4688870906829834, Validation loss: 2.4878599643707275
Training loss: 2.41792631149292, Validation loss: 2.457453727722168
Training loss: 2.422886610031128, Validation loss: 2.4499640464782715
Training loss: 2.4246163368225098, Validation loss: 2.436007499694824
Training loss: 2.389042615890503, Validation loss: 2.4294393062591553
Training loss: 2.425231695175171, Validation loss: 2.405696153640747


Wha thanceith belo ISIZETe can
CALEROGSOG:
The! highs a ckitr, an isprinen kst I bast anto le wilt conto;
Waf doreay rssand sag cyor thapcols ar re kimer hererd hadi heealer ulcove.

The yst;
G ESS:



### 13. Model: Multi-Head Attention

In [None]:
# 13. Model: Multi-Head Attention
#    - Expand the model to include multiple attention heads.
#    - Implement a projection layer to combine the outputs of the multiple heads.
#    - Train, evaluate and generate sample from the model.

# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 8609
Trainable Parameters: 8609
Training loss: 4.203001022338867, Validation loss: 4.203583240509033
Training loss: 2.423476219177246, Validation loss: 2.4322385787963867
Training loss: 2.3412065505981445, Validation loss: 2.3748598098754883
Training loss: 2.332437038421631, Validation loss: 2.366330862045288
Training loss: 2.2979369163513184, Validation loss: 2.339280605316162
Training loss: 2.2743730545043945, Validation loss: 2.3021461963653564
Training loss: 2.2756242752075195, Validation loss: 2.305638313293457
Training loss: 2.258408546447754, Validation loss: 2.2880289554595947
Training loss: 2.218405246734619, Validation loss: 2.2747461795806885
Training loss: 2.219954490661621, Validation loss: 2.2759642601013184

Thing gock,
As dry orgoord
YICORIA:
And prit, eea If sacdek? 
mact all prer nold?
What, chat?

Whill
Onll,
Then;
To grarn bew; an as to milesen we par dights womy
Whal, sth,
Drens, Efage, whatt yane w


## Part 4: Building the Transformer Block

### 14. Model: MLP

In [None]:
# 14. Model: MLP
#    - Add a multi-layer perceptron (MLP) to the model.
#    - The MLP should consist of a projection up, ReLU activation, and a projection down.

# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 16961
Trainable Parameters: 16961
Training loss: 4.153048992156982, Validation loss: 4.156349182128906
Training loss: 2.447960615158081, Validation loss: 2.4581236839294434
Training loss: 2.33398699760437, Validation loss: 2.3294832706451416
Training loss: 2.2790815830230713, Validation loss: 2.3204283714294434
Training loss: 2.2687125205993652, Validation loss: 2.312082290649414
Training loss: 2.2480599880218506, Validation loss: 2.2877395153045654
Training loss: 2.2398452758789062, Validation loss: 2.2740907669067383
Training loss: 2.216015577316284, Validation loss: 2.2861952781677246
Training loss: 2.192251205444336, Validation loss: 2.265817642211914
Training loss: 2.1534619331359863, Validation loss: 2.255052328109741

You to goodamby earsabluth say an heak no I torlans nothen ack, witin owouir to mee
To for you an in there deart, vour tormesten as Row
Must caill sher't shiciveme? dold gues,
Histle,
Ray nosur gaid t


### 15. Model: Transformer Block

In [None]:
# 15. Model: Transformer Block
#    - Combine multi-head attention and MLP into a single Transformer block.

# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 16961
Trainable Parameters: 16961
Training loss: 4.153048992156982, Validation loss: 4.156349182128906
Training loss: 2.447960615158081, Validation loss: 2.4581236839294434
Training loss: 2.33398699760437, Validation loss: 2.3294832706451416
Training loss: 2.2790815830230713, Validation loss: 2.3204283714294434
Training loss: 2.2687125205993652, Validation loss: 2.312082290649414
Training loss: 2.2480599880218506, Validation loss: 2.2877395153045654
Training loss: 2.2398452758789062, Validation loss: 2.2740907669067383
Training loss: 2.216015577316284, Validation loss: 2.2861952781677246
Training loss: 2.192251205444336, Validation loss: 2.265817642211914
Training loss: 2.1534619331359863, Validation loss: 2.255052328109741

You to goodamby earsabluth say an heak no I torlans nothen ack, witin owouir to mee
To for you an in there deart, vour tormesten as Row
Must caill sher't shiciveme? dold gues,
Histle,
Ray nosur gaid t


## Part 5: Final Enhancements

### 16. Model: Skip Connections, normalization and dropout

In [None]:
# 16. Model: Skip Connections, normalization and dropout
#    - Implement skip connections (residual connections) around the attention and MLP layers.
#    - Add layer normalization before applying the skip connections to stabilize training.
#    - Include dropout layers in both the attention and MLP layers to prevent overfitting.
#    - Implement two or more Blocks.

# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 29761
Trainable Parameters: 29761
Training loss: 4.3595194816589355, Validation loss: 4.3730082511901855
Training loss: 2.3575327396392822, Validation loss: 2.383897304534912
Training loss: 2.2578649520874023, Validation loss: 2.2878003120422363
Training loss: 2.2061879634857178, Validation loss: 2.1913561820983887
Training loss: 2.166557550430298, Validation loss: 2.198211193084717
Training loss: 2.1467902660369873, Validation loss: 2.212841272354126
Training loss: 2.1245851516723633, Validation loss: 2.179342031478882
Training loss: 2.102412223815918, Validation loss: 2.154122829437256
Training loss: 2.104166269302368, Validation loss: 2.1471915245056152
Training loss: 2.071061134338379, Validation loss: 2.143803119659424

What thir, i Eswet pored
Wend paserteld, thour in mearne 'Tur.
What, curperfed,
And I by good hron,
So--fiar it stake maull how, Heram welf ouse be farrall ler hem,
O'll, ime Vawadauche this him no sw


### 17. Model: MultiHeadAttention [Alternate Implementation] (Optional)

In [None]:
# 17. Model: MultiHeadAttention [Alternate Implementation] (Optional)
#    - Implement MultiHeadAttention by utilizing a single class.
#    - Refer to the following resource for guidance:
#      https://github.com/rasbt/LLMs-from-scratch/tree/main/ch03/02_bonus_efficient-multihead-attention


# Follow the instructions and code the solution.

vocab size: 65
Total Parameters: 29761
Trainable Parameters: 29761
Training loss: 4.336919784545898, Validation loss: 4.352819919586182
Training loss: 2.3357858657836914, Validation loss: 2.3310577869415283
Training loss: 2.254905939102173, Validation loss: 2.2979040145874023
Training loss: 2.1935226917266846, Validation loss: 2.2520251274108887
Training loss: 2.171903610229492, Validation loss: 2.208240270614624
Training loss: 2.1279537677764893, Validation loss: 2.1669797897338867
Training loss: 2.1280019283294678, Validation loss: 2.155714511871338
Training loss: 2.0721373558044434, Validation loss: 2.1721951961517334
Training loss: 2.0832924842834473, Validation loss: 2.1373047828674316
Training loss: 2.0340380668640137, Validation loss: 2.1342978477478027

Fith andy mured'd Of ther,, wtuh, do deeadect? Wellove knablyw a sy seed.
Your sumate dod peay your as Julaet, I
To mece of all Fon con houldst; mhy his quend thour he wrowd hawell ight go maity your 


## Part 6: Final Training and Evaluation

### 18. Final Evaluation and Text Generation

In [None]:
# 18. Final Evaluation and Text Generation
#    - Train the final GPT model with the complete architecture on a GPU for improved performance.
#    - Model hyperparameters:
#        - block_size: 256
#        - n_embd: 128
#        - n_head: 6
#        - n_layer: 6
#        - head_size: 16
#    - Training hyperparameters:
#        - batch_size: 128
#        - max_iters: 5000
#        - learning_rate: 1e-3
#        - eval_interval: 500
#        - eval_iters: 100
#    - Evaluate the final model on the validation set at regular intervals.
#    - Use the trained model to generate new text samples and assess its performance.

# Follow the instructions and code the solution.

# Code Solutions

## Part 1: Setup and Data Preparation

### 1. Imports & Configurations

In [None]:
# 1. Imports & Configurations
#    - Import necessary libraries (`torch`, `torch.nn`, `torch.nn.functional`, etc.).
#    - Set device configuration (CPU/GPU).
#    - Set random seed for reproducibility.

import torch
import torch.nn as nn
import torch.nn.functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

print(f"device: {device}")

device: cuda


### 2. Download Dataset

In [None]:
# 2. Download Dataset
#    - Download and load the "tinyshakespeare" dataset.
#    - https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
#    - Read the text data from the file.

! wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()


--2024-08-23 13:48:01--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.3’


2024-08-23 13:48:02 (5.89 MB/s) - ‘input.txt.3’ saved [1115394/1115394]



### 3. Vocabulary Creation

In [None]:
# 3. Vocabulary Creation
#    - Extract unique characters from the dataset.
#    - Determine the vocabulary size.

chars = sorted(list(set(text)))
vocab_size = len(chars)

### 4. Tokenizer

In [None]:
# 4. Tokenizer
#    - Create mappings for character-to-index (`stoi`) and index-to-character (`itos`).
#    - Implement `encode()` and `decode()` functions for tokenization.

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

decode(encode('hello world!'))

'hello world!'

### 5. Train and Test Splits

In [None]:
# 5. Train and Test Splits
#    - Convert the entire dataset into token indices.
#    - Split the data into training and validation sets (90/10 split).

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

print(len(train_data), len(val_data))
print(train_data[:10])
print(val_data[:10])

1003854 111540
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])
tensor([12,  0,  0, 19, 30, 17, 25, 21, 27, 10])


### 6. Dataloader

In [None]:
# 6. Dataloader
#    - Define a function `get_batch()` to generate batches of data for training and evaluation.
#    - Ensure that each batch contains sequences of fixed block size.

block_size = 8
batch_size = 4

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

x, y = get_batch('train')
x, y

(tensor([[24, 43, 58,  5, 57,  1, 46, 43],
         [44, 53, 56,  1, 58, 46, 39, 58],
         [52, 58,  1, 58, 46, 39, 58,  1],
         [25, 17, 27, 10,  0, 21,  1, 54]], device='cuda:0'),
 tensor([[43, 58,  5, 57,  1, 46, 43, 39],
         [53, 56,  1, 58, 46, 39, 58,  1],
         [58,  1, 58, 46, 39, 58,  1, 46],
         [17, 27, 10,  0, 21,  1, 54, 39]], device='cuda:0'))

## Part 2: Building the Initial GPT Model

### 7. Model: Embedding Layer, Generate Function, Training Loop

In [None]:
# Code Consolidation
# - Starting from this cell, we'll include all the code from 'Part 1: Setup and Data Preparation'
#   at the beginning of each subsequent code cell. This approach ensures that we have a complete
#   and up-to-date version of the entire codebase as we incrementally build the Transformer model.
# - In addition to the data preparation code, we'll also include the model hyperparameters
#   and training configurations in each cell.
# - Going forward, always copy the complete code from the previous cell into the next one,
#   and then add new features or enhancements. This method keeps the development process clear,
#   and makes it easier to track progress and make modifications.
# - This is the first cell where we begin consolidating code, so ensure that you bring over all
#   necessary components from the previous steps as we continue to build on the Transformer.

# 7. Model: Embedding Layer, Generate Function, Training Loop
#    - Create a dataclass `GPTConfig` to store model hyperparameters.
#    - Implement the GPT class with an embedding layer, producing logits for output.
#    - Implement the `generate()` function to generate text using the trained model.
#    - Instantiate a model.
#    - Print the number of trainable parameters.
#    - Pass one minibatch through the model.
#    - Generate a sample from the model (output should be garbled/random initially).

import torch
import torch.nn as nn
import torch.nn.functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 4

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size


# -----------------------

class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.vocab_size)

    def forward(self, x, targets=None):

        logits = self.token_embeddings(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=100):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

xb, yb = get_batch('train')

logits, loss = model(xb, yb)

starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))


vocab size: 65
Total Parameters: 4225
Trainable Parameters: 4225

pYCXxfRkRZd
wc'wfNfT;OLlTEeC K
jxqPToTb?bXAUG:C-SGJO-33SM:C?YI3a
hs:LVXJFhXeNuwqhObxZ.tSVrddXlaSZaNe


### 8. Model: Training Loop

In [None]:
# 8. Model: Training Loop
#    - Set up the training loop with AdamW optimizer.
#    - block_size:8, Batch size: 32, Training steps: 3,000, learning_rate: 1e-2
#    - Generate sample from the model
#    - Model output should have more structure after training

import torch
import torch.nn as nn
import torch.nn.functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 32

training_steps = 3000
learning_rate = 1e-2

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size


# -----------------------

class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.vocab_size)

    def forward(self, x, targets=None):

        logits = self.token_embeddings(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=100):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()

print(f"loss: {loss}")

starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 4225
Trainable Parameters: 4225
loss: 2.5201268196105957



CEThik brid owindakis s, ble

Hiset bube d e.
S:
O:3 my d?
LUCous:
Wanthar u qur, vet?
F dXENDoate


### 9. Evaluation Loop

In [None]:
# 9. Evaluation Loop
#    - Implement the `estimate_loss()` function to evaluate the model on training and validation data.
#    - Ensure that the model is in evaluation mode during this process.
#    - Set up the training loop as before.
#    - Include periodic evaluation using `estimate_loss()` and print 10 training/validation losses.
#    - Generate sample from the model.

import torch
import torch.nn as nn
import torch.nn.functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 32

training_steps = 3000
learning_rate = 1e-2

eval_interval = 300
eval_steps = 50

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size


# -----------------------

class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.vocab_size)

    def forward(self, x, targets=None):

        logits = self.token_embeddings(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=200):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

# -------------------------

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.ones(eval_steps)
        for i in range(eval_steps):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[i] = loss
        out[split] = losses.mean()
    model.train()
    return out


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    if (i % eval_interval == 0):
        losses = estimate_loss()
        print(f"Training loss: {losses['train']}, Validation loss: {losses['val']}")

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()


starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 4225
Trainable Parameters: 4225
Training loss: 4.7332258224487305, Validation loss: 4.725982189178467
Training loss: 2.7876455783843994, Validation loss: 2.8157029151916504
Training loss: 2.5510098934173584, Validation loss: 2.5644922256469727
Training loss: 2.491964340209961, Validation loss: 2.507617950439453
Training loss: 2.4817771911621094, Validation loss: 2.515298843383789
Training loss: 2.471799850463867, Validation loss: 2.4981491565704346
Training loss: 2.487903594970703, Validation loss: 2.495173454284668
Training loss: 2.47419810295105, Validation loss: 2.504297971725464
Training loss: 2.4576833248138428, Validation loss: 2.494094133377075
Training loss: 2.448154926300049, Validation loss: 2.496605157852173



CExthy brid owindakis by ble

Hisen bobe t e.
S:
O:3 my d?
LUCous:
Wanthar usqur, vet?
F dXENDoate awice my.

Hastacom oroup
Yowhthetof is h ble mil ndill, ath iree s, hein lat Heridrovets, anend l 


## Part 3: Enhancing the GPT Model

### 11. Model: Positional Embeddings

In [None]:
# 11. Model: Positional Embeddings
#    - Add positional embeddings to the model.
#    - Train, evaluate and generate sample from the model.


import torch
import torch.nn as nn
import torch.nn.functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 32

training_steps = 3000
learning_rate = 1e-2

eval_interval = 300
eval_steps = 50

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size


# -----------------------

class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.n_embd)
        self.positional_embeddings = nn.Embedding(config.block_size, config.n_embd)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, x, targets=None):

        B, T = x.shape

        tok_emb = self.token_embeddings(x)
        pos_emb = self.positional_embeddings(torch.arange(T, device=device))
        emb = tok_emb + pos_emb
        logits = self.lm_head(emb)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=200):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

# -------------------------

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.ones(eval_steps)
        for i in range(eval_steps):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[i] = loss
        out[split] = losses.mean()
    model.train()
    return out


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    if (i % eval_interval == 0):
        losses = estimate_loss()
        print(f"Training loss: {losses['train']}, Validation loss: {losses['val']}")

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()


starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 4481
Trainable Parameters: 4481
Training loss: 4.474525451660156, Validation loss: 4.480035305023193
Training loss: 2.522061347961426, Validation loss: 2.5282061100006104
Training loss: 2.5352325439453125, Validation loss: 2.5343902111053467
Training loss: 2.4970157146453857, Validation loss: 2.5091142654418945
Training loss: 2.4937198162078857, Validation loss: 2.5417208671569824
Training loss: 2.511420965194702, Validation loss: 2.5166428089141846
Training loss: 2.5183358192443848, Validation loss: 2.521740198135376
Training loss: 2.511129140853882, Validation loss: 2.526968002319336
Training loss: 2.483370065689087, Validation loss: 2.5209455490112305
Training loss: 2.472799062728882, Validation loss: 2.5132687091827393



CExthy brid owindakis s, ble

Hirenk obe d e.
S:
O:
IS:
Falatanss:
Wanthar usqur he.
War dilasoaten wice my.
Whandarom oroug
Yowns
MERf inth ble mil ndilincath iree sengcin latisttid ovets, and Win 


### 12. Model: Single Attention Head

In [None]:
# 12. Model: Single Attention Head
#    - Implement a single attention head.
#    - Train, evaluate and generate sample from the model.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 32

training_steps = 3000
learning_rate = 1e-2

eval_interval = 300
eval_steps = 50

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size
    head_size:int = 32


# -----------------------

class Head(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.head_size = config.head_size
        self.query = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.key = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.value = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(config.block_size, config.block_size)))

    def forward(self, x):
        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        attention_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_size)
        attention_scores = torch.masked_fill(attention_scores, self.tril[:T, :T]==0, float('-inf'))
        attention_weights = F.softmax(attention_scores, dim=-1)
        out = attention_weights @ v

        return out



class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.n_embd)
        self.positional_embeddings = nn.Embedding(config.block_size, config.n_embd)
        self.attention = Head(config)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, x, targets=None):

        B, T = x.shape

        tok_emb = self.token_embeddings(x)
        pos_emb = self.positional_embeddings(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.attention(x)
        logits = self.lm_head(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=200):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

# -------------------------

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.ones(eval_steps)
        for i in range(eval_steps):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[i] = loss
        out[split] = losses.mean()
    model.train()
    return out


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    if (i % eval_interval == 0):
        losses = estimate_loss()
        print(f"Training loss: {losses['train']}, Validation loss: {losses['val']}")

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()


starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 7553
Trainable Parameters: 7553
Training loss: 4.196639537811279, Validation loss: 4.2004075050354
Training loss: 2.4888217449188232, Validation loss: 2.5333635807037354
Training loss: 2.4639956951141357, Validation loss: 2.487450361251831
Training loss: 2.433098077774048, Validation loss: 2.476203203201294
Training loss: 2.4688873291015625, Validation loss: 2.4878597259521484
Training loss: 2.417926549911499, Validation loss: 2.457453727722168
Training loss: 2.422887086868286, Validation loss: 2.4499638080596924
Training loss: 2.4246163368225098, Validation loss: 2.436007499694824
Training loss: 2.3890421390533447, Validation loss: 2.4294393062591553
Training loss: 2.425231456756592, Validation loss: 2.405696392059326

Whent ikind
Ocowr, hyo lay bth
Y: ant bobe ale.
S:
O-'ts thalild
hy ar hthar uwearthe.
War dthay ate awicromy.

HAEROYom onou waowns, tof itie botharl ndill, aes iree sen cie latiHet lrovets, and th p


### 13. Model: Multi-Head Attention

In [None]:
# 13. Model: Multi-Head Attention
#    - Expand the model to include multiple attention heads.
#    - Implement a projection layer to combine the outputs of the multiple heads.
#    - Train, evaluate and generate sample from the model.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 32

training_steps = 3000
learning_rate = 1e-2

eval_interval = 300
eval_steps = 50

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size
    head_size:int = 8
    n_heads:int = 4


# -----------------------

class Head(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.head_size = config.head_size
        self.query = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.key = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.value = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(config.block_size, config.block_size)))

    def forward(self, x):
        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        attention_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_size)
        attention_scores = torch.masked_fill(attention_scores, self.tril[:T, :T]==0, float('-inf'))
        attention_weights = F.softmax(attention_scores, dim=-1)
        out = attention_weights @ v

        return out

class MultiHeadAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.heads = nn.ModuleList([Head(config) for _ in range(config.n_heads)])
        self.proj_o = nn.Linear(config.n_embd, config.n_embd)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.proj_o(out)
        return out



class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.n_embd)
        self.positional_embeddings = nn.Embedding(config.block_size, config.n_embd)
        self.attention = MultiHeadAttention(config)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, x, targets=None):

        B, T = x.shape

        tok_emb = self.token_embeddings(x)
        pos_emb = self.positional_embeddings(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.attention(x)
        logits = self.lm_head(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=200):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

# -------------------------

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.ones(eval_steps)
        for i in range(eval_steps):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[i] = loss
        out[split] = losses.mean()
    model.train()
    return out


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    if (i % eval_interval == 0):
        losses = estimate_loss()
        print(f"Training loss: {losses['train']}, Validation loss: {losses['val']}")

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()


starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 8609
Trainable Parameters: 8609
Training loss: 4.203001499176025, Validation loss: 4.203583240509033
Training loss: 2.423476457595825, Validation loss: 2.4322385787963867
Training loss: 2.3412063121795654, Validation loss: 2.374859571456909
Training loss: 2.3324356079101562, Validation loss: 2.366330146789551
Training loss: 2.2979342937469482, Validation loss: 2.339296817779541
Training loss: 2.274425745010376, Validation loss: 2.30216646194458
Training loss: 2.2752697467803955, Validation loss: 2.3007593154907227
Training loss: 2.2543351650238037, Validation loss: 2.289290428161621
Training loss: 2.2191426753997803, Validation loss: 2.277108907699585
Training loss: 2.222029685974121, Validation loss: 2.274542808532715

WAll be Rer
wcowfach O la, bt madisen bobe to tarshr-' my calieanss:
Whit If us hat vet?

MEO:
Do,
Buswice my.

Hand, I zo mus
Yowns, to wit he me my wnd,
Whates is ens, I in latiselidrev the he quing


## Part 4: Building the Transformer Block

### 14. Model: MLP

In [None]:
# 14. Model: MLP
#    - Add a multi-layer perceptron (MLP) to the model.
#    - The MLP should consist of a projection up, ReLU activation, and a projection down.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 32

training_steps = 3000
learning_rate = 1e-2

eval_interval = 300
eval_steps = 50

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size
    head_size:int = 8
    n_heads:int = 4


# -----------------------

class Head(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.head_size = config.head_size
        self.query = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.key = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.value = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(config.block_size, config.block_size)))

    def forward(self, x):
        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        attention_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_size)
        attention_scores = torch.masked_fill(attention_scores, self.tril[:T, :T]==0, float('-inf'))
        attention_weights = F.softmax(attention_scores, dim=-1)
        out = attention_weights @ v

        return out

class MultiHeadAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.heads = nn.ModuleList([Head(config) for _ in range(config.n_heads)])
        self.proj_o = nn.Linear(config.n_embd, config.n_embd)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.proj_o(out)
        return out

class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.up = nn.Linear(config.n_embd, config.n_embd * 4)
        self.relu = nn.ReLU()
        self.down = nn.Linear(config.n_embd * 4, config.n_embd)

    def forward(self, x):
        x = self.up(x)
        x = self.relu(x)
        x = self.down(x)
        return x


class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.n_embd)
        self.positional_embeddings = nn.Embedding(config.block_size, config.n_embd)
        self.attention = MultiHeadAttention(config)
        self.mlp = MLP(config)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, x, targets=None):

        B, T = x.shape

        tok_emb = self.token_embeddings(x)
        pos_emb = self.positional_embeddings(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.attention(x)
        x = self.mlp(x)
        logits = self.lm_head(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=200):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

# -------------------------

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.ones(eval_steps)
        for i in range(eval_steps):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[i] = loss
        out[split] = losses.mean()
    model.train()
    return out


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    if (i % eval_interval == 0):
        losses = estimate_loss()
        print(f"Training loss: {losses['train']}, Validation loss: {losses['val']}")

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()


starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 16961
Trainable Parameters: 16961
Training loss: 4.153048992156982, Validation loss: 4.156348705291748
Training loss: 2.4448862075805664, Validation loss: 2.453115940093994
Training loss: 2.3416080474853516, Validation loss: 2.3415679931640625
Training loss: 2.2846858501434326, Validation loss: 2.320159912109375
Training loss: 2.2464277744293213, Validation loss: 2.2876460552215576
Training loss: 2.2493317127227783, Validation loss: 2.2965967655181885
Training loss: 2.240849018096924, Validation loss: 2.2827556133270264
Training loss: 2.2084546089172363, Validation loss: 2.274306535720825
Training loss: 2.1917595863342285, Validation loss: 2.2594780921936035
Training loss: 2.1379611492156982, Validation loss: 2.2440052032470703

When befor you will to lay be mad and bobe do: and Or I lechaitangy: manth fou que hert?
Wedtlah anes wice my thand a will mus
You sproof is hy me mill dill, aeg is ens, hand lat Herid on to and I me 


### 15. Model: Transformer Block

In [None]:
# 15. Model: Transformer Block
#    - Combine multi-head attention and MLP into a single Transformer block.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 32

training_steps = 3000
learning_rate = 1e-2

eval_interval = 300
eval_steps = 50

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size
    head_size:int = 8
    n_heads:int = 4


# -----------------------

class Head(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.head_size = config.head_size
        self.query = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.key = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.value = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(config.block_size, config.block_size)))

    def forward(self, x):
        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        attention_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_size)
        attention_scores = torch.masked_fill(attention_scores, self.tril[:T, :T]==0, float('-inf'))
        attention_weights = F.softmax(attention_scores, dim=-1)
        out = attention_weights @ v

        return out

class MultiHeadAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.heads = nn.ModuleList([Head(config) for _ in range(config.n_heads)])
        self.proj_o = nn.Linear(config.n_embd, config.n_embd)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.proj_o(out)
        return out

class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.up = nn.Linear(config.n_embd, config.n_embd * 4)
        self.relu = nn.ReLU()
        self.down = nn.Linear(config.n_embd * 4, config.n_embd)

    def forward(self, x):
        x = self.up(x)
        x = self.relu(x)
        x = self.down(x)
        return x

class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.attention = MultiHeadAttention(config)
        self.mlp = MLP(config)

    def forward(self, x):
        x = self.attention(x)
        x = self.mlp(x)
        return x



class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.n_embd)
        self.positional_embeddings = nn.Embedding(config.block_size, config.n_embd)
        self.blocks = nn.Sequential(
            Block(config)
        )
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, x, targets=None):

        B, T = x.shape

        tok_emb = self.token_embeddings(x)
        pos_emb = self.positional_embeddings(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=200):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

# -------------------------

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.ones(eval_steps)
        for i in range(eval_steps):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[i] = loss
        out[split] = losses.mean()
    model.train()
    return out


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    if (i % eval_interval == 0):
        losses = estimate_loss()
        print(f"Training loss: {losses['train']}, Validation loss: {losses['val']}")

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()


starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 16961
Trainable Parameters: 16961
Training loss: 4.153048992156982, Validation loss: 4.156348705291748
Training loss: 2.4448862075805664, Validation loss: 2.453115940093994
Training loss: 2.3416080474853516, Validation loss: 2.3415679931640625
Training loss: 2.2846858501434326, Validation loss: 2.320159912109375
Training loss: 2.2464277744293213, Validation loss: 2.2876460552215576
Training loss: 2.2493317127227783, Validation loss: 2.2965967655181885
Training loss: 2.240849018096924, Validation loss: 2.2827556133270264
Training loss: 2.2084546089172363, Validation loss: 2.274306535720825
Training loss: 2.1917595863342285, Validation loss: 2.2594780921936035
Training loss: 2.1379611492156982, Validation loss: 2.2440052032470703

When befor you will to lay be mad and bobe do: and Or I lechaitangy: manth fou que hert?
Wedtlah anes wice my thand a will mus
You sproof is hy me mill dill, aeg is ens, hand lat Herid on to and I me 


## Part 5: Final Enhancements

### 16. Model: Skip Connections, normalization and dropout

In [None]:
# 16. Model: Skip Connections, normalization and dropout
#    - Implement skip connections (residual connections) around the attention and MLP layers.
#    - Add layer normalization before applying the skip connections to stabilize training.
#    - Include dropout layers in both the attention and MLP layers to prevent overfitting.
#    - Implement two or more Blocks.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 32

training_steps = 3000
learning_rate = 1e-2

eval_interval = 300
eval_steps = 50

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size
    head_size:int = 8
    n_heads:int = 4
    n_layers:int = 2
    dropout:float = 0.1


# -----------------------

class Head(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.head_size = config.head_size
        self.query = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.key = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.value = nn.Linear(config.n_embd, config.head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(config.block_size, config.block_size)))
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        B, T, C = x.shape

        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        attention_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_size)
        attention_scores = torch.masked_fill(attention_scores, self.tril[:T, :T]==0, float('-inf'))
        attention_weights = F.softmax(attention_scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        out = attention_weights @ v

        return out

class MultiHeadAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.heads = nn.ModuleList([Head(config) for _ in range(config.n_heads)])
        self.proj_o = nn.Linear(config.n_embd, config.n_embd)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.proj_o(out)
        return out

class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.up = nn.Linear(config.n_embd, config.n_embd * 4)
        self.relu = nn.ReLU()
        self.down = nn.Linear(config.n_embd * 4, config.n_embd)

    def forward(self, x):
        x = self.up(x)
        x = self.relu(x)
        x = self.down(x)
        return x

class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.attention = MultiHeadAttention(config)
        self.mlp = MLP(config)
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = x + self.dropout(self.attention(self.ln1(x)))
        x = x + self.dropout(self.mlp(self.ln2(x)))
        return x



class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.n_embd)
        self.positional_embeddings = nn.Embedding(config.block_size, config.n_embd)
        self.layers = nn.Sequential(*[Block(config) for _ in range(config.n_layers)])
        self.ln = nn.LayerNorm(config.n_embd)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, x, targets=None):

        B, T = x.shape

        tok_emb = self.token_embeddings(x)
        pos_emb = self.positional_embeddings(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.layers(x)
        x = self.ln(x)
        logits = self.lm_head(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=200):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

# -------------------------

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.ones(eval_steps)
        for i in range(eval_steps):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[i] = loss
        out[split] = losses.mean()
    model.train()
    return out


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    if (i % eval_interval == 0):
        losses = estimate_loss()
        print(f"Training loss: {losses['train']}, Validation loss: {losses['val']}")

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()


starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 29761
Trainable Parameters: 29761
Training loss: 4.3595194816589355, Validation loss: 4.3730082511901855
Training loss: 2.3513779640197754, Validation loss: 2.3574705123901367
Training loss: 2.256624698638916, Validation loss: 2.2850542068481445
Training loss: 2.2053983211517334, Validation loss: 2.242753267288208
Training loss: 2.1784188747406006, Validation loss: 2.2001447677612305
Training loss: 2.1738219261169434, Validation loss: 2.2290825843811035
Training loss: 2.134580135345459, Validation loss: 2.204617738723755
Training loss: 2.105912446975708, Validation loss: 2.1689751148223877
Training loss: 2.07553768157959, Validation loss: 2.161000967025757
Training loss: 2.078049421310425, Validation loss: 2.1510331630706787

LUCIONLRDICHARD IOLONTEL:
My, ther he lome.
Fare brout rhibt:
Fhorssty noks my how will of My batood oust goody
Costown.

DUKINCE:
But miclay ten.

ORKKE LARLUCIO:
Sward,
Mut they liveing;
He lery you


### 17. Model: MultiHeadAttention [Alternate Implementation] (Optional)

In [None]:
# 17. Model: MultiHeadAttention [Alternate Implementation] (Optional)
#    - Implement MultiHeadAttention by utilizing a single class.
#    - Refer to the following resource for guidance:
#      https://github.com/rasbt/LLMs-from-scratch/tree/main/ch03/02_bonus_efficient-multihead-attention


import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 8
batch_size = 32

training_steps = 3000
learning_rate = 1e-2

eval_interval = 300
eval_steps = 50

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 32
    vocab_size:int = vocab_size
    block_size:int = block_size
    head_size:int = 8
    n_heads:int = 4
    n_layers:int = 2
    dropout:float = 0.1


# -----------------------

class MultiHeadAttention(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.head_size = config.head_size
        self.n_heads = config.n_heads
        self.query = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.key = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.value = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(config.block_size, config.block_size)))
        self.o_proj = nn.Linear(config.n_embd, config.n_embd)

    def forward(self, x):
        B, T, C = x.shape

        q = self.query(x).view(B, T, self.n_heads, self.head_size)
        k = self.key(x).view(B, T, self.n_heads, self.head_size)
        v = self.value(x).view(B, T, self.n_heads, self.head_size)

        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)
        # (B, n_heads, T, head_size)

        attention_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_size)
        # (B, n_heads, T, T)
        attention_scores = attention_scores.masked_fill(self.tril[:T, :T]==0, float('-inf'))
        attention_weights = F.softmax(attention_scores, dim=-1)
        out = attention_weights @ v
        # (B, n_heads, T, head_size)
        out = out.transpose(1,2)
        # (B, T, n_heads, head_size)
        out = out.contiguous().view(B, T, C)

        out = self.o_proj(out)

        return out



class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.up = nn.Linear(config.n_embd, config.n_embd * 4)
        self.relu = nn.ReLU()
        self.down = nn.Linear(config.n_embd * 4, config.n_embd)

    def forward(self, x):
        x = self.up(x)
        x = self.relu(x)
        x = self.down(x)
        return x

class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.attention = MultiHeadAttention(config)
        self.mlp = MLP(config)
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = x + self.dropout(self.attention(self.ln1(x)))
        x = x + self.dropout(self.mlp(self.ln2(x)))
        return x



class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.n_embd)
        self.positional_embeddings = nn.Embedding(config.block_size, config.n_embd)
        self.layers = nn.Sequential(*[Block(config) for _ in range(config.n_layers)])
        self.ln = nn.LayerNorm(config.n_embd)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, x, targets=None):

        B, T = x.shape

        tok_emb = self.token_embeddings(x)
        pos_emb = self.positional_embeddings(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.layers(x)
        x = self.ln(x)
        logits = self.lm_head(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=200):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

# -------------------------

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.ones(eval_steps)
        for i in range(eval_steps):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[i] = loss
        out[split] = losses.mean()
    model.train()
    return out


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    if (i % eval_interval == 0):
        losses = estimate_loss()
        print(f"Training loss: {losses['train']}, Validation loss: {losses['val']}")

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()


starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 29761
Trainable Parameters: 29761
Training loss: 4.336919784545898, Validation loss: 4.352819919586182
Training loss: 2.361799955368042, Validation loss: 2.358400583267212
Training loss: 2.2427780628204346, Validation loss: 2.2684075832366943
Training loss: 2.1912455558776855, Validation loss: 2.236848831176758
Training loss: 2.1693694591522217, Validation loss: 2.197221279144287
Training loss: 2.1507184505462646, Validation loss: 2.2159225940704346
Training loss: 2.1148903369903564, Validation loss: 2.1888115406036377
Training loss: 2.085590362548828, Validation loss: 2.1606204509735107
Training loss: 2.0637192726135254, Validation loss: 2.156837224960327
Training loss: 2.0486533641815186, Validation loss: 2.1241254806518555

He blive, the thou! file have poun Froy tarrast chank and the stoond sorome I but thensees
h arefe horce,
Dhee, thangountly my nothis troe gleBe the duy: as to, bale rechath are,
Lintar betwle if-f, y


## Part 6: Final Training and Evaluation

### 18. Final Evaluation and Text Generation

In [None]:
# 18. Final Evaluation and Text Generation
#    - Train the final GPT model with the complete architecture on a GPU for improved performance.
#    - Model hyperparameters:
#        - block_size: 256
#        - n_embd: 128
#        - n_head: 6
#        - n_layer: 6
#        - head_size: 16
#    - Training hyperparameters:
#        - batch_size: 128
#        - max_iters: 5000
#        - learning_rate: 1e-3
#        - eval_interval: 500
#        - eval_iters: 100
#    - Evaluate the final model on the validation set at regular intervals.
#    - Use the trained model to generate new text samples and assess its performance.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(1337)

#-------------------

block_size = 256
batch_size = 128

training_steps = 5000
learning_rate = 1e-3

eval_interval = 500
eval_steps = 50

#-------------------


chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"vocab size: {vocab_size}")

stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

n = int(0.9*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix], dim=0)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix], dim=0)
    x = x.to(device)
    y = y.to(device)
    return x,y

# -----------------------

from dataclasses import dataclass

@dataclass
class GPTconfig:
    n_embd:int = 128
    vocab_size:int = vocab_size
    block_size:int = block_size
    head_size:int = 16
    n_heads:int = 8
    n_layers:int = 4
    dropout:float = 0.1


# -----------------------

class MultiHeadAttention(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.head_size = config.head_size
        self.n_heads = config.n_heads
        self.query = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.key = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.value = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(config.block_size, config.block_size)))
        self.o_proj = nn.Linear(config.n_embd, config.n_embd)

    def forward(self, x):
        B, T, C = x.shape

        q = self.query(x).view(B, T, self.n_heads, self.head_size)
        k = self.key(x).view(B, T, self.n_heads, self.head_size)
        v = self.value(x).view(B, T, self.n_heads, self.head_size)

        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)
        # (B, n_heads, T, head_size)

        attention_scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_size)
        # (B, n_heads, T, T)
        attention_scores = attention_scores.masked_fill(self.tril[:T, :T]==0, float('-inf'))
        attention_weights = F.softmax(attention_scores, dim=-1)
        out = attention_weights @ v
        # (B, n_heads, T, head_size)
        out = out.transpose(1,2)
        # (B, T, n_heads, head_size)
        out = out.contiguous().view(B, T, C)

        out = self.o_proj(out)

        return out



class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.up = nn.Linear(config.n_embd, config.n_embd * 4)
        self.relu = nn.ReLU()
        self.down = nn.Linear(config.n_embd * 4, config.n_embd)

    def forward(self, x):
        x = self.up(x)
        x = self.relu(x)
        x = self.down(x)
        return x

class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.attention = MultiHeadAttention(config)
        self.mlp = MLP(config)
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = x + self.dropout(self.attention(self.ln1(x)))
        x = x + self.dropout(self.mlp(self.ln2(x)))
        return x



class GPT(nn.Module):

    def __init__(self, config:GPTconfig):
        super().__init__()
        self.config = config
        self.token_embeddings = nn.Embedding(config.vocab_size, config.n_embd)
        self.positional_embeddings = nn.Embedding(config.block_size, config.n_embd)
        self.layers = nn.Sequential(*[Block(config) for _ in range(config.n_layers)])
        self.ln = nn.LayerNorm(config.n_embd)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, x, targets=None):

        B, T = x.shape

        tok_emb = self.token_embeddings(x)
        pos_emb = self.positional_embeddings(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.layers(x)
        x = self.ln(x)
        logits = self.lm_head(x)

        if targets==None:
            loss = None
        else:
            B, T, C = logits.shape
            logits_flat = logits.view(B * T, C)
            targets_flat = targets.view(B * T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens=200):

        for _ in range(max_new_tokens):
            idx_clipped = idx[:, -self.config.block_size:]
            logits, loss = self(idx_clipped)
            logits = logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=-1)
        return idx

# -------------------------

model = GPT(GPTconfig).to(device)

total_parameters = sum(p.numel() for p in model.parameters())
trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad==True)
print(f"Total Parameters: {total_parameters}")
print(f"Trainable Parameters: {trainable_parameters}")

# -------------------------

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.ones(eval_steps)
        for i in range(eval_steps):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[i] = loss
        out[split] = losses.mean()
    model.train()
    return out


# -------------------------

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for i in range(training_steps):

    if (i % eval_interval == 0):
        losses = estimate_loss()
        print(f"Training loss: {losses['train']}, Validation loss: {losses['val']}")

    xb, yb = get_batch('train')

    _, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    optimizer.step()


starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text)[0].tolist()))

vocab size: 65
Total Parameters: 841281
Trainable Parameters: 841281
Training loss: 4.32258415222168, Validation loss: 4.3253092765808105
Training loss: 1.9829299449920654, Validation loss: 2.0749642848968506
Training loss: 1.6285208463668823, Validation loss: 1.7984155416488647
Training loss: 1.4782105684280396, Validation loss: 1.6699893474578857
Training loss: 1.4023021459579468, Validation loss: 1.6121479272842407
Training loss: 1.3509314060211182, Validation loss: 1.584789752960205
Training loss: 1.3177154064178467, Validation loss: 1.553526759147644
Training loss: 1.283528447151184, Validation loss: 1.5408179759979248
Training loss: 1.2645879983901978, Validation loss: 1.5332915782928467
Training loss: 1.243046522140503, Validation loss: 1.5287652015686035

Then remies from thy hope shall be clock,
That we enjoy, nurse Marcius thy father's,
For Alike, our protection how of lend
As as fingly? why, I pray even up his
New from that flour him and miserous
Th


In [None]:
starting_text = torch.tensor([[0]], dtype=torch.long, device=device)

print(decode(model.generate(starting_text, max_new_tokens=5000)[0].tolist()))



First Most Cominius' to our plate.

JULIET:
O you let the riving of you.

ROMEO:
I will't even it word by the baitip of soul,
Which it would be solemness, and think upon,
A poloybrouse Clarence! Dedarest none:
You woo must on the little of you?

Both:
Compare adument And that most perfeige.

Second Citizen:
What news? cousin, I had, what a produce smate,
Harking, ladier, what's the fault, who lord mays?

MENENIUS:
What cannot! told what's young axard together?

CORIOLANUS::
Lest your joyful forty didst-beshored
To jewel than readine your town: she was here,
Would come of makes, if he hoppy and firns
And kness'd two dies, and what thy words;
What would then father thanks of have stoud hath
Found of bids, or brother no pitience.
Or I,neverien not to chose;
But thy contracte?

Lover:
O, if he is news,
As yea; for these way: nurse may deserves the heaven my
With such corse-havir beats, nay bearthing down,
Figmend them as lid change affawel.

Clifford:
How none laws to his great air brook