## The transformer architecture for yung ting

<div>
<img src="https://heidloff.net/assets/img/2023/02/transformers.png" width="800"/>
</div>

- #### how does a generative pre-trained transformer differ from a transformer?
    - a gpt has no encoder and multi-head attention (what the encoder plugs into)
    - so it will basically :
        - i) masked multi head attention
        - ii) add and normalise
        - iii) feed forward
        - iv) add and normalise
        - v) linear transformation
        - vi) softmax

# Converting the Romeo and Juliet model to a GPT model for yung ting

## 1) import necessary modules and set the hyperparameters

- same as before but we define the hyper parameters up here as per best practices
- what each hyper parameter is for :
    - `device` : device we are using i.e `cuda`(nvidia gpu) or `mps` (metal accelaration on mac) or `cpu` 
    - `block_size` : size of each batch (you can think of this as the size of each brick and block size as how many bricks stacked)
    - `batch_size` : amount of batches (how many stacked)
    - `max_iterations` : amount of iterations for the training loop (adjust as needed)
    - `learning_rate` : learning rate for our model (too low or too high is bad, experiment around to find what work)
    - `evaluation_iterations` : the amount of interations to wait before evaluating the model to calculate training loss
    - `n_embd` : embedding dimension 
    - `n_head` : number of heads
    - `n_layer` : number of layers
    - `dropout` : dropout rate 

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import mmap
import random
import pickle
import argparse

device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
print(device)

block_size = 64
batch_size = 128
max_iterations = 5000
learning_rate = 3e-4
evaluation_iterations = 100
evaluation_interval = 200
n_embd = 384
n_head = 8
n_layer = 4
dropout = 0.2

cuda


## 2) read the text file with data and make a sorted set of characters to get the vocab_size 

- same as before (if needed refer to the bigram model file)

In [2]:
with open('data.txt', 'r', encoding='utf-8') as file:
    text = file.read()  
    
chars = sorted(set(text))
vocab_size = len(chars)
print(vocab_size)

71


## 3) make a character-level tokenizer and encode the text corpus

- same as before (if needed refer to the bigram model file)

In [3]:
string_to_integer = { ch:i for i,ch in enumerate(chars) }
integer_to_string = { i:ch for i,ch in enumerate(chars) }
encoder = lambda s: [string_to_integer[c] for c in s]
decoder = lambda l: ''.join([integer_to_string[i] for i in l])

data = torch.tensor(encoder(text), dtype=torch.long)

## 4) Create training and Validation splits and define the get_batch function

- same as before (if needed refer to the bigram model file)

In [4]:
split_size = int(0.8*len(data))
training_data = data[:split_size]
validation_data = data[split_size:]

def get_batch(split):
    data = training_data if split == 'train' else validation_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y


## 5) Define the estimate loss function

- same as before (if needed refer to the bigram model file)
- only difference being us using the hyperparameter properly here , it was hard coded in the old bigram model file

In [5]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(evaluation_iterations)
        for k in range(evaluation_iterations):
            inputs, targets = get_batch(split)
            logits, loss = model(inputs, targets)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

## 6) Define the GPT Model class, Transformer Block class, Feed Forward Class, Multiple Head attention class and Head class and initialise a model and load it into our device 

- #### i) Changes to the GPTLanguageModel class initialisation (used to be called BigramModel class for bigram model so yah):
    - ` self.token_embedding_table = nn.Embedding(vocab_size, n_embd)` : we change `vocab_size x vocab_size` to `vocab_size x n_embd` making the embedding table a lot larger 
    - `self.position_embedding_table = nn.Embedding(block_size, n_embd)` : we add positional embedding as per the architechture
        - Position embedding table provides positional information for each token in the sequence
    - `self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])` : we stack multiple transformer blocks on top of each other to build the required depth to our model.
        - we create a transformer block with the size of the embedding table and the number of heads
        - we repeat the step above for `n_layer` times 
        - so if `n_head` is 5 and `n_embd` is 100 , and `n_layer` is 6 we create `Block(100, n_head=6)` 6 times
        - finally we stack that all
    - `self.ln_f = nn.LayerNorm(n_embd)` : we create a layer normalisation function with the size of our kayer (horizontal embedding table length)
        - this is for stabilizing the outputs 
    - `self.lm_head = nn.Linear(n_embd, vocab_size)` : Linear transformation to project the final embeddings to vocabulary logits (if you do not remember what logits are refer to the bigram file)
    - ` self.apply(self._init_weights)` : use the defined function to initialise weights

- #### ii) defining the init_weights function for the gpt model class 
    - the `init_weight` function which accepts a module from the class
        - `if isinstance(module, nn.Linear)` : checks if the passed in instance is part of `nn.linear` then :
            - `torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)` : initialises the module with normal distrubition with a mean of 0 and standard distribution of 0.02
            - `if module.bias is not None` : checks if there are any existig biases , if there are any existing biases then :
                - `torch.nn.init.zeros_(module.bias)` : initialises the biases to 0
            - `elif isinstance(module, nn.Embedding)` : checks if the passed in module is of `nn.embedding` if i is:
                - `torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)` : initialises the weights with mean of 0 and standard deviation of 0.02
                    - we do not check for biases here and that is the only difference here , we do not check as tjis is part of `nn.embedding`

- #### iii) updating the forward pass function to follow the GPT architecture 
    - we accept `index` and optionally `targets` same as before 
    - ` B, T = index.shape` : since our index tensor will have a shape of (B, T) we will have to reshape it before using it for certain operations in the function hence we will first unpack the tensor
    - `tok_emb = self.token_embedding_table(index)` : we retrieve the embeddings for the given indices
    - `self.position_embedding_table = nn.Embedding(block_size, n_embd)` : we create a range of positions and retrieve their embeddings
    - `x = tok_emb + pos_emb ` : we add the token and positional embedding information together to add positional information to the token emneddings
    - `x = self.blocks(x)` : we pass the embeddings through the transformer blocks
    - `x = self.ln_f(x)` : we apply the final layer normalisation to stabilise the outputs
    - `logits = self.lm_head(x)` : we project the normalized embeddings to vocabulary logits
    - `if targets is None` : we check if the optional targets is not given (not training):
        - `loss = None` : we just set the loss to none as loss calculation is not needed
    - else we :
        - below is repeated from the other file
        - `logits = logits.view(B * T, -1)` : reshape out logits for loss computation
        - `targets = targets.view(B * T) ` : reshape our targets for loss computation 
        - we do the above reshaping due to the shapes required by the cross entropy function
        - `loss = F.cross_entropy(logits, targets)` : we calculate the scalar loss value using the cross entorpy fiction
    - `return logits, loss` : finally we return the logits and the loss (can be None)

- #### iv) updating the generate tokens function to follow the GPT architecture
    - we accept `index` and `max_new_tokens` as parameters same as before
    - `for _ in range(max_new_tokens)` : we have a loop that runs for the amount of new tokens we wanna generate
        - `index_cond = index[:, -block_size:]` : we crop the input indices to the last block_size tokens to adhere to the model's maximum context length
        - `logits, _ = self.forward(index_cond)` : we pass the current context into the forward pass to get the logits for the current context
        - `logits = logits[:, -1, :] ` : we focus on the logits of the last time step (the most recent token), which is why we do -1
        - `probs = F.softmax(logits, dim=-1)` : we apply a softmax to convert the logits to probabilities
        - `index_next = torch.multinomial(probs, num_samples=1)` : we sample the next token index from the probability distribution
        - `index = torch.cat((index, index_next), dim=1)` we concatanate the newly generated tokens to the provided indices 
        - `return index` : we return the index with the newly generated tokens concatanated

- #### v) Define the Head class 
    - This class is for a single instance of a head from the multi head attention
    - we accept one parameter of `head_size` for initialising
        - `self.key = nn.Linear(n_embd, head_size, bias=False)` : we create a linear layer to project the input embedding to key
        - `self.query = nn.Linear(n_embd, head_size, bias=False)` : we create a linear layer to project the input embedding to query
        - `self.value = nn.Linear(n_embd, head_size, bias=False)` : we create a linear layer to project the input embedding to value
        - `self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))` : we create a lower traingular matrix for casual masking to prevent attention to future tokens (so can only see present and past , cannot cheat by looking at the future)
        - `self.dropout = nn.Dropout(dropout)` : we create a droput layer for regularization
            - the dropout layer randomly sets some of the inputs to 0 based on our hyper parameter `dropout`
                - `dropout` is the probability of the dropout occuring
    
    - we define the forward pass function for the self attendion head, we accept the input `x` that is a tensor (the embedding):
        - `B, T, C = x.shape` : we unpack the input tensor for usage later on
        - `k = self.key(x)` : we project the input embedding to the key
        - `q = self.query(x)` : we project the input embedding to the query
        - `wei = q @ k.transpose(-2, -1) * (k.shape[-1] ** -0.5)` : we compute the attention scores by :
            - we transpose k to shape (B, head_size, T) for batch matrix multiplication (shape becomes (B, T, T) ) 
            - we do a matrix multiplication of queries and keys to get the attention scores
        - `wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))` : we apply a casual buffer to ensure the head can only look at the past 
        - `wei = F.softmax(wei, dim=-1)` : we do a softmax to get the attention weights 
        - `wei = self.dropout(wei)` : we apply the dropout layer to         - `self.dropout = nn.Dropout(dropout)` : we create a droput layer for regularization of the results 
        - `v = self.value(x)` : we project the input embeddings to values
        - `out = wei @ v` : we do a matrix multiplication of our attenstion scores to the input values to ensure that the inout information was not lost
            - `shape` : (B, T, head_size)
        - `return out` : we return the result with attention scores and values

- #### vi) we define multi head attention class 
    - we accept `num_heads` and `head_size` as parameters for the initialisation
        - `self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])` : we create `num_heads` amount of heads with each head having a dimensionality of `head_size` and add it to a list
            - for e.g : if `head_size` is 7 and `num_heads` is 8 then in :
                - we do `Head(7)` and it to the module list and repeat it 7 more times (8 times in total)
        - `self.proj = nn.Linear(head_size * num_heads, n_embd)` : we create a linear layer to project concatenated head outputs back to embedding dimension
        - `self.dropout = nn.Dropout(dropout)` : we create a dropout layer for regularization

    - we define the forward pass for the multi head attention class, we accept an input of `x` which is a tensor (the embedding) : 
        - `out = torch.cat([h(x) for h in self.heads], dim=-1)` : we iterate over the heads in our list of heads and and concatanate all our outputs
            - `for h in self.head` we go over each head in our list and for each head :
                - `h(x)` we pass the input tensor embedding x of shape (B, T, C) and perform the forward pass on the individual head
            - `torch.cat(...) : ` we concatanate all the outputs from each individual head to one tensor 
                - the one tensor has the shape (B, T, head_size * num_heads)
            - `out = self.dropout(self.proj(out))` : we project and apply dropout
                - `self.proj(out)` : we project concatenated head outputs back to embedding dimension
                - `self.dropout(...)` : we apply the dropout layer to regularize the results 
            - `return out` : we return the projected and regularized outputs

- #### vii) we define the feed forward class 
    - we accept the input `n_embd` (embedding dimension) for the initialistion :
        - `self.net = nn.Sequential(nn.Linear(n_embd, 4 * n_embd), nn.ReLU(), nn.Linear(4 * n_embd, n_embd), nn.Dropout(dropout),)` : we define a sequential network comprising linear layers and ReLU activation and dropout for regularization 
            - `nn.Linear(n_embd, 4 * n_embd)` : Expand embedding dimension
            - `nn.ReLU()` : Apply ReLU activation
            - `nn.Linear(4 * n_embd, n_embd)` : Project back to original embedding dimension
            - ` nn.Dropout(dropout)` : apply dropout for normalization 
    - we define the forward pass for the feed forward :
        -  `return self.net(x)` : we just apply our sequential netword and return the input

- #### viii) we define the transformer block class 
    - the transformer block represents a block of the transformer with communicationa and computation 
    - we accept `n_embd` (embedding dimension) and `n_head` (number of heads) as parameters for initialisation 
        - `head_size = n_embd // n_head` : we find the head size based on embedding dimension and number of heads
        - `self.sa = MultiHeadAttention(n_head, head_size)` : we create a self attention module that is just a multi head attention module from the class we defined
        - `self.ffwd = FeedForward(n_embd)` : we create our feed forward network based on the class we defined above
        - `self.ln1 = nn.LayerNorm(n_embd)` : Layer normalization for after attention
        - `self.ln2 = nn.LayerNorm(n_embd)` : Layer normalization for after feedforward
    - we define the forward pass for the transformer block 
        - `y = self.sa(x)` : we apply multi-head self-attention
        - `x = self.ln1(x + y)` : we add residual connection and apply layer normalization after the self attention
        - `y = self.ffwd(x)` : we apply the feedforward network
        - `x = self.ln2(x + y)` : we add residual connection and apply layer normalization after the feed forward
        - `return x` : we return the tensor after performing the operations
        
- #### ix) initialise the model 
    - `model = GPTLanguageModel(vocab_size)` : we create a model using the `GPTLanguageModel` class we have defined
    - `m = model.to(device)` : we move the model to our device (cpu or cuda or mps)

In [6]:
class Head(nn.Module):
    """One head of self-attention."""
    
    def __init__(self, head_size):
        """
        Initializes the self-attention head.
        
        Args:
            head_size (int): The dimensionality of each attention head.
        """
        super().__init__()
        # Linear layers to project input embeddings to key, query, and value vectors
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        
        # Create a lower triangular matrix for causal masking to prevent attention to future tokens
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        
        # Dropout layer for regularization
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        """
        Performs the forward pass for the self-attention head.
        
        Args:
            x (torch.Tensor): Input tensor of shape (B, T, C), where
                              B = Batch size,
                              T = Sequence length,
                              C = Embedding dimension.
        
        Returns:
            torch.Tensor: Output tensor after applying self-attention, shape (B, T, head_size).
        """
        B, T, C = x.shape  # Unpack the input shape
        
        # Project input embeddings to keys and queries
        k = self.key(x)    # Shape: (B, T, head_size)
        q = self.query(x)  # Shape: (B, T, head_size)
        
        # Compute attention scores by taking the dot product of queries and keys
        # Transpose k to shape (B, head_size, T) for batch matrix multiplication
        wei = q @ k.transpose(-2, -1) * (k.shape[-1] ** -0.5)  # Shape: (B, T, T)
        
        # Apply causal masking to ensure each position can only attend to previous positions
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # Shape: (B, T, T)
        
        # Apply softmax to obtain attention weights
        wei = F.softmax(wei, dim=-1)  # Shape: (B, T, T)
        
        # Apply dropout to the attention weights for regularization
        wei = self.dropout(wei)
        
        # Project input embeddings to values
        v = self.value(x)  # Shape: (B, T, head_size)
        
        # Perform weighted aggregation of the values based on attention weights
        out = wei @ v  # Shape: (B, T, head_size)
        
        return out

class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel."""
    
    def __init__(self, num_heads, head_size):
        """
        Initializes the multi-head self-attention module.
        
        Args:
            num_heads (int): Number of attention heads.
            head_size (int): Dimensionality of each attention head.
        """
        super().__init__()
        # Create a list of Head modules
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        
        # Linear layer to project concatenated head outputs back to embedding dimension
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        
        # Dropout layer for regularization
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        """
        Performs the forward pass for multi-head self-attention.
        
        Args:
            x (torch.Tensor): Input tensor of shape (B, T, C).
        
        Returns:
            torch.Tensor: Output tensor after multi-head attention, shape (B, T, n_embd).
        """
        # Concatenate outputs from all attention heads along the embedding dimension
        out = torch.cat([h(x) for h in self.heads], dim=-1)  # Shape: (B, T, head_size * num_heads)
        
        # Project the concatenated outputs back to the original embedding dimension
        out = self.dropout(self.proj(out))  # Shape: (B, T, n_embd)
        
        return out

class FeedForward(nn.Module):
    """A simple linear layer followed by a non-linearity."""
    
    def __init__(self, n_embd):
        """
        Initializes the feedforward network.
        
        Args:
            n_embd (int): Embedding dimension.
        """
        super().__init__()
        # Define a sequential network comprising linear layers and ReLU activation
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),  # Expand embedding dimension
            nn.ReLU(),                      # Apply ReLU activation
            nn.Linear(4 * n_embd, n_embd),  # Project back to original embedding dimension
            nn.Dropout(dropout),            # Apply dropout for regularization
        )
    
    def forward(self, x):
        """
        Performs the forward pass for the feedforward network.
        
        Args:
            x (torch.Tensor): Input tensor of shape (B, T, C).
        
        Returns:
            torch.Tensor: Output tensor after feedforward processing, shape (B, T, C).
        """
        return self.net(x)

class Block(nn.Module):
    """Transformer block: communication followed by computation."""
    
    def __init__(self, n_embd, n_head):
        """
        Initializes the Transformer block.
        
        Args:
            n_embd (int): Embedding dimension.
            n_head (int): Number of attention heads.
        """
        super().__init__()
        head_size = n_embd // n_head  # Determine head size based on embedding dimension and number of heads
        self.sa = MultiHeadAttention(n_head, head_size)  # Multi-head self-attention module
        self.ffwd = FeedForward(n_embd)                  # Feedforward network
        self.ln1 = nn.LayerNorm(n_embd)                  # Layer normalization after attention
        self.ln2 = nn.LayerNorm(n_embd)                  # Layer normalization after feedforward
    
    def forward(self, x):
        """
        Performs the forward pass for the Transformer block.
        
        Args:
            x (torch.Tensor): Input tensor of shape (B, T, C).
        
        Returns:
            torch.Tensor: Output tensor after processing, shape (B, T, C).
        """
        # Apply multi-head self-attention
        y = self.sa(x)  # Shape: (B, T, C)
        
        # Add residual connection and apply layer normalization
        x = self.ln1(x + y)  # Shape: (B, T, C)
        
        # Apply feedforward network
        y = self.ffwd(x)  # Shape: (B, T, C)
        
        # Add residual connection and apply layer normalization
        x = self.ln2(x + y)  # Shape: (B, T, C)
        
        return x

class GPTLanguageModel(nn.Module):
    """GPT Language Model implementing Transformer architecture."""
    
    def __init__(self, vocab_size):
        """
        Initializes the GPT language model.
        
        Args:
            vocab_size (int): Size of the vocabulary (number of unique tokens).
        """
        super().__init__()
        # Token embedding table maps each token index to an embedding vector
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        
        # Position embedding table provides positional information for each token in the sequence
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        
        # Stack multiple Transformer blocks to build the model's depth
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        
        # Final layer normalization for stabilizing the output
        self.ln_f = nn.LayerNorm(n_embd)
        
        # Language modeling head projects the final embeddings to vocabulary logits
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
        # Initialize weights using the defined method
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """
        Initializes weights of the model's layers.
        
        Args:
            module (nn.Module): A module within the model.
        """
        if isinstance(module, nn.Linear):
            # Initialize linear layers with normal distribution (mean=0, std=0.02)
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                # Initialize biases to zero
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            # Initialize embedding layers with normal distribution (mean=0, std=0.02)
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, index, targets=None):
        """
        Performs the forward pass for the GPT language model.
        
        Args:
            index (torch.Tensor): Input tensor of token indices, shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices, shape (B, T).
        
        Returns:
            Tuple[torch.Tensor, Optional[torch.Tensor]]:
                - logits: Predicted logits for each token, shape (B, T, vocab_size).
                - loss: Cross-entropy loss (if targets are provided), else None.
        """
        B, T = index.shape  # Unpack batch size and sequence length
        
        # Retrieve token embeddings for the input indices
        tok_emb = self.token_embedding_table(index)  # Shape: (B, T, C)
        
        # Create a range of positions and retrieve their embeddings
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # Shape: (T, C)
        
        # Add token and position embeddings to incorporate positional information
        x = tok_emb + pos_emb  # Shape: (B, T, C)
        
        # Pass the embeddings through the stack of Transformer blocks
        x = self.blocks(x)  # Shape: (B, T, C)
        
        # Apply final layer normalization
        x = self.ln_f(x)  # Shape: (B, T, C)
        
        # Project the normalized embeddings to vocabulary logits
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)
        
        # If targets are provided, compute the cross-entropy loss
        if targets is None:
            loss = None
        else:
            # Reshape logits and targets for loss computation
            logits = logits.view(B * T, -1)    # Shape: (B*T, vocab_size)
            targets = targets.view(B * T)      # Shape: (B*T)
            loss = F.cross_entropy(logits, targets)  # Scalar loss value
        
        return logits, loss
    
    def generate(self, index, max_new_tokens):
        """
        Generates new tokens based on the input context.
        
        Args:
            index (torch.Tensor): Input tensor of token indices, shape (B, T).
            max_new_tokens (int): Number of new tokens to generate.
        
        Returns:
            torch.Tensor: Generated token indices, shape (B, T + max_new_tokens).
        """
        for _ in range(max_new_tokens):
            # Crop the input indices to the last block_size tokens to adhere to the model's maximum context length
            index_cond = index[:, -block_size:]
            
            # Perform a forward pass to get the logits for the current context
            logits, _ = self.forward(index_cond)
            
            # Focus on the logits of the last time step (the most recent token)
            logits = logits[:, -1, :]  # Shape: (B, C)
            
            # Apply softmax to convert logits to probabilities
            probs = F.softmax(logits, dim=-1)  # Shape: (B, C)
            
            # Sample the next token index from the probability distribution
            index_next = torch.multinomial(probs, num_samples=1)  # Shape: (B, 1)
            
            # Append the sampled token to the sequence
            index = torch.cat((index, index_next), dim=1)  # Shape: (B, T + 1)
        
        return index

# Instantiate the GPT language model with the specified vocabulary size
model = GPTLanguageModel(vocab_size)

# Move the model to the specified device (CPU or GPU)
m = model.to(device)

## 7) create an AdamW optimiser and define the training loop

- #### i) we create the optimizer we are gonna use 
    - we are using the adamW optimizer which is the adam oprimizer with added weight decay
    - `optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)` : use `adamW` from `torch.optim` and we create the optimizer with our model parameters and our `learning_rate` hyper parameter

- #### ii) define the training loop
    - `for iter in range(max_iterations)` : we iterate for `max_iterations` times , `max_iterations` is our hyper parameter that says how many itterations we need to run the training loop for and in each iteration : 
        - ` if iter % evaluation_iterations == 0` : we check if we have hit our evaluation interval (i.e for e.g. every 100 iterations, it is defined by the `evaluation_iterations` hyper parameter) 
            - `losses = estimate_loss()` : we calculate our losses using the `estimate_loss` function we have defined above
            - `print(f"Iteration {iter}, training loss {losses['train']:.2f}, validation loss {losses['val']:.2f}")` : we format and print the loss values for visualisation
        - `inputs, targets = get_batch('train')` : we get our inputs and targets using the `get_batch` function that we have defined
        - `logits, loss = model.forward(inputs, targets)` : we use our model's forward pass with targets since it is for training
        - `optimizer.zero_grad(set_to_none=True)` : we reset the gradients of our tensors , we set it to None instead of 0's
        - `loss.backward()` : we computes the gradient left by the current tensor wrt graph
        - `optimizer.step()` : we update the model parameters after calculating the gradient
    - `print(loss.item())` : we print the final loss value

In [7]:

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iterations):
    if iter % evaluation_iterations == 0:
        losses = estimate_loss()
        print(f"Iteration {iter}, training loss {losses['train']:.2f}, validation loss {losses['val']:.2f}")
    inputs, targets = get_batch('train')
    logits, loss = model.forward(inputs, targets)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

Iteration 0, training loss 4.38, validation loss 4.39
Iteration 100, training loss 2.34, validation loss 2.40
Iteration 200, training loss 1.97, validation loss 2.08
Iteration 300, training loss 1.77, validation loss 1.89
Iteration 400, training loss 1.63, validation loss 1.79
Iteration 500, training loss 1.54, validation loss 1.73
Iteration 600, training loss 1.47, validation loss 1.69
Iteration 700, training loss 1.41, validation loss 1.66
Iteration 800, training loss 1.35, validation loss 1.64
Iteration 900, training loss 1.30, validation loss 1.65
Iteration 1000, training loss 1.25, validation loss 1.65
Iteration 1100, training loss 1.20, validation loss 1.65
Iteration 1200, training loss 1.14, validation loss 1.64
Iteration 1300, training loss 1.09, validation loss 1.66
Iteration 1400, training loss 1.05, validation loss 1.67
Iteration 1500, training loss 1.00, validation loss 1.69
Iteration 1600, training loss 0.95, validation loss 1.72
Iteration 1700, training loss 0.90, validat

## just some testing 

In [10]:
test_prompt = 'hi i am yung ting'
context = torch.tensor(encoder(test_prompt), dtype=torch.long, device=device)
# context = torch.zeros((1,1), dtype=torch.long, device=device)
generated_tokens = m.generate(context.unsqueeze(0), max_new_tokens=100)
generated_chars = decoder(generated_tokens[0].tolist())
print(generated_chars)

hi i am yung tings bers in cold make him list,
Enor kinsman is old, what passes’d it shall.
In tell you go to thee ni
