In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import mmap
import random
import pickle
import argparse

parser = argparse.ArgumentParser(description='This is a demonstration program')

# Here we add an argument to the parser, specifying the expected type, a help message, etc.
# parser.add_argument('-batch_size', type=str, required=True, help='Please provide a batch_size')

# args = parser.parse_args()

# Now we can use the argument value in our program.
# print(f'batch size: {args.batch_size}')

device = "mps" if torch.backends.mps.is_available() and torch.backends.mps.is_built() else device

# batch_size = args.batch_size # to use the batch_size cmd arg -> python file_name.py -batch_size 32
batch_size = 32
block_size = 128
max_iters  = 200
learning_rate =  2e-5
eval_iters = 100
n_embd = 384
n_head = 4
n_layer = 4 # decoding layer number
dropout = 0.2

print(device)

mps


In [5]:
chars = ""
with open("vocab.txt", 'r', encoding='utf-8') as f:
# with open("wizard_of_oz.txt", 'r', encoding='utf-8') as f:
        text = f.read()
        chars = sorted(list(set(text)))
        
vocab_size = len(chars)
# print(vocab_size)
# 5520

'\x00\n\n\n \n!\n"\n#\n$\n%\n&\n\'\n(\n)\n*\n+\n,\n-\n.\n/\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n:\n;\n<\n=\n>\n?\n@\nA\nB\nC\nD\nE\nF\nG\nH\nI\nJ\nK\nL\nM\nN\nO\n'

In [3]:
string_to_int = { ch:i for i,ch in enumerate(chars) }
int_to_string = { i:ch for i,ch in enumerate(chars) }

encode = lambda s: [string_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_string[i] for i in l])

In [4]:
# memory map for using small snippets of text from a single file of any size

"""
## 🔍 What This Function Does

Your function `get_random_chunk(split)`:

- Loads data from a pre-created file (`train_split.txt` or `val_split.txt`).
- Uses `mmap` to **efficiently read a random chunk** of the file without loading the whole thing into memory.
- Randomly picks a position, reads a block of data, decodes it, encodes it, and returns it as a PyTorch tensor.

---

## 🚀 Benefits Over Traditional Train/Test Splitting

### 1. **Memory Efficiency (via `mmap`)**
- Traditional splitting loads the **entire dataset into memory** to split it.
- This function **streams only a small chunk** using memory-mapped I/O, which is perfect for **large datasets** (think gigabytes or more).
- Ideal for training language models or handling token streams that don’t fit in RAM.

### 2. **Dynamic Sampling**
- Each call gives you a **random chunk**, meaning:
  - No need to pre-shuffle your dataset.
  - Each training epoch sees **different random samples**.
- This acts as an implicit form of data augmentation and can help prevent overfitting.

### 3. **Pre-Split Files**
- The function **assumes data is already split** into train and validation files.
- That’s cleaner and avoids bugs where train/test sets leak into each other, especially in streaming contexts.

### 4. **Scalability**
- Works well in **distributed settings** or online learning.
- Can be used as part of a data loader that feeds small random chunks to your model continuously.

---

## 🤔 When Traditional Splitting Might Be Better

- **Small datasets** that easily fit in memory.
- When you want a **fixed, repeatable train/test split** for reproducibility.
- When you're doing classic ML (not deep learning on token sequences or large corpora).

---

## 🧠 Summary

| Feature | `get_random_chunk()` | Traditional Splitting (`train_test_split`, slicing) |
|--------|------------------------|-------------------------------|
| Memory efficient | ✅ Yes (uses mmap) | ❌ No (loads whole dataset) |
| Random sampling | ✅ Every call | ❌ Fixed once |
| Good for large corpora | ✅ | ❌ |
| Reproducible split | ❌ (unless seeded carefully) | ✅ |
| Simplicity | ❌ More complex | ✅ Easier to understand |

---

If you're working with text or token data for training models like transformers, this chunked/random approach is often a **smart, scalable solution**.

"""

def get_random_chunk(split):
    filename = "train_split.txt" if split == 'train' else "val_split.txt"
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            # Determine the file size and a random position to start reading
            file_size = len(mm)
            start_pos = random.randint(0, (file_size) - block_size*batch_size)

            # Seek to the random position and read the block of text
            mm.seek(start_pos)
            block = mm.read(block_size*batch_size-1)

            # Decode the block to a string, ignoring any invalid byte sequences
            decoded_block = block.decode('utf-8', errors='ignore').replace('\r', '')
            
            # Train and test splits
            data = torch.tensor(encode(decoded_block), dtype=torch.long)
            
    return data


def get_batch(split):
    data = get_random_chunk(split)
    ix   = torch.randint(len(data) - block_size, (batch_size,))
    x    = torch.stack([data[i:i+block_size] for i in ix])
    y    = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [5]:
""" This class represents a single attention head in a Transformer-like architecture, 
    which is a critical component of models like GPT or BERT, 
    using scaled dot-product attention. 
    Let's break it down in the context of a Large Language Model (LLM) architecture:
    
The term "scaled dot-product attention" is a crucial concept in attention mechanisms, especially in models like Transformers. 
The reason we use scaled dot-product attention specifically comes down to the need for numerical stability and efficient learning in the context of neural networks.

Single attention head:
- In Transformer models, attention mechanisms are used to weigh the importance of different tokens (words or subwords) 
- in a sequence when processing input data. A single attention head focuses on a particular way of attending to 
- or weighting these tokens.
- A single attention head refers to one individual instance of the attention mechanism in a multi-head attention system

- A single attention head performs the following steps:

    Projects the input tokens into query, key, and value representations.
    Computes the attention scores by comparing the query against all keys in the sequence (using dot products or other distance metrics).
    Applies the attention weights (calculated from the scores) to the values to generate the output, which is a weighted sum of the values.

- why 'single'?
    :In the Transformer model, multiple heads (often 8, 12, or more) are used in parallel. 
    Each head learns to attend to different relationships or aspects of the input sequence. 
    For example, one head might focus on syntax (word order), while another head might focus on semantics (meaning of words), and yet another might focus on long-term dependencies between words.
    :A single attention head means, 
    just one of these parallel mechanisms. It's essentially one "perspective" of the data, 
    where the model is attending to the input from a specific viewpoint.
    :After computing the output from all individual attention heads, 
    their results are concatenated and passed through a final linear transformation to combine 
    them into a single output representation.
    By using multiple heads, the model can capture a richer set of relationships in the data, 
    as each head is capable of focusing on different aspects of the input sequence.
"""


class Head(nn.Module):
    """ 
        :This `Head`class represents a single attention head in a self-attention mechanism, 
        which is a core component of Transformer models (including architectures like GPT)
        :The Head class implements a single attention mechanism that processes input data and 
        outputs a weighted aggregation of the values based on the similarity between the queries and keys. 
        :It is one part of a multi-head attention setup, where multiple such heads work in parallel to capture different aspects of the input relationships.
    """

    def __init__(self, head_size):
        super().__init__()
        """
            - The attention mechanism starts by creating learnable linear projections (through nn.Linear) of the input embeddings into 
            key, query, and value vectors.
            - Each of these vectors has a size of head_size (typically smaller than the embedding dimension n_embd in multi-head attention), 
            which allows each attention head to focus on different aspects of the input.
            - These vectors are used to calculate attention weights and aggregate information based on those weights.
            - In the context of self-attention, 
            the model learns how much attention to pay to each token in the sequence relative to all other tokens. 
            This is done by computing three vectors for each token:
            
                - key: this vector helps to calclate the attention weights, represents the tokens that can provide answers 
                - query:this vector is compared against keys to calculate the attention scores(similarity
                between key and query), represents the token that is asking the question (i.e., which other tokens it should attend
                - value: this vector holds the info that will be agregated using attention weights, 
                represents the actual content or information that is passed along if the token is attended to

            The output of each of these transformation has the dimension `head_size`, which is
            typically smaller than `n_embd` in a multi-head attention setup.  This allows the model to learn
            different attentions representations with smaller and more specialized attention heads.

            - register_buffer('trill',...)
            This line registers a lower triangular matrix(`tril`) to mask out future tokens during training.  This is especially important in autoregressive models like GPT, where you don't want
            info from future tokens to leak into the current token during training.  The matrix is a triangular mask, ensuring that
            attention scores for tokens after the current ones are masked (set to `-inf`)
            : This lower triangular matrix (stored as a buffer) ensures that in autoregressive models (like GPT), 
            no future tokens are attended to during training. 
            This mask prevents information from leaking from future tokens in the sequence, 
            ensuring the model only attends to previous tokens or the current token.

            - dropout
            :A dropout layer is used for regularization. During training, some of the attention weights(`wei`) 
            are randomly dropped to prevent overfitting, promoting better generalization.
            
        """     
        self.key   = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) # Register and prevent over-computation over n over, reduce time and energy 

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels) 
        """
        ## 1. input tensor(x)
        ## B for batch size(the number of sequences in a batch), 
        ## T for sequence length (number of tokens in the input), 
        ## C for embedding size (usually n_embd): the dimension of each token's embedding

        ## 2. linear projections
        ## the input x is passed through three linear layers to get the key, k(ey), q(uery), v(alue), 
        ## each having a shape of (B, T, head_size)

        ## 3. attention score calculation:
        ## Dot Product Attention: The query vector is compared against the key vector 
        to calculate the attention score, which is done by 
        computing the dot product of the query and key matrices (using q @ k.transpose(-2, -1)).
        ## The result of this operation gives a score for how much attention should be paid to each token in the sequence.
        ## Scaling: The result is scaled by the square root of the head size 
        (k.shape[-1]**-0.5) to prevent the scores from becoming too large, which could destabilize training.
        ## Masking: The triangular mask (self.tril) is applied to the attention scores, setting future token scores to -inf to prevent attention to future tokens during training (important for autoregressive behavior).
            - Masking is used to control which tokens in a sequence a given token can attend to. 
            The core goal is to ensure that the model does not have access to future information 
            when predicting or generating tokens during training.
            - Autoregressive Behaviour: 
            In autoregressive models, like GPT, the model generates tokens one by one, 
            and each generated token is conditioned on the previous tokens. 
            During training, this behavior is mimicked by ensuring that a token can only "see" or 
            attend to tokens that come before it in the sequence (including the token itself), 
            but not to tokens that come after it. This is crucial because, 
            in the actual generation process, the model will generate tokens sequentially, 
            so it can't "peek ahead" to future tokens.
            - When the mask is applied, the `-inf` values in the attention score matrix ensure that 
            the attention probability for future tokens becomes zero after applying the softmax operation.
            Specifically, the softmax function transforms all the attention scores (after adding the mask) into a probability distribution. 
            The `-inf` values turn into zeros, meaning that no attention is paid to those tokens.

        ## Softmax: The attention scores are passed through a softmax function to normalize them into probabilities (between 0 and 1) across the sequence length, which indicates how much attention each token should receive from others.
        ## Dropout: Dropout is applied to the attention weights to regularize the learning.
        """
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs) Head Size
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        """
        ## Aggregation:
        The final step is to perform the weighted aggregation of the values, 
        using the attention weights to weight the value vectors (v), 
        resulting in the final output tensor (out), which represents the attended values based on the computed attention scores.
        """
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

# time elapse, expose one more token as staircase format
# [1, 0, 0]
# [1, 0.6, 0]
# [1, 0.6, 0.4] 

"""
Purpose of the Attention Head in LLMs:

In the context of a Transformer-based architecture, 
multi-head attention allows the model to simultaneously focus on multiple different parts of the input sequence. 
Each head processes the input independently with a different set of learnable projections 
(keys, queries, and values) and captures different types of relationships (e.g., syntactic, semantic) between tokens. The outputs of all heads are then concatenated and linearly transformed to form the final representation.

"""

'\nPurpose of the Attention Head in LLMs:\n\nIn the context of a Transformer-based architecture, \nmulti-head attention allows the model to simultaneously focus on multiple different parts of the input sequence. \nEach head processes the input independently with a different set of learnable projections \n(keys, queries, and values) and captures different types of relationships (e.g., syntactic, semantic) between tokens. The outputs of all heads are then concatenated and linearly transformed to form the final representation.\n\n'

In [6]:
"""
Transformer-based model:
This MultiHeadAttention class implements the multi-head self-attention mechanism in parallel. 
It consists of multiple individual attention heads that each compute their own attention outputs from the input sequence. 
The outputs of all attention heads are then concatenated and projected into a final embedding space. 
Dropout is applied to the final output to help prevent overfitting.
This mechanism allows the model to attend to different parts of the sequence simultaneously, 
capturing multiple relationships in the input data and improving the model's ability to understand complex dependencies in language tasks.

"""

class MultiHeadAttention(nn.Module):

    """
    num_heads: The number of attention heads in the multi-head attention mechanism.
    head_size: The size of each attention head, 
    i.e., the dimension of the key, query, and value vectors for each head.
    
    """
    def __init__(self, num_heads, head_size):
        super().__init__()
        
        # This line creates a list of individual attention heads.  Each `head` object is an instance of a single attention head
        # `ModuleList` ensures that each head is registered as a submodule of the `MultiHeadAttention` module 
        # `Head` class is assumed to handle the specific logic for calculating attention for each individual head
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        # After the individual attention heads process the input, their results are concatenated.
        # The `proj` layer is a linear transformation that takes the concatenated output and projects in into the final embedding dimension, `n_embd`
        # THis ensures that the final output of multi-head attention is of the same dimension as the input sequence, allowing it to be passed to the next layer of the model
        self.proj  = nn.Linear(head_size * num_heads, n_embd)
        # `Dropout` is applied to the output of the multi-head attention mechanism.  This helps regularize the model and prevent overfitting during training by
        # randomly setting some of the output values to zero with a certain probability, defined by the `dropout` rate
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # `forward` method defines how the multi-head attention mechanism processes the input `x`

        """
        input tensor `x` has the shape of (B,T,F)
        B is the batch size (number of sequences in the batch);
        T is the sequence length (number of tokens in each sequence);
        F is the embedding dimension (n_embd).
        """

        # iterates over all attention heads (self.heads) and applies each one to the input x
        # each head processes the input and produces an output tensor of shape
        # after processing, the outputs are concatenated along the last dimension(dim=-1).  if there are `num_heads` heads, 
        # the resulting tensor has shape (B,T, num_heads * head_size)
        out = torch.cat([h(x) for h in self.heads], dim=-1) # Last dimension, # (B, T, F) -> (B, T, [h1, h1, h1, h1, h2, h2, h2, h2, h3, h3, h3, h3])
        # the concatenated output(`out`) is passed through a linear projection (`self.proj`) to transform in into the desired final embedding dimension n_embd.
        # This projection layer ensures that the output of multi-head attention has the same shape as the input sequence (B,T,n_embd)
        # The linear layer self.proj has weights of shape (num_heads * head_size, n_embd), so the output tensor out after the projection will have shape (B, T, n_embd)
        out = self.dropout(self.proj(out))
        # The method returns the final output tensor after the projection and dropout. 
        # This tensor has shape (B, T, n_embd) and represents the attended version of the input sequence, 
        # which will be used in the subsequent layers of the model (such as feedforward layers, etc.).
        return out

In [7]:
""" 
a simple fully connected(dense) neural network layer that consists of two linear transformations, 
a non-linearity(ReLU activation) and dropout for regularization: 
this type of feedforward network is used after the multi-head attention mechanism in the Transformer architecture

Purpose: This class implements a position-wise feedforward network, 
which is applied to each position (token) in the sequence independently. 
After the self-attention mechanism, this feedforward network allows the model to 
further process and transform the representations of each token.

"""

class FeedFoward(nn.Module):
    
    """
        nn.Sequential is a container module in PyTorch that allows stacking layers in a sequential order. 
        The layers inside this container will be applied one after another when you pass input through the network.
    """
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd), # The reason for increasing the size of the representation (by a factor of 4) is to allow the network to have more capacity to learn complex relationships. This is a common design choice in Transformer models, often referred to as the intermediate layer size.
            nn.ReLU(), # ReLU introduces non-linearity into the model by setting all negative values in the output to zero while keeping positive values unchanged. This helps the model learn more complex patterns in the data.
            nn.Linear(4 * n_embd, n_embd), # The output of this layer will have the same dimension as the input, which is necessary to ensure that the output can be added back to the residual connection in the Transformer architecture.
            nn.Dropout(dropout),
        )

    """
    Input x: The input x is expected to be a tensor with shape (B, T, n_embd), where:
    
        B is the batch size (number of sequences),
        T is the sequence length (number of tokens in each sequence),
        n_embd is the embedding dimension.
    
    Processing:
    
        The input x is passed through the feedforward network (self.net), which applies the sequential operations defined in the __init__ method.
        This involves passing the input through the first linear layer, applying the ReLU activation, passing it through the second linear layer, and applying dropout to the final output.
    
    Output:
    
        The output is a tensor with the same shape as the input (B, T, n_embd) 
        because the second linear layer transforms the input back into the original embedding dimension n_embd.
    
    """
    def forward(self, x):
        return self.net(x)
        

In [8]:
"""
The Block class in this code represents a Transformer block, 
which is the fundamental building block in Transformer architectures (like GPT, BERT, etc.). 
The block performs two key operations: communication (via self-attention) and computation (via a feedforward network). 
It also includes residual connections and layer normalization to stabilize the learning process and help with gradient flow.

Transformer block: communication followed by computation

Purpose: This class implements a single Transformer block, which consists of two main parts:

    Communication: This is done through multi-head self-attention (represented by the MultiHeadAttention class).
    Computation: This is done through a feedforward network (represented by the FeedForward class).

  The block also incorporates residual connections and layer normalization to improve training efficiency.
  
"""

class Block(nn.Module):

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        # This initializes the multi-head attention layer (self.sa), 
        # which performs the self-attention mechanism on the input. 
        # The attention mechanism splits the input into n_head heads, each of size head_size, 
        # and computes attention over the sequence.
        self.sa = MultiHeadAttention(n_head, head_size)
        # This initializes the feedforward network (self.ffwd), which processes the output of the attention layer. 
        # It consists of two linear layers with a ReLU activation in between, 
        # and it helps the model learn complex representations.
        self.ffwd = FeedFoward(n_embd)
        # These are Layer Normalization layers applied at different stages of the block. 
        ## Layer normalization normalizes the input across the features (tokens in this case) and helps stabilize the learning process.
        ## ln1 is applied after the multi-head attention and before the residual addition, 
        ## and ln2 is applied after the feedforward network.
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # The input x is passed through the multi-head attention layer self.sa. 
        # This produces an output y, which represents the attended (or context-aware) version of the input sequence. 
        # The attention mechanism allows each token to "attend" to other tokens in the sequence, capturing dependencies between them.
        y = self.sa(x)
        x = self.ln1(x + y)
        y = self.ffwd(x)
        x = self.ln2(x + y)
        return x

"""
Why These Operations?

    Residual Connections: Adding the original input (x) back to the output of each major block (self-attention and feedforward) helps preserve information and mitigates the vanishing gradient problem during backpropagation.
    : Mitigates the Vanishing Gradient Problem:

        The vanishing gradient problem occurs when the gradients (which are used to update the model's parameters during training) become very small as they are propagated backward through many layers. This makes it extremely hard to train deep networks because the updates to the parameters become tiny and ineffective, preventing the model from learning.
    
        Residual connections help mitigate this problem by providing a direct shortcut for the gradient to flow back through. The input x is added to the output, so during backpropagation, the gradients can either flow directly through the residual connection or through the transformations of the layer, making it much less likely that they will vanish.
            Essentially, even if the transformations inside the layers aren't learning useful representations (i.e., their gradients are small), the residual connections provide an alternative pathway for the gradient to propagate backward effectively.


    Layer Normalization: Layer normalization ensures that the activations are on a consistent scale across layers, which speeds up convergence and improves training stability.

    Self-Attention + Feedforward Network: The combination of these two operations enables the model to first capture long-range dependencies between tokens (via self-attention) and then refine the token representations (via the feedforward network).
    
"""

"\nWhy These Operations?\n\n    Residual Connections: Adding the original input (x) back to the output of each major block (self-attention and feedforward) helps preserve information and mitigates the vanishing gradient problem during backpropagation.\n    : Mitigates the Vanishing Gradient Problem:\n\n        The vanishing gradient problem occurs when the gradients (which are used to update the model's parameters during training) become very small as they are propagated backward through many layers. This makes it extremely hard to train deep networks because the updates to the parameters become tiny and ineffective, preventing the model from learning.\n    \n        Residual connections help mitigate this problem by providing a direct shortcut for the gradient to flow back through. The input x is added to the output, so during backpropagation, the gradients can either flow directly through the residual connection or through the transformations of the layer, making it much less likely 

GPTLanguageModel:
The `GPTLanguageModel` class in the context of a large language model (LLM) architecture, such as GPT (Generative Pretrained Transformer), serves as a **transformer-based language model**. This model is designed to predict the next token in a sequence, generate coherent text, and learn to understand and produce human-like language. Here’s an explanation of its purpose and how it fits within the architecture of LLMs:

### 1. **Core Functionality: Language Modeling**
At its core, the `GPTLanguageModel` class is a **causal (autoregressive) language model**. This means:
- It predicts the next word (or token) in a sequence based on the previous tokens. For example, given the input "The cat sat on the", the model might predict "mat" as the next token.
- The model does not see the future tokens during training or generation. This is why the transformer architecture (using causal self-attention) only attends to previous tokens, not future ones.

### 2. **Embedding Layers**
```python
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
```
- **Token embeddings**: Each token (a word or part of a word) in the input sequence is mapped to a dense vector of size `n_embd`. This allows the model to work with continuous representations of discrete tokens.
- **Position embeddings**: Since transformers do not inherently handle the order of tokens, position embeddings are added to provide positional context to each token in the sequence. This enables the model to understand the order of tokens, which is essential for generating coherent text.

### 3. **Transformer Blocks**
```python
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
```
- **Transformer blocks**: This is the heart of the GPT model. A transformer block consists of two main components:
  - **Self-attention**: The model looks at all previous tokens in the sequence and learns relationships between them. In GPT, causal (masked) self-attention ensures that the model can only look at the past, not the future, when predicting the next token.
  - **Feed-forward network**: After attention, the model applies a feed-forward neural network to process the information and refine the token embeddings.

The model's layers (`n_layer`) control the depth of the transformer network, and `n_head` refers to the number of attention heads used in each block, which allows the model to capture diverse patterns in the data.

### 4. **Final Layer and Prediction**
```python
self.ln_f = nn.LayerNorm(n_embd)
self.lm_head = nn.Linear(n_embd, vocab_size)
```
- **Layer normalization (`ln_f`)**: This normalization layer is applied to the final output of the transformer blocks, stabilizing training and improving convergence.
- **Language model head**: After passing through the transformer layers, the model output is transformed into logits over the entire vocabulary (the size of `vocab_size`). These logits represent the unnormalized probabilities of each token in the vocabulary being the next token in the sequence.

### 5. **Loss Calculation**
```python
logits = self.lm_head(x)
if targets is None:
    loss = None
else:
    loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
```
- During training, the model computes the loss using **cross-entropy loss**, comparing the predicted logits to the true target tokens. This is the standard loss function used for language models, and it encourages the model to predict the correct next token.

### 6. **Text Generation**
```python
def generate(self, index, max_new_tokens):
    for _ in range(max_new_tokens):
        logits, loss = self.forward(index_cond)
        logits = logits[:, -1, :]
        probs = F.softmax(logits, dim=-1)
        index_next = torch.multinomial(probs, num_samples=1)
        index = torch.cat((index, index_next), dim=1)
    return index
```
- **Text generation**: The `generate` method demonstrates how the model can be used for text generation. Starting with a given sequence (context), the model predicts the next token, appends it to the sequence, and repeats the process. This is done iteratively until the desired number of new tokens are generated.

### Purpose of `GPTLanguageModel` in LLMs:
In the context of large language models (LLMs), such as GPT, this class serves several crucial purposes:

1. **Autoregressive Text Generation**: It is built to predict the next token based on the given sequence of tokens, which is fundamental to tasks like text generation, translation, summarization, and more.

2. **Learning Language Patterns**: It trains on large corpora of text to learn the statistical relationships between words or subwords, helping it generate coherent and contextually appropriate text.

3. **Scalability and Depth**: The use of multiple transformer layers (`n_layer`) and attention heads (`n_head`) allows the model to scale and capture complex patterns in large datasets, making it capable of understanding and generating long and intricate sequences.

4. **Pretrained Model**: Like GPT models, this class would typically be pretrained on a massive corpus (unsupervised learning) and fine-tuned for specific tasks. The architecture's flexibility allows it to be adapted to various NLP tasks beyond simple text generation.

In summary, the `GPTLanguageModel` class represents a **transformer-based language model** with key components such as token and position embeddings, multiple transformer layers for contextual understanding, and a head for generating next-token predictions. Its purpose in LLMs is to model and generate natural language, and it is a core building block for powerful models like GPT.

In [9]:
"""
   - Embeds tokens and their positions, instead positional trogonometric approach.
   - Processes the embeddings through multiple transformer layers.
   - Generates logits that correspond to the probability distribution over the vocabulary for each token in the sequence.
   - Can compute the loss when targets are provided.
   - Can generate new sequences of text given an initial input.
"""

class GPTLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # This is a lookup table for embeddings of tokens in the vocabulary.
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
            # This layer provides embeddings for the positions of tokens in the sequence. 
            # Since the transformer does not inherently handle sequential information 
            # (like recurrence in RNNs), position embeddings are added to each token’s embedding 
            # to give the model a sense of token order.
        self.position_embedding_table = nn.Embedding(block_size, n_embd) # block_size is sequence length
            # This creates a list of n_layer transformer blocks. 
            # Each Block is a multi-head self-attention and feed-forward neural network layer, 
            # designed to process the input sequence in parallel. The number of attention heads is n_head, 
            # and the size of the embedding is n_embd.
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) # Unpck(*) here: with the *, you're passing each Block as its own positional argument.
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
             # A fully connected layer that projects the final output embedding of each token into the vocabulary space. This gives us a vector of logits corresponding to each possible token in the vocabulary, which will be used for predicting the next token.
        self.lm_head = nn.Linear(n_embd, vocab_size) # lm_head:  language model head
            
        self.apply(self._init_weights) # weight initialization

    def _init_weights(self, module): # initialize weights around std deviation
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) # If there's a bias term, it's initialized to zero.
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, index, targets=None): # 
        """
        This function defines the forward pass of the model:
            index: A tensor of token indices (shape [B, T], where B is the batch size and T is the sequence length).
            targets: Optional ground truth tokens for computing the loss. If not provided, the model only returns predictions.
            
        """
        B, T = index.shape
        
        
        # idx and targets are both (B,T) tensor of integers
        # https://pytorch.org/docs/stable/notes/broadcasting.html
        # https://www.geeksforgeeks.org/understanding-broadcasting-in-pytorch/
        tok_emb = self.token_embedding_table(index) # resulting in embedding size of (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
                # If targets are provided, the loss is computed using F.cross_entropy. The logits and targets are reshaped into a 2D tensor for compatibility with the loss function.


        return logits, loss
    
    def generate(self, index, max_new_tokens): # this function generates new text
        # index is (B, T) array of indices in the current context
        # The method generates tokens one at a time. 
        # It crops the input sequence to the last block_size tokens (index_cond) to limit the context 
        # to the maximum allowed by the model.
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            index_cond = index[:, -block_size:]
            # get the predictions
            logits, loss = self.forward(index_cond)
            # focus only on the last time step
            # The current sequence is passed through the model to get the logits.
            # Only the predictions for the last token are kept (i.e., the next token prediction).
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            # The logits are converted to probabilities using the softmax function.
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            index_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            index = torch.cat((index, index_next), dim=1) # (B, T+1)
            # The generated token is appended to the input sequence, 
            # and the process continues until the desired number of new tokens is generated
        return index

model = GPTLanguageModel(vocab_size)
# print('loading model parameters...')
# with open('model-01.pkl', 'rb') as f:
#     model = pickle.load(f)
# print('loaded successfully!')
m = model.to(device)

In [10]:
"""
The block is used to estimate the loss on both the training and validation datasets. 
This is typically done during the training process of a machine learning model to track how well the model is 
performing on these datasets without affecting the actual training. 

Purpose: Decorator disables gradient calculation within the function. 
    It is important because during the evaluation (e.g., loss estimation), we don't need gradients 
    (as we're not performing backpropagation or updating model parameters). Disabling gradients reduces memory usage and speeds up computation.
Effect: When applied to the estimate_loss function, 
it ensures that the loss computation does not track gradients, which is more efficient for validation or inference.

"""

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()

    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        # loss calculation loop
        for k in range(eval_iters):
            X, Y = get_batch(split)
            # This is similar to a typical forward pass, 
            # but the key difference is that during evaluation, 
            # the model is not updating its weights (thanks to model.eval()).
            logits, loss = model(X, Y) # model forward pass
            losses[k] = loss.item()
        # computes the mean loss for the current split ('train' or 'val').     
        out[split] = losses.mean

    # This restores the model to training mode after the loss estimation is done.
    # This ensures that after the loss estimation, the model is back in training mode, where it can resume using techniques like Dropout and BatchNorm as needed for regular training.
    model.train()
    return out

In [11]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    print(iter)
    if iter % eval_iters == 0:
        losses = estimate_loss()
        print(f"step: {iter}, train loss: {losses['train']:.3f}, val loss: {losses['val']:.3f}")

    # sample a batch of data
    # xb: A batch of input data (features).
    # yb: The corresponding target values (labels) for each input in the batch.
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model.forward(xb, yb)

    # This clears the gradients of all model parameters before computing the gradients 
    # for the current batch. Without this, gradients would accumulate across iterations, which is not desired.
    optimizer.zero_grad(set_to_none=True)
    loss.backward() # This computes the backpropagation of the loss to calculate the gradients of the model parameters. 
    optimizer.step()
print(loss.item())

with open('model-01.pkl', 'wb') as f:
    pickle.dump(model, f)
print('model saved')

0


FileNotFoundError: [Errno 2] No such file or directory: 'train_split.txt'

An **autoregressive language model** is a type of language model that 
generates text one token at a time, predicting each token based on the previous ones in the sequence. This model generates output sequentially, using previously generated tokens as context for predicting the next token.

### Key Characteristics of an Autoregressive Language Model:
1. **Generates tokens one by one**: 
   - At each time step, the model predicts the next token in a sequence based on all the tokens that have been generated or observed up to that point.
   - For example, given the prompt "The cat is", the model might predict the next token as "on", so the sequence becomes "The cat is on", and then it continues generating the next token from this updated sequence.

2. **Conditional probability**:
   - An autoregressive model learns to estimate the probability distribution of the next token, conditioned on the tokens that came before it. Formally, the model learns to maximize the probability of the next token \( p(t_{i+1} | t_1, t_2, ..., t_i) \), where \( t_1, t_2, ..., t_i \) are the tokens before the \( i+1 \)-th token.

3. **Training process**:
   - During training, the model is provided with a sequence of tokens and learns to predict each token based on the preceding tokens.
   - For example, if the input sequence is "The cat is on the mat", the model learns to predict:
     - "The" → "cat"
     - "The cat" → "is"
     - "The cat is" → "on"
     - "The cat is on" → "the"
     - and so on, until the entire sequence is predicted.

4. **Autoregressive in nature**:
   - The term "autoregressive" refers to the fact that the model generates the next output based on its own previous outputs. At each step, the model "autoregressively" generates the next token in the sequence, conditioned on the tokens it has already produced.
   - For example, after generating the token "on" in the example above, it uses the sequence "The cat is on" to predict the next token.

### How It Works:
1. **Given a sequence of tokens**: 
   - The model starts with an initial context, which could be a prompt or an empty sequence, and begins generating the first token.
   
2. **Autoregressive Generation**:
   - After generating the first token, it is appended to the input context, and the model uses this new context to predict the next token. This process repeats until the desired sequence length is reached, or a special end-of-sequence token is generated.

### Example:
Let’s say the model is trained to predict words in the sequence "The cat sat on the mat". During generation:
- Step 1: Given "The", the model predicts "cat".
- Step 2: Now, given "The cat", the model predicts "sat".
- Step 3: Given "The cat sat", it predicts "on", and so on.

The process is **autoregressive** because at each step, the model generates the next word based on all the previous words it has generated.

### Why is it called "autoregressive"?
- The name "autoregressive" comes from the fact that the model predicts future tokens using **itself** — meaning its own previous predictions — as the context. The prediction for the next token at each time step depends on the tokens it generated in the previous steps.

### Autoregressive vs. Other Models:
- **Autoregressive models** (like GPT, or RNNs) generate sequences step-by-step, always conditioned on previous tokens.
- **Non-autoregressive models** generate the entire sequence at once, often with methods like sequence-to-sequence learning or attention mechanisms that allow the model to look at all tokens in the sequence simultaneously.

### Application:
Autoregressive models are commonly used in:
- **Text generation** (e.g., generating a paragraph of text or a story).
- **Language modeling** (e.g., predicting the likelihood of a sequence of words).
- **Machine Translation** (e.g., generating a translated sentence word by word).
  
Autoregressive models like GPT (Generative Pretrained Transformer) are trained to predict the next token (word, subword, or character) based on the previous ones, which is why they are particularly powerful for tasks like text generation.

### Summary:
An **autoregressive language model** generates the next token in a sequence based on all the tokens that have been generated before it. It learns the conditional probability distribution of the next token, given the previous tokens. This type of model is widely used for tasks like text generation, where the goal is to produce a coherent sequence of words, sentences, or even entire paragraphs one token at a time.


In building Large Language Models (LLMs), a Transformer is a type of deep learning model architecture that has revolutionized natural language processing (NLP) and other tasks, such as image recognition and translation. It was introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017.
Transformers are specifically designed to handle sequences of data (like sentences) and are known for their efficiency and parallelization capabilities, making them particularly effective for large-scale models like GPT and BERT.
Key Components of Transformers:
1. Attention Mechanism: The core innovation of the Transformer is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when making predictions, regardless of their positions. This means that each word can "attend" to every other word in the sequence when forming its representation. This is different from traditional models like RNNs or LSTMs, which process words one by one in order and struggle with long-range dependencies.
2. Positional Encoding: Since the Transformer doesn't process data sequentially, it needs a way to understand the order of words in a sentence. This is achieved through positional encodings, which are added to the input embeddings to provide information about the position of each word in the sequence.
3. Multi-Head Attention: This allows the model to focus on different parts of the input simultaneously. Instead of having just one attention mechanism, it uses multiple "heads," which each focus on different aspects of the input, and then combines them to create a more comprehensive understanding.
4. Encoder-Decoder Architecture (for some tasks):
    * Encoder: Takes an input sequence and processes it into a contextually rich representation.
    * Decoder: Uses that representation to generate an output sequence (e.g., in translation tasks).
5. For tasks like language modeling, models like GPT (which is based on the Transformer) only use the decoder, while BERT uses the encoder for tasks like classification.
6. Feedforward Neural Networks: After each attention mechanism, the Transformer applies position-wise fully connected feedforward neural networks (a standard neural network layer). This allows for additional transformations to the data, helping the model learn more complex patterns.
7. Layer Normalization and Residual Connections: These help stabilize training by normalizing the inputs to each layer and allowing gradients to flow more easily through the network.
Why Transformers Are Popular in LLMs:
* Scalability: The architecture allows for training on massive datasets and enables the creation of large models (like GPT-3, with 175 billion parameters).
* Parallelization: Unlike RNNs, which process data sequentially, Transformers process data in parallel, making them much faster to train.
* Long-range dependencies: The self-attention mechanism helps capture long-range dependencies in text (for example, understanding the relationship between words at the beginning and end of a sentence).
In essence, Transformers are the backbone of modern LLMs like GPT-3, GPT-4, and BERT, powering their ability to understand and generate human-like text.
In building Large Language Models (LLMs), a Transformer is a type of deep learning model architecture that has revolutionized natural language processing (NLP) and other tasks, such as image recognition and translation. It was introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017.
Transformers are specifically designed to handle sequences of data (like sentences) and are known for their efficiency and parallelization capabilities, making them particularly effective for large-scale models like GPT and BERT.
Key Components of Transformers:
1. Attention Mechanism: The core innovation of the Transformer is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when making predictions, regardless of their positions. This means that each word can "attend" to every other word in the sequence when forming its representation. This is different from traditional models like RNNs or LSTMs, which process words one by one in order and struggle with long-range dependencies.
2. Positional Encoding: Since the Transformer doesn't process data sequentially, it needs a way to understand the order of words in a sentence. This is achieved through positional encodings, which are added to the input embeddings to provide information about the position of each word in the sequence.
3. Multi-Head Attention: This allows the model to focus on different parts of the input simultaneously. Instead of having just one attention mechanism, it uses multiple "heads," which each focus on different aspects of the input, and then combines them to create a more comprehensive understanding.
4. Encoder-Decoder Architecture (for some tasks):
    * Encoder: Takes an input sequence and processes it into a contextually rich representation.
    * Decoder: Uses that representation to generate an output sequence (e.g., in translation tasks).
5. For tasks like language modeling, models like GPT (which is based on the Transformer) only use the decoder, while BERT uses the encoder for tasks like classification.
6. Feedforward Neural Networks: After each attention mechanism, the Transformer applies position-wise fully connected feedforward neural networks (a standard neural network layer). This allows for additional transformations to the data, helping the model learn more complex patterns.
7. Layer Normalization and Residual Connections: These help stabilize training by normalizing the inputs to each layer and allowing gradients to flow more easily through the network.
Why Transformers Are Popular in LLMs:
* Scalability: The architecture allows for training on massive datasets and enables the creation of large models (like GPT-3, with 175 billion parameters).
* Parallelization: Unlike RNNs, which process data sequentially, Transformers process data in parallel, making them much faster to train.
* Long-range dependencies: The self-attention mechanism helps capture long-range dependencies in text (for example, understanding the relationship between words at the beginning and end of a sentence).
In essence, Transformers are the backbone of modern LLMs like GPT-3, GPT-4, and BERT, powering their ability to understand and generate human-like text.
