In [4]:
import torch
import torch.nn as nn
import tiktoken

# SYD'S GPT-345M Model Configuration

This notebook cell defines the configuration dictionary for a GPT-style language model with approximately 345 million parameters. The configuration parameters are as follows:

- **vocab_size**: The number of unique tokens (words, subwords, or characters) that the model can understand. For GPT-345M, this is set to 50,257, matching the standard GPT-2 vocabulary size.
- **context_length**: The maximum number of tokens the model can consider in a single input sequence. Here, it is set to 1,024 tokens, which determines the model's memory window for generating or analyzing text.
- **embedding_dim**: The size of the vector used to represent each token. A higher embedding dimension allows the model to capture more nuanced relationships between tokens. This configuration uses 1,024 dimensions.
- **num_heads**: The number of attention heads in the multi-head self-attention mechanism. More heads allow the model to focus on different parts of the input simultaneously. This model uses 16 heads.
- **num_layers**: The number of transformer blocks (layers) stacked in the model. More layers generally increase the model's capacity to learn complex patterns. Here, 24 layers are used.
- **dropout**: The dropout rate applied during training to prevent overfitting. A value of 0.1 means 10% of the connections are randomly dropped during each training step.
- **qkv_bias**: A boolean indicating whether to include a bias term in the query, key, and value projections of the attention mechanism. Setting this to `False` means no bias is added.

This configuration is typical for a medium-sized GPT-2 model and can be used as a starting point for training or fine-tuning a transformer-based language model.

In [9]:
SYDSGPT_CONFIG_345M = {
    "vocab_size" : 50257,
    "context_length" : 1024,
    "embedding_dim" : 1024,
    "num_heads" : 16,
    "num_layers" : 24,
    "dropout" : 0.1,
    "qkv_bias" : False
}

# PlaceholderGPT Model Class Overview

The following code cell defines a simplified, illustrative version of a GPT-style transformer model using PyTorch. This implementation is intended for educational purposes and demonstrates the core architectural components of a transformer-based language model:

## Main Components

- **PlaceholderGPT (nn.Module)**: The main model class. It initializes the following layers:
  - **Token Embedding**: Maps each token in the input sequence to a high-dimensional vector.
  - **Position Embedding**: Adds positional information to each token, allowing the model to distinguish between tokens at different positions.
  - **Dropout Layer**: Regularizes the model by randomly dropping units during training to prevent overfitting.
  - **Transformer Blocks**: A stack of placeholder transformer blocks (not fully implemented here) that would normally perform self-attention and feed-forward operations.
  - **Final Layer Normalization**: Normalizes the output of the transformer blocks.
  - **Output Projection**: Projects the final hidden states to the vocabulary size, producing logits for each token.

- **forward(input)**: The forward pass of the model. It processes the input sequence through embeddings, dropout, transformer blocks, layer normalization, and output projection to produce logits for each token position.

## Supporting Classes

- **PlaceholderTransformerBlock (nn.Module)**: Represents a single transformer block. In a full implementation, this would include multi-head self-attention and feed-forward layers. Here, it simply returns the input unchanged.

- **PlaceholderLayerNorm (nn.Module)**: A placeholder for layer normalization. In a real model, this would normalize the input tensor across the embedding dimension.

### What does `eps = 1e-5` mean?

In the context of neural networks and layer normalization, `eps` (epsilon) is a small constant added to the denominator when normalizing activations. This prevents division by zero and improves numerical stability.

- **Value:** `1e-5` means epsilon is set to $0.00001$.
- **Purpose:** When calculating the normalized output, the formula typically looks like:
  $$
  \text{normalized} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}
  $$
  where $\mu$ is the mean and $\sigma^2$ is the variance of the input $x$.
- **Why needed:** If the variance is very close to zero, adding epsilon ensures the denominator is never zero, avoiding undefined results and improving stability during training.

This is a standard practice in deep learning layers such as batch normalization and layer normalization.

## Notes

- This code is a skeleton and does not implement the full transformer logic (e.g., attention mechanisms, feed-forward networks, or actual layer normalization).
- The model uses configuration parameters such as `vocab_size`, `embedding_dim`, `context_length`, `num_layers`, and `dropout` from the configuration dictionary defined earlier in the notebook.
- The purpose of this cell is to illustrate the structure and flow of a transformer-based language model in PyTorch, serving as a starting point for further development or experimentation.

In [11]:
class PlaceholderGPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embedding = nn.Embedding(config["vocab_size"], config["embedding_dim"])
        self.position_embedding = nn.Embedding(config["context_length"], config["embedding_dim"])
        self.dropout = nn.Dropout(config["dropout"])
        self.transformer_blocks = nn.Sequential(*[PlaceholderTransformerBlock(config) for _ in range(config["num_layers"])])
        self.final_layer_norm = PlaceholderLayerNorm(config["embedding_dim"])
        self.output_projection = nn.Linear(config["embedding_dim"], config["vocab_size"], bias = False)

    def forward(self, input):
        batch_size, seq_length = input.shape
        token_embeddings = self.token_embedding(input)
        position_embeddings = self.position_embedding(torch.arange(seq_length, device = input.device))
        x = token_embeddings + position_embeddings
        x = self.dropout(x)
        x = self.transformer_blocks(x)
        x = self.final_layer_norm(x)
        logits = self.output_projection(x)
        return logits
    
class PlaceholderTransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        
    def forward(self, x):
        return x
    
class PlaceholderLayerNorm(nn.Module):
    def __init__(self, embedding_dim, eps = 1e-5):
        super().__init__()
                
    def forward(self, x):
        return x



# Example: Tokenization and Model Forward Pass

This code cell demonstrates how to tokenize input text, prepare a batch for the model, and run a forward pass using the `PlaceholderGPT` model defined earlier in the notebook.

## Steps Explained

1. **Tokenization**
   - Uses the `tiktoken` library to obtain the GPT-2 tokenizer.
   - Two example sentences (`ex1` and `ex2`) are encoded into lists of token IDs.

2. **Batch Preparation**
   - The encoded token lists are converted to PyTorch tensors and added to a batch list.
   - The batch is stacked into a single tensor with shape `(batch_size, sequence_length)`.
   - The batch tensor is printed to show its contents.

3. **Model Initialization and Forward Pass**
   - Sets a manual random seed for reproducibility.
   - Instantiates the `PlaceholderGPT` model with the configuration dictionary.
   - Runs the batch through the model to obtain logits (unnormalized predictions for each token position).

4. **Output**
   - Prints the shape of the logits tensor, which should be `(batch_size, sequence_length, vocab_size)`.
   - Prints the actual logits tensor values.

## Notes
- This example uses a placeholder model, so the logits are not meaningful predictions but demonstrate the expected output structure.
- The code illustrates the typical workflow for preparing text data and running it through a transformer-based language model in PyTorch.
- The use of `torch.manual_seed` ensures that results are reproducible for debugging and experimentation.

In [17]:
tokenizer = tiktoken.get_encoding("gpt2")
batch = []
ex1 = "Hello how are you"
ex2 = "What are you doing"
batch.append(torch.tensor(tokenizer.encode(ex1)))
batch.append(torch.tensor(tokenizer.encode(ex2)))
batch = torch.stack(batch, dim = 0)
print(batch)

torch.manual_seed(246)
test_model = PlaceholderGPT(SYDSGPT_CONFIG_345M)
logits = test_model(batch)
print(f"Logits Shape: {logits.shape}")
print(f"Logits: \n {logits}")


tensor([[15496,   703,   389,   345],
        [ 2061,   389,   345,  1804]])
Logits Shape: torch.Size([2, 4, 50257])
Logits: 
 tensor([[[ 0.2638, -0.1249, -1.9838,  ..., -0.0289,  0.6842, -0.1247],
         [-0.0896,  1.0701,  0.0236,  ...,  0.4069, -0.5886, -0.4666],
         [ 0.6501,  0.0914, -1.0634,  ...,  1.0771,  0.1660, -0.0224],
         [ 0.5803,  0.3458, -0.4553,  ...,  0.7233, -1.3835, -0.2096]],

        [[ 1.6676,  0.5273, -0.3480,  ..., -0.0478, -0.2177,  0.0361],
         [ 0.6377, -0.6156, -1.3517,  ...,  1.0306, -1.0850, -1.3458],
         [ 0.1607,  0.2084,  0.6130,  ...,  0.6961, -0.9398,  0.6000],
         [-0.1086, -0.2255, -0.4622,  ...,  0.8606,  0.1524, -1.0448]]],
       grad_fn=<UnsafeViewBackward0>)
Logits Shape: torch.Size([2, 4, 50257])
Logits: 
 tensor([[[ 0.2638, -0.1249, -1.9838,  ..., -0.0289,  0.6842, -0.1247],
         [-0.0896,  1.0701,  0.0236,  ...,  0.4069, -0.5886, -0.4666],
         [ 0.6501,  0.0914, -1.0634,  ...,  1.0771,  0.1660, -0.0224],


# Example: Manual Layer Normalization in PyTorch

This code cell demonstrates how to manually normalize the output of a neural network layer in PyTorch, illustrating the concept behind layer normalization.

## Steps Explained

1. **Batch and Layer Setup**
   - Sets a manual random seed for reproducibility.
   - Creates a random batch tensor `ex_batch` of shape `(2, 6)`.
   - Defines a simple neural network layer (`nn_layer`) consisting of a linear transformation followed by a ReLU activation.
   - Passes the batch through the layer to obtain `output`.
   - Prints the shape and values of the output tensor.

2. **Mean and Variance Calculation**
   - Computes the mean and variance of the output tensor along the last dimension (features) for each sample in the batch.
   - Prints the mean and variance values.

3. **Manual Normalization**
   - Normalizes the output tensor by subtracting the mean and dividing by the square root of the variance (without epsilon for simplicity).
   - Prints the normalized output tensor.
   - Calculates and prints the mean and variance of the normalized output to verify that the mean is close to zero and the variance is close to one for each sample.

## Notes
- This example shows the core idea behind layer normalization, which is commonly used in deep learning models to stabilize and accelerate training.
- In practice, a small epsilon value is added to the denominator to prevent division by zero and improve numerical stability.
- The normalization is performed per sample (row) in the batch, across the feature dimension.
- This manual approach helps illustrate what built-in PyTorch layers like `nn.LayerNorm` do internally.

In [22]:
torch.manual_seed(246)
ex_batch = torch.randn(2,6)
nn_layer = nn.Sequential(nn.Linear(6,8), nn.ReLU())
output = nn_layer(ex_batch)
print(f"Output Shape: {output.shape}")
print(f"Output: \n {output}")

mean = output.mean(dim = -1, keepdim = True)
variance = output.var(dim = -1, keepdim = True)
print(f"Mean: \n {mean}")
print(f"Variance: \n {variance}")

normalized_output = (output - mean) / torch.sqrt(variance)
mean_after_norm = normalized_output.mean(dim = -1, keepdim = True)
variance_after_norm = normalized_output.var(dim = -1, keepdim = True)
print(f"Normalized Output: \n {normalized_output}")
print(f"Mean After Norm: \n {mean_after_norm}")
print(f"Variance After Norm: \n {variance_after_norm}")

Output Shape: torch.Size([2, 8])
Output: 
 tensor([[0.0000, 1.4358, 0.8909, 0.0000, 0.3553, 1.4952, 0.0000, 0.0000],
        [1.4236, 0.0000, 0.0000, 2.0742, 1.7022, 0.1547, 0.0589, 0.0000]],
       grad_fn=<ReluBackward0>)
Mean: 
 tensor([[0.5221],
        [0.6767]], grad_fn=<MeanBackward1>)
Variance: 
 tensor([[0.4337],
        [0.7986]], grad_fn=<VarBackward0>)
Normalized Output: 
 tensor([[-0.7929,  1.3874,  0.5599, -0.7929, -0.2533,  1.4775, -0.7929, -0.7929],
        [ 0.8357, -0.7572, -0.7572,  1.5638,  1.1475, -0.5841, -0.6913, -0.7572]],
       grad_fn=<DivBackward0>)
Mean After Norm: 
 tensor([[0.0000e+00],
        [2.2352e-08]], grad_fn=<MeanBackward1>)
Variance After Norm: 
 tensor([[1.],
        [1.]], grad_fn=<VarBackward0>)


# Improved Custom LayerNorm Class in PyTorch

This code cell presents an improved custom implementation of the Layer Normalization operation, a key technique for stabilizing and accelerating training in deep learning models, especially transformers.

## How This Implementation Works

- **Class Definition:**
  - `LayerNorm` inherits from `nn.Module` for seamless integration with PyTorch models.
  - The constructor accepts `embedding_dim` (the size of the last dimension to normalize) and `eps` (a small constant for numerical stability).

- **Parameters:**
  - `scale`: A learnable scaling parameter (initialized to ones) that allows the model to adjust the normalized output's scale.
  - `shift`: A learnable shifting parameter (initialized to zeros) that allows the model to adjust the normalized output's mean.

- **Forward Pass:**
  - Calculates the mean and variance of the input tensor `x` along the last dimension for each sample.
  - Normalizes the input: $\text{normalized}_x = \frac{x - \text{mean}}{\sqrt{\text{variance} + \epsilon}}$
  - Applies the learnable scale (`scale`) and shift (`shift`) to the normalized tensor.
  - Returns the final output.

## Key Points
- Layer normalization is performed per sample, across the feature dimension, making it suitable for variable-length sequences and transformer models.
- The use of `eps` ensures numerical stability by preventing division by zero.
- Learnable parameters (`scale` and `shift`) allow the model to recover the original distribution if needed.
- This implementation closely mirrors PyTorch's built-in `nn.LayerNorm`, but is written from scratch for educational clarity and flexibility.

## Usage
This custom `LayerNorm` class can be used in place of PyTorch's built-in layer normalization in neural network modules, especially when you want to understand, customize, or experiment with normalization behavior.

In [25]:
class LayerNorm(nn.Module):
    def __init__(self, embedding_dim, eps = 1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(embedding_dim))
        self.shift = nn.Parameter(torch.zeros(embedding_dim))

    def forward(self, x):
        mean = x.mean(dim = -1, keepdim = True)
        variance = x.var(dim = -1, keepdim = True, unbiased = False)
        normalized_x = (x - mean) / torch.sqrt(variance + self.eps)
        return self.scale * normalized_x + self.shift

# Using the Custom LayerNorm Class

This code cell demonstrates how to use the custom `LayerNorm` class defined above to normalize a batch of data in PyTorch.

## Steps Explained

1. **LayerNorm Instantiation**
   - Creates an instance of the custom `LayerNorm` class with `embedding_dim = 6`, matching the feature dimension of the input batch (`ex_batch`).

2. **Applying Layer Normalization**
   - Passes the batch tensor `ex_batch` through the `LayerNorm` instance to obtain `normalized_output`.
   - Prints the normalized output tensor.

3. **Verifying Normalization**
   - Calculates and prints the mean and variance of the normalized output along the last dimension for each sample in the batch.
   - The mean should be close to zero and the variance close to one, confirming that the normalization is working as intended (modulo learnable parameters, which are initialized to scale=1 and shift=0).

## Notes
- This example shows how to use a custom normalization layer in practice, similar to how you would use PyTorch's built-in `nn.LayerNorm`.
- Layer normalization is applied independently to each sample (row) in the batch, across the feature dimension.
- The learnable parameters (`scale` and `shift`) allow the model to adapt the normalized output during training.
- This approach is essential in transformer models and other deep learning architectures to ensure stable and efficient training.

In [27]:
layer_norm = LayerNorm(embedding_dim = 6)
normalized_output = layer_norm(ex_batch)
print(f"Layer Norm Output: \n {normalized_output}")
mean_after_norm = normalized_output.mean(dim = -1, keepdim = True)
variance_after_norm = normalized_output.var(dim = -1, keepdim = True, unbiased = False)
print(f"Mean After Layer Norm: \n {mean_after_norm}")
print(f"Variance After Layer Norm: \n {variance_after_norm}")

Layer Norm Output: 
 tensor([[-1.3450,  1.1168, -0.1748, -0.5987, -0.5124,  1.5140],
        [-1.2331,  0.8340,  0.8506,  1.1080, -0.2246, -1.3349]],
       grad_fn=<AddBackward0>)
Mean After Layer Norm: 
 tensor([[-3.9736e-08],
        [-1.9868e-08]], grad_fn=<MeanBackward1>)
Variance After Layer Norm: 
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)
