# **Appendix A: Converting GPT to LLaMA2 🚀🐑**

Welcome to the first appendix notebook of our journey through the world of language models! 🎉 After spending significant time mastering the intricacies of GPT architecture, it's time to venture into uncharted territory—or rather, into the terrain of another renowned model family: **LLaMA**, the famous open-source creation by Meta. 🦙✨

This notebook is all about **transforming our understanding of GPT into the foundation for LLaMA2**. Why? Because building on what you already know makes everything more intuitive—and let's be honest, a bit more fun! 😄 

While GPT2 and LLaMA2 share a lot of the same DNA (Transformers FTW!), there are a few **interesting tweaks** in the architecture that set LLaMA2 apart. Think of this as a chance to compare notes, explore optimizations, and learn new tricks—all while grounding ourselves in concepts we're already familiar with.

By the end of this notebook, you'll have a clear understanding of how to:

1. Adapt GPT2's core ideas into LLaMA2's structure.  
2. Dive into architectural modifications like **RMSNorm**, **rotary embeddings**, and Meta's flavor of feed-forward layers.  
3. Appreciate the thoughtful engineering behind LLaMA2's open-source power.

So, let's get started and see how the GPT torch passes on to its cousin from the Meta family! Ready? Let’s code. 🧑‍💻

<p align="center">
    <img src="images/llama.png" alt="LLaMA Image" />
</p>

## NOTE: This notebook has minimal comments; it's just a code snippet. I rely on your previous experience with the project. Think of this as an exercise and not a tutorial. Have FUN!!!


In [1]:
# Importing PyTorch, a deep learning framework for tensor operations and building neural networks
import torch 
import torch.nn as nn  # Importing the `nn` module for building neural network layers
import torch.nn.functional as F  # Importing the functional API for PyTorch, which provides functions for activation layers, loss, etc.

# Importing the Hugging Face Hub library for downloading and interacting with pre-trained models
import huggingface_hub 

# Importing SentencePiece, a library for tokenization, used for subword units like Byte Pair Encoding (BPE) and Unigram
import sentencepiece


### **Summary of Main Architectural Differences Between GPT and LLaMA**

| **Aspect**               | **GPT**                                      | **LLaMA**                                   |
|---------------------------|----------------------------------------------|---------------------------------------------|
| **Activation Function**   | GELU (Gaussian Error Linear Unit)           | SiLU (Sigmoid Linear Unit)                 |
| **Normalization**         | LayerNorm                                   | RMSNorm (Root Mean Square Normalization)   |
| **Feedforward Network**   | Standard dense layers                       | Gated Linear Units (GLUs) for nonlinearity |
| **Attention Mechanism**   | Standard scaled dot-product attention       | Includes Rotary Embeddings for efficiency  |
| **Positional Encoding**   | Learned absolute positional embeddings      | Rotary positional embeddings               |

### **Key Takeaways**
- LLaMA's architectural tweaks, like RMSNorm and rotary embeddings, focus on efficiency while maintaining strong performance.
- GPT relies on more traditional design choices, such as LayerNorm and absolute positional embeddings, which have proven effective but may not be as efficient.

In [4]:
#####################################
# from notebooks/3.GPT.ipynb
#####################################

# class LayerNorm(nn.Module):
#     def __init__(self, emb_dim):
#         super().__init__()
#         self.eps = 1e-5
#         self.scale = nn.Parameter(torch.ones(emb_dim))
#         self.shift = nn.Parameter(torch.zeros(emb_dim))

#     def forward(self, x):
#         mean = x.mean(dim=-1, keepdim=True)
#         var = x.var(dim=-1, keepdim=True, unbiased=False)
#         norm_x = (x - mean) / torch.sqrt(var + self.eps)
#         return self.scale * norm_x + self.shift

class RMSNorm(nn.Module):
    def __init__(self, emb_dim, eps=1e-5):
        super().__init__()
        self.eps = eps 
        self.scale  =nn.Parameter(torch.ones(emb_dim))
        self.emd_dim = emb_dim


    def forward(self, x):
        mean = x.pow(2).mean(dim=-1, keepdim=True)
        x_norm = x * torch.rsqrt(mean + self.eps)
        output = (self.scale * x_norm).to(dtype=x.dtype) 
        return output
        



In [5]:
torch.manual_seed(123)

batch = torch.randn(2, 3, 4)

our_rms = RMSNorm(emb_dim=batch.shape[-1])
torch_rms = torch.nn.RMSNorm(batch.shape[-1], eps=1e-5)

assert torch.allclose(our_rms(batch), torch_rms(batch))

In [6]:
#####################################
# from notebooks/3.GPT.ipynb
#####################################

# class GELU(nn.Module):
#     def __init__(self):
#         super().__init__()

#     def forward(self, x):
#         return 0.5 * x * (1 + torch.tanh(
#             torch.sqrt(torch.tensor(2.0 / torch.pi)) *
#             (x + 0.044715 * torch.pow(x, 3))
#         ))

class SiLU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x * torch.sigmoid(x)

In [7]:
our_silu = SiLU()
torch_silu = torch.nn.SiLU()

assert torch.allclose(our_silu(batch), torch_silu(batch))

In [72]:
#####################################
# from notebooks/3.GPT.ipynb
#####################################
# class FeedForward(nn.Module):
#     def __init__(self, cfg):
#         super().__init__()
#         self.layers = nn.Sequential(
#             nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
#             GELU(),
#             nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
#         )

#     def forward(self, x):
#         return self.layers(x)

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)
        self.silu = SiLU()

    def forward(self, x):
        x1 = self.fc1(x)
        x2 = self.fc2(x1)
        x = self.silu(x1) * x2        
        return self.fc3(x)


In [73]:
def precompute_theta_pos_frequencies(head_dim: int, seq_len: int, device: str, theta: float = 10000.0):
    # As written in the paragraph 3.2.2 of the paper
    # >> In order to generalize our results in 2D to any xi ∈ Rd where **d is even**, [...]
    assert head_dim % 2 == 0, "Dimension must be divisible by 2"
    # Build the theta parameter
    # According to the formula theta_i = 10000^(-2(i-1)/dim) for i = [1, 2, ... dim/2]
    # Shape: (Head_Dim / 2)
    theta_numerator = torch.arange(0, head_dim, 2).float()
    # Shape: (Head_Dim / 2)
    theta = 1.0 / (theta ** (theta_numerator / head_dim)).to(device) # (Dim / 2)
    # Construct the positions (the "m" parameter)
    # Shape: (Seq_Len)
    m = torch.arange(seq_len, device=device)
    # Multiply each theta by each position using the outer product.
    # Shape: (Seq_Len) outer_product* (Head_Dim / 2) -> (Seq_Len, Head_Dim / 2)
    freqs = torch.outer(m, theta).float()
    # We can compute complex numbers in the polar form c = R * exp(m * theta), where R = 1 as follows:
    # (Seq_Len, Head_Dim / 2) -> (Seq_Len, Head_Dim / 2)
    freqs_complex = torch.polar(torch.ones_like(freqs), freqs)
    return freqs_complex

def apply_rotary_embeddings(x: torch.Tensor, freqs_complex: torch.Tensor, device: str):
    # Separate the last dimension pairs of two values, representing the real and imaginary parts of the complex number
    # Two consecutive values will become a single complex number
    # (B, Seq_Len, H, Head_Dim) -> (B, Seq_Len, H, Head_Dim/2)
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    # Reshape the freqs_complex tensor to match the shape of the x_complex tensor. So we need to add the batch dimension and the head dimension
    # (Seq_Len, Head_Dim/2) --> (1, Seq_Len, 1, Head_Dim/2)
    freqs_complex = freqs_complex.unsqueeze(0).unsqueeze(2)
    # Multiply each complex number in the x_complex tensor by the corresponding complex number in the freqs_complex tensor
    # Which results in the rotation of the complex number as shown in the Figure 1 of the paper
    # (B, Seq_Len, H, Head_Dim/2) * (1, Seq_Len, 1, Head_Dim/2) = (B, Seq_Len, H, Head_Dim/2)
    x_rotated = x_complex * freqs_complex
    # Convert the complex number back to the real number
    # (B, Seq_Len, H, Head_Dim/2) -> (B, Seq_Len, H, Head_Dim/2, 2)
    x_out = torch.view_as_real(x_rotated)
    # (B, Seq_Len, H, Head_Dim/2, 2) -> (B, Seq_Len, H, Head_Dim)
    x_out = x_out.reshape(*x.shape)
    return x_out.type_as(x).to(device)


In [74]:
# Example input
batch_size = 4
seq_len = 10
head_dim = 16
device = "cuda" if torch.cuda.is_available() else "cpu"

# Generate random input tensor (batch_size, seq_len, num_heads, head_dim)
x = torch.randn(batch_size, seq_len, 8, head_dim, device=device)

# Precompute frequency complex numbers
freqs_complex = precompute_theta_pos_frequencies(head_dim, seq_len, device)

# Apply rotary position embeddings
x_rot = apply_rotary_embeddings(x, freqs_complex, device)

print("Input shape:", x.shape)
print("Rotated shape:", x_rot.shape)   
print(x_rot[:5]) 


Input shape: torch.Size([4, 10, 8, 16])
Rotated shape: torch.Size([4, 10, 8, 16])
tensor([[[[ 4.7546e-01, -1.8017e-01,  1.5203e-01,  ...,  1.3537e+00,
            2.8199e-01, -2.5730e-01],
          [ 4.4417e-01,  5.0913e-02,  1.3995e+00,  ..., -6.5992e-01,
           -7.2023e-01, -6.0064e-01],
          [-5.6531e-02, -1.1139e+00,  3.6032e-01,  ...,  3.5922e-01,
            4.8927e-01, -2.7623e-01],
          ...,
          [ 1.2838e+00, -6.0748e-01, -6.2717e-01,  ..., -2.1262e-01,
            1.5759e-02, -6.4197e-02],
          [-9.8002e-01,  1.2238e+00, -2.2663e-01,  ..., -1.1541e+00,
            2.6489e-02, -4.8589e-01],
          [ 1.0867e+00, -2.3381e+00, -4.2838e-01,  ...,  3.9752e-02,
           -4.3039e-01, -5.5511e-02]],

         [[ 1.1230e+00, -3.1704e-01, -7.7769e-01,  ...,  1.2925e+00,
           -3.4155e-01, -1.5815e-02],
          [-1.3986e+00, -5.2680e-01, -1.5462e+00,  ...,  1.1274e+00,
            1.4862e+00, -2.7242e-01],
          [-2.8332e-01,  5.6965e-01,  5.5136e

In [75]:
import torch
import torch.nn as nn
import math

# Define the MultiHeadAttention class
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, seq_len, num_heads):
        super(MultiHeadAttention, self).__init__()
        
        self.d_in = d_in  # Input dimension
        self.d_out = d_out  # Output dimension
        self.seq_len = seq_len  # Sequence length
        self.num_heads = num_heads  # Number of attention heads
        self.head_dim = d_out // num_heads  # Dimension of each head
        
        assert self.head_dim * num_heads == d_out, "Output dimension must be divisible by number of heads"
        
        # Linear layers for query, key, and value
        self.query = nn.Linear(d_in, d_out)
        self.key = nn.Linear(d_in, d_out)
        self.value = nn.Linear(d_in, d_out)
        
        # Output projection
        self.fc_out = nn.Linear(d_out, d_out)
        
        # Softmax for attention scores
        self.softmax = nn.Softmax(dim=-1)
        
        # Precompute rotary embeddings (not implemented here, adjust as needed)
        # self.freqs_complex = precompute_theta_pos_frequencies(...)
        
    def forward(self, x):
        B, T, D_in = x.shape  # Batch size, sequence length, input dimension
        
        assert T == self.seq_len, "Input sequence length must match the configured sequence length"
        assert D_in == self.d_in, "Input dimension must match the configured input dimension"
        
        # Linear projections of query, key, and value
        Q = self.query(x)  # (B, T, d_out)
        K = self.key(x)  # (B, T, d_out)
        V = self.value(x)  # (B, T, d_out)
        
        # Reshape for multi-head attention: (B, T, num_heads, head_dim)
        Q = Q.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)  # (B, num_heads, T, head_dim)
        K = K.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)  # (B, num_heads, T, head_dim)
        V = V.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)  # (B, num_heads, T, head_dim)
        
        # Scaled dot-product attention
        energy = torch.matmul(Q, K.transpose(-2, -1))  # (B, num_heads, T, T)
        attention = self.softmax(energy / math.sqrt(self.head_dim))  # Normalize energy
        
        # Apply attention weights to value
        out = torch.matmul(attention, V)  # (B, num_heads, T, head_dim)
        
        # Combine heads and pass through final linear layer
        out = out.transpose(1, 2).contiguous().view(B, T, self.d_out)  # (B, T, d_out)
        out = self.fc_out(out)  # (B, T, d_out)
        
        return out




In [76]:
# Set up the model parameters
batch_size = 32
seq_len = 10  # Ensure seq_len matches the configured sequence length
d_in = 64  # Input dimension
d_out = 128  # Output dimension
num_heads = 4  # Number of attention heads

# Select the device: CPU or GPU (if available)
device = 'cpu'  # You can use 'cuda' if you have a compatible GPU and CUDA is enabled

# Initialize the MultiHeadAttention model and move it to the selected device
mha = MultiHeadAttention(d_in, d_out, seq_len=seq_len, num_heads=num_heads).to(device)

# Generate a random input tensor with shape (batch_size, seq_len, d_in)
x = torch.randn(batch_size, seq_len, d_in, device=device)

# Forward pass through the model
output = mha(x)

# Print the output shape
print("Output shape:", output.shape)

Output shape: torch.Size([32, 10, 128])


In [77]:
class Transformer(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.attn = MultiHeadAttention(d_in = cfg["emb_dim"],
                                       d_out=cfg["emb_dim"],
                                        seq_len=cfg["context_length"],
                                        num_heads=cfg["num_heads"]
        )
        
        self.ff = FeedForward(cfg)
        ##########################################
        # self.norm1 = LayerNorm(cfg["emb_dim"])
        # self.norm2 = LayerNorm(cfg["emb_dim"])
        self.norm1 = RMSNorm(cfg["emb_dim"])
        self.norm2 = RMSNorm(cfg["emb_dim"])
        ########################################

        # self.drop_resid = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        shortcut = x 
        x = self.norm1(x)
        x = self.attn(x)
        x = shortcut + x
        # x = self.drop_resid(x)
        x = self.norm2(x)
        x = self.ff(x)
        # x = self.drop_resid(x)
        x = x + shortcut
        return x



In [78]:
# class GPTModel(nn.Module):
class Llama2Model(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
        # self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        # self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[Transformer(cfg) for _ in range(cfg["n_layers"])])

        ################################### NEW ###################################
        # self.final_norm = LayerNorm(cfg["emb_dim"])
        self.final_norm = RMSNorm(cfg["emb_dim"])
        ###########################################################################
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])

    def forward(self, in_idx):
        # batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        # pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds  # + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        # x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

In [79]:
LLAMA2_CONFIG_7B = {
    "vocab_size": 32000,     # Vocabulary size
    "context_length": 4096,  # Context length
    "emb_dim": 4096,         # Embedding dimension
    "num_heads": 32,           # Number of attention heads
    "n_layers": 32,          # Number of layers
    "hidden_dim": 11008,     # NEW: Size of the intermediate dimension in FeedForward
    "dtype": torch.bfloat16  # NEW: Lower-precision dtype to reduce memory usage
}

In [None]:
model = Llama2Model(LLAMA2_CONFIG_7B)


In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

model.to(device)

Please note that Meta AI requires that you accept the Llama 2 licensing terms before you can download the files; to do this, you have to create a Hugging Face Hub account and visit the meta-llama/Llama-2-7b repository to accept the terms

In [None]:
from huggingface_hub import login
import json

with open("config.json", "r") as config_file:
    config = json.load(config_file)
    access_token = config["HF_ACCESS_TOKEN"]

login(token=access_token)

In [None]:
from huggingface_hub import hf_hub_download

tokenizer_file = hf_hub_download(
    repo_id="meta-llama/Llama-2-7b",
    filename="tokenizer.model",
    local_dir="Llama-2-7b"
)

In [None]:
import sentencepiece as spm


class LlamaTokenizer:
    def __init__(self, tokenizer_file):
        sp = spm.SentencePieceProcessor()
        sp.load(tokenizer_file)
        self.tokenizer = sp

    def encode(self, text):
        return self.tokenizer.encode_as_ids(text)

    def decode(self, ids):
        return self.tokenizer.decode_pieces(ids)


tokenizer = LlamaTokenizer(tokenizer_file)

In [None]:
from UTILS.finetune_utils import generate , text_to_token_ids, token_ids_to_text
torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves", tokenizer).to(device),
    max_new_tokens=30,
    context_size=LLAMA2_CONFIG_7B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Load pretrained weights

In [None]:
weights_file = hf_hub_download(
   repo_id="meta-llama/Llama-2-7b",
   filename="consolidated.00.pth",
   local_dir="Llama-2-7b"
)

In [None]:
weights = torch.load(weights_file, weights_only=True)


In [None]:
def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")

    if isinstance(right, torch.Tensor):
        return torch.nn.Parameter(right.clone().detach())
    else:
        return torch.nn.Parameter(torch.tensor(right))


def load_weights_into_llama(model, param_config, params):
    model.tok_emb.weight = assign(model.tok_emb.weight, params["tok_embeddings.weight"])

    for l in range(param_config["n_layers"]):

        # Load attention weights
        model.trf_blocks[l].att.W_query.weight = assign(
            model.trf_blocks[l].att.W_query.weight,
            params[f"layers.{l}.attention.wq.weight"]
        )
        model.trf_blocks[l].att.W_key.weight = assign(
            model.trf_blocks[l].att.W_key.weight,
            params[f"layers.{l}.attention.wk.weight"]
        )
        model.trf_blocks[l].att.W_value.weight = assign(
            model.trf_blocks[l].att.W_value.weight,
            params[f"layers.{l}.attention.wv.weight"]
        )
        model.trf_blocks[l].att.out_proj.weight = assign(
            model.trf_blocks[l].att.out_proj.weight,
            params[f"layers.{l}.attention.wo.weight"]
        )
        model.trf_blocks[l].norm1.weight = assign(
            model.trf_blocks[l].norm1.weight,
            params[f"layers.{l}.attention_norm.weight"]
        )

        # Load FeedForward weights
        model.trf_blocks[l].ff.fc1.weight = assign(
            model.trf_blocks[l].ff.fc1.weight,
            params[f"layers.{l}.feed_forward.w1.weight"]
        )
        # For some reason w2 and w3 are provided in the wrong order in the weights file
        model.trf_blocks[l].ff.fc2.weight = assign(
            model.trf_blocks[l].ff.fc2.weight,
            params[f"layers.{l}.feed_forward.w3.weight"]
        )
        model.trf_blocks[l].ff.fc3.weight = assign(
            model.trf_blocks[l].ff.fc3.weight,
            params[f"layers.{l}.feed_forward.w2.weight"]
        )
        model.trf_blocks[l].norm2.weight = assign(
            model.trf_blocks[l].norm2.weight,
            params[f"layers.{l}.ffn_norm.weight"]
        )

    # Load output layer weights
    model.final_norm.weight = assign(model.final_norm.weight, params["norm.weight"])
    model.out_head.weight = assign(model.out_head.weight, params["output.weight"])


load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device)

In [None]:
torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort", tokenizer).to(device),
    max_new_tokens=25,
    context_size=LLAMA2_CONFIG_7B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))