# Transformer Teardown: Llama 3.1

> Trace an Inference Through Each Layer of the SOTA Llama 3.1 Foundation Models

In [the last Transformer Teardown](https://stickshift.github.io/2024/09/04/transformer-teardown.html), we dissected a DistilBERT text classification pipeline, tracing a single inference through the entire stack from raw data to final prediction. We learned about the main stages of a Transformer pipeline as well as fundamental Transformer concepts such as token embeddings and Multi-Head Self Attention. Studying BERT-based text classification models is a fantastic way to see the basic Transformer machinery in action. But BERT was published in 2018! It would be another 4 years before ChatGPT launched and Generative AI exploded onto the scene.

In this Transformer Teardown, we're going to fast forward to present day. We'll use the same teardown process to unpack the state-of-the-art [Llama 3.1](https://llama.meta.com/) open source foundation models released by Meta in July. We'll walk through each step of a text generation pipeline one cell at a time, tracing an inference from raw text to the first output token. We'll illustrate the main ideas from the latest Transformer literature with minimal, straightforward, working Python code, giving you a close-up view of the core mechanisms driving the Generative AI revolution.

# Setup

In [1]:
from functools import partial
import json
import math
import os
from pathlib import Path
from sys import stdout
from textwrap import dedent
import warnings

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
from pandas import Series
from pydantic import BaseModel, validate_call
from pytest import approx
from tqdm import tqdm

import torch
from torch import nn, Tensor
from torch.nn.functional import relu, silu, softmax

from llama_models.llama3.reference_impl.model import RMSNorm

import stickshift as ss
from stickshift import default_arg, take
from stickshift.torch import device as torch_device

In [2]:
# Ignore all warnings
warnings.filterwarnings("ignore")

# Configure gpu
device = torch_device()

In [3]:
%%html
<style>
figure > img {
    display:block;
    margin-left: auto !important;
    margin-right: auto !important;
}
figcaption {
    text-align: center;
}
blockquote {
    margin-top: 2.0rem !important;
    margin-bottom: 2.0rem !important;
    margin-left: 0 !important;
    margin-right: 0 !important;
    padding: 1.0rem !important;
    background-color: rgba(0,0,0,0.05) !important;
    border: 1px solid rgba(0,0,0,0.1) !important;
    font-style: italic !important;
}
blockquote p {
    margin: 0 !important;
    padding: 0 !important;
}
</style>

In [4]:
%pprint

Pretty printing has been turned OFF


# Llama Foundation Models

[Llama](https://llama.meta.com/) is a family of general purpose, state-of-the-art open source foundation models from Meta. According to the 3.1 technical report, the latest models can "answer questions in at least 8 languages, write high quality code, solve complex reasoning problems, and use tools in a zero-shot way." (Dubey et al. 2024) The Llama 3.1 release includes 8B, 70B, and 405B sizes. While you need a multi-GPU cluster to run the 70B and 405B sizes, the 8B model is small enough to experiment with on a laptop. Not only did Meta release the pre-trained model checkpoints for all 3 sizes, they also published a fantastically detailed, [70 page technical report](https://arxiv.org/abs/2407.21783v2) as well as a complete [reference implementation](https://github.com/meta-llama/llama-models). Together, Llama 3.1 represents both a tremendous contribution to the AI community as well as an incredible learning opportunity to study the inner workings of a modern frontier model.

Over the course of this post, we'll implement a complete text generation pipeline using only the research literature, pre-trained weights from the `Meta-Llama3.1-8B-Instruct` checkpoint, and Meta's reference implementation as a guide. After we load the 8B checkpoint, we'll review the stages of an end-to-end, text generation pipeline. In the sections that follow, we'll walk through a detailed teardown of each stage—tracing an inference from raw data to the first output token. In the last section, we'll put all the pieces together into a complete generative Transformer capable of producing long form content.

Let the teardown begin!

# Model Checkpoint

We'll start by loading the configuration and pre-trained weights for the `Meta-Llama3.1-8B-Instruct` checkpoint. The "instruct" versions of the Llama models include the raw pre-training and substantial post-training to support user and assistant interactions and complex tool-calling scenarios. The weights for all Llama checkpoints can be downloaded directly from [Meta](https://llama.meta.com/), [Hugging Face](https://huggingface.co/meta-llama), and [Kaggle](https://www.kaggle.com/organizations/metaresearch/models).

In [5]:
# Import custom utilities
from stickshift.models import llama

# Load model config
config = llama.config("Meta-Llama3.1-8B-Instruct")

# Load pre-trained model parameters
checkpoint = torch.load(
    config.checkpoint_path / "consolidated.00.pth", 
    weights_only=True, 
    map_location=device,
)

config.model_dump()

{'checkpoint_path': PosixPath('/Users/andrewyoung/.llama/checkpoints/Meta-Llama3.1-8B-Instruct'), 'vocab_size': 128256, 'd_model': 4096, 'd_head': 128, 'd_ffn': 14336, 'n_layers': 32, 'n_heads': 32, 'n_kv_heads': 8, 'n_kv_groups': 4, 'rms_norm_eps': 1e-05, 'rope_theta': 500000.0, 'max_seq_len': 8192, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9, 'max_output_tokens': 500}

We'll reference a number of the settings in `config` throughout the teardown. For now, a few interesting ones to note are `d_model`, `d_ffn`, `n_layers`, and `n_heads`. These represent the primary differences between the 8B, 70B, and 405B sizes.

In [6]:
def load_pretrained_state(layer):    
    # Load pre-trained state
    llama.load_state(
        normalize_attention, "normalize_attention", 
        normalize_ffn, "normalize_ffn", 
        w_q, "w_q", 
        w_k, "w_k", 
        w_v, "w_v", 
        attention_outputs, "attention_outputs",
        ffn_gates, "ffn_gates",
        ffn_inputs, "ffn_inputs",
        ffn_outputs, "ffn_outputs",
        checkpoint=checkpoint,
        layer=layer,
    ) 

# Text Generation Pipeline

In [the last teardown](https://stickshift.github.io/2024/09/04/transformer-teardown.html), we looked at a text classification Transformer. This time we're going to dissect a *text generation* Transformer. Instead of simply applying a label to the input text, the Head stage will be responsible for *generating* new content. But don't worry! It's not as complicated as it sounds.

Figure X illustrates the stages in a text generation pipeline. It's very similar to the text classification pipeline we looked at last time. The Tokenize stage splits raw text into a sequence of tokens. The Embeddings stage maps the sequence of tokens to a sequence of embedding vectors. The Context Layers augment the embeddings with contextual signals drawn from the surrounding tokens, transforming individual token embeddings into contextualized "semantic embeddings". Finally, the Head stage converts the semantic embeddings into predictions. The key difference is, instead of predicting a label for the raw text, text generation Transformers *predict the next token*.

<figure>
<img src="transformer-pipeline.svg" width="940">
<figcaption>Figure X: Text Generation Pipeline</figcaption>
</figure>

But one token is just the beginning! The magical powers of Generative AI are manifested by simply running the token predictions in a loop. The predicted token in each iteration is appended to the end of the input sequence, and the process repeats. Over and over again.

# Raw Text

Before we can tear anything down, we need a prompt. Since our goal is to trace an inference from raw text to the first output token, we want to start with a prompt that's specific enough to generate a consistent, one-word answer. If we do everything right, the first output token we predict should be "Boston".

In [7]:
# Prompt
prompt = "<|start_header_id|>user<|end_header_id|>\n\n"
prompt += "What is the capital of Massachusetts? Answer in one word."
prompt += "<|eot_id|>"
prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n"

You can see `prompt` includes a number of special tokens. These would usually be injected by a framework like Hugging Face's `transformers`. We need to manually inject them because we're working with the model directly. You can find more information on the Llama 3.1 prompt syntax in the [Llama Prompting Guide](https://www.llama.com/docs/how-to-guides/prompting).

# Tokenize

The Tokenize stage splits raw text into a sequence of tokens using a fixed vocabulary. Llama uses a vocabulary of 128k tokens built on top of OpenAI's tiktoken tokenizer. We'll dig into the gory details in the later stages, but here we'll simply use the off-the-shelf Tokenizer from Meta's llama-models reference implementation.

In [8]:
from llama_models.llama3.api.tokenizer import Tokenizer

# Load tokenizer model from checkpoint
tokenizer = Tokenizer(str(config.checkpoint_path / "tokenizer.model"))

In [9]:
# Split raw text into tokens
token_ids = tokenizer.encode(prompt, bos=True, eos=False, allowed_special="all")
token_ids

[128000, 128006, 882, 128007, 271, 3923, 374, 279, 6864, 315, 22108, 30, 22559, 304, 832, 3492, 13, 128009, 128006, 78191, 128007, 271]

In [10]:
len(token_ids)

22

We see `tokenizer.encode` split our prompt into 22 token ids. These ids represent the index of each token in Llama's 128k token vocabulary. We can always reverse the process with `tokenizer.decode`. If you look closely at the cell output below, you'll notice the tokenizer injected another special token `(128000, '<|begin_of_text|>')` to mark the beginning of the sequence.

In [11]:
# Decode token ids back into raw text
tokenizer.decode(token_ids)

'<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of Massachusetts? Answer in one word.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

In [12]:
# Load token_ids into a tensor
x = torch.tensor(token_ids, device=device)

x.shape

torch.Size([22])

# Embeddings

Embeddings are a key component of the Transformer architecture. They're also abstract mathematical structures that can be difficult to wrap your head around. To illustrate the crucial role embeddings play, let's use a quick metaphor.

> If a Transformer was a brain, then embeddings would be the electrical signals carrying information through the brain.

Continuing with the metaphor, the Embeddings stage of the pipeline would be your sensory organs where light rays and air vibrations are translated into electrical impulses. Token embeddings would be the fresh sensory percepts. Semantic embeddings would be the abstract thoughts at the top of the cortical stack. The idea of percepts traveling up the cortical stack is a perfect analogy for token embeddings traveling through the Transformer layers.

Implementing Llama's Embeddings stage is relatively straightforward. We'll use a lookup table with a unique embedding for each of the 128k tokens in the vocabulary. Each embedding is a vector with $d_{model}$ elements that were randomly generated and then learned during training. Given a sequence of token ids, the lookup table returns their embeddings as row vectors stacked in an $n \times d_{model}$ tensor. 

<figure>
<img src="embeddings.svg" width="940">
<figcaption>Figure X: Learned Token Embeddings</figcaption>
</figure>

In [13]:
# Initialize embeddings lookup table
embeddings = nn.Embedding(
    num_embeddings=config.vocab_size, 
    embedding_dim=config.d_model,
    device=device,
)

# Load pre-trained state
llama.load_state(embeddings, "embeddings", checkpoint=checkpoint)

In [14]:
# Map token ids to embeddings
x = embeddings(x)

x.shape

torch.Size([22, 4096])

We can see from `x.shape` that we successfully mapped the 22 token ids to 22 token embeddings stacked in an $n \times d_{model}$ tensor.

In [15]:
# Show sample
x

tensor([[ 2.6512e-04, -4.9973e-04, -5.8365e-04,  ...,  3.8147e-03,
          6.3419e-05,  1.1902e-03],
        [-1.6499e-04, -2.4319e-04,  1.6403e-04,  ..., -1.5163e-04,
          3.5095e-04,  7.3242e-04],
        [ 3.5095e-03,  7.2021e-03,  5.3406e-05,  ..., -7.2479e-04,
         -1.0620e-02,  8.2779e-04],
        ...,
        [-9.7656e-03, -3.4637e-03,  1.8616e-03,  ..., -7.1411e-03,
         -4.3030e-03,  8.6060e-03],
        [-4.6158e-04, -3.9291e-04, -6.5863e-06,  ..., -6.2561e-04,
         -5.0354e-04,  6.6757e-04],
        [-2.8687e-03,  3.8910e-03, -1.7357e-04,  ...,  8.0872e-04,
          5.0354e-04,  2.3041e-03]], device='mps:0',
       grad_fn=<EmbeddingBackward0>)

Before we move on, a quick note on terminology. If you've used cloud-based LLM APIs like OpenAI or LangChain, you're likely familiar with the term "embedding model". An *embedding model* is really a combination of a tokenizer and embeddings table. These are often bundled together to give you everything you need to convert raw text into embedding vectors and can be used for a number of things independent of the LLM.

Now that we've converted our raw text into token embeddings, it's time to start transforming!

# Context Layers

Context layers are where the Transformer magic happens. Collectively, the Context Layers are responsible for transforming a sequence of token embeddings into a sequence of semantic embeddings. The mechanism works by passing the embeddings through multiple layers of attention and feedforward blocks. The attention blocks focus on relationships between embeddings, augmenting each one with a weighted combination of the surrounding embeddings. The feedforward blocks capitalize on the extra context, transforming each augmented embedding with the non-linear magic of a fully-connected multilayer perceptron. By stacking multiple layers together, Transformers repeat the pattern of attention and transformation, gradually converting representations of individual words into representations of abstract semantic concepts.

<figure>
<img src="transformer-layers.svg" width="940">
<figcaption>Figure X: Context Layers</figcaption>
</figure>

Figure X illustrates the flow of information through a single layer. Embeddings are first passed to the Attention block. The attention outputs are added to the attention inputs before being passed to the FFN block. Similarly, the FFN outputs are added to the FFN inputs before being passed to the next layer. Adding the inputs and outputs of each block is known as "residual learning" and is critical for providing a stable path for gradient flow during training (He et al. 2015).

## Decoder-Only Model Architecture

Like most of today's generative models, Llama uses a "decoder-only" model architecture. Instead of using the fully connected self attention we saw in [the DistilBERT teardown](https://stickshift.github.io/2024/09/04/transformer-teardown.html), the context layers in Llama use *masked self attention*. The "decoder-only" term comes from the "Attention is All You Need" paper, where Vaswani et al. described layers of self attention as "encoder layers" and layers of masked self attention as "decoder layers". While Vaswani et al.'s *Vanilla* Transformer architecture processed inputs and outputs with encoder layers and decoder layers respectively, later researchers showed that by adding more compute you could achieve the same goals using a single stack of decoder layers. For a fascinating discussion of how decoders became the dominant architecture, I highly recommend watching [Hyung Won Chung's guest lecture at Stanford on the Future of AI](https://youtu.be/orDKvo8h71o?si=J2sxhYtL9LCd6IRk) from April of this year.

## Attention

Attention is the signature component in the Transformer architecture. In the 7 years since Vaswani et al. published "Attention is All You Need", researchers have experimented with numerous attention variations of all shapes and sizes. Before we jump into the code, we'll quickly review the fundamental concepts behind attention followed by details on the specific approach chosen by the Llama authors.

### What is Attention?

Given our input embeddings are stacked in an $n \times d_{model}$ tensor $\mathbf{X}$, the goal of attention is to map each embedding $\set{\mathbf{x}_i \mid \mathbf{x}_i \in \mathbf{X}}$ to an attention representation $\mathbf{a}_i$ that includes relevant contextual signals drawn from the rest of the embeddings $\set{\mathbf{x}_j \mid \mathbf{x}_j \in \mathbf{X}, i \neq j}$. 

For example, let's imagine we've mapped the sentence `I love New York` to the sequence of token embeddings $\mathbf{x} = [E_{I}, E_{love}, E_{New}, E_{York}]$. The embedding $\mathbf{x}_2$ represents the word `New` *in isolation*. The word `New` can mean a lot of things; many of which have nothing to do with this sentence. Our goal would be to generate an attention representation $\mathbf{a}_2$ containing signals from the other embeddings $\set{E_{I}, E_{love}, E_{York}}$ that would help us create a *better* version of $\mathbf{x}_2$:

$$
\mathbf{x}_{2*} = \mathbf{x}_{2} + \mathbf{a}_{2}
$$

Let's assume each embedding contributes "something" to a_i. Even though we can't quantify "something" yet, we can write $\mathbf{a}_i$ as an unknown function $f_A$ of the two embeddings $\mathbf{x}_i$, $\mathbf{x}_j$:

$$
\mathbf{a}_i = \sum_{\mathbf{x}_j \in \mathbf{X}} f_A(\mathbf{x}_i, \mathbf{x}_j)
$$

All of the attention variations in the Transformer literature—e.g. Self Attention, Multi-Head Self Attention, Linear Attention, Grouped Query Attention—are different approaches to implement $f_A$. In practice, the authors of a new Transformer model start with Vaswani et al.'s attention definition and then select from the large, a la carte menu of improvements that have been published since, resulting in their own unique variation of attention.

### Attention in Llama 3.1

Defining attention in Llama 3.1 requires a bit of backtracking. Most of the details can be found in Llama 1 (Touvron et al. 2023) with a few changes in Llama 2 (Touvron et al. 2023) and only minor adjustments in Llama 3 (Dubey et al. 2024).

Starting with the standard Masked Self Attention definition from Vaswani et al. (2017), Llama adopts the following improvements that factor into the attention calculation:

* Pre-normalization with RMSNorm: improves training stability and inference speed.
* Grouped Query Attention (GQA) with 8 key/value heads: improves inference speed and reduce memory overhead of key-value caches.
* Rotary Position Embedding (RoPE) with $\Theta = 500,000$: relative position encoding improves performance on longer context windows.

We'll start by describing the standard Masked Self Attention and then describe how these improvements modify the final attention calculation in Llama.

### Masked Self Attention

Given $n$ input embeddings of length $d_{model}$ stacked in an $n \times d_{model}$ tensor $\mathbf{X}$, the standard masked self attention algorithm from Vaswani et al. can be expressed using the following equation:

$$
\begin{equation}
\mathbf{A} = softmax\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_K}} + \mathbf{M}\right)\mathbf{V} \\
\end{equation}
$$

where

* $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ are all linear projections of $\mathbf{X}$.
* $\mathbf{Q}$ is an $n \times d_K$ tensor of *queries* that represent selection criteria for surrounding embeddings that would add valuable context to the current representation.
* $\mathbf{K}$ is an $n \times d_K$ tensor of *keys* that represent characteristics that satisfy the selection criteria in the queries.
* $\mathbf{V}$ is an $n \times d_{model}$ tensor of *values* that represent the contextual signals to transfer from one embedding to another.
* The $\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_K}}$ term is an $n \times n$ tensor of "attention weights" that define how much of each embedding's values to include from $\mathbf{V}$.
* $\mathbf{M}$ is an $n \times n$ attention mask that prevents earlier embeddings from attending to later ones by adding a $-\infty$ bias to the later embeddings' scores.
* $\mathbf{A}$ is an $n \times d_{model}$ tensor of attention representations.

The idea is the $\mathbf{Q}\mathbf{K}^T$ term calculates the *angular distance* between each query vector $\mathbf{q}_i$ and key vector $\mathbf{k}_j$ by taking their dot product. The smaller the angle, the closer the vectors, and the better the match between $\mathbf{q}_i$ and $\mathbf{k}_j$. The result of the $\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_K}}$ term is an $n \times n$ tensor of raw scores where row $i$ represents query $\mathbf{q}_i$ and column $j$ represents how well key $\mathbf{k}_j$ matches $\mathbf{q}_i$.

The rationale for the mask term $\mathbf{M}$ requires a backstory. In the original Transformer architecture, decoder layers played a more limited role. All of the input tokens for our prompt "What is the capital of Massachusetts?" would be processed by the encoder while the output tokens would be handled by the decoder. Since all of the tokens in our prompt are available when we start, the encoder's Attention blocks can connect them all together using what's called "cross attention". But this isn't true for the decoder. Since the model only generates one token at a time, the output tokens generated at the beginning aren't allowed to attend to tokens generated later. Put another way, the words I say now can't be influenced by the words I say later because they haven't happened yet. This idea is referred to as "causal attention" and is implemented using the mask term $\mathbf{M}$. 

As described earlier, decoder-only architectures that process both input and output tokens with the same layers have taken over as the dominant standard for generative Transformers. *This can be really confusing at first!* Why shouldn't we allow $E_{capital}$ to attend to $E_{Massachusetts}$? Because decoder-only models have repeatedly demonstrated simpler single-stack architectures scale better. Masking the input tokens is an intentional trade off.

### Grouped Query Attention (GQA)

The standard Masked Self Attention definition from Vaswani et al. (2017) is quadratic in computational complexity and memory overhead.

### Rotary Position Embedding (RoPE)

In [None]:
### Masked Self Attention

Given $n$ input embeddings of length $d$ are stacked in an $n \times d$ tensor $\mathbf{X}$, the masked self attention function used by Llama can be expressed using the following equation:

$$
\begin{equation}
\mathbf{A} = softmax\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}} + \mathbf{M}\right)\mathbf{V} \\
\end{equation}
$$

where

* $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ are all linear projections of $\mathbf{X}$.
* $\mathbf{Q}$ is an $n \times d$ tensor of *queries* that represent selection criteria for surrounding embeddings that would add valuable context to the current representation.
* $\mathbf{K}$ is an $n \times d$ tensor of *keys* that represent characteristics that satisfy the selection criteria in the queries.
* $\mathbf{V}$ is an $n \times d$ tensor of *values* that represent the contextual signals to transfer from one embedding to another.
* The $\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}$ term is an $n \times n$ tensor of "attention weights" that define how much of each embedding's values to include from $\mathbf{V}$.
* $\mathbf{M}$ is an $n \times n$ attention mask that prevents earlier embeddings from attending to later ones by adding a $-\infty$ bias to the later embeddings' scores.
* $\mathbf{A}$ is an $n \times d$ tensor of attention representations.

The idea is the $\mathbf{Q}\mathbf{K}^T$ term calculates the *angular distance* between each query vector $\mathbf{q}_i$ and key vector $\mathbf{k}_j$ by taking their dot product. The smaller the angle, the closer the vectors, and the better the match between $\mathbf{q}_i$ and $\mathbf{k}_j$. The result of the $\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_K}}$ term is an $n \times n$ tensor of raw scores where row $i$ represents query $\mathbf{q}_i$ and column $j$ represents how well key $\mathbf{k}_j$ matches $\mathbf{q}_i$.

The idea is the $\mathbf{Q}\mathbf{K}^T$ term calculates the distance between each query $\mathbf{q}_i$ and key $\mathbf{k}_j$ where *distance* is measured by the angle between the vectors. 




The smaller the angle, the closer the vectors, and the better the match between $\mathbf{q}_i$ and $\mathbf{k}_j$. The result of the $\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}$ term is an $n \times n$ tensor of scores between each query and key. Next, the attention mask $\mathbf{M}$ is added to prevent later embeddings from influencing earlier embeddings by adding a $-\infty$ bias to their scores. Finally, these are all passed to $softmax$ to normalize the scores across the keys. The result is an $n \times n$ tensor where each row $i$ represents query $\mathbf{q}_i$ and each column $j$ represents how well key $\mathbf{k}_j$ matches $\mathbf{q}_i$. These scores are then applied to the values in $\mathbf{V}$.

In [None]:
### Position Encoding with RoPE

The relevance of one embedding to another is heavily influenced by the distance between them. We saw last time that BERT-based models encode the absolute token positions directly in the token embeddings. More recent models including Llama have adopted *relative* position encoding schemes that have been shown to perform better on much longer sequences.

Llama uses an approach known as Rotary Position Embedding (RoPE) from Su et al. (2021). Instead of baking the positions into the token embeddings at the beginning, RoPE modifies the attention equation to take the positions into account.

integrates the embedding positions into the attention calculation. We saw in the last section that the attention calculation relies on the angular distance between queries and keys. The main idea behind RoPE is to convert the distance between two embeddings in the sequence into the angular distance between their vectors.

<figure>
<img src="rope.svg" width="940">
<figcaption>Figure X: RoPE Concept in 2D</figcaption>
</figure>

Figure X illustrates the intuition behind RoPE in 2-dimensions. On the left, we have a sequence of 2-dimensional vectors $\begin{bmatrix}\mathbf{x}_0 & \mathbf{x}_1 & \mathbf{x}_2\end{bmatrix}^T$ with their 2-dimensional geometric interpretation plotted on the right. The matrix in the center represents a rotational transformation. Given an embedding's position $m$, the idea behind RoPE is to rotate the embedding vector a distance of $m \theta$. Following this idea, $\mathbf{x}_0$ would stay the same, $\mathbf{x}_1$ would be rotated a distance of $\theta$, and $\mathbf{x}_2$ would be rotated $2 \theta$.

While Figure X illustrates the idea behind RoPE in 2D, implementing it for n-dimensional vectors is a little more complicated. In practice, RoPE splits each embedding $\mathbf{x}$ into pairs $\set{(x_0, x_1), (x_2, x_3), \dots, (x_{d-2}, x_{d-1})}$, and then applies 2D rotations to each pair just like Figure X.


While Figure X illustrates the idea behind RoPE in 2D, implementing it for n-dimensional vectors is a little more complicated. In practice, RoPE splits the token embeddings into pairs, e.g. {::nomarkdown}$\Set{(\mathbf{x}_0,\mathbf{x}_1), (\mathbf{x}_2,\mathbf{x}_3), \dots}${:/}, and then applies 2D rotations to each pair using the rotation matrix $\mathbf{R}_{\Theta,m}^d$.

Given a hyperparameter $\Theta$,

$$
\begin{align}
\mathbf{R}_{\Theta,m}^d &= 
\begin{bmatrix}
cos(m \theta_0) & -sin(m \theta_0) & 0 & 0 & \dots & 0 & 0 \\
sin(m \theta_0) & cos(m \theta_0) & 0 & 0 & \dots & 0 & 0 \\
0 & 0 & cos(m \theta_1) & -sin(m \theta_1) & \dots & 0 & 0 \\
0 & 0 & sin(m \theta_1) & cos(m \theta_1) & \dots & 0 & 0 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & \dots & 0 & 0 & cos(m \theta_{d/2-1}) & -sin(m \theta_{d/2-1}) \\
0 & 0 & \dots & 0 & 0 & sin(m \theta_{d/2-1}) & cos(m \theta_{d/2-1}) \\
\end{bmatrix} \\
\text{where } \theta_i &= \frac{1}{\Theta^{2i/d_{head}}}, \text{and } i \text{ is the embedding index}
\end{align}
$$

Luckily, Su et al. (2021) give us a more compact approach to apply $\mathbf{R}_{\Theta,m}^d$ in practice. To implement RoPE, we'll calculate the $cos$ and $sin$ terms in the following equation up front and then share them across all the layers.

$$
\mathbf{R}_{\Theta,m}^d \mathbf{x} = 
\begin{bmatrix}
x_0 \\
x_1 \\
x_2 \\
x_3 \\
\vdots \\
x_{d-2} \\
x_{d-1} \\
\end{bmatrix}
\begin{bmatrix}
cos(m \theta_0) \\
cos(m \theta_0) \\
cos(m \theta_1) \\
cos(m \theta_1) \\
\vdots \\
cos(m \theta_{d/2-1}) \\
cos(m \theta_{d/2-1}) \\
\end{bmatrix}
+
\begin{bmatrix}
-x_1 \\
x_0 \\
-x_3 \\
x_2 \\
\vdots \\
-x_{d-1} \\
x_{d-2} \\
\end{bmatrix}
\begin{bmatrix}
sin(m \theta_0) \\
sin(m \theta_0) \\
sin(m \theta_1) \\
sin(m \theta_1) \\
\vdots \\
sin(m \theta_{d/2-1}) \\
sin(m \theta_{d/2-1}) \\
\end{bmatrix} = 
\begin{bmatrix}
1 \\
1 \\
1 \\
1 \\
\vdots \\
1 \\
1 \\
\end{bmatrix}
$$

$$
\mathbf{R}_{\Theta,m}^d \mathbf{x} = 
\begin{bmatrix}
x_0 \\
x_1 \\
x_2 \\
x_3 \\
\vdots \\
x_{d-2} \\
x_{d-1} \\
\end{bmatrix}
\begin{bmatrix}
cos(m \theta_0) \\
cos(m \theta_0) \\
cos(m \theta_1) \\
cos(m \theta_1) \\
\vdots \\
cos(m \theta_{d/2-1}) \\
cos(m \theta_{d/2-1}) \\
\end{bmatrix}
+
\begin{bmatrix}
-x_1 \\
x_0 \\
-x_3 \\
x_2 \\
\vdots \\
-x_{d-1} \\
x_{d-2} \\
\end{bmatrix}
\begin{bmatrix}
sin(m \theta_0) \\
sin(m \theta_0) \\
sin(m \theta_1) \\
sin(m \theta_1) \\
\vdots \\
sin(m \theta_{d/2-1}) \\
sin(m \theta_{d/2-1}) \\
\end{bmatrix} = 
\begin{bmatrix}
x_0 cos(m \theta_0) - x_1 sin(m \theta_0)\\
x_0 sin(m \theta_0) + x_1 cos(m \theta_0)\\
x_2 cos(m \theta_1) - x_3 sin(m \theta_1)\\
x_2 sin(m \theta_1) + x_3 cos(m \theta_1)\\
\vdots \\
x_{d-2} cos(m \theta_{d/2-1}) - x_{d-1} sin(m \theta_{d/2-1})\\
x_{d-2} sin(m \theta_{d/2-1}) + x_{d-1} cos(m \theta_{d/2-1})\\
\end{bmatrix}
$$

In [27]:
# Hyperparams
base = config.rope_theta
d = config.d_head

In [28]:
# Compute theta_i = 1 / base^(2i/d) from i = 0 to d/2-1
thetas = 1.0 / base**(2 * torch.arange(d // 2, device=device) / d)

thetas.shape

torch.Size([64])

In [16]:
def rope(n):
    """Calculate RoPE rotation matrices to be shared across all layers."""
    
    # Hyperparams
    base = config.rope_theta
    d = config.d_head
    
    # Compute theta_i = 1 / base^(2i/d) from i = 0 to d/2-1
    thetas = 1.0 / base**(2 * torch.arange(d // 2, device=device) / d)
    
    # Compute m * theta_i for position m in 0 to n
    frequencies = torch.stack([m*thetas for m in range(n)])
    
    # Duplicate each row
    frequencies = torch.cat((frequencies, frequencies), dim=-1)
    
    # Apply cos, sin
    cos = torch.cos(frequencies)
    sin = torch.sin(frequencies)
    
    # Sanity check
    assert cos.shape[0] == n and cos.shape[1] == config.d_head
    assert sin.shape[0] == n and sin.shape[1] == config.d_head

    return cos, sin

In [17]:
# Compute RoPE rotation matrices
rope_cos, rope_sin = rope(len(x))

rope_cos.shape, rope_sin.shape

(torch.Size([22, 128]), torch.Size([22, 128]))

### Attention Workflow

Before we get to the attention code, let's rewrite the attention equation one more time with all of the elements we need to calculate.

$$
\begin{align}
\mathbf{A} &= softmax\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}} + \mathbf{M}\right)\mathbf{V} \\
\end{align}
$$

expands to

$$
\begin{align}
\mathbf{A} &= softmax\left(\frac{(\mathbf{R}_{\Theta}^d\mathbf{W}_Q\mathbf{X})(\mathbf{R}_{\Theta}^d\mathbf{W}_K\mathbf{X})^T}{\sqrt{d}} + \mathbf{M}\right)\mathbf{W}_V\mathbf{X} \\
\end{align}
$$

where

* $\mathbf{R}_{\Theta}^d$ is the RoPE rotation tensor.
* $\mathbf{W}_Q$, $\mathbf{W}_K$, $\mathbf{W}_V$ are the linear projections for queries, keys, and values respectively.

The flowchart below enumerates the steps we'll walk through to calculate $\mathbf{A}$. While it may look like there are a lot of steps, the goal is to break the calculation down into small enough pieces that each one is only a few lines of code.

<figure>
<img src="attention-flow.svg" width="700">
<figcaption>Figure X: Attention Workflow</figcaption>
</figure>

### Normalize Inputs

In the original Transformer architecture, the Attention and FFN blocks normalized the outputs of each sub-layer. In contrast, Llama normalizes the inputs to each sub-layer to improve training stability. Furthermore, they also replace the standard LayerNorm algorithm with the RMSNorm algorithm designed by Zhang and Sennrich (2019) to be less computationally expensive and scale better.

In [18]:
# Configure attention normalization
normalize_attention = RMSNorm(config.d_model, config.rms_norm_eps).to(device)

# Load pre-trained weights
llama.load_state(normalize_attention, "normalize_attention", checkpoint=checkpoint)

In [19]:
# Preserve residuals
residual = x

# Normalize attention inputs
x = normalize_attention(x)

x.shape

torch.Size([22, 4096])

### Project Queries, Keys, Values

Next, we'll configure and then apply the linear projections $\mathbf{W}_Q$, $\mathbf{W}_K$, $\mathbf{W}_V$ that map input embeddings $\mathbf{X}$ to query, key, and value subspaces.

In [20]:
# Configure query, key, value projections
w_q = nn.Linear(
    in_features=config.d_model,
    out_features=config.n_heads * config.d_head,
    bias=False,
    device=device,
)
w_k = nn.Linear(
    in_features=config.d_model,
    out_features=config.n_kv_heads * config.d_head,
    bias=False,
    device=device,
)
w_v = nn.Linear(
    in_features=config.d_model,
    out_features=config.n_kv_heads * config.d_head,
    bias=False,
    device=device,
)

# Load pre-trained weights
llama.load_state(w_q, "w_q", w_k, "w_k", w_v, "w_v", checkpoint=checkpoint)

In [21]:
# Project embeddings to query, key, value spaces
q = w_q(x)
k = w_k(x)
v = w_v(x)

q.shape, k.shape, v.shape

(torch.Size([22, 4096]), torch.Size([22, 1024]), torch.Size([22, 1024]))

One of the challenges with scaling Transformers is the quadratic complexity of comparing every query to every key in $\mathbf{Q}\mathbf{K}^T$. If you look closely at the output above, you'll notice the dimension of $\mathbf{Q}$ is 4 times that of $\mathbf{K}$ and $\mathbf{V}$. This is because Llama uses a technique called Grouped Query Attention (GQA) from Ainslie et al. (2023). Instead of each attention head learning a separate set of queries, keys, and values, the attention heads are arranged in groups that share a set of keys and values across the group, dramatically reducing the compute and memory resources that are required. (Touvron et al. 2023)

### Split Attention Heads

Next, we'll take the full scale queries, keys, and values, and split them into attention heads. The resulting tensors will have shape $(h, n, d_{head})$ where $h$ is the number of heads. Note there are 32 query heads but only 8 key/value heads.

In [25]:
def split_heads(x, n_heads):
    return x.view(-1, n_heads, config.d_head).transpose(-3, -2)

In [26]:
# Split attention heads
q = split_heads(q, config.n_heads)
k = split_heads(k, config.n_kv_heads)
v = split_heads(v, config.n_kv_heads)

q.shape, k.shape, v.shape

(torch.Size([32, 22, 128]), torch.Size([8, 22, 128]), torch.Size([8, 22, 128]))

### Encode Positions

Now that we've split queries, keys, and values into attention heads, it's time to rotate the queries and keys according to the embedding positions using the RoPE rotation matrix $\mathbf{R}_{\Theta}^d$.

$$
\mathbf{R}_{\Theta,m}^d \mathbf{x} = 
\begin{bmatrix}
x_0 \\
x_1 \\
x_2 \\
x_3 \\
\vdots \\
x_{d/2-2} \\
x_{d/2-1} \\
\end{bmatrix}
\begin{bmatrix}
cos(m \theta_0) \\
cos(m \theta_0) \\
cos(m \theta_1) \\
cos(m \theta_1) \\
\vdots \\
cos(m \theta_{d/2-1}) \\
cos(m \theta_{d/2-1}) \\
\end{bmatrix}
+
\begin{bmatrix}
-x_1 \\
x_0 \\
-x_3 \\
x_2 \\
\vdots \\
-x_{d/2-1} \\
x_{d/2-2} \\
\end{bmatrix}
\begin{bmatrix}
sin(m \theta_0) \\
sin(m \theta_0) \\
sin(m \theta_1) \\
sin(m \theta_1) \\
\vdots \\
sin(m \theta_{d/2-1}) \\
sin(m \theta_{d/2-1}) \\
\end{bmatrix}
$$

In [None]:
def rotate_half(x):
    """Convert x to -x1, x0, ... for the sin column of RoPE rotation matrix."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

In [None]:
# Encode positions by rotating queries and keys
q = (q * rope_cos) + (rotate_half(q) * rope_sin)
k = (k * rope_cos) + (rotate_half(k) * rope_sin)

q.shape, k.shape, v.shape

### Expand Key / Value Groups

In [None]:
# Expand key/value groups
k = k.repeat_interleave(config.n_kv_groups, dim=0)
v = v.repeat_interleave(config.n_kv_groups, dim=0)

q.shape, k.shape, v.shape

In [None]:
# Sanity check
assert q.shape == k.shape == v.shape

### Calculate Attention

In [None]:
# Compute attention mask M
n = len(x)
mask = torch.ones(n, n, dtype=torch.bool, device=device).tril(diagonal=0)
m = torch.zeros(n, n, device=device).masked_fill_(mask.logical_not(), float("-inf"))

m

In [None]:
# Compute attention for all heads in parallel
a = softmax(q @ k.transpose(-2, -1) / np.sqrt(config.d_head) + m, dim=-1) @ v

a.shape

### Recombine Attention Heads

In [None]:
def combine_heads(x):
    return x.transpose(-3, -2).contiguous().view(-1, int(config.n_heads * config.d_head))

In [None]:
# Combine attention heads
a = combine_heads(a)

a.shape

### Project Outputs

In [None]:
# Configure attention output projection
attention_outputs = nn.Linear(
    in_features=config.d_model, 
    out_features=config.d_model,
    bias=False,
    device=device,
)

# Load pre-trained weights
llama.load_state(attention_outputs, "attention_outputs", checkpoint=checkpoint)

In [None]:
# Project attention embeddings back to model space
a = attention_outputs(a)

a.shape

### Combine Outputs with Residuals

In [None]:
# Combine attention embeddings with residuals
x = residual + a

x.shape

## Feedforward Networks

<figure>
<img src="ffn-flow.svg" width="940">
<figcaption>Figure X: FFN Flow</figcaption>
</figure>

### Normalize Inputs

In [None]:
# Configure FFN normalization
normalize_ffn = RMSNorm(config.d_model, config.rms_norm_eps).to(device)

# Load pre-trained state
llama.load_state(normalize_ffn, "normalize_ffn", checkpoint=checkpoint)

In [None]:
# Preserve residuals
residual = x

# Normalize FFN inputs
x = normalize_ffn(x)

x.shape

### Transform

In [None]:
# Configure SwiGLU FFN
ffn_gates = nn.Linear(
    in_features=config.d_model,
    out_features=config.d_ffn,
    bias=False,
    device=device,
)
ffn_inputs = nn.Linear(
    in_features=config.d_model,
    out_features=config.d_ffn,
    bias=False,
    device=device,
)

# Load pre-trained weights
llama.load_state(ffn_gates, "ffn_gates", ffn_inputs, "ffn_inputs", checkpoint=checkpoint)

In [None]:
# Apply transform
f = silu(ffn_gates(x)) * ffn_inputs(x)

f.shape

### Project Outputs

In [None]:
# Configure FFN output projection
ffn_outputs = nn.Linear(
    in_features=config.d_ffn,
    out_features=config.d_model,
    bias=False,
    device=device,
)

# Load pre-trained weights
llama.load_state(ffn_outputs, "ffn_outputs", checkpoint=checkpoint)

In [None]:
# Project FFN embeddings back to model space
f = ffn_outputs(f)

f.shape

### Combine Outputs with Residuals

In [None]:
# Combine FFN embeddings with residuals
x = residual + f

x.shape

## Stacking the Layers Together

In [None]:
def context_layers(x):
    # Compute RoPE rotation matrices
    rope_cos, rope_sin = rope(len(x))

    # Apply layer logic in a loop
    for layer in range(config.n_layers):
    
        # Load pre-trained state for layer
        load_pretrained_state(layer)
    
        #
        # Attention
        #
    
        # Normalize attention inputs
        residual = x
        x = normalize_attention(x)
        
        # Project embeddings to query, key, value spaces
        q = w_q(x)
        k = w_k(x)
        v = w_v(x)
        
        # Split attention heads
        q = split_heads(q, config.n_heads)
        k = split_heads(k, config.n_kv_heads)
        v = split_heads(v, config.n_kv_heads)
    
        # Encode positions by rotating queries and keys
        q = (q * rope_cos) + (llama.rotate_half(q) * rope_sin)
        k = (k * rope_cos) + (llama.rotate_half(k) * rope_sin)
        
        # Expand key/value groups
        k = k.repeat_interleave(config.n_kv_groups, dim=0)
        v = v.repeat_interleave(config.n_kv_groups, dim=0)
    
        # Compute masked attention bias M
        n = len(x)
        mask = torch.ones(n, n, dtype=torch.bool, device=device).tril(diagonal=0)
        m = torch.zeros(n, n, device=device).masked_fill_(mask.logical_not(), float("-inf"))
        
        # Compute attention for all heads in parallel
        a = softmax(q @ k.transpose(-2, -1) / np.sqrt(config.d_head) + m, dim=-1) @ v
    
        # Combine attention heads
        a = combine_heads(a)
        
        # Project attention embeddings back to model space
        a = attention_outputs(a)
        
        # Combine attention embeddings with residuals
        x = residual + a
        
        #
        # FFN
        #
    
        # Normalize FFN inputs
        residual = x
        x = normalize_ffn(x)
    
        # Apply transform
        f = silu(ffn_gates(x)) * ffn_inputs(x)
    
        # Project FFN embeddings back to model space
        f = ffn_outputs(f)
        
        # Combine FFN embeddings with residuals
        x = residual + f

    return x

In [None]:
# Start over from initial tokens
x = torch.tensor(token_ids, device=device)

# Initial embeddings
x = embeddings(x)

# Contextualized embeddings
x = context_layers(x)

x.shape

In [None]:
x

# Head

<figure>
<img src="head-flow.svg" width="700">
<figcaption>Figure X: Head Flow</figcaption>
</figure>

### Normalize Inputs

In [None]:
# Configure head normalization
normalize_head = RMSNorm(config.d_model, config.rms_norm_eps).to(device)

# Load pre-trained weights
llama.load_state(normalize_head, "normalize_head", checkpoint=checkpoint)

In [None]:
# Normalize head inputs
x = normalize_head(x)

x.shape

### Project Outputs

In [None]:
# Configure output projection
head_outputs = nn.Linear(
    in_features=config.d_model,
    out_features=config.vocab_size,
    bias=False,
    device=device,
)

# Load pre-trained weights
llama.load_state(head_outputs, "head_outputs", checkpoint=checkpoint)

In [None]:
# Use last embedding to represent the entire sequence
x = x[-1]

# Project outputs to token space
x = head_outputs(x)

x.shape

## Top Token

In [None]:
# Select top scoring token
token_id = x.argmax()

# Decode token
token = tokenizer.decode([token_id]).strip()

token

In [None]:
# Verify answer
assert token == "Boston"

## Sample Tokens

### Temperature

In [None]:
# Hyperparameters
temperature = config.temperature
temperature

In [None]:
# Apply temperature
x = x / temperature

### Ranking

In [None]:
# Convert logits to probabilities
probs = softmax(x)

# Sort probabilities in descending order
probs, indices = probs.sort(descending=True)

### Top K

In [None]:
# Hyperparameters
top_k = config.top_k
top_k

In [None]:
# Retain top k tokens
probs = probs[:top_k]
print(f"Retained {len(probs)} of {len(x)}")

### Top P

In [None]:
# Hyperparameters
top_p = config.top_p
top_p

In [None]:
# Find cutoff where cumulative probability exceeds top_p
cumulative_mask = probs.cumsum(dim=-1) > top_p
threshold_index = torch.argmax(cumulative_mask).item()

# Only apply threshold if top_p was exceeded
if cumulative_mask.any():
    probs = probs[:threshold_index+1]

print(f"Retained {len(probs)} of {len(x)}")

### Random Selection

In [None]:
# Print remaining token pool
for i, prob in enumerate(probs):
    print(f"token id {indices[i]}, token '{tokenizer.decode([indices[i]])}', score {prob:0.3f}")

In [None]:
# Sample from remaining tokens weighted by probability
sampled_index = torch.multinomial(probs, 1)

# Convert sampled_index to original logits
token_id = indices[sampled_index]

# Decode token
token = tokenizer.decode([token_id]).strip()

token

## Complete Head Stage

In [None]:
def head(x):
    # Normalize head inputs
    x = normalize_head(x)
    
    # Use last embedding to represent the entire sequence
    x = x[-1]
    
    # Project outputs to token space
    x = head_outputs(x)

    #
    # Temperature
    #
    
    # Apply temperature
    x = x / config.temperature

    #
    # Ranking
    #
    
    # Convert logits to probabilities
    probs = softmax(x)
    
    # Sort probabilities in descending order
    probs, indices = probs.sort(descending=True)

    #
    # Top K
    #
    
    # Retain top k tokens
    probs = probs[:config.top_k]

    #
    # Top P
    #
    
    # Find cutoff where cumulative probability exceeds top_p
    cumulative_mask = probs.cumsum(dim=-1) > config.top_p
    threshold_index = torch.argmax(cumulative_mask).item()
    
    # Only apply threshold if top_p was exceeded
    if cumulative_mask.any():
        probs = probs[:threshold_index+1]

    #
    # Random Selection
    #
    
    # Sample from remaining tokens weighted by probability
    sampled_index = torch.multinomial(probs, 1)
    
    # Convert sampled_index to original logits
    token_id = indices[sampled_index]

    return token_id.item()

# Generator

In [None]:
class Message(BaseModel):
    role: str
    content: str

In [None]:
@validate_call
def prepare_messages(messages: list[Message]):
    # Initialize prompt
    prompt = ""
    
    # Format each message
    for message in messages:
        prompt += f"<|start_header_id|>{message.role}<|end_header_id|>\n\n"
        prompt += message.content
        prompt += "<|eot_id|>"

    # Finish with the assistant role to prime the model's response
    prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n"

    return prompt

In [None]:
@validate_call
def generate(messages: list[Message]):
    # Format message prompt
    prompt = prepare_messages(messages)
    
    # Split raw text into tokens
    token_ids = tokenizer.encode(prompt, bos=True, eos=False, allowed_special="all")
    
    # Generate output until we get a stop token or we exceed max_output_tokens.
    for _ in range(config.max_output_tokens):
        
        # Start over from initial tokens
        x = torch.tensor(token_ids, device=device)
        
        # Initial embeddings
        x = embeddings(x)
        
        # Semantic embeddings
        x = context_layers(x)
        
        # Head
        token_id = head(x)
        
        # Check stopping criteria
        if token_id in tokenizer.stop_tokens:
            break
    
        # Print token
        token = tokenizer.decode([token_id])
        stdout.write(token)
        
        # Append to end of sequence
        token_ids.append(token_id)

In [None]:
generate([
    {
        "role": "user",
        "content": "What is capital of Massachusetts?",
    },
])

# Recap

# References

Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” arXiv. <https://doi.org/10.48550/arXiv.2305.13245>.

Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent. 2000. “A Neural Probabilistic Language Model.” In Advances in Neural Information Processing Systems. Vol. 13. MIT Press. <https://proceedings.neurips.cc/paper_files/paper/2000/hash/728f206c2a01bf572b5940d7d9a8fa4c-Abstract.html>.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” arXiv. <https://doi.org/10.48550/arXiv.2005.14165>.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv.Org. May 24, 2019. <https://arxiv.org/abs/1810.04805v2>.

Dubey, Abhimanyu, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. 2024. “The Llama 3 Herd of Models.” arXiv.Org. July 31, 2024. <https://arxiv.org/abs/2407.21783v2>.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” arXiv. <https://doi.org/10.48550/arXiv.1512.03385>.

Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. “Improving Language Understanding by Generative Pre-Training.”

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” arXiv. <https://doi.org/10.48550/arXiv.1508.07909>.

Shazeer, Noam. 2020. “GLU Variants Improve Transformer.” arXiv. <https://doi.org/10.48550/arXiv.2002.05202>.

Stanford Online, dir. 2024. Stanford CS25: V4 I Hyung Won Chung of OpenAI. <https://www.youtube.com/watch?v=orDKvo8h71o>.

Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv.Org. April 20, 2021. <https://arxiv.org/abs/2104.09864v5>.

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023. “LLaMA: Open and Efficient Foundation Language Models.” arXiv. <https://doi.org/10.48550/arXiv.2302.13971>.

Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv.Org. July 18, 2023. <https://arxiv.org/abs/2307.09288v2>.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv. <https://doi.org/10.48550/arXiv.1706.03762>.