**LLM Workshop 2024 by Sebastian Raschka**

This code is based on *Build a Large Language Model (From Scratch)*, [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch)

# 3) Coding an LLM architecture

In [1]:
# Requirements from: https://github.com/rasbt/LLM-workshop-2024/blob/main/requirements.txt
requirements = """
# torch >= 2.0.1
tiktoken >= 0.5.1
# matplotlib >= 3.7.1
# numpy >= 1.24.3
# tensorflow >= 2.15.0
# tqdm >= 4.66.1
# numpy >= 1.25, < 2.0
# pandas >= 2.2.1
# psutil >= 5.9.5
# litgpt[all] >= 0.4.1
"""

with open("requirements.txt", mode="wt") as f:
    f.write(requirements)

%pip install -r requirements.txt --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import importlib

importlib.util.find_spec("tiktoken")

ModuleSpec(name='tiktoken', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7f41e3129e70>, origin='/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/tiktoken/__init__.py', submodule_search_locations=['/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/tiktoken'])

In [3]:
from importlib.metadata import version


print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.2.1+cu121
tiktoken version: 0.7.0


Add supplementary Python module from Sebastian Raschka's training material

In [4]:
import requests

session = requests.Session()
with open("llm_architecture.py", "wt", encoding="utf-8") as f:
    response = session.get(
        "https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/03_architecture/supplementary.py"
    )
    f.write(response.text)

- In this notebook, we implement a GPT-like LLM architecture; the next notebook will focus on training this LLM

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/03_architecture/figures/01.png?raw=1" width="1000px">

<br>
<br>
<br>
<br>

# 3.1 Coding an LLM architecture

- Models like GPT, Gemma, Phi, Mistral, Llama etc. generate words sequentially and are based on the decoder part of the original transformer architecture
- Therefore, these LLMs are often referred to as "decoder-like" LLMs
- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code
- We'll see that many elements are repeated in an LLM's architecture

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/03_architecture/figures/02.png?raw=1" width="700px">

- In the previous notebook, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they neatly fit on the screen
- In this notebook, we consider embedding and model sizes akin to a small GPT-2 model
- We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)



<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/03_architecture/figures/03.png?raw=1" width="1200px">

- The next notebook will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters
- Models like Llama and others are very similar to this model, since they are all based on the same core concepts

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/03_architecture/figures/04.png?raw=1" width="1200px">

- Configuration details for the 124 million parameter GPT-2 model (GPT-2 "small") include:

In [5]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,  # Vocabulary size
    "context_length": 1024,  # Context length
    "emb_dim": 768,  # Embedding dimension
    "n_heads": 12,  # Number of attention heads
    "n_layers": 12,  # Number of layers
    "drop_rate": 0.0,  # Dropout rate
    "qkv_bias": False,  # Query-Key-Value bias
}

<br>
<br>
<br>
<br>



# 3.2 Coding the GPT model

- We are almost there: now let's plug in the transformer block into the architecture we coded at the very beginning of this notebook so that we obtain a useable GPT architecture
- Note that the transformer block is repeated multiple times; in the case of the smallest 124M GPT-2 model, we repeat it 12 times:

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/03_architecture/figures/07.png?raw=1" width="800px">

- The corresponding code implementation, where `cfg["n_layers"] = 12`:

In [6]:
import torch.nn as nn
from llm_architecture import TransformerBlock, LayerNorm


class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        )

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

- Using the configuration of the 124M parameter model, we can now instantiate this GPT model with random initial weights as follows:

In [7]:
import torch
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

batch = []

txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [8]:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)

out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

Input batch:
 tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])

Output shape: torch.Size([2, 4, 50257])
tensor([[[ 6.4165e-02,  2.0443e-01, -1.6945e-01,  ...,  1.7887e-01,
           2.1921e-01, -5.8153e-01],
         [ 3.7736e-01, -4.2545e-01, -6.5874e-01,  ..., -2.5050e-01,
           4.6553e-01, -2.5760e-01],
         [ 8.8996e-01, -1.3770e-01,  1.4748e-01,  ...,  1.7770e-01,
          -1.2015e-01, -1.8902e-01],
         [-9.7276e-01,  9.7338e-02, -2.5419e-01,  ...,  1.1035e+00,
           3.7639e-01, -5.9006e-01]],

        [[ 6.4165e-02,  2.0443e-01, -1.6945e-01,  ...,  1.7887e-01,
           2.1921e-01, -5.8153e-01],
         [ 1.3433e-01, -2.1289e-01, -2.7021e-02,  ...,  8.1153e-01,
          -4.7410e-02,  3.1186e-01],
         [ 8.9996e-01,  9.5396e-01, -1.7896e-01,  ...,  8.3053e-01,
           2.7657e-01, -2.4577e-02],
         [-9.3013e-05,  1.9390e-01,  5.1217e-01,  ...,  1.1915e+00,
          -1.6431e-01,  3.7046e-02]]], grad_fn=<UnsafeViewBackward0>)


In [9]:
for i, input_txt in enumerate(tokenizer.encode(txt1) + tokenizer.encode(txt2)):
    print(input_txt)

6109
3626
6100
345
6109
1110
6622
257


In [10]:
print("Currently, the untrained model does not produce meaningful outputs:\n")

print(f"|{'input':^15}|{'output':^15}|")
print("|" + "-" * 15 + "|" + "-" * 15 + "|")
output_ids = torch.argmax(out, dim=2).view(-1, 1).tolist()
for i, input_txt in enumerate(tokenizer.encode(txt1) + tokenizer.encode(txt2)):
    print(
        f"| {tokenizer.decode([input_txt]):<14}| {tokenizer.decode(output_ids[i]):<14}|"
    )
print("|" + "-" * 15 + "|" + "-" * 15 + "|")

Currently, the untrained model does not produce meaningful outputs:

|     input     |    output     |
|---------------|---------------|
| Every         |  Orche        |
|  effort       |  compan       |
|  moves        | Friday        |
|  you          |  Ae           |
| Every         |  Orche        |
|  day          |  Dre          |
|  holds        |  Valent       |
|  a            | ftime         |
|---------------|---------------|


- We will train this model in the next notebook

# 3.4 Generating text

- LLMs like the GPT model we implemented above are used to generate one word at a time

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/03_architecture/figures/08.png?raw=1" width="600px">

- The following `generate_text_simple` function implements greedy decoding, which is a simple and fast method to generate text
- In greedy decoding, at each step, the model chooses the word (or token) with the highest probability as its next output (the highest logit corresponds to the highest probability, so we technically wouldn't even have to compute the softmax function explicitly)
- The figure below depicts how the GPT model, given an input context, generates the next word token

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/03_architecture/figures/09.png?raw=1" width="900px">

In [46]:
def sample_temperature(
    a: torch.Tensor = torch.tensor([0.1, 0.2, 0.4], dtype=torch.bfloat16),
    temperature: float = 0.9,
    seed: int | None = None
):
    if seed:
        # Store the current state to restore it later
        random_state = torch.random.get_rng_state()
        torch.manual_seed(seed)
    print("Temperature:", temperature)
    print("Tensor:", {a})
    print("Softmax:", torch.softmax(a, dim=-1))
    print("w/ temp:", torch.softmax((a / temperature), dim=-1))
    # Restore the random number generator
    if seed:
        torch.random.set_rng_state(random_state)
    return None

sample_temperature(temperature=1.0)
sample_temperature(temperature=0.9)
sample_temperature(temperature=0.5)
sample_temperature(temperature=0.1)

Temperature: 1.0
Tensor: {tensor([0.1001, 0.2002, 0.4004], dtype=torch.bfloat16)}
Softmax: tensor([0.2891, 0.3203, 0.3906], dtype=torch.bfloat16)
w/ temp: tensor([0.2891, 0.3203, 0.3906], dtype=torch.bfloat16)
Temperature: 0.9
Tensor: {tensor([0.1001, 0.2002, 0.4004], dtype=torch.bfloat16)}
Softmax: tensor([0.2891, 0.3203, 0.3906], dtype=torch.bfloat16)
w/ temp: tensor([0.2852, 0.3184, 0.3965], dtype=torch.bfloat16)
Temperature: 0.5
Tensor: {tensor([0.1001, 0.2002, 0.4004], dtype=torch.bfloat16)}
Softmax: tensor([0.2891, 0.3203, 0.3906], dtype=torch.bfloat16)
w/ temp: tensor([0.2471, 0.3027, 0.4512], dtype=torch.bfloat16)
Temperature: 0.1
Tensor: {tensor([0.1001, 0.2002, 0.4004], dtype=torch.bfloat16)}
Softmax: tensor([0.2891, 0.3203, 0.3906], dtype=torch.bfloat16)
w/ temp: tensor([0.0420, 0.1143, 0.8438], dtype=torch.bfloat16)


In [47]:
def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):
        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)

        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

In [117]:
def generate_text_with_temperature(
    model,
    idx,
    max_new_tokens: int,
    context_size: int,
    top_k: int | None = None,
    top_p: float | None = None,
    temperature: float = 0.9,
    seed: int | None = None,
) -> torch.Tensor:
    """
    Generate text using the provided model and temperature values.

    Parameters
    ----------
    model: PyTorch model
        Transformer model that outputs logits for a given input.
    idx: torch.Tensor of shape: (batch_size, n_tokens)
        Input tensor of token indices.
    max_new_tokens: int
        The maximum number of new tokens to generate.
    context_size: int
        The model's context size
    top_k: int, default = None
        If provided, filter the logits based on the top_k highest values
        prior to sampling.
    top_p: float in the range [0.0, 1.0], default = None
        If provided, filter the logits based on the top_p percent values
        prior to sampling.
    temperature: float in the range [0.0, 1.0] default = 0.9
        Add sampling variety. Higher values increase sampling variety.
        1.0 means greedy search (argmax).
    seed: int, default = None
        If provided, makes the output deterministic.

    Returns
    -------
    output_indices: a tensor of output indices

    References
    ---------
    - [LabML: Sampling with Temperature](https://nn.labml.ai/sampling/temperature.html)
    - [torch.distributions.Categorical](https://pytorch.org/docs/stable/distributions.html#categorical)
    - [torch.multinomial](https://pytorch.org/docs/stable/generated/torch.multinomial.html#torch.multinomial), which is equivalent to torch...Categorical
    """
    if seed:
        # Store the current state to restore it later
        random_state = torch.random.get_rng_state()
        torch.manual_seed(seed)
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):
        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)

        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]

        # Apply top-k and top-p filtering
        if top_k:
            top_k_idx = torch.argsort(input=logits, dim=-1, descending=True)[:, :top_k]
            filter_idx = top_k_idx
        
        if top_p:
            top_p_vals = torch.quantile(input=logits, q=top_p, dim=-1)
            top_p_idx = torch.argwhere(logits >= top_p_vals)
            # Reshape to match the input: batch, num_top_p_indices
            top_p_idx = top_p_idx[:, 1].view(logits.shape[0], -1)
            filter_idx = top_p_idx

        # Filter using the most restrictive criteria
        if top_k and top_p:
            if top_p_idx.shape[1] < top_k:
                # Use top_p since it is more restrictive
                filter_idx = top_p_idx
            else:
                # Use top_k since it is more restrictive
                filter_idx = top_k_idx
        
        # Set to -inf for the softmax
        if top_k or top_p:
            print("Logits index shape", logits.shape)
            print("Filter index shape", filter_idx.shape)
            logits_copy = torch.ones_like(logits) * (-torch.inf)
            logits_copy[:, filter_idx.squeeze()] = logits[:, filter_idx.squeeze()]
            logits = logits_copy

        # Apply softmax to get probabilities
        if temperature == 0.0:
            temperature = 1e-5  # to avoid ZeroDivision errors
        probs = torch.softmax(logits / temperature, dim=-1)  # (batch, vocab_size)

        # Sample using temperature:
        if temperature < 1.0:
            # Adapted from: https://huggingface.co/transformers/v3.4.0/_modules/transformers/generation_utils.html
            idx_next = torch.multinomial(probs, num_samples=1)
        else:
            # Get the idx of the vocab entry with the highest probability value
            idx_next = torch.argmax(probs, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)
    
    # Restore the random number generator
    if seed:
        torch.random.set_rng_state(random_state)
    
    return idx


In [98]:
a = torch.rand((4, 100))

top_k = 10
top_p = 0.95

if top_k:
    top_k_idx = torch.argsort(input=a, dim=-1, descending=True)[:, :top_k]

if top_p:
    top_p_vals = torch.quantile(input=a, q=top_p, dim=-1).view(-1, 1)
    top_p_idx = torch.argwhere(a >= top_p_vals)
    # Reshape to match the input: batch, num_top_p_indices
    top_p_idx = top_p_idx[:, 1].view(a.shape[0], -1)

if top_k and top_p:
    if top_p_idx.shape[0] < top_k:
        # Use top_p since it is more restrictive
        pass
    else:
        # Use top_k since it is more restrictive
        pass

In [94]:
# Different ways to get the values that are greater than a given probability
torch.all(
    a[a >= top_p_vals] == a[
        torch.argwhere(a >= top_p_vals)[:, 0],
        torch.argwhere(a >= top_p_vals)[:, 1]
    ]
)

tensor(True)

In [103]:
# Notice that the index values in top_p correspond
# to those in top_k (just not the same order)
print(top_p_idx)
print(top_k_idx)

tensor([[36, 42, 50, 51, 89],
        [17, 18, 34, 42, 70],
        [24, 48, 63, 76, 98],
        [36, 39, 76, 96, 97]])
tensor([[36, 89, 50, 51, 42, 33, 62, 78, 15, 28],
        [42, 34, 70, 17, 18, 95, 75, 41, 13, 33],
        [48, 24, 76, 63, 98, 15, 77, 61, 12,  3],
        [96, 36, 76, 97, 39, 79, 57, 83, 98, 75]])


- The `generate_text_simple` above implements an iterative process, where it creates one token at a time

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/03_architecture/figures/10.png?raw=1" width="800px">

# Exercise: Generate some text

1. Use the `tokenizer.encode` method to prepare some input text
2. Then, convert this text into a pytprch tensor via (`torch.tensor`)
3. Add a batch dimension via `.unsqueeze(0)`
4. Use the `generate_text_simple` function to have the GPT generate some text based on your prepared input text
5. The output from step 4 will be token IDs, convert them back into text via the `tokenizer.decode` method

In [49]:
text = "I like to look for rainbows"
tokens = tokenizer.encode(text)  # convert text to tokens
tokens = torch.tensor(tokens).unsqueeze(0)  # convert to tensor and add batch dimension
model.eval()  # disable dropout
output = generate_text_simple(
    model=model, idx=tokens, max_new_tokens=5, context_size=1024
)
for one_token in output.squeeze():
    print(f"{one_token}".ljust(10, "_"), tokenizer.decode([one_token]), sep="")

40________I
588_______ like
284_______ to
804_______ look
329_______ for
6290______ rain
25435_____bows
28800_____ comprises
24739_____ GH
7267______ argue
49240_____ decaying
26594_____fighting


In [56]:
text = "I like to look for rainbows"
tokens = tokenizer.encode(text)  # convert text to tokens
tokens = torch.tensor(tokens).unsqueeze(0)  # convert to tensor and add batch dimension
model.eval()  # disable dropout
output = generate_text_with_temperature(
    model=model,
    idx=tokens,
    max_new_tokens=5,
    context_size=1024,
    top_k = None,
    top_p = None,
    temperature = 1.0,  # reproduce the results above
    seed = None,
)
for one_token in output.squeeze():
    print(f"{one_token}".ljust(10, "_"), tokenizer.decode([one_token]), sep="")

40________I
588_______ like
284_______ to
804_______ look
329_______ for
6290______ rain
25435_____bows
28800_____ comprises
24739_____ GH
7267______ argue
49240_____ decaying
26594_____fighting


In [107]:
text = "I like to look for rainbows"
tokens = tokenizer.encode(text)  # convert text to tokens
tokens = torch.tensor(tokens).unsqueeze(0)  # convert to tensor and add batch dimension
model.eval()  # disable dropout
output = generate_text_with_temperature(
    model=model,
    idx=tokens,
    max_new_tokens=5,
    context_size=1024,
    top_k = None,
    top_p = None,
    temperature = 0.9,  # add sample variability
    seed = None,
)
for one_token in output.squeeze():
    print(f"{one_token}".ljust(10, "_"), tokenizer.decode([one_token]), sep="")

40________I
588_______ like
284_______ to
804_______ look
329_______ for
6290______ rain
25435_____bows
13076_____ bust
4258______ climate
1904______use
1824______ustom
34955_____ fishermen


In [119]:
text = "I like to look for rainbows"
tokens = tokenizer.encode(text)  # convert text to tokens
tokens = torch.tensor(tokens).unsqueeze(0)  # convert to tensor and add batch dimension
model.eval()  # disable dropout
output = generate_text_with_temperature(
    model=model,
    idx=tokens,
    max_new_tokens=5,
    context_size=1024,
    top_k = 10,
    top_p = 0.95,
    temperature = 0.9,  # add sample variability
    seed = None,
)
for one_token in output.squeeze():
    print(f"{one_token}".ljust(10, "_"), tokenizer.decode([one_token]), sep="")

Logits index shape torch.Size([1, 50257])
Filter index shape torch.Size([1, 10])
Logits index shape torch.Size([1, 50257])
Filter index shape torch.Size([1, 10])
Logits index shape torch.Size([1, 50257])
Filter index shape torch.Size([1, 10])
Logits index shape torch.Size([1, 50257])
Filter index shape torch.Size([1, 10])
Logits index shape torch.Size([1, 50257])
Filter index shape torch.Size([1, 10])
40________I
588_______ like
284_______ to
804_______ look
329_______ for
6290______ rain
25435_____bows
10712_____ tissue
32769_____ guardians
38679_____ocl
4194______ agg
16081_____ vib
