* "vocab_size" refers to a vocabulary of 50,257 words, as used by the BPE tokenizer from chapter 2.

* "context_length" denotes the maximum number of input tokens the model can handle, via the positional embeddings discussed in chapter 2.

* "emb_dim" represents the embedding size, transforming each token into a 768-dimensional vector.

* "n_heads" indicates the count of attention heads in the multi-head attention mechanism, as implemented in chapter 3.

* "n_layers" specifies the number of transformer blocks in the model, which will be elaborated on in upcoming sections.

* "drop_rate" indicates the intensity of the dropout mechanism (0.1 implies a 10% drop of hidden units) to prevent overfitting, as covered in chapter 3.

* "qkv_bias" determines whether to include a bias vector in the Linear layers of the multi-head attention for query, key, and value computations. We will initially disable this, following the norms of modern LLMs, but will revisit it in chapter 6 when we load pretrained GPT-2 weights from OpenAI into our model.

In [9]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,  # Vocabulary size
    "context_length": 1024,      # Context length
    "emb_dim": 768,       # Embedding dimension
    "n_heads": 12,        # Number of attention heads
    "n_layers": 12,       # Number of layers
    "drop_rate": 0.1,     # Dropout rate
    "qkv_bias": False     # Query-Key-Value bias
}

In [10]:
import torch
import torch.nn as nn

class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg['vocab_size'], cfg['emb_dim'])
        self.pos_emb = nn.Embedding(cfg['context_length'], cfg['emb_dim'])
        self.drop_emb = nn.Dropout(cfg['drop_rate'])
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits


class DummyTransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
 
    def forward(self, x):
        return x
 
class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
 
    def forward(self, x):
        return x

## Text Tokenization

In [11]:
import tiktoken
import torch

tokenizer = tiktoken.get_encoding("gpt2")

batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(tokenizer.encode(txt1))
batch.append(tokenizer.encode(txt2))

batch = torch.tensor(batch)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


the number of input tokens matches the number of output tokens, hence the first word from the input tokens will be missing. 

In [12]:
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print(f"Shape: {logits.shape}")
logits

Shape: torch.Size([2, 4, 50257])


tensor([[[-0.8403,  0.1084, -1.5787,  ...,  0.7093, -0.6889, -0.4551],
         [-0.4453, -0.8470,  0.6933,  ..., -0.1697,  0.6784, -0.3407],
         [-1.6366,  0.4718,  0.4861,  ..., -0.7956,  1.1207,  1.8033],
         [-0.8734,  0.0432,  0.4491,  ..., -0.7978, -0.6560,  1.9442]],

        [[-1.3049,  0.2769, -1.3049,  ...,  0.6904, -1.0820, -0.0545],
         [-0.4211,  0.6843, -1.1777,  ...,  0.8300,  1.0757, -0.6688],
         [-0.9818,  0.2407, -0.7427,  ..., -0.3881, -0.0753,  0.1227],
         [-0.6742,  0.5832,  0.8282,  ..., -0.5311, -0.3442,  1.0316]]],
       grad_fn=<UnsafeViewBackward0>)

The output tensor has two rows corresponding to the two text samples. Each text sample consists of 4 tokens; each token is a 50,257-dimensional vector, which matches the size of the tokenizer's vocabulary.

The embedding has 50,257 dimensions because each of these dimensions refers to a unique token in the vocabulary. At the end of this chapter, when we implement the postprocessing code, we will convert these 50,257-dimensional vectors back into token IDs, which we can then decode into words.