## Generative Pretrained Transformer (GPT) ModeL
Now that we understand the attention mechanism, one of the core components of traditional LLMs, we can now put this mechanism in the context of other building blocks, which we can then put together to end up with our own GPT model. Up to this point, we have been keeping embedding dimensionality small in the interest of easier learning and understanding. Now we will attempt to scale everything up to a scale comparable to the smallest GPT-2 model (124 million parameters).

#### *Language Models are Unsupervised Multitask Learners (Radford et.al, 2019)*
This paper first introduced GPT-2, the largest model of which achieved, at the time, state-of-the-art results in 7 out of 8 tested language modeling datasets in a zero-shot setting. It represented a huge step towards building language models that could be accurately characterized as 'competent generalists', rather than 'narrow experts', systems that could perform tasks (sentiment analysis, translation, entity extraction, etc.) without the need to create and label a separate training set for each one.

The standard definition of a language model is an unsupervised probability distribution that is fitted over token sequences. Given a corpus of sequences:

$$
\{\,x^{(j)} = (s_1^{(j)}, s_2^{(j)}, \dots, s_{n_j}^{(j)})\}_{j=1}^N.
$$

We maximize the log-likelihood
$$
\mathcal{L} = \sum_{j=1}^N \log p\bigl(x^{(j)}\bigr),
$$

where
$$
p(x) = \prod_{i=1}^{n} p\bigl(s_i \mid s_{<i}\bigr).
$$

Recent architectures, like the Transformer with its self-attention, compute and parametrize each conditional $p(s_i \mid s_{<i})$ and dramatically increase expressivity. Therefore, learning to perform a single task can be ordinarily modeled as an estimation of a distribution $p(\text{output}\mid\text{input})$. A general solver must however also condition on which task to perform $p(\text{output}\mid\text{input, task})$. Up to this point, task conditioning in the context of multitask settings was implemented at an architectural level: task specific encoders and decoders, at an algorithmic level with meta-learning loops, etc. The paper's hypothesis was that **unsupervised multitask learning via pure language modeling was possible.**

> When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. [...] high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.
>
>--<cite>Language Models are Unsupervised Multitask Learners, Radford et.al, 2019</cite>


In [1]:
from nltk.draw import cfg
from transformers.utils.fx import torch_flip
from dataclasses import dataclass

@dataclass
class GPTConfig124:
    vocab_size: int = 50257
    context_length: int = 1024
    emb_dim: int = 768
    n_heads: int = 12
    n_layers: int = 12
    dropout: float = 0.1
    qkv_bias: bool = False


In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class GPT2Dummy(nn.Module):
    """
    A *do nothing* GPT2 scaffold.
    We will progressively accordingly swap nn.Identity for real implementations.
    """
    def __init__(self, cfg: GPTConfig124):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.emb_dim)
        self.pos_emb = nn.Embedding(cfg.context_length, cfg.emb_dim)
        self.drop_emb = nn.Dropout(cfg.dropout)
        self.blocks = nn.ModuleList(
            [nn.Identity() for _ in range(cfg.n_layers)]
        )
        self.norm = nn.Identity()
        self.lm_head = nn.Linear(cfg.emb_dim, cfg.vocab_size, bias=False)

    def forward(self, in_idx: torch.Tensor):
        """
        idx: (batch_size, seq_len) tensor of token indices.
        :return: logits: (batch_size, seq_len, vocab_size) tensor of logits (unnormalized scores).
        """
        B, T = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(T, device=in_idx.device))
        x = self.drop_emb(tok_embeds + pos_embeds)
        for block in self.blocks:
            x = block(x)

        x = self.norm(x)
        logits = self.lm_head(x)
        return logits


- `self.tok_emb`, `self.pos_emb` turn discrete tokens and positions into continuous vectors of size `emb_dim`.
- `self.blocks`, collected in a `nn.ModueList` is where the attention & multilayer perceptron layers would normally live. Each block is currently a no-op; it simply returns its input.
- `self.norm` is also a placeholder that does nothing. GPT-2 applies a Layernorm after the stack of the blocks.
- `nn.Linear` with `bias=False` projects the final hidden states back to vocabulary size so we can compute logits for the next-token predictions.
- The forward pass therefore embeds tokens and positions -> adds them -> adds dropout -> passes through blocks -> layer-norm -> linear head. It mirrors the high-level flow of a real GPT-2, just without any real computations inside the blocks.

In [3]:
import tiktoken

tokenizer = tiktoken.get_encoding('gpt2')
batch = []

text = """A man told me once that all the bad people
Were needed. Maybe not all, but your fingernails
You need; they are really claws, and we know
Claws. The sharks--what about them?
They make other fish swim faster. The hard-faced men
In black coats who chase you for hours
In dreams--that's the only way to get you
To the shore. Sometimes those hard women
Who abandon you get you to say, "You."
A lazy part of us is like a tumbleweed.
It doesn't move on its own. It takes sometimes
A lot of Depression to get tumbleweeds moving.
Then they blow across three or four States.
This man told me that things work together.
Bad handwriting sometimes leads to new ideas;
And a careless God--who refuses to let you
Eat from the Tree of Knowledge--can lead
To books, and eventually to us. We write
Poems with lies in them, but they help a little."""

tokens = tokenizer.encode(text)
tokens.append(tokenizer.eot_token)

print(tokens[:5])

B, T = 2, 4
data = torch.tensor(tokens[:8+1])

x = data[:-1].view(B,T) #input tensor
y = data[1:].view(B,T) #target tensor for next token prediction

print(x)
print(y)


[32, 582, 1297, 502, 1752]
tensor([[  32,  582, 1297,  502],
        [1752,  326,  477,  262]])
tensor([[ 582, 1297,  502, 1752],
        [ 326,  477,  262, 2089]])


|             ![Full GPT-2 Architecture](images/full_architecture.png)             |
|:--------------------------------------------------------------------------------:|
| *GPT-2 Architecture* (GPT-2 Detailed Model Architecture, **Henry Wu** on Medium) |

In [4]:
torch.manual_seed(123)
model = GPT2Dummy(GPTConfig124)
logits = model(x)
print(f'Output shape:\n {logits.shape}')
print(f'Logits:\n{logits}')

Output shape:
 torch.Size([2, 4, 50257])
Logits:
tensor([[[ 0.3944, -0.0408, -0.2424,  ..., -0.2212,  0.0131,  1.3445],
         [-0.6989,  0.2696, -0.5769,  ..., -0.2170,  0.0857, -0.0557],
         [ 0.8880,  0.1022,  0.5163,  ...,  0.7871,  0.9948,  0.7927],
         [ 0.0979,  0.5000, -0.9337,  ...,  1.7114,  0.3513, -0.3061]],

        [[-0.1601, -0.8575, -0.5662,  ...,  0.3182, -1.2915, -0.2427],
         [-1.3554,  1.2148, -0.4383,  ..., -0.5768,  0.1907,  0.6106],
         [-0.5857,  0.6561,  0.0237,  ...,  1.3546, -0.2586, -1.0205],
         [ 0.1957, -0.2056, -0.4146,  ...,  2.6419, -0.4892, -0.1127]]],
       grad_fn=<UnsafeViewBackward0>)


The output tensor has two "blocks" corresponding to the two text samples (two batches). Each sample consists of 4 tokens, and each token is mapped to a 50257-dimensional vector, one for each word in our dictionary. There are two blocks, each block has 4 rows, each row has 50257 columns.

### Layer Normalization

Vanishing or exploding gradients are a common problem when training deep neural networks. They lead to instability in the training process, and make it difficult for the network to find a set of parameters that can robustly minimize the loss function. In the original Transformer architecture (*Attention is All You Need*, Vaswani et al., 2017, a paper already introduced in our attention exploration.) layer normalization was placed between the residual blocks.

> A *residual block* augments a sub-layer (e.g. self-attention or feed-forward network) with a shortcut connection that simply adds the block's input to its output. If $S(\bullet)$ denotes the sub-layer transformation on input $x$, the block yields $y = x + S(x)$. This connection encourages the network to learn perturbations around the identity function, making very deep architectures trainable. In the original architecture, each attention or feed-forward sub-layer is followed by a residual addition and then layer normalization.

|              ![Transformer Model Architecture](images/transformerarch.png)              |
|:---------------------------------------------------------------------------------------:|
| *Original Transformer Architecture* (*Attention is All You Need, Vaswani et al., 2017*) |

*Layer normalization* is an evolution of *batch normalization*, which first appeared as a method to speed up the learning of deep neural networks, which would often require many days. Batch normalization standardized summed input using its mean and its standard deviation across the training data. **Batch normalization led to faster converging, faster training time and the randomness from batch statistics also served as a regularized during training, reducing variance.**

Batch normalization is also designed to reduce undesirable *covariate shift*. This occurs when the gradients with respect to weights in one layer are highly affected on the outputs from previous layers.

However, as already mentioned, batch normalization requires averaging summed input statistics. **This is efficient and straightforward in feed-forward networks with fixed depth, where the length of inputs does not vary.** In problems where the summed inputs often vary with the length of the input sequence, each distinct time-step might require different statistics. Therefore, while batch normalization is applied over the batch dimension independently for each feature index, layer normalization is applied over the $d_{model}$ (in this case, our 50257-dimensional vector) only, independently for each $(\text{batch}, \text{token})$ pair.

Layer normalization is an element-wise affine transformation applied to the activations of a neural network layer. Given an input $v \in \mathbb{R}^d$, we compute:

$$
\mu = \frac{1}{d}\sum_{k=1}^{d}v_k,  \space   \sigma = \sqrt{\frac{1}{d}\sum_{k=1}^{d}(v_k - \mu)^2}
$$

and then

$$
\text{LayerNorm($v$)} = \gamma\frac{v - \mu}{\sigma} + \beta
$$

where scale $\gamma$ and bias $\beta$ are learnable parameters restoring representational flexibility. Concretely, recalling that in a transformer we are working with activations of shape
$$
\text{(batchSize, numOfTokens, dModel)}
$$
**For batch normalization** take the tensor $(N, T, d)$. For each feature $k \in ({1, \dots, d})$, we compute:
$$
\mu_k = \frac{1}{NT}\sum_{n=1}^{N}\sum_{t=1}^{T}v_{n,t,k}, \space \sigma_k^2 = \frac{1}{NT}\sum_{n=1}^{N}\sum_{t=1}^{T}(v_{n,t,k} - \mu_k)^2
$$
then
$$
\text{BatchNorm}(v_{n,t,k}) = \gamma_k\frac{v_{n,t,k} - \mu_k}{\sigma_k} + \beta_k
$$

Batch normalization therefore mixes statistics across all examples (and token positions) so:
1. each example's normalization depends on the rest of the mini-batch,
2. it requires reasonably large batch sizes,
3. it cannot be used in a pure online (batch size 1) regime without tricks.

**For layer normalization**, for an activation vector $v_{n,t} \in \mathbb{R}^d$ were $n$ indexes batch and $t$ indexes sequence position:

$$
\mu_{n,t} = \frac{1}{d}\sum_{k=1}^{d}v_{n,t,k}, \space \sigma_{n,t}^2 = \frac{1}{d}\sum_{k=1}^{d}(v_{n,t,k} - \mu_{n,t})^2
$$

then

$$
\text{LayerNorm}(\mathbb{v}_{n,t}) = \gamma\frac{\mathbb{v}_{n,t} - \mu_{n,t}}{\sigma_{n,t}} + \beta
$$

Now there is no mixing across different examples or token positions.
 1. each vector is normalized 'in-place'. Per-token normalization keeps each position's representation self-contained, which works naturally with residual blocks and attention.
2. No cross-sample dependency also allows us to not leak information between examples, and to avoid adding complexity (e.g. tracking running means) when moving between training and inference. We are reducing the *covariate shift* problem further, by fixing the mean and the variance of summed inputs to be computed **within each layer**.



