## Generative Pretrained Transformer (GPT) ModeL
Now that we understand the attention mechanism, one of the core components of traditional LLMs, we can now put this mechanism in the context of other building blocks, which we can then put together to end up with our own GPT model. Up to this point, we have been keeping embedding dimensionality small in the interest of easier learning and understanding. Now we will attempt to scale everything up to a scale comparable to the smallest GPT-2 model (124 million parameters).

#### *Language Models are Unsupervised Multitask Learners (Radford et.al, 2019)*
This paper first introduced GPT-2, the largest model of which achieved, at the time, state-of-the-art results in 7 out of 8 tested language modeling datasets in a zero-shot setting. It represented a huge step towards building language models that could be accurately characterized as 'competent generalists', rather than 'narrow experts', systems that could perform tasks (sentiment analysis, translation, entity extraction, etc.) without the need to create and label a separate training set for each one.

The standard definition of a language model is an unsupervised probability distribution that is fitted over token sequences. Given a corpus of sequences:

$$
\{\,x^{(j)} = (s_1^{(j)}, s_2^{(j)}, \dots, s_{n_j}^{(j)})\}_{j=1}^N.
$$

We maximize the log-likelihood
$$
\mathcal{L} = \sum_{j=1}^N \log p\bigl(x^{(j)}\bigr),
$$

where
$$
p(x) = \prod_{i=1}^{n} p\bigl(s_i \mid s_{<i}\bigr).
$$

Recent architectures, like the Transformer with its self-attention, compute and parametrize each conditional $p(s_i \mid s_{<i})$ and dramatically increase expressivity. Therefore, learning to perform a single task can be ordinarily modeled as an estimation of a distribution $p(\text{output}\mid\text{input})$. A general solver must however also condition on which task to perform $p(\text{output}\mid\text{input, task})$. Up to this point, task conditioning in the context of multitask settings was implemented at an architectural level: task specific encoders and decoders, at an algorithmic level with meta-learning loops, etc. The paper's hypothesis was that **unsupervised multitask learning via pure language modeling was possible.**

> When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. [...] high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.
>
>--<cite>Language Models are Unsupervised Multitask Learners, Radford et.al, 2019</cite>


In [None]:
from nltk.draw import cfg
from transformers.utils.fx import torch_flip
from dataclasses import dataclass

@dataclass
class GPTConfig124:
    vocab_size: int = 50257
    context_length: int = 1024
    emb_dim: int = 768
    n_heads: int = 12
    n_layers: int = 12
    dropout: float = 0.1
    qkv_bias: bool = False


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class GPT2Dummy(nn.Module):
    """
    A *do nothing* GPT2 scaffold.
    We will progressively swap nn.Identity for real implementations.
    """
    def __init__(self, cfg: GPTConfig124):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.emb_dim)
        self.pos_emb = nn.Embedding(cfg.context_length, cfg.emb_dim)
        self.drop_emb = nn.Dropout(cfg.dropout)
        self.blocks = nn.ModuleList(
            [nn.Identity() for _ in range(cfg.n_layers)]
        )
        self.norm = nn.Identity()
        self.lm_head = nn.Linear(cfg.emb_dim, cfg.vocab_size, bias=False)

    def forward(self, in_idx: torch.Tensor):
        """
        idx: (batch_size, seq_len) tensor of token indices.
        :return: logits: (batch_size, seq_len, vocab_size) tensor of logits (unnormalized scores).
        """
        B, T = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(T, device=in_idx.device))
        x = self.drop_emb(tok_embeds + pos_embeds)
        for block in self.blocks:
            x = block(x)

        x = self.norm(x)
        logits = self.lm_head(x)
        return logits


- `self.tok_emb`, `self.pos_emb` turn discrete tokens and positions into continuous vectors of size `emb_dim`.
- `self.blocks`, collected in a `nn.ModueList` is where the attention & multilayer perceptron layers would normally live. Each block is currently a no-op; it simply returns its input.
- `self.norm` is also a placeholder that does nothing. GPT-2 applies a Layernorm after the stack of the blocks.
- `nn.Linear` with `bias=False` projects the final hidden states back to vocabulary size so we can compute logits for the next-token predictions.
- The forward pass therefore embeds tokens and positions -> adds them -> adds dropout -> passes through blocks -> layer-norm -> linear head. It mirrors the high-level flow of a real GPT-2, just without any real computations inside the blocks.