## Generative Pretrained Transformer (GPT) ModeL
Now that we understand the attention mechanism, one of the core components of traditional LLMs, we can now put this mechanism in the context of other building blocks, which we can then put together to end up with our own GPT model. Up to this point, we have been keeping embedding dimensionality small in the interest of easier learning and understanding. Now we will attempt to scale everything up to a scale comparable to the smallest GPT-2 model (124 million parameters).

#### *Language Models are Unsupervised Multitask Learners (Radford et.al, 2019)*
This paper first introduced GPT-2, the largest model of which achieved, at the time, state-of-the-art results in 7 out of 8 tested language modeling datasets in a zero-shot setting. It represented a huge step towards building language models that could be accurately characterized as 'competent generalists', rather than 'narrow experts', systems that could perform tasks (sentiment analysis, translation, entity extraction, etc.) without the need to create and label a separate training set for each one.

The standard definition of a language model is an unsupervised probability distribution that is fitted over token sequences. Given a corpus of sequences:

$$
\{\,x^{(j)} = (s_1^{(j)}, s_2^{(j)}, \dots, s_{n_j}^{(j)})\}_{j=1}^N.
$$

We maximize the log-likelihood
$$
\mathcal{L} = \sum_{j=1}^N \log p\bigl(x^{(j)}\bigr),
$$

where
$$
p(x) = \prod_{i=1}^{n} p\bigl(s_i \mid s_{<i}\bigr).
$$

Recent architectures, like the Transformer with its self-attention, compute and parametrize each conditional $p(s_i \mid s_{<i})$ and dramatically increase expressivity. Therefore, learning to perform a single task can be ordinarily modeled as an estimation of a distribution $p(\text{output}\mid\text{input})$. A general solver must however also condition on which task to perform $p(\text{output}\mid\text{input, task})$. Up to this point, task conditioning in the context of multitask settings was implemented at an architectural level: task specific encoders and decoders, at an algorithmic level with meta-learning loops, etc. The paper's hypothesis was that **unsupervised multitask learning via pure language modeling was possible.**

> When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. [...] high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.
>
>--<cite>Language Models are Unsupervised Multitask Learners, Radford et.al, 2019</cite>


In [None]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 1024, #--> max. length of input tokens the model can handle via pos. embeddings.
    "emb_dim": 768, #--> we transform each token into a 768-dimensional vector.
    "n_heads": 12,
    "n_layers": 12, #--> number of transformer blocks in the model
    "dropout": 0.1,
    "qkv_bias": False,
}