<a href="https://colab.research.google.com/github/royam0820/Notebooks/blob/master/amr_gpt_dev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a GPT

Companion notebook to the [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT.

ChatGPT is interesting. It sequentially generates text based on prompts. And it does so slightly differently every time.<br>Also, its prompt acceptance technically seems to not be limited by anything.<br>

**ChatGPT is a probabilistic system, a language model**.<br>
**It continues a sequence started by our prompt by modeling a continuing sequence of words.**

How does this work? What kind of model is applied under the hood?<br>
[Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) proposed the Transformer model architecture.<br>
A transformer-based language model is a type of neural network architecture that is used for natural languages processing tasks such as language translation, text summarization, and language generation. The key innovation of the transformer architecture is the **attention mechanism**, which allows the model to weigh the importance of different parts of the input when making predictions.

Transformers really took over the field of AI by now...

## Objective
**We will train a transformer-based, character-level language model** on [Tiny-Shakespeare](https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/data/tiny-shakespeare.txt) (all of Shakespeare in a single file).

Given a chunk of text from [Tiny Shakespeare](https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/data/tiny-shakespeare.txt), the transformer will decide on what character will follow.
GPT is state-of-the-art (2022) in language modeling.

In [None]:
!pip install tiktoken



In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import tiktoken

# The Dataset

Let's first look at the contents of the dataset:

In [None]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset - file input.txt
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-03-12 15:57:02--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2024-03-12 15:57:02 (15.1 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



NB: our small dataset contains Shakespeare texts contained into a file called `input.txt` of size ! MB. We are dealing with roughly 1 million characters. We will use this file to model how these characters follow each other.

In [None]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [None]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [None]:
# finding the unique characters that occur in this text
chars = sorted(list(set(text)))   # Get all unique characters in the text
vocab_size = len(chars)           # Length of the vocabulary (this includes the space character)
print(''.join(chars))             # joins all the characters in chars back into a single string
print(f'vocabulary size', {vocab_size})


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocabulary size {65}


NB: **The vocabulary size is a total of 65 characters for the text variable, apace included.**

`chars = sorted(list(set(text)))` This line performs several operations on `text`:
- `set(text)`: Converts the string text into a set of its unique characters. A set is a collection that automatically removes duplicates, so after this operation, each character from text will appear only once.
- `list(set(text))`: Converts the set of unique characters back into a list. This is necessary because a set does not preserve order, and we might want to work with the characters in a specific sequence.
- `sorted(list(set(text)))`: Sorts the list of unique characters. This ensures that the characters are in a consistent order, typically alphabetical for strings. The result is assigned to the variable chars.

`print(''.join(chars))`: This line joins all the characters in chars back into a single string (with no spaces between them) and prints it. This shows what unique characters are present in the text, in sorted order.

## Tokenization Process - using the encoder and decoder

In [None]:
# create a mapping from characters to integers and vice-versa
# building a look-up table via a dictionary
stoi = { ch:i for i,ch in enumerate(chars) } # Character to index mapping
itos = { i:ch for i,ch in enumerate(chars) } # Index to character mapping

encode = lambda s: [stoi[c] for c in s]           # encode a string to a list of integers
decode = lambda l: ''.join([itos[i] for i in l])  # Decode a list of integers to a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


NB: this is a **very simple encoding/decoding procedure, in practice, people used subword units tokenizer**. Google, for example, uses a [sentencepiece](https://github.com/google/sentencepiece
) schema for text tokenizer and detokenizer. SentencePiece implements **subword units** (e.g., **byte-pair-encoding (BPE)** meaning you are not coding entire words but you are not also encoding individual characters.  OpenAI has its library called [tiktoken](https://github.com/openai/tiktoken), it is a **fast BPE tokeniser** for use with OpenAI's models. It is efficient API usage. It helps developers estimate API usage costs by counting tokens in text and supports automatic loading for model-specific encoding.

## Digression - Tiktoken

Different systems use different approaches to encoding/decoding.<br>
For example, OpenAI uses byte-pair encoding (BPE) with their GPT-2 model.<br>
BPE is a subword tokenization technique. It is a bit more complex than what we will do here, but its shown here nonetheless for a little bit:

In [None]:
enc = tiktoken.get_encoding('gpt2')

msg = "hii there"
token_list = enc.encode(msg)
print(token_list) # BPE returns fewer tokens than the character encoding
print(enc.decode(enc.encode("hii there")))

print(enc.n_vocab) # total amount of tokens in the vocabulary

[71, 4178, 612]
hii there
50257


NB: Tiktoken shows that there is a trade-off between the length of the encoding and the amount of tokens.<br>
**We can have short sequences of tokens with very large vocabulary, or we can just as well have long sequences of tokens with a small vocabulary**.

This BPE approach is used widely for NLP tasks nowadays.

## Tokenizing the Entire Dataset

In [None]:
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(f'Total size: {data.shape} elements of type {data.dtype}')
print(data[:1000]) # the 1000 characters we looked at earlier

Total size: torch.Size([1115394]) elements of type torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 5

NB: this output is the tokenization, character by character, of our 1000 characters text. For instance, `0` is a new line character and `1` is a space.

## Splitting the Dataset

The entire Tiny-Shakespeare text is now represented as a sequence of integers.
We can start separating the data into training and validation sets.

The **split** between a **training and a validation dataset** is a fundamental concept in machine learning and data science, aimed at creating robust and generalizable models. Here's an overview of what it means and why it's important:

**Training Dataset**: This is the subset of the data that we use **to train our machine learning model**. The model learns to make predictions or decisions based on this data. The training process involves adjusting the model's parameters to minimize the error between the predicted outputs and the actual outcomes in the training dataset.

**Validation Dataset**: This subset of the data is used **to evaluate the model's performance during the training phase**. It acts as a proxy for test data, helping to tune the hyperparameters (settings of the model that are fixed before the training process begins, like learning rate or the depth of a decision tree) and to prevent overfitting. Overfitting occurs when a model learns the training data too well, including its noise and outliers, making it perform poorly on unseen data because it has essentially memorized the training dataset rather than learned the underlying patterns.

The primary purposes of creating a split between training and validation datasets include:

**Model Evaluation**: The validation dataset provides a reliable estimate of the performance of the model on new, unseen data. This helps in evaluating how well the model has learned from the training dataset and how it generalizes to data it hasn't seen before.

**Hyperparameter Tuning**: The validation set is crucial for tuning the model's hyperparameters. By evaluating the model's performance on the validation set, one can adjust the hyperparameters to find the best combination that maximizes the model's performance.

**Preventing Overfitting**: Regularly checking the model's performance on the validation set during training can signal if the model is starting to memorize the training data rather than learning general patterns. If the model's performance on the training set improves while its performance on the validation set worsens, it's likely overfitting.

A typical **workflow** involves iteratively training the model on the training dataset, assessing its performance on the validation dataset, and adjusting the model or its hyperparameters based on this assessment. After finalizing the model, its performance is then tested on a separate, untouched dataset known as the test dataset to evaluate its real-world applicability.

The **split ratio** between the training and validation datasets can vary depending on the total size of the dataset and the specific problem or domain, but common splits include **70/30, 80/20**, or using techniques like k-fold cross-validation for more efficient use of the data.

In [None]:
# splitting the data between train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest 10% will be val
train_data = data[:n]
val_data = data[n:]

In [None]:
# amr
print(f"length training set:", {len(train_data)})
print(f"length validation set:", {len(val_data)})

length training set: {1003854}
length validation set: {111540}


### Block Size

We will now sample **random chunks out of the training set** and train them chunks at the time. Feeding the NN with the whole training set in one shot will not be computationally feasible and too expensive. **Chunks are basically the length of a sequence which is called a block size**.

Block size in NLP tasks refers to the number of tokens (words or characters) that a model can process in a single input.

For models like Transformers, the **block size is critical** because it determines the **sequence length that the model can handle at once**, affecting both the computational resources required and the model's ability to capture long-distance dependencies in the data.

Let's prepare the model. We will never feed our model the entire sequence of tokens as prompt at once.<br>
Instead, we will feed it **a randomly drawn but consecutive sequence of tokens**.<br>
The model will then predict the next token in the sequence from this prompt.<br>

> We call these consecutive, size-limited input sequences of tokens **blocks**.<br>
> Size-limited means that blocks can have a length of up to `block_size`.

In [None]:
block_size = 8              # Upper limit on the length of the text sequences
train_data[:block_size+1]   # First 9 characters (8 + 1 for the target)

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

NB: this is the first 9 characters in the sequence in the training set. When we plug this sequence into a Transformer, we are going to actually simultaneously train it to make prediction at every one. Now, **in this sequence of 9 characters, they are actually 8 individual examples**, the way it is processed is as follows:
- in the context of [`18`], → `47` comes next
- in the context of [`18, 47`], → `56` comes next
- in the context of [`18,47,56`], → `58` comes next
- ... and so on ...

Let's spilled it out with code, see below.

In [None]:
# Predicting the next token based on a given context

# the first block of tokens
x = train_data[:block_size] # x inputs to the Transformer, e.g. [1, 2, 3, 4, 5, 6, 7, 8]

# individual tokens shifted by one (also including the very last token now)
y = train_data[1:block_size+1] # y is the target, # e.g. [2, 3, 4, 5, 6, 7, 8, 9]

# iterating over the block size of 8
for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(f"when input is {context} the target is {target}")


when input is tensor([18]) the target is 47
when input is tensor([18, 47]) the target is 56
when input is tensor([18, 47, 56]) the target is 57
when input is tensor([18, 47, 56, 57]) the target is 58
when input is tensor([18, 47, 56, 57, 58]) the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


NB: this way of processing makes the Transformer network used to seeing **contexts** all the way from as little as one, all the way to the block size. See code explained below:

**Data preparation**:
- `x` is defined as the first block_size tokens from `train_data`, serving as the initial context or input sequence for the model.
- `y` is essentially `x` shifted by one position to the right, indicating the next token that should be predicted by the model given the context in `x`. The last token in `y` goes beyond the initial `block_size` tokens to include the immediate next token in the sequence.

**Iterative Prediction Task Setup**:
- The loop `for t in range(block_size)`: iterates through each position in the block_size, creating increasingly larger contexts and their corresponding targets.
- `context = x[:t+1]` gradually increases the context window by including one more token from x in each iteration. Initially, the context contains just the first token, and by the end of the loop, it includes all block_size tokens.
- `target = y[t]` identifies the next token that the model should predict based on the given context. It's the token immediately following the last token in the current context.

### Batch Size
Extracting a batch of sequences and the corresponding next elements as targets from these datasets.

**The batch size is the number of training examples processed before the model's internal parameters are updated**. For example, if you have a dataset of 1000 sentences and you choose a batch size of 100, the dataset will be divided into 10 batches. Each batch of 100 sentences will be passed through the network in sequence, with each pass followed by an update to the model's weights.

**Impact on Training:**

**Memory Usage**: A larger batch size requires more memory, as more data needs to be loaded and processed simultaneously. This can be a limiting factor depending on the hardware being used for training.
**Convergence**: The choice of batch size can affect how quickly and smoothly the model converges to a solution. **Smaller batches** often lead to faster convergence but can result in a more erratic learning process. **Larger batches** provide more stable and accurate estimates of the gradient, but they might make the learning process slower and potentially get stuck in local minima.
> **Generalization**: Some studies suggest that smaller batch sizes may lead to better generalization in the trained model. This is thought to be because the noise introduced by the smaller subsets helps to regularize the model.

**Types of Batch Size**:

- **Mini-Batch Gradient Descent**: This is the most common training method, where the batch size is a compromise between the extremes of 1 example per batch (stochastic gradient descent) and the entire dataset per batch (batch gradient descent). It balances the need for computational efficiency with the benefits of stochastic updates.
- **Choosing Batch Size**: The optimal batch size is often determined experimentally, as it can depend on the specific task, the model architecture, and the hardware capabilities. Researchers and practitioners might start with a value that fits their system’s memory constraints and adjust based on training speed and model performance outcomes.

In summary, batch size in NLP tasks (and machine learning more broadly) is a critical hyperparameter that influences the efficiency, convergence speed, and generalization performance of the training process.

## Dataloader

Every time we are going to feed inputs to the transformer, we are going to have **many batches of multiple chunks of text that are stacked up in a single tensor**. It is done for efficiency as GPUs are very good at the parallel processing of data. The 1-dimensional arrays are going to be stacked up to form a 4×8 tensor, that is with a sequence length (also called block size or context) of 4 and a batch size of 8.

In [None]:
# Generating a batch of data of sequence x and target y

torch.manual_seed(1337) # set the random number generator seed to a fixed value, i.e. 1337; important for reproducibility
batch_size = 8  # number of sequences in a batch / processed in parallel
block_size = 4  # maximum sequence length serving as a context/prompt

def get_batch(split):
    # Generate a batch of inputs/prompts x and respective targets y
    # batches are always of shape (batch_size, block_size)
    data = train_data if split == 'train' else val_data

    # Tensor of shape (batch_size) with random sequence start indices between 0 and len(data) - block_size
    ix = torch.randint(len(data) - block_size, (batch_size,))

    # Accumulate and add each sequence of this batch to form a tensor (tensor shape: batch size, block size)
    x = torch.stack([data[i:i+block_size] for i in ix])

    # Same as x but shifted by one token
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # targets
    return x, y # x is (4,8), y is (4,8) too

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)

print('targets:')
print(yb.shape)
print(yb)

print('------')

for b in range(batch_size):      # batch dimension, number of sequences in the batch (batch_size)
    for t in range(block_size):  # time dimension, number of tokens in the sequence  (block_size)
        context = xb[b, :t+1]    # context means prompt, taking the first t+1 tokens from the b-th sequence in the batch
        target = yb[b, t]        # we take the t-th token from the b-th sequence in the batch for the target (the token we want to predict)
        print(f"when input is {context.tolist()} the target is {target}")


inputs:
torch.Size([8, 4])
tensor([[ 1, 60, 39, 47],
        [46, 43, 39, 60],
        [ 1, 46, 43, 56],
        [61, 47, 50, 50],
        [43,  1, 39, 52],
        [53, 58, 46,  1],
        [53,  1, 40, 43],
        [ 1, 56, 43, 45]])
targets:
torch.Size([8, 4])
tensor([[60, 39, 47, 50],
        [43, 39, 60, 43],
        [46, 43, 56, 43],
        [47, 50, 50,  1],
        [ 1, 39, 52,  1],
        [58, 46,  1, 40],
        [ 1, 40, 43,  1],
        [56, 43, 45, 39]])
------
when input is [1] the target is 60
when input is [1, 60] the target is 39
when input is [1, 60, 39] the target is 47
when input is [1, 60, 39, 47] the target is 50
when input is [46] the target is 43
when input is [46, 43] the target is 39
when input is [46, 43, 39] the target is 60
when input is [46, 43, 39, 60] the target is 43
when input is [1] the target is 46
when input is [1, 46] the target is 43
when input is [1, 46, 43] the target is 56
when input is [1, 46, 43, 56] the target is 43
when input is [61] the t

In [None]:
# this is our batch of inputs to feed to the transformer (tensor shape: (B,T) = batch size 8, block size 4)
print(xb)

tensor([[ 1, 60, 39, 47],
        [46, 43, 39, 60],
        [ 1, 46, 43, 56],
        [61, 47, 50, 50],
        [43,  1, 39, 52],
        [53, 58, 46,  1],
        [53,  1, 40, 43],
        [ 1, 56, 43, 45]])


In [None]:
# amr
# checking one row from the batch input
print(xb[0])

tensor([ 1, 60, 39, 47])


> NB: function `get_batch` code explained below:

`ix = torch.randint(len(data) - block_size, (batch_size,))`
This line generates a **tensor of random integers** `ix` using PyTorch's randint function. The integers are in the range [0, len(data) - block_size), which means **starting from 0 up to the length of the selected data minus block_size**. The size of the tensor is determined by batch_size, which means it will contain batch_size random integers. **These integers are used as the starting indices for the batches of data to be extracted**.

`x = torch.stack([data[i:i+block_size] for i in ix])`
This line uses a list comprehension to iterate over each starting index in ix and slices the data tensor from that index i to i + block_size, creating a sequence of data. `torch.stack` then combines these sequences into a new tensor x, where each sequence is a separate element in the batch. **This tensor x represents the input data for the model**.

`y = torch.stack([data[i+1:i+block_size+1] for i in ix])`
This line is similar to the previous one but creates the target tensor y. For each starting index i in ix, it slices the data tensor from i + 1 to i + block_size + 1. **This represents the target or "next" elements corresponding to the inputs in x**.

In [None]:
# amr testing the code above
# generating a tensor with random sequence (tensor shape: batch size)
ix = torch.randint(len(train_data) - block_size, (batch_size,))
print(ix)
print(ix.shape)

# stacking the sequences (tensor shape: batch size, block size)
x = torch.stack([data[i:i+block_size] for i in ix])
print(x)
print(x.shape)


tensor([400784, 110140, 944762, 354070, 724297, 412236, 176790, 256488])
torch.Size([8])
tensor([[56, 57, 46,  1],
        [46, 53, 59,  1],
        [53, 51, 44, 53],
        [43, 56,  5, 57],
        [58, 10,  1, 58],
        [47, 52, 45,  1],
        [51, 54, 50, 39],
        [ 1, 53, 44,  1]])
torch.Size([8, 4])


## Embedding Layer

An input batch consists of tensors `xb` and `yb`.<br>
Both `xb` and `yb` are of size $batch\_size \times block\_size$.

The batch is used as basis for 'sub-batching'.

Because `yb` is just `xb` shifted by one token, we can use `yb` to train<br>
on multiple examples *within a batches' partial sequences*, each being of different context size.

These 'sub-batches' are called `context` and `target`. They are the pairs that we will feed into the model.

For now, we can start focusing on the model itself and feed `xb` and later on `yb` into it.<br>

We'll start building a bigram model.



## Implementing a simple bi-gram model
A bi-gram model, in NLP, is a type of statistical language model that predicts the probability of a word given the preceding word. **It's called a bi-gram because it considers a "gram" (or token) in the context of one preceding gram, thus forming pairs or "bi"-grams**.

**Example**: If your corpus had the sentence "*The quick brown fox jumps*", the bi-grams would be: "The quick", "quick brown", "brown fox", and "fox jumps". A bi-gram model would use these to calculate the likelihood of "brown" following "quick", "fox" following "brown", and so forth.

In summary, a bi-gram model is a simple language model that can predict the next unit in a sequence based on the preceding one and is often used in applications requiring a balance between contextual relevance and computational simplicity.


In [None]:
# This is a model that predicts the next token based on the previous token:
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # Embedding the vocabulary
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)   # 65 embeddings (vocabulary), embedding size: 65-dim vectors

    def forward(self, idx, targets=None):       # if targets not provided, the method computes no loss
        # idx and targets are both (B,T) tensor of integers (batch_size, block_size)
        # Embed the input indices, shape is now (batch_size, block_size, vocab_size) (B, T, C)
        logits = self.token_embedding_table(idx)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape        # B = batch_size, T = block_size, C = vocab_size
            logits = logits.view(B*T, C)  # Transpose logits to (B*T, C)
            # This is the first time we actively use the targets:
            targets = targets.view(B*T)   # Transpose targets to (B*T) (targets contains the next token's index for each input sequence in the batch)
            loss = F.cross_entropy(logits, targets)  # Calculating cross entropy loss across all tokens in the batch (using targets to plug out the correct token for each input sequence)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# Instantiate the model
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)              # Forward pass (yb remains unused for now)
print(f"logit shape", {logits.shape}) # [B,T], vocabulary_size)
print(f"loss shape", {loss})
print(f"loss", {loss})

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


logit shape {torch.Size([32, 65])}
loss shape {tensor(4.7724, grad_fn=<NllLossBackward0>)}
loss {tensor(4.7724, grad_fn=<NllLossBackward0>)}

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


NB: the text generation is garbage! because it is a totally random model, the token are not interconnected.

**Every integer of our tokenized text is now represented by an embedding vector of size `vocab_size`**.<br>
We do this by using **an embedding layer**. **This layer is effectively a lookup table that maps<br>**
each possible (`vocab_size` are possible in total) character-representing index to a unique vector of size `vocab_size`.

**The `logits` are the outputs of the model.<br>**
We just treat the embedded tokens of the input batch as the logits.<br>
This `logits` tensor holds all the embedded identities of the tokens in the input batch -> ($batch\_size \times block\_size \times vocab\_size$).

> We are *not yet* interconnecting the tokens with any sort of model/logic.<br>
We are not yet training or predicting *anything*.

This is about to change.

> NB: Code Explaination for the above forward pass and text generation:

**Forward Pass**: `forward(self, idx, targets=None)`

**Parameters**:

- `idx`: A tensor of shape `(B, T)` containing indices of tokens, where `B` is the batch size and `T` is the sequence length.
- `targets`: A tensor of shape `(B, T)` containing the indices of the target tokens. If targets is not provided, the method computes no loss.

**Process**:

- The method retrieves the logits for each token in `idx` using the token_embedding_table.
- If `targets` is provided, it reshapes the logits to `(B*T, C)` and the targets to `(B*T)`, then computes the cross-entropy loss between the logits and the targets. This loss measures how well the model predicts the next token.
- Returns: The logits and the computed loss (if `targets` is provided; otherwise, `loss` is `None`).

**Text Generation**: `generate(self, idx, max_new_tokens)`

**Parameters**:

- `idx`: A tensor of shape `(B, T)` representing the initial context for generation.
- `max_new_tokens`: The maximum number of new tokens to generate.

**Process**:

- The method iteratively generates one token at a time based on the current context (idx).
- It updates the context by appending the newly generated token and repeats the process until `max_new_tokens` have been generated.
- For each new token, it computes the logits for the last token in the current context, applies softmax to convert these logits into probabilities, and samples a new token index from this probability distribution.
- Returns: The tensor `idx` containing the original context plus the newly generated tokens.

**Model Instantiation and Usage**

- An instance of `BigramLanguageModel` is created with a specified vocab_size.
- The model is then used to compute logits and loss for a given batch of inputs `(xb, yb)` and to generate text starting from an initial context.

**Function Calls and Outputs**

- `logits.shape` and `loss`: Prints the shape of the logits tensor and the value of the loss computed during the forward pass.
- decode(...): Assuming the decode function maps token indices back to their string representations, this line generates 100 new tokens starting from an initial context of a single zero index and prints the decoded text.

This model is a simplistic representation of language modeling and is more illustrative than practical, especially since the embedding table is used in an unconventional way to directly produce logits for next-token prediction.

In [None]:
# amr
# looking at the model
m

BigramLanguageModel(
  (token_embedding_table): Embedding(65, 65)
)

In [None]:
# amr
# Access the embeddings weight
embeddings_weight = m.token_embedding_table.weight.data

print("Shape:", embeddings_weight.shape)
print("Embeddings weights:", embeddings_weight)

Shape: torch.Size([65, 65])
Embeddings weights: tensor([[ 0.1808, -0.0700, -0.3596,  ...,  1.6097, -0.4032, -0.8345],
        [ 0.5978, -0.0514, -0.0646,  ..., -1.4649, -2.0555,  1.8275],
        [ 1.3035, -0.4501,  1.3471,  ...,  0.1910, -0.3425,  1.7955],
        ...,
        [ 0.4222, -1.8111, -1.0118,  ...,  0.5462,  0.2788,  0.7280],
        [-0.8109,  0.2410, -0.1139,  ...,  1.4509,  0.1836,  0.3064],
        [-1.4322, -0.2810, -2.2789,  ..., -0.5551,  1.0666,  0.5364]])


In [None]:
# amr
# Token index to check
token_index = 5  # Replace with the index of the token you want to check

# Get the embeddings for the token at the specified index
token_embeddings = embeddings_weight[token_index]

print("Embeddings for token at index", token_index, ":", token_embeddings)

Embeddings for token at index 5 : tensor([-0.1338,  0.3899, -0.2884, -1.4651,  0.0101, -0.3004, -1.5733,  0.0148,
        -0.0447, -0.5367, -0.5223, -0.2181, -2.1608,  0.7865,  0.6854, -1.2576,
         0.6094, -2.0551, -0.4431, -0.6499, -0.6870,  0.2567, -1.2669,  0.2645,
        -0.6445,  1.0834, -0.7995,  0.2922,  1.3143,  1.2607, -0.3505, -2.0660,
         1.0575, -1.0572,  0.9911, -0.0797,  1.0751,  0.2381,  0.5757,  1.6685,
         0.5976, -1.8736,  1.2910, -0.3753, -1.8943,  0.5557,  0.8567, -0.8461,
         0.5015, -0.9656, -0.7255,  0.0990,  0.5928, -0.0422, -0.9566,  1.4424,
         0.4341, -0.4292,  0.3666,  0.1275, -0.0560,  0.8315, -0.5512,  1.0477,
         1.6187])


## Setting up the optimizer

In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

## Setting up a Loss Function

In [None]:
# basic Pytorch training loop
batch_size = 32
for steps in range(1000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    # The model m processes the batch xb and compares the predictions (logits) against the actual targets yb to compute the loss,
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True) # argument is an optimization that can potentially speed up gradient zeroing.
    loss.backward()                       # Computes the gradient of the loss
    optimizer.step()                      # Updates the model parameters based on the computed gradients and the optimizer algorithm

print(loss.item())


1.6548352241516113


In [None]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


LORD RIZEREL:
Neave fling to: yet spit this wrong way.

ISABELA:
Share is thou? There thou bettague!

Nurself.

PENE:
Why mine arsought; I I day with moss!

CORIOLANUS:
O, comple you wither titted to
prayison of shall prisone in evass!-
Nopper'd compatter! May, say, what cames Lanward't, so say lord.

FROMEUSS:
Heneme, what fear in the frame but say
Yorshy of that, if now love; if in a passsians thy new,
I in that may didspoken to amble and lespere straid's sake ling mind houme of of spirit,
My 


NB: the output is much better! more like Shakespeare writings!

Given that we have the identities of the next character through `yb`, how well does the model predict them through the `logits`? **The `loss` is the measurement of prediction quality**.

We want the index within `yb` to be the same as the most likely/active index within `logits`.<br>
The loss is measured as the average of this across all the tokens in the input batch.

We know the `vocab_size` is $65$.<br>
We can calculate what the loss should be if we were to predict the next token totally randomly:

$$-ln(\frac{1}{65}) = 4.1743872699$$

**Our calculated loss is higher/worse, because we are not predicting perfectly randomly to begin with.<br>**
**The initial predictions are not perfectly spread out across the `vocab_size`.<br>**
**They aren't super diffuse and contain a bit of entropy.<br>**
**We haven't yet learned uniform distribution across the `vocab_size`**.

![](https://images.squarespace-cdn.com/content/56316c94e4b098620a45e78a/1457973972468-D5XJVA1ABFXSD0AH9RZC/?content-type=image%2Fpng)
<br>Source: [Shiken](https://shiken.ai/chemistry/entropy)

The `loss` is to be minimized.<br>
We will need the model to make predictions of individual next tokens.<br>

Let's append the current model with a function `generate` that takes in the last token of a sequence and returns the next token however many times we want:

In [None]:
# amr - test section
# the code below at token 0 will give a line return
idx = torch.zeros(1,1); idx

tensor([[0.]])

> NB: code explained for the text generation, see below:

Generation setup

- `m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)`: This line generates a sequence of 500 new tokens starting from an initial context provided by `idx`. **Here, `idx` is initialized as a tensor of zeros with shape (1, 1), implying that the generation starts with no specific context, actually token 0 will produce a line return**. The model `m` likely uses a form of prediction mechanism (such as sampling or greedy selection) based on the learned distributions to choose the next token at each step.

Token list conversion

- `[0].tolist()`: The generated sequence is accessed (assuming the model may return a batch of sequences, and we're interested in the first) and converted to a Python list of token indices.

## The mathematical trick in self-attention
- matrix multiplication
- applying a mask
- attention mechanism

In [161]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
# This sets the seed for the random number generator for reproducibility
torch.manual_seed(42)

# tensor a (3x3) - triangular matrix
a = torch.tril(torch.ones(3, 3))
print("Matrix a orig = ")
print(a)
print('--')

# tensor a: normalizing  each row of the tensor by dividing it by the sum of its elements,
# computes the sum of elements along the columns (dimension 1), and
# keepdim=True keeps the dimension for broadcasting to work correctly during the division.
a = a / torch.sum(a, 1, keepdim=True)

# tensor b (2 x 3) of random integers between 0 and 9, from a uniform distribution
# The .float() method converts the tensor from an integer type to a floating-point type, necessary for subsequent matrix operations
b = torch.randint(0,10,(3,2)).float()

# Pytorch matrix multiplication
c = a @ b

# printing matrices
print('Matrix a normalized =')
print(a)
print('--')
print('Matrix b =')
print(b)
print('--')
print('Matrix c = matmul a@b')
print(c)

Matrix a orig = 
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
--
Matrix a normalized =
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
Matrix b =
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
Matrix c = matmul a@b
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


NB: matrices summary ;
- Matrix a: A lower triangular matrix with normalized rows.
- Matrix b: A 3x2 matrix of random floating-point numbers between 0 and 9.
- Matrix c: The result of the matrix multiplication of a with b, a 3x2 matrix.


> **Matrix c**: the first row of c is identical to the first row of b because the first row of a is [1, 0, 0], meaning it multiplies the first element of b by 1 and adds zeros for the rest.
The second row of c is the average of the first and second rows of b because the second row of a is [0.5, 0.5, 0].
The third row of c is the average of all three rows of b because the third row of a is [0.3333, 0.3333, 0.3333].

In [162]:
# consider the following toy example:

torch.manual_seed(1337) # set the seed for reproducibility
B,T,C = 4,8,2           # batch, time, channels
x = torch.randn(B,T,C)  # random numbers for a tensor of shape B,T,C
print(x)
print(f"tensor x shape \n", {x.shape})

tensor([[[ 0.1808, -0.0700],
         [-0.3596, -0.9152],
         [ 0.6258,  0.0255],
         [ 0.9545,  0.0643],
         [ 0.3612,  1.1679],
         [-1.3499, -0.5102],
         [ 0.2360, -0.2398],
         [-0.9211,  1.5433]],

        [[ 1.3488, -0.1396],
         [ 0.2858,  0.9651],
         [-2.0371,  0.4931],
         [ 1.4870,  0.5910],
         [ 0.1260, -1.5627],
         [-1.1601, -0.3348],
         [ 0.4478, -0.8016],
         [ 1.5236,  2.5086]],

        [[-0.6631, -0.2513],
         [ 1.0101,  0.1215],
         [ 0.1584,  1.1340],
         [-1.1539, -0.2984],
         [-0.5075, -0.9239],
         [ 0.5467, -1.4948],
         [-1.2057,  0.5718],
         [-0.5974, -0.6937]],

        [[ 1.6455, -0.8030],
         [ 1.3514, -0.2759],
         [-1.5108,  2.1048],
         [ 2.7630, -1.7465],
         [ 1.4516, -1.5103],
         [ 0.8212, -0.2115],
         [ 0.7789,  1.5333],
         [ 1.6097, -0.4032]]])
tensor x shape 
 {torch.Size([4, 8, 2])}


NB: x can represent an input tensor to a neural network where you have a batch of 4 sequences, each sequence of length 8, and each element of the sequence has 2 channels (features).

So, here we have  8  tokens, each of which is a vector of size  2 .
They are not talking to each other / are not related to each other in any way.

We'd like to couple them so that e.g. the 3rd token can only communicate with the tokens in the 2nd and 1st location, but not with a future token in the 4th location.

Information has to be able to flow, but exclusively in one direction.

We can do this in a most simple way by averaging preceding tokens, including the current_token. This would, in essence, summarize current_token in the context of current_token's history.

For every  𝑡 -th token, we'd like to get the average of all the vectors of previous tokens and the current one ( 𝑡 ) as well:

In [163]:
# We want x[b, t] = mean_{i <= t} x[b, i]
xbow = torch.zeros((B, T, C))          # Create tensor of zeros of shape (B, T, C) (bag of words representation of the input)
for b in range(B):                     # For all batches
    for t in range(T):                 # For all tokens in the batch
        xprev = x[b, :t+1]             # Get all tokens up to and including the current token (t, C)
        xbow[b, t] = xprev.mean(dim=0) # Calculate the mean of the tokens up to and including the current token

print('Batch [0]:\n', x[0], "\n")     # First batch of 8 tokens, each of size 2
print('Running Averages:\n', xbow[0]) # Running averages of the first batch of 8 tokens, each of size 2

Batch [0]:
 tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]]) 

Running Averages:
 tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


NB: Due to the loops, this is relatively inefficient.<br>**The trick is that we can build a running average like this using<br>
much faster matrix multiplication:**

In [164]:
# version 2: it uses a weighted sum approach, leveraging matrix multiplication for efficiency.
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x                         # (B, T, T) @ (B, T, C) ----> (B, T, C)
print('Batch [0]:\n', x[0], "\n")       # First batch of 8 tokens, each of size 2
print('Running Averages:\n', xbow2[0])  # Running averages of the first batch of 8 tokens, each of size 2

# comparing the bag of words (xbow, xbow2)
# Set a higher absolute tolerance
atol_value = 1e-6  # For example, you can adjust this value as needed

# Use the torch.allclose function with the custom atol
close = torch.allclose(xbow, xbow2, atol=atol_value)
print(close)

Batch [0]:
 tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]]) 

Running Averages:
 tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])
True


In [165]:
# version 3: it uses the Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x

print('Batch [0]:\n', x[0], "\n")       # First batch of 8 tokens, each of size 2
print('Running Averages:\n', xbow3[0])  # Running averages of the first batch of 8 tokens, each of size 2

# comparing the bag of words (xbow, xbow2)
# Set a higher absolute tolerance
atol_value = 1e-6  # For example, you can adjust this value as needed

# Use the torch.allclose function with the custom atol
close = torch.allclose(xbow, xbow3, atol=atol_value)
print(close)

Batch [0]:
 tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]]) 

Running Averages:
 tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])
True


In [166]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [167]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [168]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [169]:
k.var()

tensor(1.0449)

In [170]:
q.var()

tensor(1.0700)

In [171]:
wei.var()

tensor(1.0918)

In [172]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [173]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

In [174]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

In [175]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [176]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))

In [177]:
# French to English translation example:

# <--------- ENCODE ------------------><--------------- DECODE ----------------->
# les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>



### Full finished code, for reference

You may want to refer directly to the git repo instead though.

In [178]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


0.209729 M parameters
step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5091, val loss 2.5060
step 300: train loss 2.4199, val loss 2.4337
step 400: train loss 2.3500, val loss 2.3563
step 500: train loss 2.2961, val loss 2.3126
step 600: train loss 2.2408, val loss 2.2501
step 700: train loss 2.2053, val loss 2.2187
step 800: train loss 2.1636, val loss 2.1870
step 900: train loss 2.1226, val loss 2.1483
step 1000: train loss 2.1017, val loss 2.1283
step 1100: train loss 2.0683, val loss 2.1174
step 1200: train loss 2.0376, val loss 2.0798
step 1300: train loss 2.0256, val loss 2.0645
step 1400: train loss 1.9919, val loss 2.0362
step 1500: train loss 1.9696, val loss 2.0304
step 1600: train loss 1.9625, val loss 2.0470
step 1700: train loss 1.9402, val loss 2.0119
step 1800: train loss 1.9085, val loss 1.9957
step 1900: train loss 1.9080, val loss 1.9869
step 2000: train loss 1.8834, val loss 1.9941
step 2100: train loss 1.

# FULL CODE EXPLANATION


## Hyperparameters:

```
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)
```


- `batch_size = 16`: The number of sequences processed in parallel during training. Batch size affects both learning dynamics and computational efficiency.
- `block_size = 32`: The maximum sequence length the model will consider for predictions. In the context of Transformers, this is often referred to as the model's "context size" or "sequence length," limiting how much prior context the model can use.
- `max_iters = 5000`: The maximum number of iterations (or updates) to perform during training. This caps the training process and is crucial for preventing overfitting and unnecessary computation.
- `eval_interval = 100`: The interval (in iterations) at which the model's performance is evaluated on a validation set or some evaluation metric is calculated. This helps in monitoring the training progress.
- `learning_rate = 1e-3`: The step size used for updating the model's weights during optimization. The learning rate is a critical hyperparameter that affects training stability and convergence.
- `device = 'cuda' if torch.cuda.is_available() else 'cpu'`: Determines the computing device for training. If CUDA (NVIDIA's GPU computing platform) is available, training will be accelerated using the GPU; otherwise, it falls back to the CPU.
- `eval_iters = 200`: This parameter might indicate the number of iterations to run during each evaluation phase, but its specific role depends on the context in which it's used in the training or evaluation loop.
- `n_embd = 64`: The size of the embeddings used for representing tokens. This could also imply the dimensionality of the model's hidden layers.
- `n_head = 4`: The number of attention heads in each Transformer block. Multi-head attention allows the model to focus on different parts of the input sequence simultaneously.
- `n_layer = 4`: The number of layers (or depth) of the Transformer model. More layers can increase the model's capacity but also its computational cost and the risk of overfitting.
- `dropout = 0.0`: The dropout rate used for regularization during training. Dropout randomly zeroes some of the elements of the input tensor with the given probability, helping prevent overfitting. A value of 0.0 indicates no dropout is applied.

Setting a Manual Seed:

`torch.manual_seed(1337)`: Sets the seed for generating random numbers in PyTorch. This ensures that the model's initialization and any other random operations are reproducible across runs for debugging and comparison purposes.

These parameters together define the structure and training behavior of a Transformer-based model. Adjusting these hyperparameters can significantly affect model performance, training speed, and resource requirements.

## Function estimate loss
The code defines a function `estimate_loss` that estimates the average loss of a model on the training and validation datasets without updating the model's weights. It's designed to be run during or after training to monitor the model's performance.


```
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out
```





Let's break down what it does:

`@torch.no_grad()` Decorator:

This decorator is applied to the estimate_loss function to disable gradient computation within the function. Disabling gradients saves memory and computations, making evaluation faster since backpropagation (needed for training) is not performed.

Function Definition `def estimate_loss()`:

Defines the function `estimate_loss` that, when called, estimates and returns the average loss for both the training and validation splits.

Initialization of the Output Dictionary:

`out = {}` initializes an empty dictionary to store the average loss for each data split ('train' and 'val').

Model Evaluation Mode:

`model.eval()` sets the model to evaluation mode, affecting layers like dropout and batch normalization, which behave differently during training vs. evaluation.

Iterating Over Data Splits:

The loop for split in `['train', 'val']`: iterates over two splits: training and validation.

Loss Calculation for Each Split:

Initializes a tensor losses = `torch.zeros(eval_iters)` to store the loss values for each iteration in evaluating the specified split.

Iterates `eval_iters` times, each time retrieving a batch of data `X`, `Y` through `get_batch(split)`, which is assumed to be a function that provides batches of input data `X` and target labels `Y` for the specified split.

`logits, loss = model(X, Y)` calculates the model's output logits and the corresponding loss `loss` for the given batch. The model is expected to return both the raw predictions (`logits`) and the calculated loss value when provided with inputs and targets.

`losses[k] = loss.item()` stores the loss value for the current iteration by converting the loss tensor to a Python scalar using `.item()`.
Storing the Mean Loss:

After iterating through `eval_iters` batches, the mean loss for the split is calculated as `losses.mean()` and stored in the `out` dictionary with the split name as the key.

Switching Back to Training Mode:

`model.train()` sets the model back to training mode, re-enabling the training-specific behaviors like dropout and batch normalization updates, which were disabled during evaluation.

Return Statement:

Finally, the function returns the `out` dictionary containing the average loss for both the training and validation splits.

This function is useful for monitoring model performance without affecting its training state, providing insights into how well the model is learning and generalizing to unseen data.

## Class Head

This code defines a class `Head`, which represents a single head of self-attention within a Transformer architecture. The self-attention mechanism allows the model to weigh the importance of different parts of the input sequence differently for each element in the sequence. This class is implemented as a subclass of nn.Module, which is the base class for all neural network modules in PyTorch.



```
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

```




Let's break down its components:

`Initialization Method __init__(self, head_size)`

Parameters:
- `head_size`: The size of the head, which determines the dimensionality of the key, query, and value vectors.

Attributes:
`self.key`, `self.query`, `self.value`: These are linear layers (without bias) that transform the input tensor `x` into key, query, and value representations, respectively. Each transformation projects the input embeddings (`n_embd`) into a lower-dimensional space (`head_size`).
- `self.tril`: A lower triangular matrix registered as a buffer. This matrix is used for masking to ensure that each position in the sequence can only attend to preceding positions, enforcing causality in the attention mechanism. The use of `torch.tril(torch.ones(block_size, block_size))` creates a square matrix where each element below the main diagonal is 1 (inclusive), and all others are 0.
- `self.dropout`: A dropout layer applied to the attention weights to prevent overfitting by randomly setting elements of the attention matrix to zero during training.

Forward Method `forward(self, x)`:

Input:
`x`: The input tensor with shape (`B, T, C`), where `B` is the batch size, `T` is the sequence length, and `C` is the number of features (`embedding dimension`).

Process:
The input `x` is first transformed into key (`k`), query (`q`), and value (`v`) representations using the linear transformations defined in `__init__`.

Attention scores are computed by taking the dot product of queries and keys `(q @ k.transpose(-2,-1))` and then scaling by the inverse square root of the dimensionality `(C**-0.5)`. This scaling factor is used to stabilize gradients.

The resulting attention scores are masked with `-inf` where the lower triangular matrix `(self.tril)` is zero, ensuring causality. This masking before the softmax operation effectively removes these positions from consideration by making their weights zero after softmax is applied.

A softmax function is applied to the masked attention scores along the last dimension to obtain the final attention weights, which are then passed through a dropout layer.

The attention weights are used to perform a weighted aggregation of the value vectors, producing the output of the self-attention head.

Output:

The method returns the output tensor `out`, which is the result of applying self-attention to the input. This tensor has the same shape as the input (`B, T, C`) and contains the aggregated information based on the computed attention weights.

This class encapsulates the functionality of a single attention head, focusing on computing weighted sums of value vectors based on the similarity between queries and keys, while respecting the sequential nature of the input through masking.

## MultiHead Attention


This code defines a class MultiHeadAttention, which represents a multi-head self-attention mechanism within a Transformer model. Multi-head attention allows the model to simultaneously attend to information from different representation subspaces at different positions. Here's a detailed breakdown:


```
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

```
Here's a detailed breakdown:


Class Definition and Initialization Method `__init__(self, num_heads, head_size)`:

Parameters:

- `num_heads`: The number of parallel attention heads.
- `head_size`: The size of each attention head.

Attributes:
- `self.heads`: A nn.ModuleList containing instances of the Head class (defined previously). Each Head instance represents a single self-attention head. The list size is determined by `num_heads`, allowing for parallel computation of different attention "views."
`self.proj`: A linear layer that projects the concatenated output of all attention heads back to the original embedding dimension (n_embd). This projection is necessary because the concatenated outputs from all heads increase the dimensionality, and we often want the output of the multi-head attention to have the same dimensionality as the input for residual connections and further processing.
- `self.dropout`: A dropout layer applied after the projection to prevent overfitting by randomly zeroing out elements of the output tensor during training.

Forward Method `forward(self, x)`:

Input:

`x`: The input tensor with shape (`B, T, C`), where `B` is the batch size, `T` is the sequence length, and `C` is the number of features (embedding dimension).

Process:

The input `x` is processed by each attention head in `self.heads`, resulting in `num_heads` output tensors.
These outputs are concatenated along the last dimension `(dim=-1)`. Since each head potentially transforms the input into a different subspace (`head_size`), concatenating these allows the model to combine diverse information from different subspaces.

The concatenated output is then projected back to the original embedding dimension (`n_embd`) using `self.proj`.
Dropout is applied to the projected output as a regularization measure.

Output:

The method returns the final output tensor, which has undergone multi-head self-attention and been projected back to the original embedding dimensionality. This output can be used for further processing or as part of a larger model, like a Transformer block.

This class effectively combines information from multiple perspectives (`heads`) on the input sequence, enhancing the model's ability to capture various dependencies and features within the data. Multi-head attention is a key component of Transformer architectures, contributing to their effectiveness in handling complex sequence-based tasks.

## Block(nn.module)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x
```


## BigramLanguageModel
This code defines a `BigramLanguageModel` class, which is a simplified transformer-based model intended for language modeling tasks. The model aims to predict the next token in a sequence based on the previous tokens, implementing a structure reminiscent of Transformer models but tailored for a specific context of generating or evaluating sequences token-by-token.

```
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx
```
Let's dissect its components and functionalities:

- Initialization Method `__init__(self)`:

Attributes:

- `self.token_embedding_table`: An embedding layer for tokens. It maps tokens from a vocabulary of size vocab_size to embeddings of size n_embd.
- `self.position_embedding_table`: An embedding layer for positions within a sequence. It maps position indices (up to block_size) to embeddings of the same size n_embd, enabling the model to understand the order of tokens.
- `self.blocks`: A sequential container of Block instances. Each Block is presumably a Transformer block (not defined in this snippet), which includes self-attention and feedforward layers, applied successively to the input embeddings. The number of blocks is n_layer, allowing for multiple layers of processing.
- `self.ln_f`: A layer normalization applied to the final output of the Transformer blocks. It helps stabilize the hidden state distributions before the final prediction.
- `self.lm_head`: A linear projection layer that maps the output of the Transformer blocks back to the vocabulary space. This is used to produce logits for each token in the vocabulary.

Forward Method `forward(self, idx, targets=None)`:

Inputs:

-`idx`: A tensor of shape (B, T) containing indices of input tokens.
- `targets`: An optional tensor of shape (B, T) containing indices of target tokens for training. If `targets` is not provided, the method assumes inference mode and does not compute loss.

Process:

- Embeds both tokens and their positions, then sums these embeddings to produce a representation that contains information about both the identity and the order of tokens.
- Processes the embedded input through the Transformer blocks `(self.blocks)`.
- Applies layer normalization `(self.ln_f)`.
- Projects the normalized output to the vocabulary space `(self.lm_head)`, producing logits for each token position.
- If `targets` are provided, computes the cross-entropy loss between the predicted logits and the true targets. This is done after reshaping logits and targets to ensure compatibility with the loss function.

Outputs:

- `logits`: The logits corresponding to the probability distribution over the vocabulary for each token position.
- `loss`: The computed cross-entropy loss if targets are provided, otherwise None.

Generate Method `generate(self, idx, max_new_tokens)`:

Inputs:

`idx`: A tensor of shape (B, T) containing the starting sequence of token indices.
- `max_new_tokens`: The maximum number of new tokens to generate.

Process:

- Iteratively generates `max_new_tokens` by predicting one token at a time and appending it to the sequence.
- At each step, limits the context to the last `block_size` tokens to manage computation and memory efficiency, and to adhere to the model's design constraints.
- Uses the model's forward pass to get logits for the next token, applies `softmax` to convert logits into probabilities, and then samples a new token index from this probability distribution.
- Concatenates the newly sampled token to the existing sequence and repeats until max_new_tokens are generated.

Output:

- Returns the extended sequence with the newly generated tokens appended.

This class encapsulates the functionalities required for both training a language model (via the forward method) and generating text (via the generate method), showcasing a basic yet powerful application of Transformer architecture principles in NLP tasks.



## Training Loop

It code below outlines the procedure for training the model on a dataset, evaluating its performance periodically, and finally generating text based on a given context.

```
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
```

Let's break down the code into its key components:

**Training Loop**

Iteration Loop:

- `for iter in range(max_iters)`: iterates through a specified number of training iterations or epochs (`max_iters`).

Conditional Evaluation:

- `if iter % eval_interval == 0 or iter == max_iters - 1`: checks if the current iteration is a multiple of the evaluation interval (eval_interval) or the last iteration. If so, the model's performance is evaluated.

- `losses = estimate_loss()` calls a function to estimate the model's loss on both the training and validation datasets. This function is likely similar to the estimate_loss function explained earlier.
The training and validation losses are printed to monitor the model's performance over time.
Batch Processing:

`xb, yb = get_batch('train')` fetches a batch of input data (`xb`) and corresponding targets (`yb`) for training.

Loss Calculation and Optimization:

- `logits, loss = model(xb, yb)` computes the model's predictions (`logits`) and the loss (`loss`) for the current batch.

- `optimizer.zero_grad(set_to_none=True)` clears any old gradients from the previous step to prevent accumulation.

- `loss.backward()` computes the gradient of the loss with respect to the model parameters.

- `optimizer.step()` updates the model parameters based on the gradients.

**Text Generation**

Initialization:

`context = torch.zeros((1, 1), dtype=torch.long, device=device)` initializes a tensor to serve as the starting context for text generation. It's set to a single token of zeros, indicating an empty context or start token.

Generation Loop:

- `m.generate(context, max_new_tokens=2000)` generates new tokens starting from the provided context for a maximum of 2000 new tokens. The generate method likely uses the model to predict the next token based on the current sequence, samples a token from the predicted probabilities, and appends it to the sequence. This process is repeated until the maximum number of new tokens is reached.

Decoding and Printing:

- `decode(m.generate(context, max_new_tokens=2000)[0].tolist())` decodes the generated indices back into readable text. The decode function is assumed to map numerical token IDs back to their corresponding string representations. The generated text is then printed.

This training and generation loop is typical for neural network-based language models, combining periods of training with evaluation to monitor progress and adjusting model parameters to minimize loss. The final generation step showcases the model's ability to produce coherent and contextually relevant text sequences.

# Resources

- [Attention is All You Need paper](https://arxiv.org/abs/1706.03762)

- [OpenAI GPT-3 paper](https://arxiv.org/abs/2005.14165 )

- [OpenAI ChatGPT blog post](https://openai.com/blog/chatgpt/)

- [The illustrated transfomer](https://jalammar.github.io/illustrated-transformer/)

# New Section

In [None]:
torch.manual_seed(1337) # for reproducibility

# Not really an LM at this stage, but we will get there...
class BigramLM(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # Embedding the vocabulary
        # Every one of the vocab_size tokens is represented by a vector of size vocab_size
        self.embed = nn.Embedding(vocab_size, vocab_size) # 65 unique 65-dim vectors

    def forward(self, idx, targets):
        # idx is of shape (batch_size, block_size)
        # targets is of shape (batch_size, block_size)
        # Embed the input indices, shape is now (batch_size, block_size, vocab_size) (B, T, C)
        logits = self.embed(idx)
        return logits


print('Vocabulary size:', vocab_size)  # Length of the vocabulary list (this includes the space character)
m = BigramLM(vocab_size)  # Instantiate the model
out = m(xb, yb)           # Forward pass (yb remains unused for now)
print(out.shape)          # (batch_size, block_size, vocab_size) -> 4 times 8 characters, each embedded as a 65-dim vector

Every integer of our tokenized text is now represented by an embedding vector of size `vocab_size`.<br>
We do this by using an embedding layer. This layer is effectively a lookup table that maps<br>
each possible (`vocab_size` are possible in total) character-representing index to a unique vector of size `vocab_size`.

The `logits` are the outputs of the model.<br>
We just treat the embedded tokens of the input batch as the logits.<br>
This `logits` tensor holds all the embedded identities of the tokens in the input batch -> ($batch\_size \times block\_size \times vocab\_size$).

We are *not yet* interconnecting the tokens with any sort of model/logic.<br>
We are not yet training or predicting *anything*.

This is about to change.

## Setting up a Loss Function

In [None]:
torch.manual_seed(1337)

class BigramLM(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, vocab_size)  # Embedding the vocabulary, each individual token is represented by a vector of size vocab_size

    def forward(self, idx, targets):
        logits = self.embed(idx)      # Embed the input indices, shape is now (batch_size, block_size, vocab_size) (B, T, C)
        B, T, C = logits.shape        # B = batch_size, T = block_size, C = vocab_size
        logits = logits.view(B*T, C)  # Transpose logits to (B*T, C)
        # This is the first time we actively use the targets:
        targets = targets.view(B*T)   # Transpose targets to (B*T) (targets contains the next token's index for each input sequence in the batch)
        loss = F.cross_entropy(logits, targets)  # Calculating cross entropy loss across all tokens in the batch (using targets to plug out the correct token for each input sequence)
        return logits, loss


m = BigramLM(vocab_size)  # Instantiate the model
logits, loss = m(xb, yb)  # Forward pass (xb becomes embedded, yb is used to calculate the loss)
print(logits.shape)       # (batch_size * block_size, vocab_size)
print(loss.item())        # Loss value

Given that we have the identities of the next character through `yb`, how well does the model predict them through the `logits`? The `loss` is the measurement of prediction quality.

We want the index within `yb` to be the same as the most likely/active index within `logits`.<br>
The loss is measured as the average of this across all the tokens in the input batch.

We know the `vocab_size` is $65$.<br>
We can calculate what the loss should be if we were to predict the next token totally randomly:

$$-ln(\frac{1}{65}) = 4.1743872699$$

Our calculated loss is **higher/worse**, because we are not predicting perfectly randomly to begin with.<br>
The initial predictions are not perfectly spread out across the `vocab_size`.<br>
They aren't super diffuse and contain a bit of entropy.<br>
We haven't yet learned uniform distribution across the `vocab_size`.

![](https://images.squarespace-cdn.com/content/56316c94e4b098620a45e78a/1457973972468-D5XJVA1ABFXSD0AH9RZC/?content-type=image%2Fpng)
<br>Source: [Shiken](https://shiken.ai/chemistry/entropy)

The `loss` is to be minimized.<br>
We will need the model to make predictions of individual next tokens.<br>

Let's append the current model with a function `generate` that takes in the last token of a sequence and returns the next token however many times we want:

In [None]:
torch.manual_seed(1337)

class BigramLM(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, vocab_size)      # Embedding the vocabulary, each individual token is represented by a vector of size vocab_size

    def forward(self, idx, targets=None):
        logits = self.embed(idx)                               # Embed the input indices, shape is now (batch_size, block_size, vocab_size) (B, T, C)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)                       # Transpose logits to (B, C, T) (B=batch_size, T=block_size, C=vocab_size)
            targets = targets.view(B*T)                        # Transpose targets to (B, T)
            loss = F.cross_entropy(logits, targets)            # Calculating cross entropy loss across all tokens in the batch
        return logits, loss

    # Generate new tokens based on respective last token of a sequence
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, _ = self(idx)                              # Forward pass (this is the forward function) with the current sequence of characters idx, results in (B, T, C)
            logits = logits[:, -1, :]                          # Focus on the last token from the logits (B, T, C) -> (B, C)
            probs = F.softmax(logits, dim=-1)                  # Calculate the probability distribution for the next token based on this last token, results in (B, C)
            idx_next = torch.multinomial(probs, num_samples=1) # Sample the next token (B, 1), the token with the highest probability is sampled most likely
            idx = torch.cat((idx, idx_next), dim=1)            # Add the new token to the sequence (B, T+1) for the next iteration
        return idx                                             # Return the sequence of tokens (B, T+1), these are characters

m = BigramLM(vocab_size)  # Instantiate the model
logits, loss = m(xb, yb)  # Forward pass

print(logits.shape)       # (batch_size, block_size, vocab_size)
print(loss) # Loss value

Let's recap this `generate` function:<br>
The function takes in a batch of tokens `xb` and a number of tokens to generate `n`.

Repeated over `n` times, it will:
- forward pass through the model with tokens `xb` to get `logits`
- disregard everything but the last token of `xb`
- calculate the probability of each possible token in the vocabulary to be the token after this last `xb` token; this is done with `F.softmax`
- sample a token from the probability distribution with `torch.multinomial`, this returns an index of the token that we can use to look up the token itself in the vocabulary if we wanted
- append the sampled token to the tokens `xb`
- repeat

See that `self(idx)` calls the `forward` function of the model. `forward` is adapted accordingly above to also take a call with just `idx`.

Let's run this model.

## Producing The First Text

In [None]:
ix = torch.zeros((1, 1), dtype=torch.long)  # Start with a single tensor of shape (1, 1) holding a 0 (new line)
tokens = m.generate(ix, max_new_tokens=100) # Generate 100 tokens as a sequence of indices
print(tokens.shape)                         # Print the shape of the resulting sequence of tokens
print(decode(tokens[0].tolist()))           # Decode the resulting sequence of indices to a string

We do the most basic generative task here:<br>
We feed the model a prompt of just the newline character and let it iteratively<br>
generate 100 'most probable' characters as a follow-up.<br>

Within the print-statement there is the `[0]` call. This is **not** because we are only interested in a first character of the generated text or anything like that.<br>
It is because `generate` returns a tensor of size `batch_size x 101`. We only have a `batch_size` of $1$ here, so we can just take the first element of the array and convert it to a string.


The generation as is right now is not very good.<br>
The `generate` function loops, increases the `context_size` and always re-feeds itself with this growing context.<br>
Yet, with the logits generated from that we are not taking anything beyond/prior the logits of the last token from the context as basis for our prediction.

For the current approach, our context could be of fixed size. With the current (bigram) model, we are not using the context to its full potential. This will be addressed soon.

For now, let's train!

## Training

In [None]:
# Create a PyTorch Optimizer
# Instantiate AdamW optimizer with the model parameters (weights)
# and a learning rate of 0.001 (often used value for *small* networks)
opt = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32 # Increasing the batch size from 4 to 32
losses = []

# Train for 10000 steps/batches
for steps in range(10000):
    xb, yb = get_batch('train', batch_size) # Sample a batch of data
    logits, loss = m(xb, yb)                # Forward pass, calculate the loss
    loss.backward()                         # Backprop with PyTorch's autograd
                                            # (effectively just updating the logits/the embedding vectors)
    opt.step()                              # Update the weights
    opt.zero_grad()                         # Set the gradients to zero

    # Print the loss every 100 steps
    if steps % 100 == 0:
        print(f'Loss at step {steps}: {loss.item()}')
        losses.append(loss.item())

Let's now sample from the model and see how it performs:

In [None]:
print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

We currently exclusively embed the tokens into randomly generated 65-dimensional vectors.<br>
When we then enter a batch of $4$ token sequences, where each sequence is $8$ character indices long.<br>
We embed the indices to receive a tensor of size $[4 \times 8 \times 65]$.<br>
We then reshape this tensor to $[32 \times 65]$ and compare that with the target tensor of (also reshaped) size $32$.<br>
The loss is then the determined by the `CrossEntropyLoss` function, which effectively plugs out the probability of the target token from the `logits` with index of the target token and then takes the negative logarithm of that.<br>
The loss is then averaged across all the tokens in the batch. A scalar value is returned.

We build a model that optimizes the embedding vectors to carry the highest probabilities for the most likely next token(s).

Remember, we only increased `batch_size` and trained for more epochs.<br>
The model is still the same. We still only predict the next token based on the previous token.

**There is one thing to say about the loss:**<br>
At this point, the loss is very noisy. This is due to every batch being (independently) more or less lucky with predictions.<br>
Viewed across the entire training, the loss is not really comparable across batches, making the loss jumpy.

In [None]:
from matplotlib import pyplot as plt
plt.plot(losses);

Consider that this is sampled loss after every $100$ steps.<br>
The loss we visualize here is too batch-specific and each batch is too small to be representative of the entire training.<br>
We see a trend though. At this point, switch over to `bigram.py` to see our code in execution-optimized script form.<br>
There, this loss interpretation problem is addressed like so:

In [None]:
eval_iters = 200
max_iters = 10000
eval_interval = 500

@torch.no_grad() # Disable gradient calculation for this function
def evaluate_loss():
    out = {}
    m.eval() # Set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split, batch_size)
            _, loss = m(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    m.train() # Set model back to training mode
    return out

train_losses = []

# Training
for iter in range(max_iters):
    xb, yb = get_batch('train', batch_size) # Get batch
    logits, loss = m(xb, yb)                # Forward pass
    loss.backward()                         # Backward pass
    opt.step()                        # Update parameters
    opt.zero_grad(set_to_none=True)   # Reset gradients

    if iter % eval_interval == 0:
        losses = evaluate_loss()
        train_losses.append(losses["train"].item())
        print(f'Iter {iter:4d} | Train Loss {losses["train"]:6.4f} | Val Loss {losses["val"]:6.4f}')

# Generate text from the model
context = torch.zeros((1, 1), dtype=torch.long) # Start with a zero context
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

In [None]:
plt.plot(train_losses);

Instead of just printing the loss batch-wise, `evaluate_loss` averages the loss across `eval_iter` batches.<br>
For `eval_iter` times, `evaluate_loss` will sample a batch of tokens, run the current model on it and average the loss.

This is done for both the training and validation set.

It is more accurate to do this, because now we start to see the trend of the loss in context of multiple, randomly sampled batches and thus a broader representation of the dataset.<br>
As the loss is averaged over multiple batches, it is also less noisy.

With all of this, our `bigram.py` script is a great starter for bulding a GPT.

In [None]:
# import torch.nn as nn
# from torch.nn import functional as F
# torch.manual_seed(1337)

# class BigramLanguageModel(nn.Module):
#     # initializing an embedding table that will create embeddings for each token in the vocabulary
#     # The embeddings have the same dimension as the vocabulary size, which means
#     # this model tries to predict the next token using a simplified one-hot-like encoding approach.
#     def __init__(self, vocab_size):
#         super().__init__()
#         # each token directly reads off the logits for the next token from a lookup table
#         self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

#     # The forward method defines the computation performed at every call.
#     # It takes indices of tokens (idx) and uses the embedding table to look up the logits for the next tokens.
#     def forward(self, idx, targets=None):
#         # idx and targets are both (B,T) tensor of integers
#         # predicting the next token
#         logits = self.token_embedding_table(idx)  # (B,T,C) (batch x time x tensor) that is 4 x 8 x 65 channel C is the vocab_size
#         B, T, C = logits.shape
#         if targets is None:
#           loss = None
#         else:
#           logits = logits.view(B*T, C)
#           targets = targets.view(B*T)
#           loss = F.cross_entropy(logits, targets)
#         return logits, loss #return the scores for the next character in the sequence

#     def generate(self, idx, max_new_tokens):
#     # idx is (B, T) array of indices in the current context
#         for _ in range(max_new_tokens):
#           logits, loss = self(idx)
#           # focus only on the last time step
#           logits = logits[:, -1, :]  # becomes (B, C)
#           # apply softmax to get probabilities
#           probs = F.softmax(logits, dim=-1)  # (B, C)
#           # sample from the distribution
#           idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
#           # append sampled index to the running sequence
#           idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
#           return idx


# # an instance of the BigramLanguageModel is created
# m = BigramLanguageModel(vocab_size)
# logits, loss = m(xb, yb) # passing the inputs and t
# print(logits.shape)
# print(loss)

# print(decode(m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


NB:
`generate` is a method and does the following:
- It takes an input tensor `idx` of shape (B, T), where B is the batch size and T is the sequence length, representing indices of tokens in the current context.
- `max_new_tokens` specifies how many new tokens should be generated.
Inside the loop, the model generates logits and (presumably ignored) loss for the current indices.
- It slices the logits to focus on the last set of predictions (the last time step) for each sequence in the batch.
- A softmax is applied to the sliced logits to convert them into probabilities.
- torch.multinomial is used to sample from the probability distribution given by probs, effectively picking the next token index for each sequence.
- The sampled index idx_next is concatenated with the current indices idx, extending each sequence in the batch by one token.

After generating the specified number of new tokens, the updated idx tensor, which contains the original context plus the new tokens, is returned.


# Training the model

## The mathematical trick in self-attention

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

In [None]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

In [None]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)


In [None]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

In [None]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)


In [None]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

In [None]:
wei[0]

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [None]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [None]:
k.var()

In [None]:
q.var()

In [None]:
wei.var()

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

In [None]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

In [None]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

In [None]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

In [None]:
# French to English translation example:

# <--------- ENCODE ------------------><--------------- DECODE ----------------->
# les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>



### Full finished code, for reference

You may want to refer directly to the git repo instead though.

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
