

# chatGPT explained

With this Colab notebook, I aim to bring a better understanding about how chatGPT at the core works to everyone. For this we are training your own super small GPT!

If you're just looking for a short explanation on how chatGPT was somewhat trained and works:

1. Collect a large amount of text data.

2. Create a mapping from natural numbers to words, word-chunks, characters, bytes, etc. These are then called `tokens`.

3. Assign them a randomly initialized vector (think back to high-school: a 2-dimensional vector starting on the standard x- and y-axis origin [0,0] can point in any direction, for instance [1,1] will point to the right and up). The vector is then adapted step by step in training.

4. Define a transformer deep learning model based on the attention mechanism that allows vectors to incorporate knowledge from previous characters. (In `This is awesome`, the `i` has a different meaning in `This` and `is` depending on the context)

5. Take your big dataset, transform it with your mapping from text to natural numbers into a sequence of numbers.

6. Cut the number sequence into pieces and teach the model to predict the next number. Say our training text is four words: `"This is an elephant"`, we can assign them the numbers `0`, `1`, `2`, `3`. Then we simply teach the model to always predict the next number based on all previous numbers. So if the input (called "context") is `"This is"` or `[0, 1]`, the model should with high probability predict `[2]`. The input number sequence is first mapped to the 0th and 1st learnable vectors and then passed through the model for prediction. A model prediction is a probability distribution over all unique tokens in your dataset.

7. If you think of the tokens as vectors that can be adapted, think of each dimension and the different multiplications happening in the model as knobs that have an effect on the final prediction. To actually `teach` the model, you simply check how wrong the prediction for the next character was and change all knobs slightly.

8. Finally, you can pass some input text to the model and then keep predicting the distribution for the next character and pick the most probable one. There are also other variants to determine the next character. If you had a model trained on the entire internet and start with `chatGPT explained` you might even end up with this collab notebook! 😉


Of course there is more to chatGPT. The model has thereby not yet learned that you interact with it in a `chat` fashion, etc. But this follows a similar regime by just `fine-tuning` the large model on chat-format text and giving the right input to the model when looking for a response to a user question. For instance you could create a dataset with multiple examples as follows:

```
System: You are a helpful AI assistant.
User: What is the capital city of Switzerland?
Assitant: While many people believe Zurich is the capital of
    Switzerland, the capital city is Bern.
User: Thanks. How many inhabitants does the country have?
Assitant: The population of Switzerland is roughly 9 million.
```

Hence, if you type anything in chatGPT, they might send to the model something as follows:
```
System: You are a helpful AI assistant.
User: {here_comes_your_question}
Assitant:
```
Then, the model will start to generate the `Assistant's` answer!


### How are we doing this?

In this notebook, we do the following steps to train our small GPT Model:
1. Select some training data.
2. Replicate (very small) GPT model architecture.
3. Build training loop and train the model.
4. Check what the model is able to come up with.

For simplicity we align with [Andrej Karpathy's nanoGPT github repository](https://github.com/karpathy/nanoGPT/). This means we base the model on character predictions. Hence, we build a character-level encoder/decoder as tokenizer to map text to numbers that we can predict and the inverse. We also use the TinyShakespeare dataset that contains a collection of Shakespeare's work. Though feel free to upload your own text!

I made the code in this notebook explicit such that anyone should be able to follow and understand variables and what is happening when we do run something.


### How do I run it?
First we should connect to a GPU which makes deep learning much faster. For this, select `Runtime` in the file menu, click `Change runtime type` and select for instance the `T4 GPU`.

Next, simply go read through the cells and press the `play` button on the left of every cell or when clicking on the first cell keep pressing `shift` + `enter`. **Make sure you run all cells in the order that they are.** You only need to run the `code` cells that have a grey background.
**You could also first run all cells until `2.5 Generating text with yourGPT` before coming back and understanding them one-by-one**. This way, your GPT model trains while you read and learn.

# 1.&nbsp;Select and prepare data

We select a simple text file containing many of Shakespeare's works called tinyshakespeare as dataset. If you would like to play with your own text file. Do not run the next cell and simply upload a file with your raw text called `input.txt` on the left sidebar to the folder icon.

In [None]:
%%capture
# Get training data from github. If you want to use your own, upload an input.txt and do not run this cell!
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Next, we load the dataset in python and split it into a training and validation dataset. The validation dataset is needed to check that the model does not simply memorize the training dataset.

In [None]:
with open("input.txt") as f:
  raw_data = f.read()

train_data = raw_data[:int(0.9 * len(raw_data))]
validation_data = raw_data[int(0.9 * len(raw_data)):]

print(f"We have a total number of characters in this dataset of: {len(raw_data)}")

As mentioned, to train a neural network to predict text, the text needs to be in numeric form. There exist various so-called tokenizers to achieve this.

A simply start is to take each character separately and encode it as a number. We check the the dataset for all unique characters and assign each character a number.

As we then teach the model to predict a number that follows a new number sequence (the `generative` part of `generative pre-trained transformers`), we also need a number to character mapping, i.e., the inverse.

In [None]:
# Define the tokenizer with encoding and decoding methods
all_unique_characters = set(train_data)

# We add an <|UNKOWN|> token for characters not found in the training set
all_unique_characters.add("<|UNKOWN|>")

n_unique_characters = len(all_unique_characters)
print(f"The unique characters in the dataset are as follows:\n {''.join(sorted(all_unique_characters))}\n")


# Now we need a mapping from character to number and vice versa.
# Mappings are only one-way, hence, we need both directions
character_to_number_mapping = {c: i for i, c in enumerate(sorted(all_unique_characters))}
number_to_character_mapping = {i: c for c, i in character_to_number_mapping.items()}


class Tokenizer:
    """The tokenizer is the 'object' that we can later use to go from text to numbers and the inverse.
    """
    def __init__(self, c_2_n_mapping, n_2_c_mapping):
        self.encoding_map = c_2_n_mapping
        self.decoding_map = n_2_c_mapping

    def encode(self, text):
        return [self.encoding_map.get(c, character_to_number_mapping["<|UNKOWN|>"]) for c in text]

    def decode(self, numbers):
        return "".join([self.decoding_map[n] for n in numbers])


Finally, we create the tokenizer 'object' that can encode text and decode a sequence of numbers. Let's also briefly check how the tokenizer works, and that it works as expected by mapping characters to numbers and back.

In [None]:
tokenizer = Tokenizer(character_to_number_mapping, number_to_character_mapping)

test_string = "It's awesome to learn about chatGPT!"
encoded_test_string = tokenizer.encode(test_string)
decoded_test_string = tokenizer.decode(encoded_test_string)

print(f"When encoding '{test_string}' with our tokenizer we get: {encoded_test_string}.\n")
print(f"Decoding the resulting sequence of numbers, we receive this: '{decoded_test_string}'.")


# 2.&nbsp;Generative Pretrained Transformer (GPT) Model Architecture

Awesome, we already have a mapping from characters to numbers and the inverse!


Now we need to define a deep learning model that we can teach to predict the next number based on a given sequence. For this we take the basic implementation of a Generative Pretrained Transformer (the `GPT` part in `chatGPT`).

![GPT2 Architecture](https://raw.githubusercontent.com/serced/gpt2/main/assets/gpt2_model_architecture.png)

*GPT2 Architecture: Taken from Steve D Yang et al. (2023) (https://pubs.acs.org/doi/10.1021/acs.iecr.3c01639)*



As shown in the yellow part of the graphic, Generative Pretrained Transformers (GPT) are made up of several core modules.

1. One of them is an embedding. It maps a character number to a vector consisting of multiple floating numbers (trivial, thus not shown on the image).

2. The second one is a positional encoding. The transformer architecture does not have a notion of where the character is in the text, this is why we add pre-defined positional encoding e.g. based on a sine/cosine fromula such that it can understand where each token in the sequence is.

3. Next we have the transformer block (orange background). This contains most of the deep learning part and can be repeated multiple times leading to more learnable parameters (you may have heard that e.g. Meta's Llama 3 model family having 8 or 70 billion learnable parameters). chatGPT has  ~1.8 trillion parameters as Nvidia's CEO Jensen Huang reveiled (see [Yahoo article about GPT-5](https://sg.news.yahoo.com/chatgpt-could-gpt-5-upgrade-175039034.html)). Though, not all of them are possibly used at the same time.

4. In the end, we have a layer normalization and a linear layer (last two steps on the yellow background). The layer norm basically normalizes the output of the previous layer and the linear layer is projecting the input into the output space (what we predict). This is then normalized to predicting the probability of each unique token in the data. Over all possible tokens (characters in our case), this will then sum to 100%.

## 2.1 Embedding and positional encoding

Now we start using PyTorch. PyTorch is a framework that helps in training neural networks. As a first step, we are importing some components used later on. Using this framework, we can make use of `autograd`. This helps because we only need to implement the `forward pass`. The forward pass describes how the model forms a prediction. When we start training to adapt the `knobs`, there is a functionality by autograd that automatically does the `backward pass`. Essentially, this does minor changes to the learnable parameters (e.g. one single weight in a weight matrix can be 0.1234 and after a training step it could be 0.1233).

In [None]:
# Deep Learning framework
import torch
import torch.nn as nn
from torch.nn import functional as F

# Other utilities
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

Let's define some basic variables. We want each character to be represented as a 32-dimensional vector (```EMBEDDING_SIZE```). Moreover, we teach our model to learn over a maximum of 32 characters (```MAXIMUM_SEQUENCE_LENGTH```), which equates to roughly one sentence.

In [None]:
EMBEDDING_SIZE = 32
MAXIMUM_SEQUENCE_LENGTH = 32

The original transformer paper uses a sinusodial positional encoding which is a
hardcoded matrix that you simply add to the character vectors. It's based on a sine/cosine formula. Effectively, we add a small number to each character vector dimension to help the model to differentiate where in a sequence the token is and how tokens relate to each other.

You could also add the actual integer position of the character to each dimension of its character vector. Say the character `s` in `this` is in the 4th position, we could add `4` to each dimension of the `s` vector representation. However, this will not work that well whereas the sine/cosine formula provides a  high-dimensional encoding that captures relative positional relationships, allowing the model to generalize better across different sequences and positional contexts.

In [None]:
def get_positional_encoding(max_seq_len, d_model):
    pos_enc = np.zeros((max_seq_len, d_model))
    for pos in range(max_seq_len):
        for i in range(0, d_model, 2):
            # This is the original implementation from the initial transformers paper
            pos_enc[pos, i] = np.sin(pos / (10000 ** ((2 * i)/d_model)))
            pos_enc[pos, i + 1] = np.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
    return torch.Tensor(pos_enc)

# The hardcoded positional encoding matrix
positional_encoding_matrix = get_positional_encoding(MAXIMUM_SEQUENCE_LENGTH, EMBEDDING_SIZE)

print("The positional encoding matrix that we add to the input vectors looks as follows:")
plt.imshow(positional_encoding_matrix)

## 2.2 Transformer block

Now, let's tackle the main part of the deep learning model: the transformer block. Transformers were invented for language translation and thus had encoding blocks and decoding blocks. GPT only uses decoding based blocks.

The decoder block consists of 2 core components, masked self-attention and a feedforward neural network. Attention is a core concept of many of these nowadays super successful models.

### 2.2.1 Multi-head masked self-attention

This title makes it seem very difficult to understand. But it's actually a relatively simple concept. Let's tackle the different parts one by one.

### Attention
First, we look at attention. Here, I gladly follow Prof. Ryan Cotterell's explanation he gave in a lecture at ETH Zurich:

![Attention example by Ryan Cotterell](https://raw.githubusercontent.com/serced/gpt2/main/assets/cotterell_attention_example.png)
*Simple attention example by Prof. Cotterell (ETH Zurich). Source: Lecture 11, slide 30 (https://rycolab.io/classes/intro-nlp-f23/)*

In the example table for values `V`, you have 4 rows and 3 columns. The rows represent the number of vectors and the columns represent the different dimensions of the vectors. If you are interested to retrieve the vector from the 3rd row, you can simply multiply a vector filled with zeros and a 1 at the 3rd dimension. With this, you retrieve the row-vector from the 3rd row!

Attention generalizes this concept and instead of setting a specific index to 1, we distribute the 1 over all dimensions in $\alpha$ . We might take 10% from the first and 90% of the third row vectors. This allows the vector to incorporate knowledge from it's context (the first vector also contributes 10% to the third vector's new representation).

To compute attention scores (the $\alpha$ vector), we need queries (what we are looking for), keys (what is available). We multiply them with the values (the data we want incrorporate context into).

If you'd like to understand the attention mechanism more in-depth, check out Prof. Cotterell's lecture 11 starting at slide 25!


### Self-attention
The "self" in "self-attention" is just there because the input to compute the new contextualized vectors is always the same for the queries, keys, and values. We simply multiply or project the same input with different learned parameters to get keys, queries, and values. Then, we compute the attention scores using queries and keys, and do the values mixture as described above.


### Multi-head
When transformers as model architecture were introduced, they basically started learning the percentages to assign for different vectors to build the new vector representation of a token that incorporates the previous context (tokens). Moreover, instead of a `single` attention "head" like in the example before, they used multiple such "heads". The hypothesis was, that this allows the different heads to attend to different relevant information. For instance in a company description, one head might attend to adjectives that describe a company, whereas another one might attend to what the company does. Both heads process relevant but different information concurrently.


### Masking
What is masking in this context? The generative pretrained transformer keeps predicting the next token (character in our case). To teach this, we give as input a number sequence, but:
- to predict the second character, it should only have access to the first character,
- to predict the third character, it should only have access to the first two characters,
- to predict the fourth character, it should only have access to the first three characters,

... you get the idea. Since transformers can train with a text chunk at once (effectively, this means multiple training examples in one forward pass), you need to ensure that the attention can only span the previously seen tokens. Hence, the masking.



## General notes to understand the code
Now you should understand the next codeblock defining the `MultiHeadMaskedSelfAttention` code module! Well done! You can mainly look at the `forward(x)` function and understand what's going on there.

1. **Note**: Clarifications on the `forward pass`:
- `batch_size` means how many text chunks we pass per training prediction step. We use many examples at once, not just the many character prediction examples from the single text chunk! Talk about efficiency :)
- `sequence_length` is the previously mentioned `MAXIMUM_SEQUENCE_LENGTH` that defines how long the character sequence is that our model should be able to handle
- `embedding_size` is the `EMBEDDING_SIZE` that defines the number of vector dimensions that represent a single character

2. **Note**: We split the character representations after the linear projections into multiple heads by rearranging the matrix from size `(batch_size, sequence_length, embedding_size)` to shape `(batch_size, number_of_heads, sequence_length, head_size)` and then do the multi-head attention computation before combining (concatenating) the sub-representations again.

3. **Note**: The basic attention formula is
$attended\_values = softmax \left(\frac{Q K^T}
{\sqrt{embedding\_size\_keys}} \right)* V$. $Q$ and $K$ are the queries and keys matrices. Before the softmax function, we need to mask the $QK^T$ result such that the attention scores of a current token cannot pay attention to future tokens of the same sequence. To do this, we build an upper triangular matrix that contains
$-infinity$ (called `autoregressive_attention_mask` in the code). It looks as follows:
        [[0, -inf, -inf, -inf],
         [0,    0, -inf, -inf],
         [0,    0,    0, -inf],
         [0,    0,    0,    0]]
When we add it to our score matrix and it is passed through the softmax function, this leads to a probability of $0$ where $-inf$ is.

4. **Note** `F.softmax` is a function that is used to normalize output such that summing over the given output dimension adds up to 1. The attention scores need to add up to 1 to represent how much of each value vector should be incorporated into the new contextualized vector representation.

In [None]:
class MultiHeadMaskedSelfAttention(nn.Module):
    """
    When we create a MultiHeadMaskedSelfAttention module,
    we need to pass the arguments in the `__init__`. These are:

    embedding_size: the vector dimensions
    number_of_heads: the number of attention heads, i.e., the "multi" number
    bias and dropout can be ignored for now
    """
    def __init__(self, embedding_size, number_of_heads, bias, dropout=0.1):
        super().__init__()
        self.embedding_size = embedding_size
        self.bias = bias
        self.number_of_heads = number_of_heads
        assert (
            embedding_size % number_of_heads == 0
        ), "Embedding dimension has to be divisible by the number of heads"

        # A simple linear transformation that projects the input into three different spaces
        self.queries_projection = nn.Linear(embedding_size, embedding_size)
        self.keys_projection = nn.Linear(embedding_size, embedding_size)
        self.values_projection = nn.Linear(embedding_size, embedding_size)

        self.dropout = nn.Dropout(dropout)
        self.output_projection = nn.Linear(embedding_size, embedding_size)

    def forward(self, x):
        # See note 1
        # Input (x) shape is (batch_size, sequence_length, embedding_size)
        batch_size, sequence_length, embedding_size = x.size()
        queries = self.queries_projection(x)
        keys = self.keys_projection(x)
        values = self.values_projection(x)

        # See note 2
        head_size = embedding_size // self.number_of_heads
        queries = queries.view(
            batch_size, sequence_length, self.number_of_heads, head_size
        ).transpose(1, 2)
        keys = keys.view(
            batch_size, sequence_length, self.number_of_heads, head_size
        ).transpose(1, 2)
        values = values.view(
            batch_size, sequence_length, self.number_of_heads, head_size
        ).transpose(1, 2)

        # See note 3
        mask = torch.triu(
            torch.ones((sequence_length, sequence_length), device=x.device),
            diagonal=1,
        )  # Strangely diagonal=1 means the diagonal is 0
        autoregressive_attention_mask = mask.masked_fill(mask == 1, float("-inf"))

        attention_scores = torch.matmul(queries, keys.transpose(-1, -2)) * (
            1.0 / np.sqrt(keys.size(-1))
        )
        masked_attention_scores = attention_scores + autoregressive_attention_mask

        # Compute self-attention scores. For F.softmax see note 4
        masked_attention_scores = F.softmax(masked_attention_scores, dim=-1)
        masked_attention_scores = self.dropout(masked_attention_scores)

        # Shape: (batch_size, number_of_heads, sequence_length, head_size)
        attended_values = torch.matmul(masked_attention_scores, values)

        # Concatenate the heads
        # Shape: (batch_size, sequence_length, embedding_size)
        # .transpose(1, 2) changes the shape order to (batch_size, sequence_length, number_of_heads, head_size)
        attended_values = (
            attended_values.transpose(1, 2)
            .contiguous()
            .view(batch_size, sequence_length, embedding_size)
        )

        # Pass through a linear layer to get the final output
        output = self.output_projection(attended_values)

        return output


### 2.2.2 Multi-layer perceptron (MLP)

The second core module in the transformer decoder block is the MLP. It's made up of a learned linear projection, followed by a non-linearity and another learned linear projection. In deep-learning, non-linear activation functions are required such that neural networks can learn non-linear decision boundaries beyond what linear operations can. Without them, if one were to stack multiple linear projections, this would just end up in another linear projection and could be represented in one step. `gelu` stands for "Gaussian Error Linear Unit" which is just the non-linearity function (`activation function`) that was used in GPT2.

In [None]:
class MultiLayerPerceptron(nn.Module):
    """
    input_size: the dimensions of the input
    hidden_size: how wide the neural network layer should be
    output_size: the dimensions you want the output to have
    """
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.input_linear_layer = nn.Linear(input_size, hidden_size)
        self.output_linear_layer = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.input_linear_layer(x)
        x = F.gelu(x)
        x = self.output_linear_layer(x)

        return x

Great, now we can almost write the code for the transformer decoder block!
We are missing the layer normalization, dropout, and residual connections.

### 2.2.3 Layer normalization, dropout, and residual connections

**Layer normalization** adjusts the data inside a neural network's layer by making sure all the inputs to the next layer have a similar scale, helping the network learn faster and more effectively. We use pytorch's provided `nn.LayerNorm` module.

**Dropout** is a regularization technique which is only applied during training. It randomly turns off e.g. 20% of neurons/parameters in the layer to make the whole network more robust and prevent overfitting (when the network perfectly remembers training data, i.e., a lookup table). Again, we can use pytorch's provided `nn.Dropout` module.

On the architecture picture we also see arrows that go from inbetween layers modules to a `+` sign. These are so-called **residual connections** and simply take the same input and add it back later on.

In [None]:
class TransformerDecoderBlock(nn.Module):
    """
    Take a look at the architecture image and you will notice that we now define
    all required parts for the transformer block in our `__init__` function and
    then use them in our `forward pass` on the input x.


    embedding_size: the vector dimensions per character
    n_attention_heads: the number of attention heads
    """
    def __init__(self, embedding_size, n_attention_heads, dropout=0.1):
        super().__init__()
        self.input_layer_normalization = nn.LayerNorm(embedding_size)

        # Here we use our own MultiHeadMaskedSelfAttention class
        self.masked_attention = MultiHeadMaskedSelfAttention(
            embedding_size,
            n_attention_heads,
            bias=True,
            dropout=dropout
        )
        self.post_attention_dropout = nn.Dropout(dropout)

        self.pre_mlp_layer_normalization = nn.LayerNorm(embedding_size)
        # Here we use our own MultiLayerPerceptron class
        self.multi_layer_perceptron = MultiLayerPerceptron(
            input_size=embedding_size,
            hidden_size=embedding_size * 4,
            output_size=embedding_size,
        )
        self.post_mlp_dropout = nn.Dropout(dropout)

    def forward(self, x):
        # x shape is (batch_size, sequence_length, embedding_dimension)
        # First residual connection wrapper
        residual = x.clone()
        x = self.input_layer_normalization(x)
        x = self.masked_attention(x)
        x = self.post_attention_dropout(x)
        x = x + residual

        # Second residual connection wrapper
        residual = x.clone()
        x = self.pre_mlp_layer_normalization(x)
        x = self.multi_layer_perceptron(x)
        x = self.post_mlp_dropout(x)
        x = x + residual

        return x

Awesome, we have the decoder transformer block together! Now we only need to define the full neural network. I also added a `generate` function to use it when the model is trained to predict the next character from any given input!

In [None]:
class GenerativePretrainedTransformer(nn.Module):
    """
    embedding_dimension: the number of vector dimensions that represent a single character
    sequence_length: how many characters the model can process at most
    n_transformer_blocks: the number of transformer decoder blocks used
    n_attention_heads: the number of attention heads in each decoder block
    vocabulary_size: the number of unique characters in our dataset
    """
    def __init__(
        self,
        embedding_dimension,
        maximum_sequence_length,
        n_transformer_blocks,
        n_attention_heads,
        vocabulary_size,
        dropout=0.1,
    ):
        super().__init__()
        # A matrix mapping characters to vectors and the static positional encoding matrix
        self.character_to_embedding_map = nn.Embedding(
            vocabulary_size, embedding_dimension
        )
        self.positional_encoding = get_positional_encoding(
            maximum_sequence_length, embedding_dimension
        ).to(DEVICE)

        # Define the transformer decoder blocks used
        self.transformer_decoder_blocks = nn.ModuleList([
            TransformerDecoderBlock(
                embedding_dimension, n_attention_heads, dropout=dropout
            )
            for _ in range(n_transformer_blocks)
        ])

        # The normalization layer and the layer used for predicting which character comes next
        self.layer_normalization = nn.LayerNorm(embedding_dimension)
        self.character_prediction_layer = nn.Linear(
            embedding_dimension, vocabulary_size, bias=False
        )


    def forward(self, batch_of_character_sequences):
        # Input shape: (batch_size, sequence_length)
        x = self.character_to_embedding_map(batch_of_character_sequences)
        character_positions_in_sequence = torch.arange(x.size(1))

        x = x + self.positional_encoding[character_positions_in_sequence]

        # Pass through all Transformer Decoder Blocks
        # x shape is (batch_size, sequence_length, embedding_dimension)
        for decoder_block in self.transformer_decoder_blocks:
            x = decoder_block(x)

        x = self.layer_normalization(x)
        # These are called logits because you need to normalize them with F.softmax
        # when actually generating text (see the `generate` function for this)
        output_logits = self.character_prediction_layer(x)
        return output_logits

    @torch.no_grad()
    def generate(self, input_text_indices, max_length=300, temperature=1.0, top_k=None):
        """
        The generate function is used after training and we can use it to
        predict the next characters from any input sequence.
        """
        input_text_indices = input_text_indices.unsqueeze(0)
        full_sequence = input_text_indices.clone()
        current_sequence_length = input_text_indices.size(1)

        for _ in range(max_length):
            # Check that the context is not too long, otherwise cut it
            if current_sequence_length > MAXIMUM_SEQUENCE_LENGTH:
                input_text_indices = input_text_indices[:, -MAXIMUM_SEQUENCE_LENGTH:]

            # Pass through the model
            logits = self.forward(input_text_indices)
            # We only need the logits for the last character
            next_character_logits = logits[0, -1, :] / temperature
            # Apply top-k sampling if needed
            if top_k is not None:
                top_k_characters, top_k_indices = torch.topk(next_character_logits)
                # Set all logits to -infinity that are not in the top-k
                # Effectively setting the probability to 0 after the softmax
                next_character_logits[~top_k_indices] = float("-inf")

            # Transform the model prediction into a probability distribution and sample from it
            next_character_probabilities = F.softmax(next_character_logits, dim=-1)
            next_character = torch.multinomial(next_character_probabilities, 1)

            # Append to the sequence
            input_text_indices = torch.cat([input_text_indices, next_character.unsqueeze(0)], dim=1)
            full_sequence = torch.cat([full_sequence, next_character.unsqueeze(0)], dim=1)
            current_sequence_length += 1

        return full_sequence


## 2.3&nbsp;Defining hyperparameters and creating the model

Now we simply need to define the hyperparamters of our model. These are things like:
- how long the sequences it should be able to process are,
- what the vector dimensions for each character and the hidden representations are,
- how many transformer decoder blocks we want, etc.

and depend on various things like the amount of available computing power, data, etc.

In [None]:
EMBEDDING_SIZE = 128                # the vector dimensions
N_TRANSFORMER_DECODER_BLOCKS = 4    # how many decoder blocks
N_ATTENTION_HEADS = 4               # how many attention heads per decoder block
MAXIMUM_SEQUENCE_LENGTH = 128       # how many characters the model should be able to handle
DROPOUT = 0.25                      # regularization paramter to prevent overfitting

BATCH_SIZE = 64                     # how many sequences we pass in at once for training
MAX_ITERATIONS = 2000               # how many gradient descent steps we want to do at most
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'     # Whether to use the GPU or CPU

compile = False


torch.device(DEVICE)

Lets create yourGPT!

In [None]:
our_own_gpt = GenerativePretrainedTransformer(
  embedding_dimension=EMBEDDING_SIZE,
  n_attention_heads=N_ATTENTION_HEADS,
  maximum_sequence_length=MAXIMUM_SEQUENCE_LENGTH,
  n_transformer_blocks=N_TRANSFORMER_DECODER_BLOCKS,
  vocabulary_size=n_unique_characters,
).to(DEVICE)

if compile:
    our_own_gpt = torch.compile(our_own_gpt)

## 2.4&nbsp;Defining the training loop

Finally, to train our model we need to define the training loop. Here I am making use of further utilities provided by PyTorch that make it easier to handle training data, etc. This is less important to understand the mechanics of chatGPT.

By now, you should understand that we pass in multiple character sequences that are represented as numbers, and the model learns to predict the next character from all previous characters from all sequences at once! This is why we can train transformers efficiently since we can make use of matrices and graphic processing units (GPUs) to parallelize many training samples.

Missing pieces for training are:
1. A loss function or `criterion` that defines the difference between what our model predicted and what is actually true.

2. An optimizer that that manipulates `knobs` based on how much one knob influenced the prediction and thereby the loss.

3. A training dataset to train the model and a validation dataset to check whether the model is overfitting (meaning it's learning the training data by heart) based on the TinyShakespeare dataset that we encoded as number sequence.

In [None]:
# Define the optimizer and the loss function
optimizer = torch.optim.AdamW(our_own_gpt.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()


# First we need to encode the training and validation data
# tensors are a generalization of 1-dimensional vectors, 2-dimensional matrices, to any n-dimensions
encoded_train_data = torch.tensor(tokenizer.encode(train_data), dtype=torch.long)
encoded_validation_data = torch.tensor(tokenizer.encode(validation_data), dtype=torch.long)


class TextDataset(torch.utils.data.Dataset):
    def __init__(self, data, sequence_length):
        self.data = data.to(DEVICE)
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.data) - self.sequence_length

    def __getitem__(self, idx):
        """
        This function defines what are training inputs (x) and the targets (y).
        For instance if our data is [0, 2, 30, 21], the input x is [0, 2, 30],
        and the target y is [2, 30, 21]
        """
        x = self.data[idx : idx + self.sequence_length]
        y = self.data[idx + 1 : idx + self.sequence_length + 1]
        return x, y

train_dataset = TextDataset(encoded_train_data, MAXIMUM_SEQUENCE_LENGTH)
validation_dataset = TextDataset(encoded_validation_data, MAXIMUM_SEQUENCE_LENGTH)


# We can then define a DataLoader. This is a utility class to sample multiple sequences
# and whether we want to randomize the order of the batch sequence or not
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
)
validation_loader = torch.utils.data.DataLoader(
    dataset=validation_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
)


Finally, we get to the training!

In [None]:
import time
start = time.time()
# Adding simple way to reload best model based on validation loss
best_validation_loss = float("inf")
# Adding scaler to make training faster
scaler = torch.cuda.amp.GradScaler(enabled=DEVICE=='cuda')
iteration = 0
validation_interval = 100
log_interval = 20


# Now we can start the training loop
while iteration < MAX_ITERATIONS:
    our_own_gpt.train()
    for x, y in train_loader:
        # stop if we hit the max iteration count
        if iteration >= MAX_ITERATIONS:
            break

        # Zero the gradients
        optimizer.zero_grad()

        # Speedup trick
        with torch.cuda.amp.autocast(dtype=torch.float16):
            # Forward pass
            logits = our_own_gpt(x)

            # Compute the loss
            loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))


        # Backward pass to update gradients of all parameters based on the loss
        # signaling how they should be adjusted to minimize the loss
        scaler.scale(loss).backward()
        # Update the weights/parameters accordingly
        scaler.step(optimizer)
        scaler.update()


        if iteration % log_interval == 0 and not iteration % validation_interval == 0:
            print(f"Iteration: {iteration} - Training loss: {loss.item()}")

        if iteration % validation_interval == 0 or iteration == MAX_ITERATIONS:

            # At the end of each epoch we check how our model performs on the validation set
            our_own_gpt.eval()
            with torch.no_grad():
                validation_loss = 0
                for x, y in validation_loader:
                    logits = our_own_gpt(x)
                    validation_loss += criterion(logits.view(-1, logits.size(-1)), y.view(-1)).item()
                validation_loss /= len(validation_loader)

            # If the model performs better, we save the learned parameters/weights
            if validation_loss < best_validation_loss:
                best_validation_loss = validation_loss
                torch.save(our_own_gpt.state_dict(), 'best_model.pth')


            print(f"Iteration: {iteration} - Training loss: {loss.item()} - Validation loss: {validation_loss}")

        iteration += 1


now = time.time()
print(f"iteration time: {now - start}")

# Reload best model weights/parameters
state_dict = torch.load("best_model.pth")
our_own_gpt.load_state_dict(state_dict)

How big is our model? Well compared to chatGPT it's extremly small. The model from OpenAI has about ~1.8 trillion parameters, meaning 1,800 billion parameters, or 1,800,000 million parameters. Ours in comparsion has:

In [None]:
total_params = sum(p.numel() for p in our_own_gpt.parameters())
print(f"This model has {total_params/1e3} thousand parameters. Thus, it is incredibly small compared to chatGPT.")

## 2.5&nbsp;Generating text with yourGPT
And now, let's see if our model is actually capabable of generating some Shakespeare-ish looking text:

In [None]:
# Play around with the input string to see what you can generate!
input_string = "Oh there goeth my"

encoded_input = tokenizer.encode(input_string)
input_text_indices = torch.tensor(encoded_input, dtype=torch.long).to(DEVICE)

output_character_indices = our_own_gpt.generate(input_text_indices, max_length=100)

decoded_output = tokenizer.decode(output_character_indices.squeeze(0).cpu().tolist())
print(decoded_output)

# 3.&nbsp;Using a larger trained model

I also trained a larger model besides what we defined in this notebook. It is not much better as it's not perfectly tuned, though it's fun to see Shakespeare-looking gibberish :)

Let's load it up to play around. Can you get better text out of this one than the tiny one you trained?

In [None]:
# @title Loading the larger trained model from remote. Run this cell, ignore the code.
# When I trained the larger model, I also used torch.compile
# This is a function to undo this on the saved model state
%%capture
!pip install wandb
import wandb
wandb.login(anonymous="must")

api = wandb.Api()
artifact = api.artifact('serced/myGPT/large_model:latest', type='model')
large_model_checkpoint_path = artifact.download()

def remove_compile_prefix(state_dict):
    new_state_dict = {}
    for k, v in state_dict.items():
        if k.startswith('_orig_mod.'):
            new_key = k[len('_orig_mod.'):]
            new_state_dict[new_key] = v
        else:
            new_state_dict[k] = v
    return new_state_dict

Next, we need to recreate the larger model architecture as I trained it. For this, we need to use the same hyperparameters and create a new `our_own_gpt_large` object with them.

In [None]:
EMBEDDING_SIZE = 384                # the vector dimensions
N_TRANSFORMER_DECODER_BLOCKS = 6    # how many decoder blocks
N_ATTENTION_HEADS = 6               # how many attention heads per decoder block
MAXIMUM_SEQUENCE_LENGTH = 256       # how many characters the model should be able to handle

torch.device(DEVICE)


# Create a new larger model
our_own_gpt_large = GenerativePretrainedTransformer(
  embedding_dimension=EMBEDDING_SIZE,
  n_attention_heads=N_ATTENTION_HEADS,
  maximum_sequence_length=MAXIMUM_SEQUENCE_LENGTH,
  n_transformer_blocks=N_TRANSFORMER_DECODER_BLOCKS,
  vocabulary_size=n_unique_characters,
).to(DEVICE)

In [None]:
# Load the learned parameters from the fetched remote model
state_dict_larger_model = torch.load(large_model_checkpoint_path + '/best_model.pth', map_location=torch.device(DEVICE))
state_dict_larger_model = remove_compile_prefix(state_dict_larger_model)
our_own_gpt_large.load_state_dict(state_dict_larger_model)

Let's check how big this larger trained model is.

In [None]:
total_params_large = sum(p.numel() for p in our_own_gpt_large.parameters())
print(f"This model has {total_params_large/1e6:.2f} million parameters. It's already {total_params_large/total_params:.1f} times bigger than your model. \nHowever, it is still incredibly small compared to chatGPT.")

In [None]:
# Play around with the input string to see what you can generate!
input_string = "Oh there goeth my"

encoded_input = tokenizer.encode(input_string)
input_text_indices = torch.tensor(encoded_input).to(DEVICE)

output_character_indices = our_own_gpt_large.generate(input_text_indices, max_length=500, temperature=0.8)
decoded_output = tokenizer.decode(output_character_indices.squeeze(0).cpu().tolist())
print(decoded_output)

## Using GPT2 from OpenAI

Let's check a larger trained model which we can easily access from the internet. We are using OpenAI's GPT2. To do this, we simply need to get the tokenizer that maps text to numbers and back, as well as the model. It's very little code!

In [None]:
%%capture
# install the huggingface transformers library that gives access to millions of trained models
!pip install transformers

In [None]:
%%capture
# Import the tokenizer and language model scaffolds
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the tokenizer that maps text to numbers and the inverse
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load the trained model
model = AutoModelForCausalLM.from_pretrained("gpt2")

Let's check this model's size.

In [None]:
print(f"""This model has {model.num_parameters()/1e6:.2f} million parameters.\nIt's already {model.num_parameters()/total_params:.1f} times bigger than your model. \nHowever, it is still incredibly small compared to chatGPT.\nRemember, chatGPT has about ~1.8 trillion parameters (though, likely not all of them are used in one prediction step).""")

In [None]:
# Example input text
input_text = "Are you conscious?"

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt")

# Generate text. The tokens are more than just single characters, hence, we need less
# for the same text length
outputs = model.generate(
    **inputs,
    max_length=100,
    do_sample=True,
    top_k=40,
    temperature=1,
    pad_token_id=tokenizer.eos_token_id
)

# Decode the generated tokens
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)


# Additional information

## How does chatGPT deal with image data and audio data?

These powerful models all rely on the transformer architecture, specifically, the decoder blocks which we implemented from scratch. That means, we need a way to adapt images into series of tokens. Similarly, audio data can be interpreted as tokens as well.

For images, we cut the image into small squares and flatten the squares into a vector by using a linear projection on the square and then flatten this into a the vector representation. Hence, as an example we have an image of size $(16, 16)$, we can cut this up into 4 squares of the size $(4, 4)$. We then flatten each of these squares into a vector of size $4 * 4 = 16$. Starting at the top-left we can thus transformed the image of size $(16,16)$ int a series of 4 vectors of size $16$ where each vector is again a token, just like our character representation (the `nn.Embedding` of the character). This results in the image being represented as the sequence of four tokens: $(4, 16)$. If you want to understand this better, I recommend having a look at this post from [Dennis Turp called "A Visual Guide to Vision Transformers"](https://blog.mdturp.ch/posts/2024-04-05-visual_guide_to_vision_transformer.html).

Audio data can also be represented as images (spectograms) and then you process them just like images.

Thus, in the end, you can tokenize text, images, and audio that you can pass through large transformer models.

## I would like to learn more

Check out these links which previously served as resources for me and are great to learn more about LLMs.

- [nanoGPT by Andrej Karpathy (ex-OpenAI)](https://github.com/karpathy/nanoGPT/). Awesome code repository with which you can train your own GPT models, though, for beginners there is already a lot of little speedup tricks, etc. which makes it harder to understand. It has a lot of comments though, hence, it still is possible to follow it. He also created a great [video](https://www.youtube.com/watch?v=kCc8FmEb1nY) available on Youtube that is well structured.
- [The annotated transformer by Sasha Rush et al.](https://nlp.seas.harvard.edu/annotated-transformer/). This is an excellent blog post explaining the original transformer architecture when it was introduced.
- [The annotated GPT-2 from Aman Arora](https://amaarora.github.io/posts/2020-02-18-annotatedGPT2.html). Great blogpost with the same principle on GPT-2.
- [Prof. Ryan Cotterell's Natural Language Processing course slides](https://rycolab.io/classes/intro-nlp-f23/). Great slides to understand the history of NLP and the attention mechanism/transformer (see Lecture 11).
- [Vision Transformer from Scratch from Brian Pulver](https://medium.com/@brianpulfer/vision-transformers-from-scratch-pytorch-a-step-by-step-guide-96c3313c2e0c). A great blogpost that implements a vision transformer from scratch. First, briefly scroll through the mentioned ["Visual Guide to Vision Transformers"](https://blog.mdturp.ch/posts/2024-04-05-visual_guide_to_vision_transformer.html) to get a rough idea how the end architecture looks like and then read Brian's post.

If you would like to learn more or are generally interested, reach out on [LinkedIn](https://www.linkedin.com/in/severinhusmann/). I'd be happy to chat.