# Transformer Architecture Explained

## 1. Introduction
Transformer architecture initially proposed in the paper ["Attention is All you Need"](https://arxiv.org/pdf/1706.03762.pdf) consists of the following major components:
1. Tokenizer convert text to token and tokens to mapped to embeddings.
2. Positional encoding inject input word-position information
3. Self-attention layer contextually encodes the input sequence information. 
4. Feed Forward layer which operate bit like a static key-value memory. 
5. Cross-attention decodes output sequence of different inputs and modalities.
- Note the key components above is summarised by Vaclav Kosar in his [blog](https://vaclavkosar.com/ml/transformers-self-attention-mechanism-simplified)

While for many people who does not have any Computer Science or ML backgrounds the above terms may sound confusing and very abstract. Please do not worry about this, we will go through the technical details of Transformers and explain each terms one by one. Please Enjoy this notebook.

<p align="center">
<img src="images/fig1.jpg" alt="fishy" class="bg-primary" width="300px" />
</p>


In [1]:
# Packages
import torch

### 1.1. Background - How did get here?
The phrase "The Turing Test" is most properly used to refer to a proposal made by the father of Machine Intelligence Alan Turing in 1950. He proposed a certain kind of game which he explained to be "The Imitation Game", this test will test if the machine is capable of understand picture and sequence of words just like human can. The test proposes that:
1. A machine, a person and an interrogator is to sit in the same room. 
2. The Interrogator does not which of the other two is the Machine and which is Person.
3. The goal of the Machine is to trick the interoogator into thinking it is the Person.
- More detailed explanation about the Turing Test can be found [here](https://plato.stanford.edu/entries/turing-test/).

Hence to create a machine that is capable of bypassing "The Turing Test", researches begin to tackle the core problems of Natural Language Processing and design difference language models that is capable of understanding the symbols and sequence of "language".

### 1.2. N-gram Language Models
Predicting is difficult - especially about the future, as the old quip goes. But how about predicting something that seems much easier? For example predicting the next few words somone is going to say? For Example:

- "Good afternoon, have you had ...."
- It is possible that the speaker want to ask whether the other person had "lunch yet" or "a meeting with James" hence we can assign a probability distribution to each of the possible outcome.
- The ability to predict the next word would require the model to have some understanding of what the speaker/writer is trying to say. Hence allowing the n-gram model to be used for tasks such as Machine Translation and Grammar Correction.

$$
P(c_1, c_2 ... c_N) = P(c_1)p(c_2|c_1)p(c_3|c_1c_2)...P(c_N|c_1c_2...c_{N-1})
$$
$$
P(the|\texttt{its water is so transparent that}) = \frac{C(\texttt{its water is so transparent that the})}{C(\texttt{its water is so transparent that})}
$$

- With a large enough corpus (text corpus is a dataset of languge sources of many forms), the above probabilistic equation can be estimated. 
- More on [N-gram model](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

## 1.3. Sequence to Sequence Neutal Netorks.
Using the previous established language model theory. An architecture known as the sequence-to-sequence (s2s) neural netowks(NNs) is established. s2s NNs are used translate a sequence of symbols from one input sequence to an output sequence. The neural netowk usually used an encoder-decoder architrcture
<p align="center" style="background-color: white;">
<img src="images/fig2.webp" alt="fishy" class="bg-primary" width="300px" />
<figcaption align="center" >
    <a href="https://culurciello.medium.com/sequence-to-sequence-neural-networks-3d27e72290fe#:~:text=These%20neural%20network%20usually%20use,Yi%20which%20is%20auto%2Dregressive.">Encoder Decoder Architecture - Eugenio Culurciello</a>
  </figcaption>
</p>
The encoder takes an input sequence x_i and encode a long sequence of symbols into a sequence vector z_i. The decoder takes as input the sequence vector from the encoder and produces the output sequence y_i. The s2s decoder is generally auto-regressive (uses the previous output of the function to predict future outputs).

### 1.3.1 Recurrent Neural Networks.
<p align="center" style="background-color: white;">
<img src="images/fig3.webp" alt="fishy" class="bg-primary" width="300px" />
<figcaption align="center" >
    <a href="https://culurciello.medium.com/sequence-to-sequence-neural-networks-3d27e72290fe#:~:text=These%20neural%20network%20usually%20use,Yi%20which%20is%20auto%2Dregressive.">RNN Architecture - Eugenio Culurciello</a>
  </figcaption>
</p>
Previously (before GPT taking over) Recurrent Neural Networks (or RNN) is almost the industry standard for natural language processing tasks. RNN and more complex LSTM (which resolved some of the vanishing gradients problem of RNN) built the foundation for many of the old voice assistants such as Siri, Cortana and Alexa. However RNN is not resource friendly what so ever, with the lack of focus on the important part of the sentence (attention) it is difficult to train an RNN model that have the similar level of performance as Transformer model. Sadly, the notebook will not have time to go into RNN, so if you are interested in learning more please head to the following resources:

1. [Stanford RNN Cheatsheet](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks)

2.  <a href="https://www.ibm.com/topics/recurrent-neural-networks#:~:text=A%20recurrent%20neural%20network%20(RNN,data%20or%20time%20series%20data.">RNN IBM</a>



## 2. Attention


### 2.1. Self - Attention
The concept of "attention" is originally designed to improve RNN (but in the end technically killed her) so that RNN will be less expensive in the handling of longer sequences of sentences. However in the famous paper ["Attention is All You Need"](https://arxiv.org/pdf/1706.03762.pdf), it is established that with the attention mechanism, a higher quality and more parrallelizable model known as transformers can be created. 
<p align="center" style="background-color: white;">
<img src="images/fig4.png" alt="fishy" class="bg-primary" width="300px" />
<figcaption align="center" >
    <a href="https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html">Attention Mechanism</a>
  </figcaption>
</p>

#### 2.1.1 Transformer Embeddings and Tokenization
<b> 1. Input is tokenized, the tokens are then embedded. </b>

<b> All the code below are taken from Prof Sebastian Raschka's [blog ](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html) </b>

In [2]:
# 1. Define a Sentence
sentence = 'Life is cookie, eat cookie first'

# 2. Iterate through the Sentence to create a dictionary
dc = {s:i for i, s in enumerate(sorted(sentence.replace(',', '').split()))}
# enumerate: iterate over a list, tuple or dictionary and return a tuple containing the index of each element. 
print(dc)

{'Life': 0, 'cookie': 2, 'eat': 3, 'first': 4, 'is': 5}


- Tokenization converts a text into a list of integers. 
- With the above code we can see that each of the unique word in the sentence is assigned to a unique index.

In [3]:
# 3. We can now conver the sentence from its string form into the tokenalized form. 
sentence_int = torch.tensor([dc[s] for s in sentence.replace(',', '').split()])
print(sentence_int)

tensor([0, 5, 2, 3, 2, 4])


<b> 2. With this list of tokenised string, we can embed the list of words into a matrix, each of the word can be represented using an n-dimensional vector.<b> 


In [4]:
torch.manual_seed(42)
embed = torch.nn.Embedding(6,16)
embedded_sentence = embed(sentence_int).detach()
# detach(): returns a new tensor, detached from the current graph. <- so the tensor would no longer require gradient descent.
# The difference between detach() and no_grad() is that when a tensor is set to detach() it will be detached from the computational graph forever and will not be able to be optimized even after requires_grad flag is set to true.
print(embedded_sentence.shape)
print(embedded_sentence)


torch.Size([6, 16])
tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047,
         -0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624],
        [ 0.0109, -0.3387, -1.3407, -0.5854,  0.5362,  0.5246,  1.1412,  0.0516,
          0.7440, -0.4816, -1.0495,  0.6039, -1.7223, -0.8278,  1.3347,  0.4835],
        [-1.3847, -0.8712, -0.2234,  1.7174,  0.3189, -0.4245,  0.3057, -0.7746,
         -1.5576,  0.9956, -0.8798, -0.6011, -1.2742,  2.1228, -1.2347, -0.4879],
        [-0.9138, -0.6581,  0.0780,  0.5258, -0.4880,  1.1914, -0.8140, -0.7360,
         -1.4032,  0.0360, -0.0635,  0.6756, -0.0978,  1.8446, -1.1845,  1.3835],
        [-1.3847, -0.8712, -0.2234,  1.7174,  0.3189, -0.4245,  0.3057, -0.7746,
         -1.5576,  0.9956, -0.8798, -0.6011, -1.2742,  2.1228, -1.2347, -0.4879],
        [ 1.4451,  0.8564,  2.2181,  0.5232,  0.3466, -0.1973, -1.0546,  1.2780,
         -0.1722,  0.5238,  0.0566,  0.4263,  0.5750, -0.6417, -2.2064, -0.7508]])


- <b>Token Embedding</b>: Map Tokens to their representations. There are many ways of embedding tokens (representing words into a vector form), the different ways we choose to embed our established tokens extracted from our corpus, will determine the meaning or value we attribute to each words. 
- One of the most common way to embed text is through the Word2vec method, Word2vec is a two-layer neural net that processes text by "vectorizing" words. 
- Word2vec has been used around since 2013, it has also been shown to be effective in creating recommendation engines and making sense of sequential data in commercial tasks. 
- While Word2vec is not a deep neural network, it turns text into a numerical form that deep NNs can understand. 
    - Word2vec group the vectors of similar words together into a vectorspace. That is, it preceives "similarities" mathematically, by creating multi-dimensional vectors representations of language features.
    - The computed vectors are shown to possess the components of that make up the meaning of the word, one very famous example is the:
        <p align="center" style="background-color: white;">
        <img src="images/fig5.png" alt="fishy" class="bg-primary" width="300px" />
        <figcaption align="center" >
            <a href="https://jalammar.github.io/illustrated-word2vec/">Word2Vec</a>
          </figcaption>
        </p>
    - To compute the closeness of words to each other for given features. Word2vec uses the cosine similarity (Think of a unit circle, highest cosine similarity occurs at 1, which is measured at 0 degrees) to associate words. 
    - Word2vec method can be used to train a language model in two ways:
        1. <b>CBOW:</b> use context to predict a target word. 
        2. <b>Skip-gram:</b> use the word to predict target context.
For More Details, check out this [blog](https://jalammar.github.io/illustrated-word2vec/).

    

<b> 3. Defining the Weight Matrices </b>

- self-attention mechanism is also known as the scaled dot-product attention. Which is integrated into the transformer architrcture. 
- self-attention mechanism utilizes three weight matrices, referred to as $W_q, W_k, W_v$, which are adjusted as model parameters during the training process. These three weight matrices operates similar to fc layers, say we have weight of $W$ and embedded inputs of $x$.
    $$
    q = W_q*x
    $$
    $$
    k = W_k*x
    $$
    $$
    v = W_v*x
    $$
<p align="center" style="background-color: white;">
        <img src="images/fig6.png" alt="fishy" class="bg-primary" width="300px" />
        <figcaption align="center" >
            <a href="https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html">query, key, value</a>
          </figcaption>
</p>


In [5]:
# Extract the embeded vector dimension from the embeded sentence matrix
d = embedded_sentence.shape[1]
# Define the Attention Weight Matrix size
d_q, d_k, d_v = 24, 24, 28
# Define the Weight Matrix
W_query = torch.nn.Parameter(torch.rand(d_q, d))
W_key = torch.nn.Parameter(torch.rand(d_k, d))
W_value = torch.nn.Parameter(torch.rand(d_v, d))
# Multiply the Weight Matrix with our input
query = W_query.matmul(embedded_sentence.T).T
keys = W_key.matmul(embedded_sentence.T).T
values = W_value.matmul(embedded_sentence.T).T

print("query.shape:", query.shape)
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)


query.shape: torch.Size([6, 24])
keys.shape: torch.Size([6, 24])
values.shape: torch.Size([6, 28])


- People who are familiar with data storage (sql), will definitely be familiar with the three terms "query", "key" and "value". Where Query is the word of interest we want to find in the database. When we multiply query and key together and pass them into the softmax function, we will find the key which best describe the query. If we are to then multiply the result described above with the value, it will correspondingly represent the value we wanted.

<b>3. Computing the Attention Scores</b>
$$
Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
$$

- The general process of attention mechanism has been described above, the idea of __self-attention__ is to find a layered attention for itself to implement more depth in the computational process. Additionally the extra term $\frac{1}{\sqrt{d_k}}$ is used to scale the unnormalized attention weights ($QK^T$). By scaling by $d_k$ ensures that the Euclidean length of the weight vector will be approximately in the same magnitude.

<p align="center" style="background-color: white;">
        <img src="images/fig7.png" alt="fishy" class="bg-primary" width="500px" />
        <figcaption align="center" >
            <a href="https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html">self attention pipeline</a>
          </figcaption>
</p>

In [6]:
import torch.nn.functional as F
# 1. Extract a single word from the embdeded sentence
x_2 = embedded_sentence[1]

# 2. Compute the query, key and value
query_2 = W_query.matmul(x_2)
key_2 = W_key.matmul(x_2)
value_2 = W_value.matmul(x_2)

# 3. Compute the normalized attention weights
omega_2 = query_2.matmul(keys.T)
attention_weights_2 = F.softmax(omega_2 / d_k**0.5, dim=0)

# 4. Compute the context vector (how each word relate to the other words)
context_vector_2 = attention_weights_2.matmul(values)

print(context_vector_2.shape)
print(context_vector_2)


torch.Size([28])
tensor([-2.5894, -1.3496, -1.4880, -2.4979, -0.6985, -2.7224, -3.4506, -1.1927,
        -1.8831, -1.4491, -3.3964, -3.1386, -2.3252, -1.2042,  0.2298, -0.4006,
        -3.1605, -1.5531, -1.3839, -1.5905, -2.8909, -0.7388, -4.8752, -2.2597,
        -1.0470, -1.8028, -3.6478, -0.9484], grad_fn=<SqueezeBackward4>)


### 2.2. Multi-Head Attention
<p align="center" style="background-color: white;">
        <img src="images/fig8.png" alt="fishy" class="bg-primary" width="500px" />
        <img src="images/fig9.png" alt="fishy" class="bg-primary" width="500px" />
        <figcaption align="centre" >
            <a href="https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html">single attention(left) vs multi-head attention(right)</a>
          </figcaption>
</p>

In [7]:
# Assume we have 3 attention heads
h = 3
multihead_W_query = torch.nn.Parameter(torch.rand(h, d_q, d))
multihead_W_key = torch.nn.Parameter(torch.rand(h, d_k, d))
multihead_W_value = torch.nn.Parameter(torch.rand(h, d_v, d))

# Do the matrix multiplication similar to single attention
multihead_query_2 = multihead_W_query.matmul(x_2)
multihead_key_2 = multihead_W_key.matmul(x_2)
multihead_value_2 = multihead_W_value.matmul(x_2)

# Repeat the weight matrix 
stacked_inputs = embedded_sentence.T.repeat(3, 1, 1)
multihead_keys = torch.bmm(multihead_W_key, stacked_inputs)
multihead_values = torch.bmm(multihead_W_value, stacked_inputs)
print("multihead_keys.shape:", multihead_keys.shape)
print("multihead_values.shape:", multihead_values.shape)

multihead_keys.shape: torch.Size([3, 24, 6])
multihead_values.shape: torch.Size([3, 28, 6])


### 2.3. Cross Attention

<p align="center" style="background-color: white;">
        <img src="images/fig10.png" alt="fishy" class="bg-primary" width="500px" />
        <figcaption align="centre" >
            <a href="https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html">cross-attention summary</a>
          </figcaption>
</p>

- The multi-head attention is somewhat un-interesting (it is very similar to the channel concept of CNN), what is truly interesting is the cross-attention. In cross-attention, we mix or combine two different input sequences. In the case of the original transformer architecture below, that's the sequence returned by the encoder module on the left and the input sequence being processed by the decorder part on the right. Cross-attention is useful when we go from an input sentence to an output sentence in the context of language translation. (stable diffusion also used corss attention extensively).

<p align="center" style="background-color: white;">
        <img src="images/fig11.png" alt="fishy" class="bg-primary" width="500px" />
        <figcaption align="centre" >
            <a href="https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html">Original Transformer</a>
          </figcaption>
</p>

### 2.4. The Mathematical Trick in self-attention
- Tokens should only be able to look back the token at the 6th location should only be able to talk to tokens say at 5th, 3rd location. They should not be able to see the tokens at the 8th or 9th location.

In [8]:
torch.manual_seed(1337)
B,T,C = 4, 8, 2
X = torch.randn(B,T,C)
print(X.shape) 

torch.Size([4, 8, 2])


In [9]:
xbow = torch.zeros((B,T,C))
# For each batch <- which will be independent from one another
for b in range(B):
    # We go through each batch and for each T
    for t in range(T):
        xprev = X[b,:t+1]
        # We compute the average
        xbow[b,t] = torch.mean(xprev, 0)

In [10]:
print("Original Embedding")
print(X[0])
print("New Embedding")
print(xbow[0])

Original Embedding
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])
New Embedding
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


- Comparing the original embedding to the average embedding we have create. We can see that the two tensors both have the same first row and for each of the proceding elements as we pass through time, we can see that the future elements are the average of all the previous elements. Computing attention this way is very similar to a horrible RNN, a way to improve this is through the usage of a "mathematical trick".

In [11]:
torch.manual_seed(42)
a = torch.ones(3,3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print(c)

tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


<p align="center" style="background-color: white;">
        <img src="images/fig12.png" alt="fishy" class="bg-primary" width="500px" />
        <figcaption align="centre" >
            <a href="https://github.com/kenjihiranabe/The-Art-of-Linear-Algebra/blob/main/The-Art-of-Linear-Algebra.pdf">The Art of Linear Algebra</a>
          </figcaption>
</p>

- Using this trick we no longer need large amount of excessive for loops like what we have showed above, we just need matrix multiplication.

In [12]:
# Now using the lower triangle function in the torch and normalising the rows, we can:
a2 = torch.tril(torch.ones(3,3))/torch.sum(a, 1)
c2 = a2 @ b
print(c2)

tensor([[0.6667, 2.3333],
        [2.6667, 3.6667],
        [4.6667, 5.3333]])


## 3 Transformer

__Now__, we will follow the Youtube Tutorial Provided by Mr Andrej Karpathy and create/train a tiny GPT model from scretch, if you find my words and instructions are unclear (which is quite likly tbh), please visit the [original tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5869s).

### 3.1. Our Data
- We have extracted our data from the github repository [tiny shakespare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt).

In [13]:
with open('input.txt', 'r', encoding = 'utf-8') as f:
    text = f.read()

print("Length of the dataset in characters:", len(text))

Length of the dataset in characters: 1115393


In [14]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [15]:
# Extract all the unique characters from the data
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


- Unlike before, where we tokenized the text based on words, here we would like to tokenize based on characters (alphebet).
- In practice, you probably don't need to write the encoding function from scretch, you could just use packages such as GPT2's [tiktoken](https://github.com/openai/tiktoken). 
- __Aside:__ A lambda function is a small anonymous function, it can take any number of arguments, but can only have one expression.
    - x = lambda a:a+10 <- the left side of the ":" is the input and the right side is the output.
    - lambda s: [stoi[c] for c in s] <- this will take in s (which is a string) and iterate through each of the characters and convert them into encoded numbers.


In [16]:
# This component of code will create a mapping of characters to number in the form of a dictionary (just like what we did)
stoi = { ch:i for i,ch in enumerate(chars) } #character -> number
itos = { i:ch for i,ch in enumerate(chars) } #number -> character

# create an encoder function that encodes words based on characters. 
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [17]:
# Encode everything in the training dataset.
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115393]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [18]:
# Create a train val data split.
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

- As we have mentioned previously, the model is a sequential model, therefore it is essential for us to define the block size of the sequential data which we will be use to train the model.
- The model will predict the next words based on previously known information. e.g. predict 47 based on 18, predict 56 based on 47 and 18 etc.

In [19]:
block_size = 8
train_data[:block_size+1]

# Some code to illustrate the training process.
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


- Just like normal neural-network to imporve the processing capabilities (python can use multi-processing owo) and model accuracy, we will also be implementing mini batches. 

In [20]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
n_embed = 32


def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) # Random offset into the training set.
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[53, 59,  6,  1, 58, 56, 47, 40],
        [49, 43, 43, 54,  1, 47, 58,  1],
        [13, 52, 45, 43, 50, 53,  8,  0],
        [ 1, 39,  1, 46, 53, 59, 57, 43]])
targets:
torch.Size([4, 8])
tensor([[59,  6,  1, 58, 56, 47, 40, 59],
        [43, 43, 54,  1, 47, 58,  1, 58],
        [52, 45, 43, 50, 53,  8,  0, 26],
        [39,  1, 46, 53, 59, 57, 43,  0]])
----
when input is [53] the target: 59
when input is [53, 59] the target: 6
when input is [53, 59, 6] the target: 1
when input is [53, 59, 6, 1] the target: 58
when input is [53, 59, 6, 1, 58] the target: 56
when input is [53, 59, 6, 1, 58, 56] the target: 47
when input is [53, 59, 6, 1, 58, 56, 47] the target: 40
when input is [53, 59, 6, 1, 58, 56, 47, 40] the target: 59
when input is [49] the target: 43
when input is [49, 43] the target: 43
when input is [49, 43, 43] the target: 54
when input is [49, 43, 43, 54] the target: 1
when input is [49, 43, 43, 54, 1] the target: 47
when input is [49, 43, 

- Now, it has been established that we have a batched dataset, we will begin feeding this batached dataset into the neural network.
- We will explain a bit more about the idea of word embedding here: 
    - nn.Embedding(vocab_size, embedding_dim) <- the vocab size represent the size of the vocabulary in this case how much unique characters are we going to capture in a model. 
        - if we have say nn.Embedding(6, 7) and a 24 words as input. Then the output matrix will be [24, 7] tensor, the model however will only be allow for 6 unique words. 
### 3.2. Bigram Model

In [21]:
import torch.nn as nn 
from torch.nn import functional as f 
torch.manual_seed(1337)

# This is the sample code for the cration of a bigram model
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size, n_embed):
        super().__init__()
        # We have demonstrated the usage of nn.Embedding in the previous sections. 
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        # We create a custom embedding layer
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)
       
        
            # - we have previously defined vocab_size to be the number of unique characters from out input.txt.
            # - embedding showns how each character relate to one and another. 
    
    # The Forward pass function is the essential component of the neural network
    def forward(self, idx, targets = None):
        B, T = idx.shape
        pos_idx = []

        # The code provided does not written correctly, I made some modification here (it is very inefficient)
        for i in range(T):
            if (i<8):
                pos_idx.append(i)
            else:
                pos_idx.append(i%8)
        pos_idx=torch.tensor(pos_idx)
        # idx and targets are both (B,T) tensor of integer
        token_emb = self.token_embedding_table(idx)  # (B,T,C1)
        pos_emb = self.position_embedding_table(pos_idx) # We add some positional embedding. <- for bigram model the positional embedding is not particularly useful
        
        #print(T)
        #print(token_emb.shape)
        #print(pos_emb.shape)
        x = token_emb + pos_emb
        
        logits = self.lm_head(x)
         # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            logits = logits.view(B*T, C)
            targets = targets.view(B*T)

            # We calculate the cross entroppy loss 
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
    # Generate function can be used to validate our model
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
           
            logits = logits[:, -1, :]
            # apply softmax to get probabilities
            #print(logits.shape)
            
            probs = F.softmax(logits, dim = -1) # (B, C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            #print(idx.shape)
        return idx
    
# We have defined the vocab size in previous section
m = BigramLanguageModel(vocab_size=vocab_size, n_embed=n_embed)
# xb and yb as shown in the above code chunk is the input and the target.
logits, loss = m(xb, yb)
print(logits.shape)
print(logits)
print(loss)

torch.Size([32, 65])
tensor([[-0.0372,  0.5101, -0.3296,  ...,  0.4318, -0.6124,  0.2609],
        [ 0.5865, -0.1595, -0.0942,  ...,  0.3435,  1.2050, -0.8776],
        [-0.0270,  0.9009,  0.6174,  ...,  0.5172,  0.5898, -0.1269],
        ...,
        [ 0.3226, -0.4082, -0.2581,  ...,  0.1849,  1.4394,  0.7642],
        [ 0.4565, -0.8794,  0.3228,  ..., -0.2943, -0.0605,  1.5607],
        [-0.1998, -0.1509,  0.0761,  ..., -0.7685,  1.0741,  0.2044]],
       grad_fn=<ViewBackward0>)
tensor(4.3152, grad_fn=<NllLossBackward0>)


In [22]:
context = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(context, max_new_tokens=100)[0].tolist()))


&Ks.O&HmjJQOOSiekNGqWFMPy'haCchX.w.qBYMD
E.lID!Ddn
?$xC?x?wH;,xG&k&tWuyUMI.JhRsXUyg-NwSimQaltzavenDI


In [23]:
# Train the Bigram Model
optimizer = torch.optim.AdamW(m.parameters(), lr = 1e-3)
batch_size = 32
for steps in range(1000): # increase number of steps for good results... 
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

4.429715633392334
4.524538993835449
4.400267124176025
4.470452785491943
4.367921829223633
4.387684345245361
4.3218488693237305
4.357346534729004
4.3777337074279785
4.338712692260742
4.239202976226807
4.310429096221924
4.291584014892578
4.215610980987549
4.142261981964111
4.225683212280273
4.204963684082031
4.227346420288086
4.187575817108154
4.192535877227783
4.215929985046387
4.169702529907227
4.087116718292236
4.243812084197998
4.0292253494262695
4.091665744781494
4.085590362548828
4.082399845123291
4.0830817222595215
4.054449081420898
4.008861541748047
4.01439905166626
3.9793031215667725
3.9738619327545166
3.9716131687164307
3.8572936058044434
3.9882524013519287
3.962925910949707
3.9457595348358154
3.8942606449127197
3.978365182876587
3.910951852798462
4.031850814819336
3.8615918159484863
3.80857253074646
3.8087329864501953
3.867537260055542
3.827179193496704
3.8883941173553467
3.858880043029785
3.820722818374634
3.791415214538574
3.759516954421997
3.7909579277038574
3.6848134994506

2.829002618789673
2.83974552154541
2.8202767372131348
2.96455979347229
2.9795970916748047
2.815258026123047
2.7606472969055176
2.8205089569091797
3.0862538814544678
2.7197582721710205
2.9499659538269043
3.025893211364746
2.7619950771331787
2.814711332321167
2.979884147644043
2.9425296783447266
2.769573211669922
2.824856996536255
2.7914462089538574
3.0634653568267822
2.784821033477783
2.815312623977661
2.791013240814209
2.8447909355163574
2.9342293739318848
2.822354316711426
2.893488883972168
2.8470354080200195
2.82509708404541
2.8932230472564697
2.994302988052368
2.9407198429107666
2.9579432010650635
2.90751051902771
2.847487211227417
2.7600510120391846
2.7437832355499268
2.835254430770874
3.0383858680725098
2.708141803741455
2.792759895324707
2.894479751586914
3.1124768257141113
2.8530986309051514
2.813288927078247
2.9875264167785645
2.778353452682495
2.6375060081481934
2.8899102210998535
2.644167184829712
2.784364938735962
2.8192663192749023
2.8054146766662598
2.798884391784668
2.860

- Now, we can see that our generated output possess much more characteristics of a real text written by a human, however since the model is only a bigram model (predicting the next word based on the previous word), the model will not be as accurate as we would have hoped.

In [24]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


WAIkpre,
APoE't te meNGa?oYoung ENoy B.

meat ik, w wiyare, arfy IN.
USad taen? qouthor c
ORe I3:
Thir.
y
T s s s mare:
 p rAnll'veante's
Bwow ous secol be nd utek, a mdsengag owenshonn he mechinlns ayestawit d  b t s llllern orud am.
I I ongf t.
M whog adive tiuFphe f se t aend wurerord thhamtunof athie cor CENGon hicr herererlrhar'xhe
Th al yomay lseet yole
I'n pe t othithen bE wELEy a ke!ay

MIVe sth hatonon
Tond, g s thend, buneomyEQesoritoumsitDKEEuris antrese DENdivantewo;oeWe chu bhidowin


### 3.3. Transformer Model
- Using what we have learnt previously, we implement self-attention here.
- From the matrix below, we can see that the sum of each row will add up to 1.
- The softmax of negative infinity is equal to 0.
- Higher dot product will mean higher affinity -> neural network have higher attenuation.

In [25]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
# Transpose the last two dimensions. 
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))  # The mask feature will always be present in the decoder block to allow the words to talk to each other.
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x
print("The Dimension of the value tensor is: "+str(v.shape))
#print(v[0])
print("The Dimension of the output tensor is: "+str(out.shape))
#print(out[0])
print("The Dimension of the weight tensor is: "+str(wei.shape))
print(wei[0])



The Dimension of the value tensor is: torch.Size([4, 8, 16])
The Dimension of the output tensor is: torch.Size([4, 8, 16])
The Dimension of the weight tensor is: torch.Size([4, 8, 8])
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)


- Attention is a communication mechnism. Can be seen as nodes in a directed graph relating to each other in terms of "closeness" and aggregating information with a weighted sum from all nodes that point to them. With these weights the attention mechnism is able to easily represent any form of directed graphs however are not capable of representing positional data unless otherwisely instructed.
- People also tried establishing links between [GNNs and Transformers](https://graphdeeplearning.github.io/post/transformers-are-gnns/) (However the Transformers use self-attention which is different to the standard-attention mechanism for GAT, however the two architecture are built on similar level of understanding).

In [26]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
# We can now scale the variance using the method proposed in the original Attention is all you need thesis to reforge the variance to approximately 1.
wei = q @ k.transpose(-2, -1) * head_size**-0.5
print(wei.var())

tensor(1.0918)


#### 3.3.1 Masked multi-head attention
Why do we need Masked multi-head attention for the decoders in our transformer framework?
- Because Transformer just like the RNN is a autoregressive function. This means when generating a given response, it will predicts the future word with respect to what it predicted previously, if we set the time dimension to 8. This means it should only be able to maintain an attention-span of 8 units and will not be able to see the future words.
- A better explanation from [stackexchange](https://stats.stackexchange.com/questions/634864/why-do-we-mask-input-tokens-for-the-decoder-in-a-transformer):
  - _The sequence we feed to the decoder always has to be the same length. So when we feed the decoder a sequence the sequence must be "full-length". In this kind of transformer the generation of tokens, happens by the decoder always predicting the next token, based on the already generated tokens. However, because we have to use as input a complete sequence and not only what has been generated so far, the Decoder could "look" at the words that are right of the already generated tokens. In order to avoid this everything right f the token that is to be predicted will be masked._

<p align="center" style="background-color: white;">
        <img src="images/fig13.gif" alt="fishy" class="bg-primary" width="500px" />
        <figcaption align="centre" >
            <a href="https://jalammar.github.io/illustrated-transformer/">Illustrated Transformer</a>
          </figcaption>
</p>

#### 3.3.2 Layer Normalization

- Layer normalization ensures all the activations in each layer to have a mean of zero and a unit (1) variance. It is proposed in the paper "[Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf)".

$$
h_i = \frac{g}{\sigma}(h_i - \mu)
$$

- g: the gain variable (can be set to 1).
- $\mu$: the mean
- $\sigma$: the standard deviation

- Now assume we have the following fc layer described by the equation:

$$
x' = f[{w_1}^{T}x+b_1]
$$

$$
h = \gamma_{1}[\frac{x'-\mu_{1}}{\sigma_{1}}]+\beta_{1}
$$

- where $\gamma_{1}$ and $\beta_{1}$ are trainable parameters, which can be trained and changed in order to optimize our loss function.
- If we are to get an output of 2 words with 3 dimensions, we are going to average the mean with respect to each words across the three dimensions to compute the mean.

In [27]:

class LayerNorm1d:
    def __init__(self, dim, eps=1e-5, momentum = 0.1):
        self.eps  = eps
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)

    def __call__(self, x):
        xmean = x.mean(1, keepdim=True)
        xvar = x.var(1, keepdim=True)
        xhat = (x - xmean)/torch.sqrt(xvar + self.eps)
        self.out = self.gamma * xhat + self.beta
        return self.out
    
    def parameters(self):
        return [self.gamma, self.beta]

_aside:_ When the Keep Dimension is equal to true. this will ensure the structure of the matrix is not destroyed.

In [28]:
a = torch.randn(4, 4)
amean_kt = a.mean(1, keepdim=True)
amean_kf = a.mean(1, keepdim=False)

print(amean_kt)
print(amean_kf)

tensor([[-0.6591],
        [-0.8219],
        [-0.1061],
        [-0.5706]])
tensor([-0.6591, -0.8219, -0.1061, -0.5706])


#### 3.3.3 Implementation

In [29]:
# Define Variables
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

In [30]:
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [31]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

In [32]:
# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)

In [33]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [34]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


step 0: train loss 4.2984, val loss 4.2951
step 100: train loss 2.6478, val loss 2.6505
step 200: train loss 2.5055, val loss 2.5142
step 300: train loss 2.4207, val loss 2.4302
step 400: train loss 2.3732, val loss 2.3725
step 500: train loss 2.3250, val loss 2.3322
step 600: train loss 2.2722, val loss 2.2747
step 700: train loss 2.2222, val loss 2.2441
step 800: train loss 2.1830, val loss 2.2153
step 900: train loss 2.1352, val loss 2.1736
step 1000: train loss 2.1051, val loss 2.1410
step 1100: train loss 2.0858, val loss 2.1184
step 1200: train loss 2.0413, val loss 2.0896
step 1300: train loss 2.0162, val loss 2.0792
step 1400: train loss 2.0007, val loss 2.0570
step 1500: train loss 1.9818, val loss 2.0421
step 1600: train loss 1.9540, val loss 2.0391
step 1700: train loss 1.9284, val loss 2.0243
step 1800: train loss 1.9173, val loss 2.0066
step 1900: train loss 1.8877, val loss 1.9950
step 2000: train loss 1.8812, val loss 1.9750
step 2100: train loss 1.8671, val loss 1.9681


In [35]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


Bise lanture to the birds to cheep
For legream'd you not me, so the dead must have
And the countle upon that pard I't have leck of I wencome amble
Bus regon her sidons it his duke!
Boting fastem ofther his could pleason of my quickness; my lord,
As hence uponce! and hone man unter some softed,
Hence doney meance a rong approus;
Sich.

DUKE OF OF York, knownst an my plent you do hellow
'Cour deap tour libuther'd, none deam longs and this wounderiting,
And roher denille formound most; quuke man of you love: let.
You hast founds to go me, this trumber
make and pless and exver of child.

Find:
I way for the my atcrad: for thy landry,
That what dewouch his shall not foreht dother were ingeen,
This shance set proud, it must, and you round,
And with arm fortump----it it should plrip out father waptantisfence
To cruny father relliked upon time win in Done
Weemp of bence you.- Souy conturn'd humbling, your hide
As hillough procling not of the were red;
Were to ther forth'd, thing wears
Which p