
# Making a Simple Bigram model with Romeo and Juliet Book


In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
print(torch.__version__)
print('cuda') if torch.cuda.is_available() else print('cpu')
# for seeing if it works on mac
if torch.backends.mps.is_available():
    print('mps available')
    mps_device = torch.device("mps")
    print(mps_device)
    x = torch.ones(1, device=mps_device)
    print (x)
else:
    print ("MPS device not found.")

2.4.0+cu124
cuda
MPS device not found.



## 1) read the text file with data and store it in a variable

    the contents of the book was stored in a `data.txt file` and all it['s content is read and copied to a variable named text.
    the length and the first 300 characters was printed to test if it was working and once confirmed moved on the second part.

In [2]:
with open('data.txt', 'r', encoding='utf-8') as file:
    text = file.read()  
print(len(text))
print(text[:300])

142466
﻿THE TRAGEDY OF ROMEO AND JULIET

by William Shakespeare




Contents

THE PROLOGUE.

ACT I
Scene I. A public place.
Scene II. A Street.
Scene III. Room in Capulet’s House.
Scene IV. A Street.
Scene V. A Hall in Capulet’s House.

ACT II
CHORUS.
Scene I. An open place adjoining Capulet’s Garden.
Scen



## 2) make a sorted set of all the characters in the dataset

    a set of all the unique characters in the data was taken using the `set()` method and the result was passed into the `sorted()` method and copied   to the `chars` variable, The result is printed for testing and visualisation purposes. This helps us have a vocabulary of sorts to work with.

In [3]:
chars = sorted(set(text))
print(chars)
print(len(chars))
vocab_size = len(chars)

['\n', ' ', '!', '&', ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'æ', '—', '‘', '’', '“', '”', '\ufeff']
71



 ### why do the above ? 

  this lets us work with tokenizers which comes with encoders and decoders, encoders can help us convert all the 71 characters to numbers. we will be making a character level tokenizer below


## 3) make a basic character-level tokenizer


In order to make the character level tokenizer two dictionaries are first made. **'string_to_integer = { ch:i for i,ch in enumerate(chars) }'** enumerate over each character in the array and assign it an unique number making the first dictionary and reverse for integer to string making the second dictionary .

**e.g. :** if  'chars = ['a', 'b', 'c']' then these will be the values of the two dictionaries :
string_to_integer = {'a': 0, 'b': 1, 'c': 2}
integer_to_string = {0: 'a', 1: 'b', 2: 'c'}

the encoder lambda function takes a string s and converts it to a list of integers. For each character c in the string s, it looks up its integer value using the string_to_integer dictionary. The result is a list of integers corresponding to each character.

the decoder lambda function takes a list of integers l and converts it back to a string. For each integer i in the list l, it looks up the corresponding character using the integer_to_string dictionary. It then joins these characters into a single string.

**another e.g:** 

encoder('abc')  # e.g. Output: [0, 1, 2]

decoder([0, 1, 2])  # e.g. Output: 'abc'

these two methods combied helps have the functionality of a basic character-level tokenizer


In [4]:
string_to_integer = { ch:i for i,ch in enumerate(chars) }
integer_to_string = { i:ch for i,ch in enumerate(chars) }
encoder = lambda s: [string_to_integer[c] for c in s]
decoder = lambda l: ''.join([integer_to_string[i] for i in l])

encoded_yungting = encoder('yungting')
print(encoded_yungting)
decoded_yungting = decoder(encoded_yungting)
print(decoded_yungting)

[62, 58, 51, 44, 57, 46, 51, 44]
yungting



### <ins>More info on tokenizers</ins>

 ### Types of tokenizers 

 #### 1) Character-level tokenizer

- Small Vocabulary but large amount of chcaracters to work with.For example, for the romeo and julirt data we are using we have 142466 characters with a vocabulary of 71 unique characters. There are more characters to encode and decode.


 #### 2) word-level tokenizer 

- Very large Vocabulary but smaller amount of words to work with.The vocabulary can get very big especially if there is multiple languages.Less words to encode and decode but very large dictionary

#### 3) subword-tokenizer

- Somewhere inbetween character-level tokenizer and word-level tokenizer


##### it is very important to be efficient with data when working with language models
         

## 4) creating and working with tensors 

#### 1) What are tensors
- tensors are the main data structure we will be working with , pytorch does most of the math for it for us

In [5]:
x = torch.empty(3,4)
print(type(x))
print(x)

<class 'torch.Tensor'>
tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])


- as seen above x is of the type Tensor and it is 3 rows and 4 columns and is empty
- 1 dimensional tensor is usally called a vector and is reffered to as such
- 2 dimensional tensor is usually called a matrix and referred to as such
- 3 dimensional and above tensors are usually just referred to as tensors
- as per definitions above we can refer to x as a matrix as it is 2 dimensional
- x i populated with 32 bit floats by defaults , if you see random values it is just the last values from memory
- We use tensors over arrays and stuff as it is easier for pytorch to work with it especially when reshaping, changing dimensionality,multiplying, doing dot products, and so on

#### 2) Working with tensors and our enncoder we made 


- we will use our encoder to encode the entire 'Romeo and Juliet' novel into numbers then we will use torch tensor to make a tensor with the numbers returned from the encoder we made , while making the tensor we can manually set the data type to be sure that the tensor we are making is gonna use the mentioned data type of long
- the first 120 characters will be printed out as a test

In [6]:
data = torch.tensor(encoder(text), dtype=torch.long)
print(data[:120])

tensor([70, 29, 17, 14,  1, 29, 27, 10, 16, 14, 13, 33,  1, 24, 15,  1, 27, 24,
        22, 14, 24,  1, 10, 23, 13,  1, 19, 30, 21, 18, 14, 29,  0,  0, 39, 62,
         1, 32, 46, 49, 49, 46, 38, 50,  1, 28, 45, 38, 48, 42, 56, 53, 42, 38,
        55, 42,  0,  0,  0,  0,  0, 12, 52, 51, 57, 42, 51, 57, 56,  0,  0, 29,
        17, 14,  1, 25, 27, 24, 21, 24, 16, 30, 14,  6,  0,  0, 10, 12, 29,  1,
        18,  0, 28, 40, 42, 51, 42,  1, 18,  6,  1, 10,  1, 53, 58, 39, 49, 46,
        40,  1, 53, 49, 38, 40, 42,  6,  0, 28, 40, 42])


## 5) Creating training and Validation splits

-  #### i) why do we need to split our data into training and validation splits ?
     - we split it this way as if we just train the model on 100% of the text after some iterations the model will just memorise the text and would just produce the exact copy of our traiiong text every time, This is not the purpose of training a model as we are trying to train a language model that will complete the text we provide in a similar manner/style to our text from the training data.
     - In order to achieve the goal mentioned above we might split our text so that 80% is training data and the rest 20% is validation data.
     - e.g. : let's say we are training a model on data that goes sonething like this:
         - Yung Ting after infiltrating a maximum security facility in atlantis has decided to make a copy of a top secret recipe that also contains the manfucating process of the viral atlanade drink with the unique...Yung Ting has achieved her goal and atlantis has been added to her list of evergrowing victims.
         - let's say the above text is 100 lines of equal tokens per line for the sake of the example. We would use the first 80 lines info as training data and the last 20 lines as validation data making a very simple training and validation split.
-  #### ii) how does bigrams come into play in what we are doing here in this simple and basic model ?
    - it is called a bigram model as the simple model that is being made here today predicts only based on bigrams, i.e for e.g let's say the model is predicting the word 'yungting' it goes like this (very simple character level example) :
        - start -> y
        - y -> u
        - u -> n
        - n -> g
        - g -> t
        - t -> i
        - i -> n
        - n -> g
    - working in bigrams and would only take the preceding character into account when predicting the next (for now)
-  #### iii) how do we train a bigram model to achieve the goals above ? and how are we gonna use an artifical neural network to achieve it
    - we would achieve the goals  above by using something called block sizes, we will make **predictions** and **targets** out of them
    - e.g. block_size = 7
        - context . . . [ 19 , 6 , 25 , 4 , 59 , 20 , 18 ] 21 . . . (predictions)
        - context . . . 19 [ 6 , 25 , 4 , 59 , 20 , 18 , 21 ] . . . (targets)
        - [] is a tensor in the lines above
        - first tensor in python can be [:7] taking the first 7 characters, in the target it can be [1:block_size + 1] making an offset of 1
        - we will figure out how far away the predictions are from the target and optimise it so that it is better
        - allright now we can do what we just described but in code below starting with the training and validation splits 

In [7]:
s = int(0.7*len(data))
training_data = data[:s]
validation_data = data[s:]

block_size = 7

a = training_data[:block_size]
b = training_data[1:block_size + 1]

for t in range(block_size):
    context = a[:t+1]
    target = b[t]
    print("(in characters) when context is", decoder(context.cpu().detach().numpy()), "target value is : ",  decoder([target.cpu().detach().numpy().max().item()]) )
    print("(In tensors) when context is", context, "target value is : ", target )

(in characters) when context is ﻿ target value is :  T
(In tensors) when context is tensor([70]) target value is :  tensor(29)
(in characters) when context is ﻿T target value is :  H
(In tensors) when context is tensor([70, 29]) target value is :  tensor(17)
(in characters) when context is ﻿TH target value is :  E
(In tensors) when context is tensor([70, 29, 17]) target value is :  tensor(14)
(in characters) when context is ﻿THE target value is :   
(In tensors) when context is tensor([70, 29, 17, 14]) target value is :  tensor(1)
(in characters) when context is ﻿THE  target value is :  T
(In tensors) when context is tensor([70, 29, 17, 14,  1]) target value is :  tensor(29)
(in characters) when context is ﻿THE T target value is :  R
(In tensors) when context is tensor([70, 29, 17, 14,  1, 29]) target value is :  tensor(27)
(in characters) when context is ﻿THE TR target value is :  A
(In tensors) when context is tensor([70, 29, 17, 14,  1, 29, 27]) target value is :  tensor(10)


## 6) Defining the get_batch function
- #### i) what are batches ? 
    -  what we have done so far is sequential which is not scalable and will be slow, instead what we can do is we can use our gpu and use a lot more cores at onces in parallel using batches. Using our example above we would just have multiple blocks and push it in our gpu for calculation in parallel.
    -  without batches performance will be poor
- #### ii) what are batch sizes ?   
    -  Our batch size is how many blocks we are processing in parallel, and block size is the length of each individual block
- #### iii) how do we use batches?
    - using batches is very simple with pytorch is very simple and is shown below **(gpu required)**

In [8]:
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
print(device)
batch_size = 6

def get_batch(split):
    data = training_data if split == 'train' else validation_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

x, y = get_batch('train')
print('inputs: ')
print(x)
print('targets: ')
print(y)

cuda
inputs: 
tensor([[50, 38, 51,  6,  0,  0, 29],
        [67, 56,  1, 49, 38, 41, 62],
        [ 1, 41, 58, 49, 49,  1, 42],
        [ 6,  0, 32, 45, 38, 57,  1],
        [46, 52, 58, 56,  5, 47, 58],
        [ 1, 57, 45, 42,  1, 44, 42]], device='cuda:0')
targets: 
tensor([[38, 51,  6,  0,  0, 29, 33],
        [56,  1, 49, 38, 41, 62,  1],
        [41, 58, 49, 49,  1, 42, 38],
        [ 0, 32, 45, 38, 57,  1, 41],
        [52, 58, 56,  5, 47, 58, 46],
        [57, 45, 42,  1, 44, 42, 51]], device='cuda:0')



- #### iv) more on the get_batch function defined above
  - the if statement is there to decide wether to create batch for training data or validation data
    - training data as mentioned before is the "known content" to train with and validation data is "unkown" content we will try and predict
    - ix takes a random integer between 0 and length of data - block_size (we get random indices in the text we can start geenrating from)
    - we use torch.stack to stack x and y in batches, y is offset from x by 1
    - ```x,y = x.to(device), y.to(device)```is used to load the tensors into the gpu (as shown by device = 'cuda:0')

## 7) initializing a neural network

- #### i) what is gradient descent ?

    - Gradient descent helps us minimise the loss by adjusting model parameters itteratively
    - how do we find if we are reducing the loss? we can take the derivative at the current point and move it in a direction (let's the slope is increasing in a negative direction so we can keep adjusting the slope in favor of this direction and basically descend with the gradient)
    - gradient descents are oprimizers
      
- #### ii) how is loss calculated ?

    - as for how the loss is calculated , let's say we have 80 possible characters with a chance of 1/80 for the prediction we will do -ln(1/80) which around ~4.382 (less than 2% chance)

- #### iii) what are optimizers ?

    - There are many optimisers, we will use AdamW which combines ideas from momentum and it uses a moving average of both the gradient and it's square parameter to adapt the learning rate of each parameter from Adam and adds weight decay to the adam algorithm (it generalises the parameters more)
    - weight decay generalises the perfomance insrtead of having high or low level performance, weight significance will shrink as graph flattens out.
    - lets say some weights give insanely high or low performance weight decay , will decay the performance to normalise it.

- #### iv) what is Learning Rate ?

    -  A hyperparameter that determines the size of the steps taken during optimization.
    -  Too high can overshoot minima
    -  Too low leads to slow convergence.
    -  we wanna have a small learning rate but not too small.

- #### v) what is an Embedding Table ?

    - A lookup table representing how each token (character) relates to every other token.
    - Typically, it’s a matrix of size ```vocab_size x vocab_size```
    - shows a probably distribution of what character comes next given one character
    - visualisation :
    <div>
    <img src="https://miro.medium.com/v2/resize:fit:1400/1*RUaIk8LE1vty-J0blNE_eA.png" width="400"/>
    </div>

      
- #### vi) why a custom forward pass ?

    - we do a custom implementation as it is a very good practice when we have specific use cases we know what exactly is going on behind the scenes in the model, what transformations we are doing, how we are storing it and a lot of information that will help us debug.
    - we can see how the input is tranformed step by step in each layer to the output

In [9]:
class BigramModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size , vocab_size)
        print(self.token_embedding_table)

    def forward_pass(self, index, targets):
        logits = self.token_embedding_table(index)
        #print(logits)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
        
    def generate_tokens(self, index, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self.forward_pass(index, targets=None)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            index_next = torch.multinomial(probs, num_samples=1)
            index = torch.cat((index, index_next), dim=1)
        return index

## more on initialising our neural network
- #### what are logits ?
    - let's normalise [2,4,6]
        - total = 2 + 4 + 6 = 12
        -  2 / 12 = 0.167 (16.7% chance)
        -  4 / 12 = 0.333 (33.3% chance) 
        -  6 / 12 = 0.5 (50% chance)
        - so a logit is basically [0.167, 0.333, 0.5]
        - so let's say first represents ab , second ac , third ad , we know ad is most likely gonna happen next as there is a 50% chance
        - basically like a probability distribution
- #### what does self.token_embedding_table(index) do?
    - For each token in index, it retrieves its embedding (logits)
- #### what is the purpose of checking targets ?
    - we check if targets is none as we can only calculate the loss if targets are given, if targets is none model is in inference mode (generating text) and no loss is calculated
- #### why do we reshape our tensors ?
    - we unpack and reshape our tensor due to the input shapes expected by the cross entropy function
    - what the cross entropy function expects for input :
        - one dimension (C) only number of classes or
        - two dimensions (N, C) batch size and number of classes or
        - or (N, C, d1, d2, .... dk) with K >= 1 in case of dimensional loss
    - we multipy B*T (T is Sequence length which is number of tokens in sequence) to get N and we have C to have the two dimensional input of (N, C)
    - the cross entropy function expects (N) for target (or till dk in case of dimensional loss) so we do B*T to get N
- #### how does generate_tokens work ?
    - the parameter token is the starting token index (shape is (B, T) array of indices in the current context), max_new_tokens is the maximum new tokens to generate
    - the new tokens are generated sequentially one at a time till max new tokens are hit
    - in each itteration :
        - ```logits, loss = self.forward_pass(index, targets=None)``` Computes the logits for the current sequence (targets = None as we are generating tokens , we are not training the model)
        - ```logits = logits[:, -1, :]``` Selects the logits corresponding to the last token in each sequence in the batch as in a bigram model the next token prediction depends only on the current (last) token.(as illustrated before) (shape becomes (B, C))
        - ```probs = F.softmax(logits, dim=-1)``` Applies the softmax function to convert logits into probabilities. (shape is still (B, C))
        - ```index_next = torch.multinomial(probs, num_samples=1)``` Samples the next token index from the probability distribution
            - Randomly selects indices based on the provided probabilities. It allows for stochastic sampling, which can introduce variability in generated sequences (shape becomes (B, 1) as it is just the next index so sequence is 1)
        - ```index = torch.cat((index, index_next), dim=1)``` Concatenates the newly sampled token to the existing sequence along the sequence length dimension (shape becomes (B, T + 1) we add the next index so sequence increasesn by 1)

## 7) testing the generate function
- we test the made generate function by generating random characters starting with a 0 tensor resulting in 501 tokens as we generate 500 tokens
- we use our decoder to change it back to charcters for visualisation

In [10]:
model = BigramModel(vocab_size)
m = model.to(device)
print(model)
print(device)
context = torch.zeros((1,1), dtype=torch.long, device=device)
print(context)
print(context.shape)
print("Context device:", context.device)
print("Model device:", next(model.parameters()).device)
generated_tokens = m.generate_tokens(context, max_new_tokens=500)
print(generated_tokens)
print(generated_tokens.shape)
generated_chars = decoder(generated_tokens[0].tolist())
print(generated_chars)

Embedding(71, 71)
BigramModel(
  (token_embedding_table): Embedding(71, 71)
)
cuda
tensor([[0]], device='cuda:0')
torch.Size([1, 1])
Context device: cuda:0
Model device: cuda:0
tensor([[ 0, 22, 14, 64, 17, 67, 56, 16, 60, 18, 43, 17,  5, 59, 69, 65, 62, 41,
         46, 48, 33,  1, 41, 12, 24, 43, 10, 25, 28, 67, 33, 44, 35,  5, 14,  8,
         48, 41, 57, 54, 65, 31, 27, 64, 33, 54, 38, 59, 60, 70, 55,  7, 61, 23,
         11, 28, 34, 15, 52, 45, 11, 10, 12, 47, 42, 67,  9, 17, 69, 59,  0, 63,
         15, 61, 59,  6, 64, 36, 68,  3, 68, 14, 26, 66, 22, 41, 64,  0, 24, 43,
         44, 10, 37, 36, 55, 19, 22, 70, 48, 59,  0, 58,  8, 16, 18, 27, 44, 62,
         54, 32, 11, 23, 17, 69, 43, 24,  5, 16,  6, 58, 62,  8, 39, 41, 40,  3,
         16, 18, 51, 13, 27, 56,  0,  2, 64, 47, 48, 59,  9, 51, 30,  1,  3, 43,
         61,  1, 50, 62, 28, 29, 11, 29, 27, 55,  6,  3, 42, 42, 16, 49, 53, 58,
         44, 40, 46, 23,  4,  9, 21, 26, 46, 61, 45, 18, 35, 63, 66, 10, 57, 31,
         48, 

## 8) create a training loop
- #### i) create the adamW optimizer using pytorch
    - we create a adamW optimizer using our model parameters and the learning rate (determines the size of the steps taken during optimization) we defined
        - ##### what are the main kinds of optimisers?
            - Mean Squared Error (MSE): MSE is a common loss function used in regression problems, where the goal is to predict a continuous output. It measures the average squared difference between the predicted and actual values, and is often used to train neural networks for regression tasks.
            - Gradient Descent (GD): is an optimization algorithm used to minimize the loss function of a machine learning model. The loss function measures how well the model is able to predict the target variable based on the input features. The idea of GD is to iteratively adjust the model parameters in the direction of the steepest descent of the loss function
            - Momentum: Momentum is an extension of SGD that adds a "momentum" term to the parameter updates. This term helps smooth out the updates and allows the optimizer to continue moving in the right direction, even if the gradient changes direction or varies in magnitude. Momentum is particularly useful for training deep neural networks.
            - RMSprop: RMSprop is an optimization algorithm that uses a moving average of the squared gradient to adapt the learning rate of each parameter. This helps to avoid oscillations in the parameter updates and can improve convergence in some cases.
            - Adam: Adam is a popular optimization algorithm that combines the ideas of momentum and RMSprop. It uses a moving average of both the gradient and its squared value to adapt the learning rate of each parameter. Adam is often used as a default optimizer for deep learning models.
            - AdamW: AdamW is a modification of the Adam optimizer that adds weight decay to the parameter updates. This helps to regularize the model and can improve generalization performance.

    - learning rate is something we need to experiment with to figure out the ideal value, if it is too slow it is bad as we would not want to wait months for a simple bigram model , too fast is bad as it can overshoot
    - The optimizer updates the model's parameters based on the computed gradients to minimize the loss function.
- #### ii) Implement the Training Loop
    - Iteratively trains the model by processing batches of data, computing loss, performing backpropagation, and updating model parameters
    - ```for iter in range(max_iterations):``` loop runs from 0 to max_iterations - 1
    - in each itteration :
        - ```inputs, targets = get_batch('train')``` Retrieves a batch of input data (inputs) and corresponding target data (targets) from the training set
        - ```logits, loss = model.forward_pass(inputs, targets)```Performs a forward pass through the model, computing the logits (raw predictions) and the loss
        - ```optimizer.zero_grad(set_to_none=True)``` Resets the gradients of all optimized parameters to none (instead of 0 due to set_to_none=True)
        - ```loss.backward()``` Performs backpropagation, calculating gradients based on the computed loss with respect to each model parameter. (Each parameter's .grad attribute is populated with the gradient of the loss with respect to that parameter)
        - ```optimizer.step()``` Updates the model's parameters based on the computed gradients to minimize the loss.(Uses the gradients stored in .grad attributes to perform parameter updates (gradient descent) ,Applies regularization to prevent overfitting by penalizing large weights as we are using AdamW)
        - ```if iter % 100 == 0: (new line) print(f"Iteration {iter}, Loss: {loss.item()}")``` logs the loss every 100 iterations
    - ```print(loss.item())``` print the final loss
- #### iii) define the estimate loss function
    - we put the model in evaluation mode to estimate losses  

In [11]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(100)
        for k in range(100):
            inputs, targets = get_batch(split)
            logits, loss = model.forward_pass(inputs, targets)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [21]:
max_iterations = 100000
learning_rate = 3e-3
# 3e-4 is around 0.0003

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iterations):
    if iter % 100 == 0:
        losses = estimate_loss()
        print(f"Iteration {iter}, training loss {losses['train']:.2f}, validation loss {losses['val']:.2f}")
    inputs, targets = get_batch('train')
    logits, loss = model.forward_pass(inputs, targets)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


print(loss.item())

Iteration 0, training loss 2.40, validation loss 2.46
Iteration 100, training loss 2.44, validation loss 2.46
Iteration 200, training loss 2.44, validation loss 2.50
Iteration 300, training loss 2.43, validation loss 2.45
Iteration 400, training loss 2.40, validation loss 2.44
Iteration 500, training loss 2.40, validation loss 2.47
Iteration 600, training loss 2.40, validation loss 2.48
Iteration 700, training loss 2.39, validation loss 2.45
Iteration 800, training loss 2.40, validation loss 2.43
Iteration 900, training loss 2.40, validation loss 2.45
Iteration 1000, training loss 2.42, validation loss 2.49
Iteration 1100, training loss 2.42, validation loss 2.45
Iteration 1200, training loss 2.40, validation loss 2.43
Iteration 1300, training loss 2.44, validation loss 2.50
Iteration 1400, training loss 2.45, validation loss 2.50
Iteration 1500, training loss 2.47, validation loss 2.49
Iteration 1600, training loss 2.44, validation loss 2.49
Iteration 1700, training loss 2.41, validat

## 9) test generation after the very basic training loop

In [30]:
test_input = encoder('Yung Ting')
print(test_input)
test_i_tensor = torch.tensor([test_input],dtype=torch.long, device=device)
print(test_i_tensor)
print(test_i_tensor.shape)
context = torch.zeros((1,1), dtype=torch.long, device=device)
print(context.shape)
generated_tokens = m.generate_tokens(test_i_tensor, max_new_tokens=1000)
generated_chars = decoder(generated_tokens[0].tolist())
print(generated_chars)

[33, 58, 51, 44, 1, 29, 46, 51, 44]
tensor([[33, 58, 51, 44,  1, 29, 46, 51, 44]], device='cuda:0')
torch.Size([1, 9])
torch.Size([1, 1])
Yung Ting on.
Se?— joos, arthevile.
whe owod y h crepeas, thte, dese knees’sank, fumy f
That drye tof ard theybe eayshind me at ilitr asu lat ck.
FRAnemycke wie, ore;


JULE. w hre ons by, hy wonny dlofuspe, h p cou VO.
RENVer d TApereeragevel usorark my I jun I ’ the theloce burcknol st ayois y soo [_ENCULare enofuntu hie ind nten h?

Buns y aik.
Jurer PUTh? sond hy me; ho w wasgre penouthins the swhinm I’ balouiel, thicont’s mp, histhodin! tthe soustshadourer sore sishethemy Mashovome’sceow mothorernghoqu t ase._E.
[_]
Sino h t thetha s satetun f aglinircond’ll’d nde, met II beaiomenare, tht!
Sho he iestoo cos, shago werunge, S.
RENULEORO. forsind?
TI’e.
LTIEngrd tondanten.

He hane neapll?
Ju’sa or LI gerstem avey,
Hoelat ist I by kero s
ANCatheron?’dily s bl

Norealeghe wh hurind, iu t lthattiga s sinck my eesk h ghten cer.
Sullman sor chate
MEO.