 ## Intro
 
We all know LLM models are word generators. When you ask Chat GPT "hello, can you help me write a blog post about xyz", behind the curtain it generates next word (more specifically tokens) and then whole sentence is fed back to LLM and it generates another next word and process is repeated again and again, But have you ever wondered how exactly it works? 
 
In this blog post, We will dig deeper into building blocks of Large Language Models and how they come together. No hard Math or coding experience is needed, All you need is Attention and some familiarity with 10th level math i.e. what is matrix, multiplication, simple equation and basic python coding would be a plus but not necessary. 
 
 ### Let's build a TLM (Tiny Language Model)
 
Before we get started, I would like to set expectation right. Tiny LM is not going to be state of Art model that you can ask, how to cure cancer and it spits out the exact procedure. On second thought, even Chat GPT can't do that yet but you get the point. 

Tiny LM as name suggests is going to be a very small model that predicts next character based on the data we will train it. So it won't even generate real english words. We will get there in next blog post but before that we need to understand how to build model that generates text which seems like English but is gibberish. 

We will use a dataset called "Tiny Open Domain Books" and it contains text from four books 

- Alice in Wonderland - Lewis Caroll
- Dracula - Bram Stoker
- The Wonderful Wizard of Oz - L. Frank Baum
- The Count of Monte Cristo - Alexandre Dumas & Auguste Maquet

this file is around 3MB and can be downloaded from here. [TinyBook](https://huggingface.co/datasets/Blackroot/Tiny-Open-Domain-Books) 


#### Read Text File

Let's load the text file and see what it contains. It has ~245K characters.

In [1]:
tinyBook = open('../../data/tinystories.txt', 'r').read()  # Read the tiny book data 

In [2]:
len(tinyBook) # Total lenth of text

244539

In [3]:
tinyBook[:1000] # First 1000 characters

"Dorothy lived in the midst of the great Kansas prairies, with\nUncle Henry, who was a farmer, and Aunt Em, who was the farmer's\nwife. Their house was small, for the lumber to build it had to be\ncarried by wagon many miles. There were four walls, a floor and a\nroof, which made one room; and this room contained a rusty looking\ncookstove, a cupboard for the dishes, a table, three or four\nchairs, and the beds. Uncle Henry and Aunt Em had a big bed in one\ncorner, and Dorothy a little bed in another corner. There was no\ngarret at all, and no cellar—except a small hole dug in the ground,\ncalled a cyclone cellar, where the family could go in case one of\nthose great whirlwinds arose, mighty enough to crush any building\nin its path. It was reached by a trap door in the middle of the\nfloor, from which a ladder led down into the small, dark hole.\nWhen Dorothy stood in the doorway and looked around, she could\nsee nothing but the great gray prairie on every side. Not a tree\nnor a hous

Now let's see how many unique characters are in this text.

In [4]:
uniqueCharacters = sorted(list(set(''.join(tinyBook))))
print(len(uniqueCharacters))
print(uniqueCharacters)

86
['\n', ' ', '!', '"', '&', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', '[', ']', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xa0', '—', '‘', '’', '“', '”', '…']


#### Encode - Decode

We have 86 unique characters. However we need to convert these to numbers before they can be fed to any model. That's how it works with any LLM model. LLM never sees text you type in directly. It gets converted to numbers and model splits out numbers and then those numbers are converted to text. No magic or AI counsiousness. 

This process of converting text to numbers is called tokenization. Easiest tokenization method would be to just use index of character. So '\n' will be 0 and ' ' will 1 and '!' will be 2 and so on. 

Following code creates two mappings, `character_to_index` which is basically {"character": "number"} and `index_to_character` which is reverse {"index": "characters"}

In [5]:
character_to_index = {s:i+1 for i,s in enumerate(uniqueCharacters)}
index_to_character = {i:s for s,i in character_to_index.items()}
print(index_to_character)

{1: '\n', 2: ' ', 3: '!', 4: '"', 5: '&', 6: "'", 7: '(', 8: ')', 9: '*', 10: ',', 11: '-', 12: '.', 13: '0', 14: '1', 15: '2', 16: '3', 17: '4', 18: '5', 19: '6', 20: '7', 21: '8', 22: '9', 23: ':', 24: ';', 25: '?', 26: 'A', 27: 'B', 28: 'C', 29: 'D', 30: 'E', 31: 'F', 32: 'G', 33: 'H', 34: 'I', 35: 'J', 36: 'K', 37: 'L', 38: 'M', 39: 'N', 40: 'O', 41: 'P', 42: 'Q', 43: 'R', 44: 'S', 45: 'T', 46: 'U', 47: 'V', 48: 'W', 49: 'Y', 50: 'Z', 51: '[', 52: ']', 53: '`', 54: 'a', 55: 'b', 56: 'c', 57: 'd', 58: 'e', 59: 'f', 60: 'g', 61: 'h', 62: 'i', 63: 'j', 64: 'k', 65: 'l', 66: 'm', 67: 'n', 68: 'o', 69: 'p', 70: 'q', 71: 'r', 72: 's', 73: 't', 74: 'u', 75: 'v', 76: 'w', 77: 'x', 78: 'y', 79: 'z', 80: '\xa0', 81: '—', 82: '‘', 83: '’', 84: '“', 85: '”', 86: '…'}


In [6]:
encode = lambda s: [character_to_index[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([index_to_character[i] for i in l]) # decoder: take a list of integers, output a string
print('Dorothy -->', encode("Dorothy"))
print('[29, 68, 71, 68, 73, 61, 78] -->', decode([29, 68, 71, 68, 73, 61, 78]))

Dorothy --> [29, 68, 71, 68, 73, 61, 78]
[29, 68, 71, 68, 73, 61, 78] --> Dorothy


encode is a function that takes a text and returns a list with index of each character in text. Similarly decode function takes list of numbers and return text.


#### Context Window
Next we define how many character models sees before generating next one. This is known as block size or context window. For simplicity let's say block_size is 3. which would mean if we input `Dor` next letter we expect is `o` and for `Doro`, we expect `t` and so on. 

When `Doro` is input to model, model will only consider `oro` to generate `t` and that's how it works with LLMs too. If hypothetically you could insert text more text than context window on Chat GPT, it would only consider text of it's context window size and will completly ignore anything before. That's why Chat GPT does not allow that long text to be entered and give an error.

`The message you submitted was too long, please submit something shorter.`

On other hand, we are allowed to enter text that is less than context window. Input `D` should return `o` and `Do` should return `t` and so on.

block_size = 3

#### Building Training Data

Let's think this through, what is traning data? We want to tell model that when input `Dor`, we expect `o`, in other words when input is `[29, 68, 71]`, output expected is `[68]` 

Even a simple world like `Dorothy` (`[29, 68, 71, 68, 73, 61, 78]`) would have multiple training set. 

- `[29]` -> `[68]`           ---------------- ('D' -> 'o')
- `[29, 68]` -> `[71]`       ----------- ('Do' -> 'r')
- `[29, 68, 71]` -> `[68]`   ------ ('Dor' -> 'o')
- `[68, 71, 68]` -> `[73]`   ------ ('oro' -> 't')
- `[71, 68, 73]` -> `[61]`   ------ ('rot' -> 'h')
- `[68, 73, 61]` -> `[78]`   ------ ('oth' -> 'y')

We don't feed all data during training, because training happens over multiple steps and each step is costly resource wise. Also turns out, whether we train on whole data during each step or if we do training step in batches, we get similar results. So we will pick training data in batches randomly for each step. Batch size will be 8. Which mean 8 set of examples like we see above.


Note: we will not split in multiple sets like train and validation, for simplicity, however it is an important step.


In [30]:
import random
import torch
import torch.nn.functional as F

In [7]:
batch_size = 8
block_size = 3
data = torch.tensor(encode(tinyBook), dtype=torch.long)
print(data.dtype, data.shape)

torch.int64 torch.Size([244539])


In [8]:
# this function selects training batch randomly 
def get_batch():
    ix = torch.randint(len(data)-block_size , (batch_size,))     # Generate random numbers  
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x,y

Let's inspect a batch. 

In [9]:
x, y = get_batch()
x

tensor([[62, 73, 58],
        [62, 67, 72],
        [68, 71,  2],
        [68,  2, 68],
        [73, 61, 68],
        [54,  2, 76],
        [44, 68, 74],
        [67, 73, 62]])

In [10]:
y

tensor([[73, 58,  2],
        [67, 72, 73],
        [71,  2, 54],
        [ 2, 68, 67],
        [61, 68, 74],
        [ 2, 76, 58],
        [68, 74, 69],
        [73, 62, 68]])

We are packing lots of information here. If we only focus on first row, We are telling model, when input is `76`, expected output is `58`, when input is `76,58`, expected output is `67` and when input is `76, 58, 67`, expected output is `73`

#### Embedding

Let's take it up a notch. You must have heard term `parameters` with LLM. For example, LLAMA3-70B has 70 Billion parameters. Parameters are like variable which are set to certain numbers during the training. There is common analogy to think parameters as dials in a machine that needs to be set to certain number in order to get desired output.

Embedding is a set of one of these trainable parameters. Each character we have can be streched to a vector, to add more dimension. Wait what? What does it even mean?

In other words we assign each character, a list of random numbers so for eg. 'A' character which has index `26` in our mapping, could be assigned  `[0.4, 0.3, 0.2, 0.1, 0.1]` Embedding can be think of as giving multiple features to a character, So here we can say `A` can have five features and later model will figure out what are good numbers of these features to make sense with data. Number of features or Embedding dimension is a hyper-parameter, similar to block_size or batch_size, these numbers we choose and fix manually.

We are setting Embedding Dimension to 5 and our vocab size is 86. so we generate a matrix of 86 rows and 5 columns. Each row contains embeddings for that a character. So if we want to see embedding for character `A`, we will just check 26th row.



In [11]:
embedding_dimension = 5
vocab_size = 86
embedding_matrix = torch.randn((vocab_size, embedding_dimension))
print('Shape of Embedding Matrix: ', embedding_matrix.shape)

Shape of Embedding Matrix:  torch.Size([86, 5])


For a batch `x` we would want to select it's embedding and that can be done with simply indexing embedding_matrix with `x`. Thanks to pytorch, it is a framework in python, which makes all these complex matrix operations a breeze. So our input embedding matrix is now three dimensional

In [12]:
embeddings = embedding_matrix[x]
print(embeddings.shape)

torch.Size([8, 3, 5])


#### Neuron

So far we have prepared the input, next step is adding a neural layer to mix. Neuron is mathematically described as `Wx + b`. `W` here represents weights, `x` is input and `b` is bias. 

We won't just use one neuron, we would like to use a large number of neurons, let's say 100. These hundred neuron will collectively be a layer. We can stack as many of these layer. For simplicity we will use two neural layers.

[Insert Image of Neuron/NN ?]



Weights and biases are another set of parameters that are trained or in other words, they start with random numbers and we perform a set calculations to modify (tune) these numbers to make final output of Neural Network closer to actual output.

First layer would have 100 neurons, then this layer will pass through activation function and to second layer.





In [20]:
W1 = torch.rand(15, 100)
b1 = torch.rand(100)
first_layer_pre_activation = embeddings.view(-1, 15) @ W1 + b1

In [22]:
first_layer_output = torch.tanh(first_layer_pre_activation)

In [26]:
W2 = torch.rand(100, vocab_size)
b2 = torch.rand(vocab_size)

In [27]:
logits = first_layer_output @ W2 + b2

In [28]:
logits.shape

torch.Size([8, 86])

In [31]:
loss = F.cross_entropy(logits,y)

RuntimeError: 0D or 1D target tensor expected, multi-target not supported

In [34]:
probs = F.softmax(logits, dim=1)

In [37]:
probs[0]

tensor([6.0203e-04, 1.9227e-02, 2.4898e-03, 3.4971e-02, 3.1267e-05, 5.3999e-04,
        1.4671e-03, 1.0701e-03, 2.1355e-04, 2.1248e-02, 1.0419e-02, 7.6492e-06,
        1.5678e-03, 1.2366e-03, 4.6987e-04, 2.3232e-04, 9.1721e-03, 1.6285e-02,
        5.4651e-03, 8.8872e-05, 2.8982e-03, 3.1337e-04, 1.3703e-02, 1.7932e-03,
        4.8126e-05, 7.3775e-05, 1.7926e-03, 1.3406e-04, 3.0437e-02, 1.7552e-03,
        7.7340e-04, 1.4197e-05, 2.4941e-04, 7.9453e-03, 1.2898e-05, 6.0568e-05,
        2.0386e-04, 5.9326e-05, 1.3861e-02, 5.2173e-04, 4.4592e-02, 6.7000e-03,
        1.3556e-03, 9.7609e-04, 1.6482e-02, 1.0915e-04, 5.2550e-04, 5.5785e-03,
        5.6149e-03, 8.5327e-05, 5.0739e-04, 2.0296e-06, 2.9259e-05, 2.5029e-05,
        6.7468e-02, 2.5599e-02, 5.8128e-05, 5.2848e-03, 8.9373e-03, 6.3425e-02,
        3.9879e-03, 2.8140e-02, 5.0439e-03, 2.0298e-04, 2.6321e-05, 2.2958e-06,
        1.1130e-02, 4.2381e-03, 3.2411e-04, 6.6507e-02, 3.4426e-04, 1.2019e-05,
        1.1563e-05, 1.2994e-03, 1.4782e-