# LLM from scratch Series

Stage 1: Data Preperation and sampling, Attention Mechanism, LLM architecture

BUILDING AN LLM

Stage 2: Training Loop, Model Evaluation, Loading Pretrained Weights

FOUNDATION Model

Stage 3: Finetuning 

Classifier OR Personal Assistant 

## Satge 1: Data Prep & Sampling-Tokenization
How do you prepare inpur text for training LLMs?

Step 1: Splitting text into individual word and subword token

Step 2: Converting tokens in token IDs

step 3: Encode token IDs into vector representations


## GPT Tokenization Process

<img src="tokenization.png" alt="GPT Tokenization Flow" width="800"/>


# Tokenization

## Step 1: Cretaing tokens

In [2]:
with open("the-verdict.txt", "r", encoding ="utf-8") as f:
    raw_text= f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [5]:
import re
text= "Hello, world. This, is a text"
result= re.split(r'(\s)',text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'text']


In [6]:
import re
text= "Hello, world. This, is a text"
result= re.split(r'([,.]|\s)',text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'text']


In [8]:
result= [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'text']


In [21]:
#simple tokenization scheme
import re
text= "Hello, world. This, is-- a text?"
result=re.split(r'([,.:;?_!"()\']|--|\s)', text)

result= [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', '--', 'a', 'text', '?']


In [25]:

result= re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
result= [item.strip() for item in result if item.strip()]
print(len(result))

4690


In [26]:
result[:30]

['I',
 'HAD',
 'always',
 'thought',
 'Jack',
 'Gisburn',
 'rather',
 'a',
 'cheap',
 'genius',
 '--',
 'though',
 'a',
 'good',
 'fellow',
 'enough',
 '--',
 'so',
 'it',
 'was',
 'no',
 'great',
 'surprise',
 'to',
 'me',
 'to',
 'hear',
 'that',
 ',',
 'in']

## Step 2: Creating Token ids

Creating Vocabulary in alphabetical form of tokens and assigning unique to each unique token


In [27]:
all_words= sorted(set(result))
print(len(all_words))

1130


In [28]:
vocab={token:integers for integers, token in enumerate(all_words)}
#later when we get an ouput from LLM it eill be in numbers so we need inverse version of vocabulary
# that converts toke IDs to back to corresponding text tokens

### Let's make a tokinizer class in Python that have encode(text token to token ids) method and decode method(tokens is to text token)

In [49]:
class SimpleTokenizer:
    def __init__(self,vocab):
        self.str_to_int= vocab
        self.int_to_str= {id:token for token, id in vocab.items()}

    def encode(self,text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        #removing the unnecessary space
        preprocessed= [item.strip() for item in preprocessed if item.strip()]
        ids= [self.str_to_int[s] for s in preprocessed]
        return ids


    def decode(self,id):
        text=" ".join([self.int_to_str[i] for i in id])
        #replace spaces before the specified punctuations
        text= re.sub(r'\s+([,.:;?_!"()\'])',r'\1',text)
        return text

In [50]:
tokens= SimpleTokenizer(vocab=vocab)

In [51]:
id=tokens.encode("This is ")

In [52]:
tokens.decode(id)

'This is'

## ADDING SPECIAL CONTEXT TOKENS
We will modify the python class to handle unknown tokens

### ADDING SPECIAL CONTEXT TOKENS

In the previous section, we implemented a simple tokenizer and applied it to a passage
from the training set. 

In this section, we will modify this tokenizer to handle unknown
words.


In particular, we will modify the vocabulary and tokenizer we implemented in the
previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and
<|endoftext|>

<div class="alert alert-block alert-warning">

We can modify the tokenizer to use an <|unk|> token if it
encounters a word that is not part of the vocabulary. 

Furthermore, we add a token between
unrelated texts. 

For example, when training GPT-like LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that
follows a previous text source

</div>



In [59]:
#adding the end of text and unkown tokens in the Vocabulary
all_tokens= sorted(list(set(result)))
all_tokens.extend(["<|endoftext|>","<|unk|>"])
vocab={tokens: integer for integer,tokens in enumerate(all_tokens)}

In [60]:
len(vocab)

1132

In [61]:
for i,item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


#### New Tokenizer class to handle new tokens

In [62]:
class SimpleTokenizerV2:
    def __init__(self,vocab):
        self.str_to_int= vocab
        self.int_to_str= {id:token for token, id in vocab.items()}

    def encode(self,text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        #removing the unnecessary space
        preprocessed= [item.strip() for item in preprocessed if item.strip()]
        preprocessed= [item if item in self.str_to_int else '<|unk|>' for item in preprocessed]
        ids= [self.str_to_int[s] for s in preprocessed]
        return ids


    def decode(self,id):
        text=" ".join([self.int_to_str[i] for i in id])
        #replace spaces before the specified punctuations
        text= re.sub(r'\s+([,.:;?_!"()\'])',r'\1',text)
        return text

In [67]:
tokenizer= SimpleTokenizerV2(vocab)
text1="Hello, do you like tea?" 
text2="In the sunlit terraces of the places."
text= " <|endoftext|> ".join((text1,text2))

In [68]:
text

'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the places.'

In [69]:
tokenizer.encode(text=text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [70]:
tokenizer.decode(tokenizer.encode(text=text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

### Tokenization algorithm
- Word based
- Sub-word based
- Chaarcter based

## Byte Pair encoding- Sub word encoding

- Rule 1: Do not split frequently used words into smaller subwords
- Rule 2: Split the rare words into smaller, meaningful subwords

Example: "boy" should not be split, "boys" should be split into "boy" and "s"

In [71]:
!pip install tiktoken

[33mDEPRECATION: Loading egg at /Users/sayedraheel/miniconda3/lib/python3.11/site-packages/google_images_download-2.8.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [77]:
import importlib
import tiktoken
#print('tiktoken version:', importlib.metadata.version("tiktoken"))

In [73]:
tokenizer= tiktoken.get_encoding("gpt2")

In [75]:
text= (" Hello, do you like tea? <|endoftext|> In the sunlit terraces" "of someunknownPlace")
integers= tokenizer.encode(text,allowed_special={"<|endoftext|>"})
print(integers)

[18435, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271]


In [76]:
tokenizer.decode(integers)

' Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace'

## Create Input-Target pairs
- The last step before we create vector embeddings is to create input-target pairs

In [79]:
with open("the-verdict.txt","r",encoding="utf-8") as f:
    raw_text= f.read()

enc_text= tokenizer.encode(raw_text)
print(len(enc_text))

5145


In [89]:
context_window=4
x= enc_sample[:context_window]
y=enc_sample[1:context_window+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [92]:
for i in range(1,context_window+1):
    context= enc_sample[:i]
    desired= enc_sample[i]
    print(context,"--->",desired)

[290] ---> 4920
[290, 4920] ---> 2241
[290, 4920, 2241] ---> 287
[290, 4920, 2241, 287] ---> 257


In [94]:
for i in range(1,context_window+1):
    context= enc_sample[:i]
    desired= enc_sample[i]
    print(tokenizer.decode(context),"--->",tokenizer.decode([desired]))

 and --->  established
 and established --->  himself
 and established himself --->  in
 and established himself in --->  a


## IMPLEMENTING A DATA LOADER
- Our goal is to create a data loader that creates two tensors a input tensor that the LLM sees and a target tensor that includes the tagets for the LLMs to predict

#### Building A Dataset Class


In [101]:
#Building A Dataset Class

from torch.utils.data import Dataset,DataLoader
import torch
class GPTDatasetV1(Dataset):
    def __init__(self, text, tokenizer, max_length,stride):

        self.input_ids=[]
        self.target_ids=[]

        #tokenize the entire text
        tokens_id= tokenizer.encode(text, allowed_special={"<|endoftext|>"})

        for i in range(0, len(tokens_id)- max_length,stride):
            input_chunk=tokens_id[i:i+max_length]
            target_chunk=tokens_id[i+1: i+1+max_length+1]

            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]
    



In [102]:
def create_dataloaderv1(text,batch_size=4, max_length=256,
                        stride=128,shuffle=True,drop_last=True,
                        num_worker=0):
    tokenizer= tiktoken.get_encoding("gpt2")

    #Create the dataset
    dataset= GPTDatasetV1(text,tokenizer,max_length,stride)

    #Create the dataloader
    dataloader= DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_worker
    )
    return dataloader

In [110]:
data=create_dataloaderv1(raw_text,batch_size=2, max_length=6,
                        stride=128,shuffle=True,drop_last=True,
                        num_worker=0)

In [111]:
for i, n in data:
    print(i,n)


tensor([[22645,    11,   465, 10904,  4252,  6236],
        [  503,  4291,   262,  4252, 18250,  8812]]) tensor([[   11,   465, 10904,  4252,  6236,   429, 25839],
        [ 4291,   262,  4252, 18250,  8812,   558,    13]])
tensor([[  26,  475,  314, 2936,  683, 1969],
        [ 286,  616, 4286,  705, 1014,  510]]) tensor([[ 475,  314, 2936,  683, 1969, 2157,  502],
        [ 616, 4286,  705, 1014,  510,   26,  475]])
tensor([[18560,   438,  7091,   750,   523,   765],
        [ 1021,   757,   438, 10919,   257,   410]]) tensor([[  438,  7091,   750,   523,   765,   683,   705],
        [  757,   438, 10919,   257,   410,  5040,   329]])
tensor([[   11, 17728,   257,  8500,  4417,   284],
        [ 2612,  4369,    11,   523,   326,   612]]) tensor([[17728,   257,  8500,  4417,   284,   670,   319],
        [ 4369,    11,   523,   326,   612,   550,   587]])
tensor([[  673,   373, 10032,   286,   852, 13055],
        [   13,   198,   198,     1, 19242,   339]]) tensor([[  373, 10032,   

# Token Embeddings/Vector Embeddings

## Core Concepts

### Token Embeddings
- Dense vector representations that map discrete tokens into continuous vector space
- Each token (word/subword) becomes a fixed-length vector of floating-point numbers
- Dimension typically ranges from 256 to 1024 based on model size

### Why Embeddings Matter
- Transform symbolic text into numerical format for neural networks
- Capture semantic relationships between tokens
- Enable mathematical operations on language
- Foundation for contextual representations

## Architecture Components

### 1. Vocabulary System
- Token dictionary mapping words to unique IDs
- Handles unknown tokens (UNK)
- Special tokens (PAD, BOS, EOS)
- Vocabulary size impacts model complexity

### 2. Embedding Layer
- Learned weight matrix mapping token IDs to vectors
- Initialized randomly, refined during training
- Shared across model's encoder/decoder components
- Enables parameter sharing and efficient learning

### 3. Positional Encoding
- Adds position information to token embeddings
- Can be learned or fixed sinusoidal encodings
- Essential for capturing token order in sequences

## Training Process

### 1. Initialization
- Random initialization of embedding weights
- Xavier/Glorot or other suitable initialization methods
- Embedding dimension chosen based on vocabulary size

### 2. Learning
- Updated via backpropagation
- Learns from contextual predictions
- Optimized to capture semantic relationships

### 3. Output
- Final embeddings reflect learned language patterns
- Similar words cluster in vector space
- Enables meaningful vector arithmetic

## Key Considerations

### Dimensionality
- Higher dimensions capture more nuanced relationships
- Trade-off between expressiveness and computation
- Must balance with model size constraints

### Context Window
- Size of context affects learning quality
- Longer contexts capture broader relationships
- Impacts computational requirements

### Normalization
- L2 normalization often applied
- Helps stabilize training
- Prevents embedding magnitude issues

This overview covers the foundational concepts for implementing embeddings in language models from scratch. Each component builds upon the previous, creating a complete system for neural language understanding.

## Test with Token embeddings

### Import trained word2vec model

In [2]:
import gensim.downloader as api
model= api.load("word2vec-google-news-300")



In [3]:
word_vector=model
print(word_vector['laptop'])
print(word_vector['laptop'].shape)

[ 2.64892578e-02 -1.64062500e-01 -7.01904297e-03  2.79296875e-01
  8.88671875e-02  1.89453125e-01 -2.29492188e-02  1.32812500e-01
  4.08203125e-01 -2.70996094e-02 -1.28906250e-01 -9.17968750e-02
  4.10156250e-02 -1.75781250e-02  1.21582031e-01  1.49414062e-01
  3.97949219e-02  4.41894531e-02 -3.63769531e-02 -1.58203125e-01
  2.72216797e-02 -1.97753906e-02 -3.51562500e-02 -2.62451172e-02
 -3.73535156e-02  3.61328125e-02 -8.05664062e-02  2.51953125e-01
  2.69531250e-01  5.43594360e-05 -3.19824219e-02  1.43432617e-02
 -2.31445312e-01  1.52343750e-01 -2.83203125e-01 -4.39453125e-01
 -3.06640625e-01  2.70996094e-02 -1.63085938e-01 -1.03515625e-01
 -2.17773438e-01  3.20312500e-01  5.05371094e-02  1.81640625e-01
  1.62109375e-01 -1.55639648e-02 -3.97949219e-02  1.91406250e-01
  9.47265625e-02 -2.08984375e-01 -2.06054688e-01 -3.27148438e-02
  1.21093750e-01 -6.68334961e-03 -2.96020508e-03  9.08203125e-02
  1.44195557e-03  2.09960938e-02  2.89306641e-02  1.75781250e-01
  1.63085938e-01  1.30859

## Similar words

### King + Woman - Man

In [5]:
#example of using most_similar
print(word_vector.most_similar(positive=['king','women'], negative=['man'],topn=10))

[('queen', 0.4827326238155365), ('queens', 0.466781347990036), ('kumaris', 0.4653734564781189), ('kings', 0.4558638632297516), ('womens', 0.422832190990448), ('princes', 0.4176960587501526), ('Al_Anqari', 0.41725507378578186), ('concubines', 0.40110787749290466), ('monarch', 0.39624831080436707), ('monarchy', 0.39430150389671326)]


### Let us check the similarity b/w a few pair of words

In [6]:
# Example of calculating similarity
print(word_vector.similarity('women','man'))
print(word_vector.similarity('king','queen'))
print(word_vector.similarity('icecream','diabetes'))
print(word_vector.similarity('sugar','honey'))
print(word_vector.similarity('uncle','nephew'))


0.2883053
0.6510956
0.03228098
0.5061663
0.84866095


### Most similar words

In [7]:
print(word_vector.most_similar('tower',topn=5))

[('towers', 0.8531749844551086), ('skyscraper', 0.6417425870895386), ('Tower', 0.639177143573761), ('spire', 0.5946877598762512), ('responded_Understood_Atlasjet', 0.5931612849235535)]


## How are token embeddings are created for large language models

### Creating Token Embeddings

In [9]:
import torch
inputs_ids= torch.tensor([2,3,5,1])

In [11]:

vocab_size= 6
output_dim=3
torch.manual_seed(123)

embedding_layer= torch.nn.Embedding(vocab_size,output_dim)

In [13]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)
