# LLM from scratch Series

Stage 1: Data Preperation and sampling, Attention Mechanism, LLM architecture

BUILDING AN LLM

Stage 2: Training Loop, Model Evaluation, Loading Pretrained Weights

FOUNDATION Model

Stage 3: Finetuning 

Classifier OR Personal Assistant 

## Satge 1: Data Prep & Sampling-Tokenization
How do you prepare inpur text for training LLMs?

Step 1: Splitting text into individual word and subword token

Step 2: Converting tokens in token IDs

step 3: Encode token IDs into vector representations


## GPT Tokenization Process

<img src="tokenization.png" alt="GPT Tokenization Flow" width="800"/>


# Tokenization

## Step 1: Cretaing tokens

In [2]:
with open("the-verdict.txt", "r", encoding ="utf-8") as f:
    raw_text= f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [5]:
import re
text= "Hello, world. This, is a text"
result= re.split(r'(\s)',text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'text']


In [6]:
import re
text= "Hello, world. This, is a text"
result= re.split(r'([,.]|\s)',text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'text']


In [8]:
result= [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'text']


In [21]:
#simple tokenization scheme
import re
text= "Hello, world. This, is-- a text?"
result=re.split(r'([,.:;?_!"()\']|--|\s)', text)

result= [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', '--', 'a', 'text', '?']


In [25]:

result= re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
result= [item.strip() for item in result if item.strip()]
print(len(result))

4690


In [26]:
result[:30]

['I',
 'HAD',
 'always',
 'thought',
 'Jack',
 'Gisburn',
 'rather',
 'a',
 'cheap',
 'genius',
 '--',
 'though',
 'a',
 'good',
 'fellow',
 'enough',
 '--',
 'so',
 'it',
 'was',
 'no',
 'great',
 'surprise',
 'to',
 'me',
 'to',
 'hear',
 'that',
 ',',
 'in']

## Step 2: Creating Token ids

Creating Vocabulary in alphabetical form of tokens and assigning unique to each unique token


In [27]:
all_words= sorted(set(result))
print(len(all_words))

1130


In [28]:
vocab={token:integers for integers, token in enumerate(all_words)}
#later when we get an ouput from LLM it eill be in numbers so we need inverse version of vocabulary
# that converts toke IDs to back to corresponding text tokens

### Let's make a tokinizer class in Python that have encode(text token to token ids) method and decode method(tokens is to text token)

In [49]:
class SimpleTokenizer:
    def __init__(self,vocab):
        self.str_to_int= vocab
        self.int_to_str= {id:token for token, id in vocab.items()}

    def encode(self,text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        #removing the unnecessary space
        preprocessed= [item.strip() for item in preprocessed if item.strip()]
        ids= [self.str_to_int[s] for s in preprocessed]
        return ids


    def decode(self,id):
        text=" ".join([self.int_to_str[i] for i in id])
        #replace spaces before the specified punctuations
        text= re.sub(r'\s+([,.:;?_!"()\'])',r'\1',text)
        return text

In [50]:
tokens= SimpleTokenizer(vocab=vocab)

In [51]:
id=tokens.encode("This is ")

In [52]:
tokens.decode(id)

'This is'

## ADDING SPECIAL CONTEXT TOKENS
We will modify the python class to handle unknown tokens

### ADDING SPECIAL CONTEXT TOKENS

In the previous section, we implemented a simple tokenizer and applied it to a passage
from the training set. 

In this section, we will modify this tokenizer to handle unknown
words.


In particular, we will modify the vocabulary and tokenizer we implemented in the
previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and
<|endoftext|>

<div class="alert alert-block alert-warning">

We can modify the tokenizer to use an <|unk|> token if it
encounters a word that is not part of the vocabulary. 

Furthermore, we add a token between
unrelated texts. 

For example, when training GPT-like LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that
follows a previous text source

</div>



In [59]:
#adding the end of text and unkown tokens in the Vocabulary
all_tokens= sorted(list(set(result)))
all_tokens.extend(["<|endoftext|>","<|unk|>"])
vocab={tokens: integer for integer,tokens in enumerate(all_tokens)}

In [60]:
len(vocab)

1132

In [61]:
for i,item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


#### New Tokenizer class to handle new tokens

In [62]:
class SimpleTokenizerV2:
    def __init__(self,vocab):
        self.str_to_int= vocab
        self.int_to_str= {id:token for token, id in vocab.items()}

    def encode(self,text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        #removing the unnecessary space
        preprocessed= [item.strip() for item in preprocessed if item.strip()]
        preprocessed= [item if item in self.str_to_int else '<|unk|>' for item in preprocessed]
        ids= [self.str_to_int[s] for s in preprocessed]
        return ids


    def decode(self,id):
        text=" ".join([self.int_to_str[i] for i in id])
        #replace spaces before the specified punctuations
        text= re.sub(r'\s+([,.:;?_!"()\'])',r'\1',text)
        return text

In [67]:
tokenizer= SimpleTokenizerV2(vocab)
text1="Hello, do you like tea?" 
text2="In the sunlit terraces of the places."
text= " <|endoftext|> ".join((text1,text2))

In [68]:
text

'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the places.'

In [69]:
tokenizer.encode(text=text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [70]:
tokenizer.decode(tokenizer.encode(text=text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

### Tokenization algorithm
- Word based
- Sub-word based
- Chaarcter based

## Byte Pair encoding- Sub word encoding

- Rule 1: Do not split frequently used words into smaller subwords
- Rule 2: Split the rare words into smaller, meaningful subwords

Example: "boy" should not be split, "boys" should be split into "boy" and "s"

In [71]:
!pip install tiktoken

[33mDEPRECATION: Loading egg at /Users/sayedraheel/miniconda3/lib/python3.11/site-packages/google_images_download-2.8.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [77]:
import importlib
import tiktoken
#print('tiktoken version:', importlib.metadata.version("tiktoken"))

In [73]:
tokenizer= tiktoken.get_encoding("gpt2")

In [75]:
text= (" Hello, do you like tea? <|endoftext|> In the sunlit terraces" "of someunknownPlace")
integers= tokenizer.encode(text,allowed_special={"<|endoftext|>"})
print(integers)

[18435, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271]


In [76]:
tokenizer.decode(integers)

' Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace'

## Create Input-Target pairs
- The last step before we create vector embeddings is to create input-target pairs

In [79]:
with open("the-verdict.txt","r",encoding="utf-8") as f:
    raw_text= f.read()

enc_text= tokenizer.encode(raw_text)
print(len(enc_text))

5145


In [89]:
context_window=4
x= enc_sample[:context_window]
y=enc_sample[1:context_window+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [92]:
for i in range(1,context_window+1):
    context= enc_sample[:i]
    desired= enc_sample[i]
    print(context,"--->",desired)

[290] ---> 4920
[290, 4920] ---> 2241
[290, 4920, 2241] ---> 287
[290, 4920, 2241, 287] ---> 257
