# **TOKENIZATION**
# What is Tokenization? 
# Tokenization is a fundamental process in Natural Language Processing (NLP) that involves breaking down a stream of text into smaller units called tokens.
(Source: GeeksforGeeks)


*Data (Book) : The Wealth of Nations by Adam Smith*

**Loading & Reading the Data**

In [51]:
with open('The-Wealth-of-Nations.txt', encoding='utf-8') as file:
    raw_text = file.read()

In [52]:
# Number of characters in the text
print(len(raw_text))

2248172


The data has more than 22 Lakh characters but not all characters are unique in this.

**Using the Regular Expressions to split the data using patterns. We have taken a book data which has punctuations. We keep the punctuations as the context of a text changes because of a punctuation**

In [53]:
# Importing the regular expression library i.e re
import re

In [54]:
# splitting the text data using the patterns using the re library
preprocessed = re.split(r'([,.!:;?_"()\']|--|\s)', raw_text)

In [55]:
preprocessed 

['An',
 ' ',
 'Inquiry',
 ' ',
 'into',
 ' ',
 'the',
 ' ',
 'Nature',
 ' ',
 'and',
 ' ',
 'Causes',
 ' ',
 'of',
 ' ',
 'the',
 ' ',
 'Wealth',
 ' ',
 'of',
 ' ',
 'Nations',
 ' ',
 'by',
 ' ',
 'Adam',
 ' ',
 'Smith',
 ' ',
 'is',
 ' ',
 'a',
 ' ',
 'publication',
 ' ',
 'of',
 ' ',
 'The',
 '\n',
 'Electronic',
 ' ',
 'Classics',
 ' ',
 'Series',
 '.',
 '',
 ' ',
 'This',
 ' ',
 'Portable',
 ' ',
 'Document',
 ' ',
 'file',
 ' ',
 'is',
 ' ',
 'furnished',
 ' ',
 'free',
 ' ',
 'and',
 ' ',
 'without',
 ' ',
 'any',
 ' ',
 'charge',
 ' ',
 'of',
 ' ',
 'any',
 '\n',
 'kind',
 '.',
 '',
 ' ',
 'Any',
 ' ',
 'person',
 ' ',
 'using',
 ' ',
 'this',
 ' ',
 'document',
 ' ',
 'file',
 ',',
 '',
 ' ',
 'for',
 ' ',
 'any',
 ' ',
 'purpose',
 ',',
 '',
 ' ',
 'and',
 ' ',
 'in',
 ' ',
 'any',
 ' ',
 'way',
 ' ',
 'does',
 ' ',
 'so',
 ' ',
 'at',
 ' ',
 'his',
 ' ',
 'or',
 ' ',
 'her',
 ' ',
 'own',
 '\n',
 'risk',
 '.',
 '',
 ' ',
 'Neither',
 ' ',
 'the',
 ' ',
 'Pennsylvania',
 ' ',


From the above output, we can see that it contains white spaces too!

We want to remove the white spaces

In [56]:
preprocessed = [item for item in preprocessed if item.split()]

Tokens i.e words have been created

In [57]:
preprocessed[:20]

['An',
 'Inquiry',
 'into',
 'the',
 'Nature',
 'and',
 'Causes',
 'of',
 'the',
 'Wealth',
 'of',
 'Nations',
 'by',
 'Adam',
 'Smith',
 'is',
 'a',
 'publication',
 'of',
 'The']

Thus, we successfully removed the white spaces from the list.

# Vocabulary: In the context of Large Language Models (LLMs), the vocabulary is the set of all tokens (words, subwords, or characters) that the model knows and can understand.

We create a vocabulary of the given data. So that when the model is trained and used it knows which words to use.

Vocabulary is a dictionary of words with proper IDs.

Thus, the words in the vocabulary should be unique and not repeated.


**Creating vocabulary**

In [58]:
# First: get all unique words sorted alphabetically
unique_words = sorted(set(preprocessed))

In [59]:
vocab_size = len(unique_words)
vocab_size

13984

So, from the 22 Lakh words we have only 13,984 unique words. 

**Now, we create a *Vocabulary* and assign unique *TokenIDs* to the tokens(here, words)**

In [60]:
vocab = {token:integer for integer, token in enumerate(unique_words)}

We swap the token and integer to create a tokenID (integer) for that specific token (token).

*enumerate()* function aloops through all elements and returns both value and index

A Vocabulary is a dictionary which provides mapping from token to token ID

In [61]:
# displaying some part of the vocabulary
for i,item in enumerate(vocab.items()):
    if(i<50):
        print(item)

('!', 0)
('#', 1)
('(', 2)
(')', 3)
(',', 4)
('-', 5)
('.', 6)
('/', 7)
('//www', 8)
('0', 9)
('000', 10)
('001', 11)
('017', 12)
('023', 13)
('027', 14)
('029', 15)
('041', 16)
('055', 17)
('068', 18)
('075', 19)
('076', 20)
('083', 21)
('086', 22)
('092', 23)
('0d', 24)
('0¹/³', 25)
('0¼', 26)
('0½', 27)
('0¾', 28)
('0¾d', 29)
('1', 30)
('1-4th', 31)
('1/', 32)
('1/12d', 33)
('1/2', 34)
('1/3', 35)
('1/3d', 36)
('1/6', 37)
('10', 38)
('10/32', 39)
('100', 40)
('1000', 41)
('101', 42)
('102', 43)
('103', 44)
('104', 45)
('105', 46)
('106', 47)
('107', 48)
('108', 49)


# This process can be called as *Encoding*

In simple terms, 

**Encoding** means turning text (like words or sentences) into numbers (token IDs) that a computer or model can understand.

**Decoding** means turning those numbers (token IDs) back into the original text.

# 

We will now implement simple tokenization classes to encode & decode the text then, we will move on to *Types of Tokenizations* and *Byte-Pair Encoding*

# Tokenizer Class 1:

In [62]:
class Tokenizer1:
    def __init__(self, vocab): #init -> automatically called when an instance of a class is created
        self.str_to_int = vocab 
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text) #tokenized the text

        preprocessed = [item.strip() for item in preprocessed if item.strip()] #removing white spaces

        #creating token ids 
        ids = [self.str_to_int[s] for s in preprocessed]

        return ids

    # decoding the ids to get string

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids]) #reversing the method used for vocabulary 
        text = re.sub(r'\s+([,.?!"()\'])',r'\1', text) #replaced the punctuations

        return text

Testing the class by creating an instance of the Tokenizer1 object

In [63]:
tokenizer = Tokenizer1(vocab)

In [64]:
text = "countries which enjoy the highest degree of industry and improvement; what is the work of one man, in a rude state of society,"

In [65]:
# ENCODING
ids = tokenizer.encode(text)
print(ids)

[5214, 13635, 6396, 12692, 7749, 5511, 9669, 8157, 3520, 8010, 1291, 13612, 8411, 12692, 13740, 9669, 9706, 8941, 4, 8025, 3092, 11471, 12182, 9669, 12010, 4]


In [66]:
# DECODING THE IDs
decode = tokenizer.decode(ids)
print(decode)

countries which enjoy the highest degree of industry and improvement ; what is the work of one man, in a rude state of society,


Thus, our Tokenizer1 class o working correctly.


But the problem with this approach is - when an unknown words is passed / a word out of vocabulary is passed, it throws an error

In [67]:
'''
# we can confirm by running this cell

text = "Apple mobiles are good
# ENCODING
ids = tokenizer.encode(text)
print(ids)
# DECODING THE IDs
decode = tokenizer.decode(ids)
print(decode)
'''

'\n# we can confirm by running this cell\n\ntext = "Apple mobiles are good\n# ENCODING\nids = tokenizer.encode(text)\nprint(ids)\n# DECODING THE IDs\ndecode = tokenizer.decode(ids)\nprint(decode)\n'

To overcome this problem, we use **Special Context Tokens**

**Special Context Tokens** can be used to deal with unknown words i.e <|unk|> token or if data is taken from multiple sources we use <|endoftext|> token

**We create another Tokenizer class to incorporate these tokens**

We will update the vocabulary by adding the *Special Context Tokens*

In [68]:
all_tokens = sorted(set(preprocessed))
#adding the new tokens
all_tokens.extend(["<|endoftext|>","<|unk|>"])

#giving token IDs to the tokens & creating a vocabulary 
vocab = {token:integer for integer,token in enumerate(all_tokens)}

Thus, the special context tokens have been added to the vocabulary.

We can verify it by-

In [69]:
len(vocab)

13986

As seen, prior to adding the special context tokens the vocab size was 13984.

In [70]:
for i, item in enumerate(list(vocab.items())[-3:]):
    print(item)

('”', 13983)
('<|endoftext|>', 13984)
('<|unk|>', 13985)


Thus, the last two tokens are special context tokens.

We successfully added the *Special Context Tokens*

Now we create the Tokenizer class which will encode depending on the type of token(unknown or not)

# Tokenizer Class 2:

In [71]:
class Tokenizer2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        #in encoding we will check if the given token belongs to the text data. if it doesn't the token will be <|unk|>
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        #checking if there are unknown words and assigning special context tokens
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        #if the item if in the vocabulary -> return the item else name it as <|unk|>

        ids = [self.str_to_int[s] for s in preprocessed]

        return ids

    def decode(self, ids): #code remains the same
        text = " ".join([self.int_to_str[i] for i in ids])

        text = re.sub(r'\s+([,.?!"()\'])',r'\1', text)

        return text

Testing the Tokenizer2 class by creating as instance of the Tokenizer2 class

In [72]:
tokenizer = Tokenizer2(vocab)

**Testing for <|unk|> token**

In [73]:
text = """Thirdly, and lastly, everybody must be thoughtful(itisunknown) how much labour
is facilitated and abridged by the application of proper machinery."""

In [74]:
ids = tokenizer.encode(text)
print(f'Encoded => {ids}')

Encoded => [2893, 4, 3520, 8594, 4, 6565, 9383, 3957, 13985, 2, 13985, 3, 7846, 9349, 8543, 8411, 6789, 3520, 3152, 4278, 12692, 3613, 9669, 10668, 8895, 6]


The 13985  token indicates the <|unk|> token.

We can decode it as-

In [75]:
original_text = tokenizer.decode(ids)
print(f'Decoded => {original_text}')

Decoded => Thirdly, and lastly, everybody must be <|unk|>( <|unk|>) how much labour is facilitated and abridged by the application of proper machinery.


**Testing for <|endoftext|> token**

In [76]:
text_1 = """Thirdly, and lastly, everybody must be sensible how much labour
is facilitated and abridged by the application of proper machinery."""
text_2 = """It is unnecessary to give any example."""

In [77]:
text = "<|endoftext|>".join((text_1, text_2))

print(text)

Thirdly, and lastly, everybody must be sensible how much labour
is facilitated and abridged by the application of proper machinery.<|endoftext|>It is unnecessary to give any example.


We considered 2 text sources and joined them using <|endoftext|>

In models like GPT, <|endoftext|> tokens are used since the data is taken from various different sources. The <|endoftext|> tokens indicates that the data sources are different.

In [78]:
ids = tokenizer.encode(text)
print(f'Encoded => {ids}')

Encoded => [2893, 4, 3520, 8594, 4, 6565, 9383, 3957, 11699, 7846, 9349, 8543, 8411, 6789, 3520, 3152, 4278, 12692, 3613, 9669, 10668, 8895, 6, 13985, 8411, 13226, 12837, 7411, 3571, 6595, 6]


In [79]:
original_text = tokenizer.decode(ids)
print(f'Decoded => {original_text}')

Decoded => Thirdly, and lastly, everybody must be sensible how much labour is facilitated and abridged by the application of proper machinery. <|unk|> is unnecessary to give any example.


The token number 13985 indicates the <|endoftext|> token

**Testing without any special tokens**

In [80]:
text = """Thirdly, and lastly, everybody must be sensible how much labour
is facilitated and abridged by the application of proper machinery."""

In [81]:
ids = tokenizer.encode(text)
print(f'Encoded => {ids}')

Encoded => [2893, 4, 3520, 8594, 4, 6565, 9383, 3957, 11699, 7846, 9349, 8543, 8411, 6789, 3520, 3152, 4278, 12692, 3613, 9669, 10668, 8895, 6]


In [82]:
original_text = tokenizer.decode(ids)
print(f'Decoded => {original_text}')

Decoded => Thirdly, and lastly, everybody must be sensible how much labour is facilitated and abridged by the application of proper machinery.


# 

# Additional Tokens

# 1) BOS (Beginning of Sequence) - signifies to LLM where a content begins
# 2) EOS (End of Sequence) - useful when concatenating texts at end of texts
# 3) PAD (Padding) - to ensure texts in batches have same lengths

# **Types of Tokenizations:** 

**1. Word-based - every word in a text is a token but it has OOV (out of vocabulary) problems**

**2. Character-based - individual characters (letters) are considered as tokens but the meaning of a word is lost**

**3. *Sub Word-based***

it follows 2 rules:

**Rule1 - don't split frequently used words into sub-words** 

**Rule2 - split rare words into smaller meaningful sub-words**

**(helps model learn the different words with the common root word (eg. token, tokens, tokenizing) - helps model learn to differentiate between different words with same suffix)**

# Byte-Pair Encoding(BPE)

> Models like **GPT** use this type of encoding. It follows a *Sub-Word-based Tokenization*

> In this, the most common words/characters are represented a single token while rare words are broken down into 2 or more tokens

> In BPE, the data is checked in sequence and the common pair(byte-pair) that occurs is replaced by a byte (variable) that does NOT occur in data

# Implementing Byte-Pair Encoding

> Tiktoken library is used which is a BPE library (used by OpenAI)

In [83]:
# install tiktoken if not installed
# pip install tiktoken

In [84]:
import tiktoken

Instantiating BPE tokenizer

In [85]:
bpe_tokenizer = tiktoken.get_encoding("gpt2")

We use rules of the GPT-2 model to tokenize our data

In [86]:
text = ("But though this equality of treatment should not be productive")

Counting number of tokens this text has-

In [87]:
print("Token count:", len(bpe_tokenizer.encode(text)))

Token count: 10


In [88]:
ids = bpe_tokenizer.encode(text)
print(f'Token IDs: {ids}')

Token IDs: [1537, 996, 428, 10537, 286, 3513, 815, 407, 307, 12973]


In [89]:
original_text = bpe_tokenizer.decode(ids)
print(f'Original Text: {original_text}')

Original Text: But though this equality of treatment should not be productive


Thus, we successfully implemented amd tested the titoken's tokenizer.

# 

A final step before creating Vector Embeddings is creating Input-Target pairs. An LLM is an ***Auto-Regressive*** model i.e the output of one iteration is the input of text iteration. It is a self-supervised learning or unsupervised learning.

*The main task of an LLM is 'Next Word Prediction'.* The resulting extra tasks performed by the LLMs such as answering questions etc. this behaviour by the LLMs is called *Emergent Behaviour*, since the model wasn't train to perform these tasks in the first place.

As seen from previous code, we are given a context i.e the size of the input. When given a data, the LLM isn't shown the entire text. It is only given an input of size = context. So, if input = context then according to the purpose of LLMs (*Next Word Prediction*), it should predict the output which is context+1. The LLMs cannot access the elements past the target. Then this predicted output along with the previous inputs are taken as input then it predicts the next word. This process goes on. 


**Creating Input-Target pairs using Data Loaders**

*Dataloader* -> PyTorch Dataloader is a utility class designed to simplify loading and iterating over datasets while training deep learning models. (Source: GeeksforGeeks)

*Sliding Window* ->  Sliding window problems are computational problems in which a fixed/variable-size window is moved through a data structure, typically an array or string. (Source: GeeksforGeeks)

**Step 1: Importing the data & tokenizing the text using BPE Tokenizer**

In [90]:
# importing tiktoken and creating an instance of the BPE tokenizer object
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

In [91]:
with open('The-Wealth-of-Nations.txt','r', encoding='utf-8') as f:
    raw_text = f.read()

#Tokenizing
enc_text = tokenizer.encode(raw_text)

In [92]:
# length of the text
print(len(enc_text))

527097


In [93]:
print(enc_text[:20])

[2025, 39138, 656, 262, 10362, 290, 46865, 286, 262, 35151, 286, 7973, 416, 7244, 4176, 318, 257, 9207, 286, 383]


**Step 2: Simple demonstration of Input-Target Pairs**

Creating 2 variables x & y where x=input & y=target

target is always context+1

In [94]:
sample = enc_text[:100]
len(sample)

100

In [95]:
# context size for the input
context_size = 4
# creating input and output variables
xs, ys = [], []
for i in range(len(sample)-context_size):
    x = enc_text[i:i+context_size]
    xs.append(x)
    y = enc_text[i+1:i+context_size+1]
    ys.append(y)
    print(f'Input -> {x} Output -> {y}')

Input -> [2025, 39138, 656, 262] Output -> [39138, 656, 262, 10362]
Input -> [39138, 656, 262, 10362] Output -> [656, 262, 10362, 290]
Input -> [656, 262, 10362, 290] Output -> [262, 10362, 290, 46865]
Input -> [262, 10362, 290, 46865] Output -> [10362, 290, 46865, 286]
Input -> [10362, 290, 46865, 286] Output -> [290, 46865, 286, 262]
Input -> [290, 46865, 286, 262] Output -> [46865, 286, 262, 35151]
Input -> [46865, 286, 262, 35151] Output -> [286, 262, 35151, 286]
Input -> [286, 262, 35151, 286] Output -> [262, 35151, 286, 7973]
Input -> [262, 35151, 286, 7973] Output -> [35151, 286, 7973, 416]
Input -> [35151, 286, 7973, 416] Output -> [286, 7973, 416, 7244]
Input -> [286, 7973, 416, 7244] Output -> [7973, 416, 7244, 4176]
Input -> [7973, 416, 7244, 4176] Output -> [416, 7244, 4176, 318]
Input -> [416, 7244, 4176, 318] Output -> [7244, 4176, 318, 257]
Input -> [7244, 4176, 318, 257] Output -> [4176, 318, 257, 9207]
Input -> [4176, 318, 257, 9207] Output -> [318, 257, 9207, 286]
Inp

In [96]:
# decoding
for i in range(len(sample)-context_size):
    x = tokenizer.decode(enc_text[i:i+context_size])
    xs.append(x)
    y = tokenizer.decode(enc_text[i+1:i+context_size+1])
    ys.append(y)
    print(f'Input -> {x} Output -> {y}')

Input -> An Inquiry into the Output ->  Inquiry into the Nature
Input ->  Inquiry into the Nature Output ->  into the Nature and
Input ->  into the Nature and Output ->  the Nature and Causes
Input ->  the Nature and Causes Output ->  Nature and Causes of
Input ->  Nature and Causes of Output ->  and Causes of the
Input ->  and Causes of the Output ->  Causes of the Wealth
Input ->  Causes of the Wealth Output ->  of the Wealth of
Input ->  of the Wealth of Output ->  the Wealth of Nations
Input ->  the Wealth of Nations Output ->  Wealth of Nations by
Input ->  Wealth of Nations by Output ->  of Nations by Adam
Input ->  of Nations by Adam Output ->  Nations by Adam Smith
Input ->  Nations by Adam Smith Output ->  by Adam Smith is
Input ->  by Adam Smith is Output ->  Adam Smith is a
Input ->  Adam Smith is a Output ->  Smith is a publication
Input ->  Smith is a publication Output ->  is a publication of
Input ->  is a publication of Output ->  a publication of The
Input ->  a public

**Step 2: Implementing Dataloader**

*Dataloader* -> loading & processing data - functionalities for batching, shuffling, and processing data. In PyTorch, DataLoader is a utility that helps you efficiently load your dataset in mini-batches, shuffle it, and use multiple workers to speed up training.

First, we tokenize the text and create input-target chunks.

GPTDataset defines how inputs are fetched from dataset

In [97]:
import torch
from torch.utils.data import Dataset, DataLoader

In [114]:
class GPTDataset(Dataset):
    #for training a GPT-like model
    def __init__(self, raw_text, tokenizer,maxlength, stride):
        #maxlength here, is like the context_size used in creating simple input-target pairs
        #stride -> controls how much the sliding window moves when new input-target pair is created
        self.input_ids=[]
        self.target_ids=[]

        #tokenizing the text
        token_ids = tokenizer.encode(raw_text, allowed_special={"<|endoftext|>"})
            #BPE tokenizer used
        #using sliding window to chunk input and target
        for i in range(0,len(token_ids) - maxlength, stride): #step value = stride
            input_chunk = token_ids[i:i+maxlength]
            target_chunk = token_ids[i+1:i+maxlength+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    # fucntion for returning the length of input ids
    def __len__(self):
        return len(self.input_ids)
    #fucntion to get input-target ids at an index
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

Implementing DataLoader

We have high value for stride to avoid overlapping. 

Less stride leads to more overlapping can lead to overfitting for predicting next word

In [115]:
def data_loader(raw_text, batch_size=4, maxlength=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    # batch_size= How many samples the model processes at once before updating weights
    #drop_last = drops the last incomplete batch
    #num_workers=How many subprocesses to use for loading data in parallel

    #creating a tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    #created dataset
    dataset=GPTDataset(raw_text, tokenizer, maxlength, stride)

    #creating dataloader
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last,num_workers=num_workers)

    return dataloader

**Testing the dataloader**

In [116]:
# creating a dataloader

dataloader = data_loader(raw_text, batch_size=4, maxlength=8, stride=4, shuffle=False)

#iterating over the data
data_iter = iter(dataloader)
batch_n = next(data_iter)

print(batch_n)

[tensor([[ 2025, 39138,   656,   262, 10362,   290, 46865,   286],
        [10362,   290, 46865,   286,   262, 35151,   286,  7973],
        [  262, 35151,   286,  7973,   416,  7244,  4176,   318],
        [  416,  7244,  4176,   318,   257,  9207,   286,   383]]), tensor([[39138,   656,   262, 10362,   290, 46865,   286,   262],
        [  290, 46865,   286,   262, 35151,   286,  7973,   416],
        [35151,   286,  7973,   416,  7244,  4176,   318,   257],
        [ 7244,  4176,   318,   257,  9207,   286,   383,   198]])]


Thus, it indicates that - iterate over the data in *4 batches* with *context_size=8*