# **TOKENIZATION**
# What is Tokenization? 
# Tokenization is a fundamental process in Natural Language Processing (NLP) that involves breaking down a stream of text into smaller units called tokens.
(Source: GeeksforGeeks)


*Data (Book) : The Wealth of Nations by Adam Smith*

**Loading & Reading the Data**

In [149]:
with open('The-Wealth-of-Nations.txt', encoding='utf-8') as file:
    raw_text = file.read()

In [150]:
# Number of characters in the text
print(len(raw_text))

2248172


The data has more than 22 Lakh characters but not all characters are unique in this.

**Using the Regular Expressions to split the data using patterns. We have taken a book data which has punctuations. We keep the punctuations as the context of a text changes because of a punctuation**

In [151]:
# Importing the regular expression library i.e re
import re

In [152]:
# splitting the text data using the patterns using the re library
preprocessed = re.split(r'([,.!:;?_"()\']|--|\s)', raw_text)

In [153]:
preprocessed 

['An',
 ' ',
 'Inquiry',
 ' ',
 'into',
 ' ',
 'the',
 ' ',
 'Nature',
 ' ',
 'and',
 ' ',
 'Causes',
 ' ',
 'of',
 ' ',
 'the',
 ' ',
 'Wealth',
 ' ',
 'of',
 ' ',
 'Nations',
 ' ',
 'by',
 ' ',
 'Adam',
 ' ',
 'Smith',
 ' ',
 'is',
 ' ',
 'a',
 ' ',
 'publication',
 ' ',
 'of',
 ' ',
 'The',
 '\n',
 'Electronic',
 ' ',
 'Classics',
 ' ',
 'Series',
 '.',
 '',
 ' ',
 'This',
 ' ',
 'Portable',
 ' ',
 'Document',
 ' ',
 'file',
 ' ',
 'is',
 ' ',
 'furnished',
 ' ',
 'free',
 ' ',
 'and',
 ' ',
 'without',
 ' ',
 'any',
 ' ',
 'charge',
 ' ',
 'of',
 ' ',
 'any',
 '\n',
 'kind',
 '.',
 '',
 ' ',
 'Any',
 ' ',
 'person',
 ' ',
 'using',
 ' ',
 'this',
 ' ',
 'document',
 ' ',
 'file',
 ',',
 '',
 ' ',
 'for',
 ' ',
 'any',
 ' ',
 'purpose',
 ',',
 '',
 ' ',
 'and',
 ' ',
 'in',
 ' ',
 'any',
 ' ',
 'way',
 ' ',
 'does',
 ' ',
 'so',
 ' ',
 'at',
 ' ',
 'his',
 ' ',
 'or',
 ' ',
 'her',
 ' ',
 'own',
 '\n',
 'risk',
 '.',
 '',
 ' ',
 'Neither',
 ' ',
 'the',
 ' ',
 'Pennsylvania',
 ' ',


From the above output, we can see that it contains white spaces too!

We want to remove the white spaces

In [154]:
preprocessed = [item for item in preprocessed if item.split()]

Tokens i.e words have been created

In [155]:
preprocessed[:20]

['An',
 'Inquiry',
 'into',
 'the',
 'Nature',
 'and',
 'Causes',
 'of',
 'the',
 'Wealth',
 'of',
 'Nations',
 'by',
 'Adam',
 'Smith',
 'is',
 'a',
 'publication',
 'of',
 'The']

Thus, we successfully removed the white spaces from the list.

# Vocabulary: In the context of Large Language Models (LLMs), the vocabulary is the set of all tokens (words, subwords, or characters) that the model knows and can understand.

We create a vocabulary of the given data. So that when the model is trained and used it knows which words to use.

Vocabulary is a dictionary of words with proper IDs.

Thus, the words in the vocabulary should be unique and not repeated.


**Creating vocabulary**

In [156]:
# First: get all unique words sorted alphabetically
unique_words = sorted(set(preprocessed))

In [157]:
vocab_size = len(unique_words)
vocab_size

13984

So, from the 22 Lakh words we have only 13,984 unique words. 

**Now, we create a *Vocabulary* and assign unique *TokenIDs* to the tokens(here, words)**

In [158]:
vocab = {token:integer for integer, token in enumerate(unique_words)}

We swap the token and integer to create a tokenID (integer) for that specific token (token).

*enumerate()* function aloops through all elements and returns both value and index

A Vocabulary is a dictionary which provides mapping from token to token ID

In [159]:
# displaying some part of the vocabulary
for i,item in enumerate(vocab.items()):
    if(i<50):
        print(item)

('!', 0)
('#', 1)
('(', 2)
(')', 3)
(',', 4)
('-', 5)
('.', 6)
('/', 7)
('//www', 8)
('0', 9)
('000', 10)
('001', 11)
('017', 12)
('023', 13)
('027', 14)
('029', 15)
('041', 16)
('055', 17)
('068', 18)
('075', 19)
('076', 20)
('083', 21)
('086', 22)
('092', 23)
('0d', 24)
('0¹/³', 25)
('0¼', 26)
('0½', 27)
('0¾', 28)
('0¾d', 29)
('1', 30)
('1-4th', 31)
('1/', 32)
('1/12d', 33)
('1/2', 34)
('1/3', 35)
('1/3d', 36)
('1/6', 37)
('10', 38)
('10/32', 39)
('100', 40)
('1000', 41)
('101', 42)
('102', 43)
('103', 44)
('104', 45)
('105', 46)
('106', 47)
('107', 48)
('108', 49)


# This process can be called as *Encoding*

In simple terms, 

**Encoding** means turning text (like words or sentences) into numbers (token IDs) that a computer or model can understand.

**Decoding** means turning those numbers (token IDs) back into the original text.

# 

We will now implement simple tokenization classes to encode & decode the text then, we will move on to *Types of Tokenizations* and *Byte-Pair Encoding*

# Tokenizer Class 1:

In [160]:
class Tokenizer1:
    def __init__(self, vocab): #init -> automatically called when an instance of a class is created
        self.str_to_int = vocab 
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text) #tokenized the text

        preprocessed = [item.strip() for item in preprocessed if item.strip()] #removing white spaces

        #creating token ids 
        ids = [self.str_to_int[s] for s in preprocessed]

        return ids

    # decoding the ids to get string

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids]) #reversing the method used for vocabulary 
        text = re.sub(r'\s+([,.?!"()\'])',r'\1', text) #replaced the punctuations

        return text

Testing the class by creating an instance of the Tokenizer1 object

In [161]:
tokenizer = Tokenizer1(vocab)

In [162]:
text = "countries which enjoy the highest degree of industry and improvement; what is the work of one man, in a rude state of society,"

In [163]:
# ENCODING
ids = tokenizer.encode(text)
print(ids)

[5214, 13635, 6396, 12692, 7749, 5511, 9669, 8157, 3520, 8010, 1291, 13612, 8411, 12692, 13740, 9669, 9706, 8941, 4, 8025, 3092, 11471, 12182, 9669, 12010, 4]


In [164]:
# DECODING THE IDs
decode = tokenizer.decode(ids)
print(decode)

countries which enjoy the highest degree of industry and improvement ; what is the work of one man, in a rude state of society,


Thus, our Tokenizer1 class o working correctly.


But the problem with this approach is - when an unknown words is passed / a word out of vocabulary is passed, it throws an error

In [165]:
'''
# we can confirm by running this cell

text = "Apple mobiles are good
# ENCODING
ids = tokenizer.encode(text)
print(ids)
# DECODING THE IDs
decode = tokenizer.decode(ids)
print(decode)
'''

'\n# we can confirm by running this cell\n\ntext = "Apple mobiles are good\n# ENCODING\nids = tokenizer.encode(text)\nprint(ids)\n# DECODING THE IDs\ndecode = tokenizer.decode(ids)\nprint(decode)\n'

To overcome this problem, we use **Special Context Tokens**

**Special Context Tokens** can be used to deal with unknown words i.e <|unk|> token or if data is taken from multiple sources we use <|endoftext|> token

**We create another Tokenizer class to incorporate these tokens**

We will update the vocabulary by adding the *Special Context Tokens*

In [166]:
all_tokens = sorted(set(preprocessed))
#adding the new tokens
all_tokens.extend(["<|endoftext|>","<|unk|>"])

#giving token IDs to the tokens & creating a vocabulary 
vocab = {token:integer for integer,token in enumerate(all_tokens)}

Thus, the special context tokens have been added to the vocabulary.

We can verify it by-

In [167]:
len(vocab)

13986

As seen, prior to adding the special context tokens the vocab size was 13984.

In [168]:
for i, item in enumerate(list(vocab.items())[-3:]):
    print(item)

('”', 13983)
('<|endoftext|>', 13984)
('<|unk|>', 13985)


Thus, the last two tokens are special context tokens.

We successfully added the *Special Context Tokens*

Now we create the Tokenizer class which will encode depending on the type of token(unknown or not)

# Tokenizer Class 2:

In [169]:
class Tokenizer2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        #in encoding we will check if the given token belongs to the text data. if it doesn't the token will be <|unk|>
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        #checking if there are unknown words and assigning special context tokens
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        #if the item if in the vocabulary -> return the item else name it as <|unk|>

        ids = [self.str_to_int[s] for s in preprocessed]

        return ids

    def decode(self, ids): #code remains the same
        text = " ".join([self.int_to_str[i] for i in ids])

        text = re.sub(r'\s+([,.?!"()\'])',r'\1', text)

        return text

Testing the Tokenizer2 class by creating as instance of the Tokenizer2 class

In [170]:
tokenizer = Tokenizer2(vocab)

**Testing for <|unk|> token**

In [171]:
text = """Thirdly, and lastly, everybody must be thoughtful(itisunknown) how much labour
is facilitated and abridged by the application of proper machinery."""

In [172]:
ids = tokenizer.encode(text)
print(f'Encoded => {ids}')

Encoded => [2893, 4, 3520, 8594, 4, 6565, 9383, 3957, 13985, 2, 13985, 3, 7846, 9349, 8543, 8411, 6789, 3520, 3152, 4278, 12692, 3613, 9669, 10668, 8895, 6]


The 13985  token indicates the <|unk|> token.

We can decode it as-

In [173]:
original_text = tokenizer.decode(ids)
print(f'Decoded => {original_text}')

Decoded => Thirdly, and lastly, everybody must be <|unk|>( <|unk|>) how much labour is facilitated and abridged by the application of proper machinery.


**Testing for <|endoftext|> token**

In [174]:
text_1 = """Thirdly, and lastly, everybody must be sensible how much labour
is facilitated and abridged by the application of proper machinery."""
text_2 = """It is unnecessary to give any example."""

In [175]:
text = "<|endoftext|>".join((text_1, text_2))

print(text)

Thirdly, and lastly, everybody must be sensible how much labour
is facilitated and abridged by the application of proper machinery.<|endoftext|>It is unnecessary to give any example.


We considered 2 text sources and joined them using <|endoftext|>

In models like GPT, <|endoftext|> tokens are used since the data is taken from various different sources. The <|endoftext|> tokens indicates that the data sources are different.

In [176]:
ids = tokenizer.encode(text)
print(f'Encoded => {ids}')

Encoded => [2893, 4, 3520, 8594, 4, 6565, 9383, 3957, 11699, 7846, 9349, 8543, 8411, 6789, 3520, 3152, 4278, 12692, 3613, 9669, 10668, 8895, 6, 13985, 8411, 13226, 12837, 7411, 3571, 6595, 6]


In [177]:
original_text = tokenizer.decode(ids)
print(f'Decoded => {original_text}')

Decoded => Thirdly, and lastly, everybody must be sensible how much labour is facilitated and abridged by the application of proper machinery. <|unk|> is unnecessary to give any example.


The token number 13985 indicates the <|endoftext|> token

**Testing without any special tokens**

In [178]:
text = """Thirdly, and lastly, everybody must be sensible how much labour
is facilitated and abridged by the application of proper machinery."""

In [179]:
ids = tokenizer.encode(text)
print(f'Encoded => {ids}')

Encoded => [2893, 4, 3520, 8594, 4, 6565, 9383, 3957, 11699, 7846, 9349, 8543, 8411, 6789, 3520, 3152, 4278, 12692, 3613, 9669, 10668, 8895, 6]


In [180]:
original_text = tokenizer.decode(ids)
print(f'Decoded => {original_text}')

Decoded => Thirdly, and lastly, everybody must be sensible how much labour is facilitated and abridged by the application of proper machinery.


# 

# Additional Tokens

# 1) BOS (Beginning of Sequence) - signifies to LLM where a content begins
# 2) EOS (End of Sequence) - useful when concatenating texts at end of texts
# 3) PAD (Padding) - to ensure texts in batches have same lengths

# **Types of Tokenizations:** 

**1. Word-based - every word in a text is a token but it has OOV (out of vocabulary) problems**

**2. Character-based - individual characters (letters) are considered as tokens but the meaning of a word is lost**

**3. *Sub Word-based***

it follows 2 rules:

**Rule1 - don't split frequently used words into sub-words** 

**Rule2 - split rare words into smaller meaningful sub-words**

**(helps model learn the different words with the common root word (eg. token, tokens, tokenizing) - helps model learn to differentiate between different words with same suffix)**

# Byte-Pair Encoding(BPE)

> Models like **GPT** use this type of encoding. It follows a *Sub-Word-based Tokenization*

> In this, the most common words/characters are represented a single token while rare words are broken down into 2 or more tokens

> In BPE, the data is checked in sequence and the common pair(byte-pair) that occurs is replaced by a byte (variable) that does NOT occur in data

# Implementing Byte-Pair Encoding

> Tiktoken library is used which is a BPE library (used by OpenAI)

In [181]:
# install tiktoken if not installed
# pip install tiktoken

In [182]:
import tiktoken

Instantiating BPE tokenizer

In [183]:
bpe_tokenizer = tiktoken.get_encoding("gpt2")

We use rules of the GPT-2 model to tokenize our data

In [184]:
text = ("But though this equality of treatment should not be productive")

Counting number of tokens this text has-

In [185]:
print("Token count:", len(tokenizer.encode(text)))

Token count: 10


In [186]:
ids = tokenizer.encode(text)
print(f'Token IDs: {ids}')

Token IDs: [1543, 12751, 12742, 6474, 9669, 12961, 11845, 9554, 3957, 10596]


In [187]:
original_text = tokenizer.decode(ids)
print(f'Original Text: {original_text}')

Original Text: But though this equality of treatment should not be productive


Thus, we successfully implemented amd tested the titoken's tokenizer.