<a href="https://www.kaggle.com/code/golammostofas/depth-knowledge-on-tokenization?scriptVersionId=168096277" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Abstract
Tokenization is the process of converting a sequence of text into individual units, commonly known as ‘token’. In NLP context, tokens can represent word, subword, or even characters. 
The primary goal is to prepare raw text data into a format that computational models can more easily analyze.

## Role in Large Language Models(LLMs):
1. **Traning Phase:**  Data Preprocessing, Sequince Alignments
2. **Inference Phase:** Query Understanding, Output Generation

## Type of Tokenization

1. Word Tokenization
2. Subword Tokenization
3. Character Tokenization
4. Morphological Tokenizaton

### 1. Word Tokenization:

word Tokenization is one of the earliest and simplest forms of text segmentation. It generally involves splitting a sequence of text into individual word.

2 type common algorithm:
1. Whitespace Tokenization
2. Rule-Based Tokenization

#### 1.1. Whitespace Tokenization
The most basic from of word tokenization is whitespace tokenization, whitch splits text based on spaces.

In [1]:
# code of whitespace tokenization

def whitespace_tokenization(text):
    return text.split()

text = 'Hello, this is an example text to demonstrate whitespace tokenization'

tokens = whitespace_tokenization(text)
print('Tokens: ', tokens)

Tokens:  ['Hello,', 'this', 'is', 'an', 'example', 'text', 'to', 'demonstrate', 'whitespace', 'tokenization']


#### 1.2. Rule-Base Tokenization
This approach used a set of predefined rules and regex patterns to identify tokens. Example:

In [2]:
# code of rule-base tokenization:

import re


def rule_base_tokenization(text):
    # define regex pattern
    pattern = r"\w+(?:'\w+)?|[^\w\s]"
    
    tokens = re.findall(pattern, text)
    
    return tokens

text = f'''Hello, this is an example text to demonstrate rule-based tokenization! Isn't it great?'''

tokens = rule_base_tokenization(text)
print('Tokens: ', tokens)

Tokens:  ['Hello', ',', 'this', 'is', 'an', 'example', 'text', 'to', 'demonstrate', 'rule', '-', 'based', 'tokenization', '!', "Isn't", 'it', 'great', '?']


##### comparison between whitespace and rule-based tokenization

In [3]:
# used common text
text = f'''Hello, this is an example text to demonstrate rule-based tokenization! Isn't it great?'''

tokens = whitespace_tokenization(text)
print('Tokens by Whitespace Tokenization:\n', tokens)

tokens = rule_base_tokenization(text)
print('Tokens by Rule-base Tokenization:\n', tokens)

Tokens by Whitespace Tokenization:
 ['Hello,', 'this', 'is', 'an', 'example', 'text', 'to', 'demonstrate', 'rule-based', 'tokenization!', "Isn't", 'it', 'great?']
Tokens by Rule-base Tokenization:
 ['Hello', ',', 'this', 'is', 'an', 'example', 'text', 'to', 'demonstrate', 'rule', '-', 'based', 'tokenization', '!', "Isn't", 'it', 'great', '?']


### 2. Subword Tokenization

Subword Tokenization techniques operate at a lavel between words and characterss, aiming to capture meaningful linguistic units smaller then a word but large then a character

3 common algorithms:

1. Byte Pair Encoding(BPE)
2. WordPiece 
3. SentencePiece

#### 2.1. Byte Pair Encoding(BPE)

BPE works by iteratively merginf the most frequently occurring character or character sequences. Following a somplified illustration of how BPE works tokenizing text:
* **Initialization:** Start with individual characters or symbols as the basic tokens
* **Frequency Count:** Count all adjancent pairs of tokens in the dataset
* **Merge:** Identifiy the most frequent pair of tokens and merge them into a single new token
* **Repeat:** Repeat the frequency count and merge steps untill a sepecified number of merges has been reached or the vocabulary reaches a desired size

Example:

Suppose the data to be encoded is
> aaabdaaabac

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:
> ZabdZabac

> Z=aa

Then the process is repeated with byte pair "ab", replacing it with "Y":
> ZYdZYac

> Y=ab

> Z=aa

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing "ZY" with "X":
> XdXac

> X=ZY

> Y=ab

> Z=aa

*This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.*

**N.B:** To decompress the data, simply perform the replacements in the reverse order.
Below is a basic python implementation of BPE:

In [4]:
from collections import defaultdict, Counter

def get_vocab(text):
    """Split text into symbles and connt symbols frequencies"""
    
    vocab = Counter(text.split())
    
    #convert vocabulary to format {'word': count}
    
    return {word: freq for word, freq in vocab.items()}

get_vocab('this is an example. I am an engineer')
    

{'this': 1, 'is': 1, 'an': 2, 'example.': 1, 'I': 1, 'am': 1, 'engineer': 1}

In [5]:

def get_stats(vocab):
    
    pairs = defaultdict(int)

    for word, freq in vocab.items():
        symbols = word.split()
        for i  in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

vocab = get_vocab('this is an example. I am an engineer.')
get_stats(vocab)

defaultdict(int, {})

In [6]:
def marge_vocab(pair, vocab):
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
        
    return new_vocab



In [7]:
def bpe_tokenize(text, number_merges):
    vocab = get_vocab(text)
    for i in range(number_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break
        best = max(pairs, key=pairs.get)
        vocab = marge_vocab(best, vocab)
        
    tokens = set()
    for word in vocab:
        tokens.update(word.split())
        
    return tokens

text = 'this is an example. I am an engineer. this is test'

tokens = bpe_tokenize(text, 10)
tokens

{'I', 'am', 'an', 'engineer.', 'example.', 'is', 'test', 'this'}

##### For LLMs

In [8]:
def encoding(text):
    tokens = text.encode('utf-8') # Raw bytes
    return list(map(int, tokens))

text = 'this is an example. I am an engineer. this is test'
tokens = encoding(text)
print(tokens)
print('\nlen of text', len(text))
print('len of token', len(tokens))



[116, 104, 105, 115, 32, 105, 115, 32, 97, 110, 32, 101, 120, 97, 109, 112, 108, 101, 46, 32, 73, 32, 97, 109, 32, 97, 110, 32, 101, 110, 103, 105, 110, 101, 101, 114, 46, 32, 116, 104, 105, 115, 32, 105, 115, 32, 116, 101, 115, 116]

len of text 50
len of token 50


In [9]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

stats = get_stats(tokens)

print(stats)

{(116, 104): 2, (104, 105): 2, (105, 115): 4, (115, 32): 4, (32, 105): 2, (32, 97): 3, (97, 110): 2, (110, 32): 2, (32, 101): 2, (101, 120): 1, (120, 97): 1, (97, 109): 2, (109, 112): 1, (112, 108): 1, (108, 101): 1, (101, 46): 1, (46, 32): 2, (32, 73): 1, (73, 32): 1, (109, 32): 1, (101, 110): 1, (110, 103): 1, (103, 105): 1, (105, 110): 1, (110, 101): 1, (101, 101): 1, (101, 114): 1, (114, 46): 1, (32, 116): 2, (116, 101): 1, (101, 115): 1, (115, 116): 1}


In [10]:
top_pair = max(stats, key=stats.get)
top_pair

(105, 115)

In [11]:
chr(105), chr(115)

('i', 's')

In [12]:
def merge(ids, pair, idx):
    new_ids = []
    
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            new_ids.append(idx)
            i += 2
        else:
            new_ids.append(ids[i])
            i += 1
    return new_ids

merge([5, 6, 6, 7, 9, 1], (6, 7), 99)

[5, 6, 99, 9, 1]

In [13]:
tokens2 = merge(tokens, top_pair, 256)
print(tokens2)

print('\nlen of token2:', len(tokens2))


[116, 104, 256, 32, 256, 32, 97, 110, 32, 101, 120, 97, 109, 112, 108, 101, 46, 32, 73, 32, 97, 109, 32, 97, 110, 32, 101, 110, 103, 105, 110, 101, 101, 114, 46, 32, 116, 104, 256, 32, 256, 32, 116, 101, 115, 116]

len of token2: 46


# Make merge dataset

In [14]:
#using constant time
vocab_size = 5000
number_megres = vocab_size - 256

with open('/kaggle/input/romeo-and-juliet-tokenization/romeo-and-juliet_tokenization.txt', 'r') as file:
    text = file.read()
    
tokens = encoding(text)
ids = tokens

merges = {}

for i in range(number_megres):
    stats = get_stats(ids)
    
    pair = max(stats, key=stats.get)
    if stats[pair] == 1:
        break
    idx = 256 + i
    #print(f"Merging: {pair} in a new token {idx}")
    ids = merge(ids, pair, idx)
    merges[pair] = idx

# Evulation

In [15]:
print('token len', len(tokens))
print('ids len', len(ids))
print('compression ratio: ', len(tokens)/len(ids))

token len 141695
ids len 31546
compression ratio:  4.4916946681037215


# Merges download dataset

In [16]:
with open('merges.txt', 'w') as file:
    file.write(str(merges))

# load merges data

In [17]:
import ast

with open('merges.txt', 'r') as file:
    data_string = file.read()
    merges = ast.literal_eval(data_string)


# Encoding

In [18]:
def encode(text):
    tokens = list(text.encode('utf-8'))
    
    while len(tokens)>=2:
        stats = get_stats(tokens)
        pair = min(stats, key=lambda p: merges.get(p, float('inf')))
        
        if pair not in merges:
            break
        
        idx = merges[pair]
        tokens = merge(tokens, pair, idx)
        
        
    return tokens

encode('a')

[97]

# Vocab download

In [19]:
vocab = {idx: bytes([idx]) for idx in range(256)}

for (p0, p1), idx in merges.items():
    vocab[idx] = vocab[p0] + vocab[p1]
    
with open('vocab.txt', 'w') as file:
    file.write(str(vocab))


# Load Vocab

In [20]:
import ast

with open('vocab.txt', 'r') as file:
    data_string = file.read()
    vocab = ast.literal_eval(data_string)




# Decoding

In [21]:

def decode(ids):
    tokens = b''.join(vocab[idx] for idx in ids)
    text = tokens.decode('utf-8', errors='replace')
    
    return text


print(decode([97]))

a


In [22]:
decode(encode('Hello World!'))

'Hello World!'