### What are Tokenizers

- Converts raw text[***human readable***] to nummerical type[***neural network processable language***]

- tokenizer does following tasks

    * build vocabulary, if one does not exists
    * assign tokenIDs[indices] to given word corpus[tokens]
    * reassign tokenIDs to tokens

- A preprocessing step in tranformer model training & inference

In [16]:
import collections as cots
import datasets
import numpy as np
import pandas as pd
import re
from tiny_shakespeare import TinyShakespeare


In [4]:
shakespeare = TinyShakespeare()
shakespeare.download_and_prepare()
dataset = shakespeare.as_dataset()
print(dataset)

for split in dataset:
    print(f"{split}: {len(dataset[split]['text'][0])} characters\n")
    print(dataset[split].features)

train = dataset["train"]["text"][0]
val = dataset["validation"]["text"][0]
test = dataset["test"]["text"][0]

print(train[:50])

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})
train: 1003854 characters

{'text': Value('string')}
validation: 55770 characters

{'text': Value('string')}
test: 55770 characters

{'text': Value('string')}
First Citizen:
Before we proceed any further, hear


# Algorithms
### 1. Bag of Words: Word Based
* Whitespace/Regex

In [5]:
# Split on punctuation and whitespace
reg_vocab = re.split(r"[ \t\n\r\f\v!\"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]+", train)

# Remove empty tokens (if any)
tokens = [t for t in reg_vocab if t]
print(f"Vocabulary size: {len(set(tokens))}")
print(f"First 10 words: {tokens[:10]}")

Vocabulary size: 12559
First 10 words: ['First', 'Citizen', 'Before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak']


### 2. Character Based
* Byte Array

To convert any text dataset to its character based(0-256, ASCII) tokens in python, follow below steps

In [6]:
# Convert text to byte array
# This is a simple way to convert text to byte array in python
# It will convert each character to its corresponding byte value
text = "Hello World" # We can use above 'train' set as well
b_arr = bytearray(text, "utf-8")
b_arr

bytearray(b'Hello World')

In [7]:
# When we save this byte array in an Python list, it will be stored as corresponding char-bytes
list(b_arr)

[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

#### Example: With Train Set

In [8]:
train_char_tokens = list(bytearray(train, "utf-8"))
print(f"size: {len(train_char_tokens)}")
train_char_tokens[:10]  # Display first 10 tokens

size: 1003854


[70, 105, 114, 115, 116, 32, 67, 105, 116, 105]

### 3. Sub Word Based

#### Byte Pair Encoding

Pseudocode:
* Task-A:
    1. split char_corpus(***text data used to prepare vocabulary***) into characters
    2. add unique chars as (index, char) pair to vocabulary_table
    3. for i in ***K***(hyper_parameter):
        * merge consequence of chars as pair
        * count the frequency of pairs  - use an (pair, freq) dict
        * pick a high freq pair
        * update the char_corpus
        * update vocabulary_table
        * repeat untill K times 
* Task-B: 
   - ***encode*** the char_corpus using vocabulary_table -> token IDs(index)
* Task-C: 
   - if needed - ***decode*** in the same way to see the bytes pairs of given text example> vocabulary_table helps

### Sources:

* [pseudocode](https://sebastianraschka.com/blog/2025/bpe-from-scratch.html)
* [get_into_implementation_details](https://www.youtube.com/watch?v=20xtCxAAkFw)

* [tiktoken@OpenAI - App](https://tiktokenizer.vercel.app/?model=gpt2)
* [tiktoken@OpenAI - Github Repo](https://github.com/openai/tiktoken)

In [9]:
# import heapq
# import itertools


In [10]:
def update_char_corpus(num_chars, most_repeated, vocabulary_table):
    copie=np.copy(num_chars).tolist()  # Create a copy of the original characters
    # copie
    for i in range(len(copie)-1):
        if i < len(copie)-1 and (copie[i]+copie[i+1] == most_repeated):
                copie[i] = list(vocabulary_table.values())[-1]
                del copie[i+1]

    # for i, j in zip(num_chars, copie):
    #     print(f"{i} -> {j}")
    
    return copie

In [11]:
def byte_code_encoding(text_string, num_steps=100):
    char_corpus = np.array(list(text_string))                                                # Step1> define the test string
  
    unique_chars = np.unique(char_corpus)                                                    # Step2> get unique characters and index them in vocabulary_table
    # Create vocabulary table
    vocabulary_table = {i: char for i, char in enumerate(unique_chars.tolist())}  
   
    for i in range(num_steps):                                                               #                       ----------------------------------------------------------------
                                                                                                                                                                                   #|
        consec_char_pairs = zip(char_corpus, char_corpus[1:])                                # Step3> create pairs of consecutive characters                                       #|
        char_pairs_arr = np.array(list(consec_char_pairs))                                                                                                                         #|
                                                                                                                                                                                   #|
        # char_pairs_merged = np.apply_along_axis(lambda x: x[0] + x[1], 1, char_pairs_arr)                                                                                        #|
        # char_pairs_merged = np.apply_along_axis(lambda x: ''.join(x), 1, char_pairs_arr)                                                                                         #|  
        char_pairs_merged = np.char.add(char_pairs_arr[:, 0], char_pairs_arr[:, 1])          # Step4> chars pairs merging                                                          #|
                                                                                                                                                                                   #|
        unique_pairs = np.array(np.unique(char_pairs_merged, return_counts=True))            # Step5> counts each pair frequency                                                   #|
                                                                                                                                                                                   #|
        merged=unique_pairs[0]                                                               # Step6> make an freq-pair array out of the numpy counter func                        #|
        counts=unique_pairs[1]                                                                                                                                                     #|
                                                                                                                                                                                   #|
        pairs = zip(counts, merged)                                                                                                                                                #|
        freq_pair_dict = np.array(list(pairs))                                                                                                                              
        
        
        only_freqs = np.array(freq_pair_dict[:,0], dtype=int)                                # Step7> get highest freq-pair
        high_freq_pair=freq_pair_dict[np.where(only_freqs == np.max(only_freqs))]
        
        most_repeated = str(high_freq_pair[0][1])
        
        vocabulary_table.update({(list(vocabulary_table.keys())[-1]+1, most_repeated)})      # Step8> update vocabulary table with the most repeated pair

        char_corpus = np.array(update_char_corpus(char_corpus, most_repeated, vocabulary_table))  # Step9> update the character corpus

    return vocabulary_table, char_corpus

In [12]:
# Encode

In [13]:
# Decode

In [20]:
from itertools import islice

# test_string = "Hello World longer text to test the tokenizer functionality"
test_string=train[:50000]
vocabulary_table, char_corpus = byte_code_encoding(test_string, num_steps=500)
print("Character Corpus:", char_corpus[:10])
print("Vocabulary Table:", dict(islice(vocabulary_table.items(), 10)))

Character Corpus: ['First ' 'Citizen:\n' 'B' 'e' 'for' 'e ' 'we ' 'pro' 'ce' 'ed ']
Vocabulary Table: {0: '\n', 1: ' ', 2: '!', 3: "'", 4: ',', 5: '-', 6: '.', 7: ':', 8: ';', 9: '?'}
