For training large language models, we need to prepare input data. For preparing input data, we need to split text into individual words or subwords token and which further needs to encode into vector representations for the LLM.   

Our goal is to convert individual words into tokens and then token into vector representations or embeddings for LLM training. 


## Tokenization

Input Text --> Tokenized Text --> Token IDs --> Embeddings Vectors

First we develop a simple tokenizer which converts words into tokens, but later on we will use more sophisticated tokenizers like Byte-Pair Encoding (BPE) from tiktoken library.



### Input Text to Tokenized Text

In [6]:
import re
text = "Hi, I am new to machine learning."
result = re.split(r'(\s)', text)

print(result)

['Hi,', ' ', 'I', ' ', 'am', ' ', 'new', ' ', 'to', ' ', 'machine', ' ', 'learning.']


This splits the sentence on white space characters. But some words are still connected to punctuation characters.

In [7]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hi', ',', '', ' ', 'I', ' ', 'am', ' ', 'new', ' ', 'to', ' ', 'machine', ' ', 'learning', '.', '']


But the list still has whitespace characters we can remove them safely like below

In [8]:
result = [item for item in result if item.strip()]
print(result)

['Hi', ',', 'I', 'am', 'new', 'to', 'machine', 'learning', '.']


> Removing whitespaces depends on requirement of application suppose models sensitive to structure of code for example python code which sensitive to identation and spacing

In [14]:
# including few more punctuations types so that we can handle them and make simple tokenizer function


def simple_tokenizer(text):
    result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    result = [item.strip() for item in result if item.strip()]
    return result


In [15]:
text = "Hi, I am new to machine learning."

print(simple_tokenizer(text))

['Hi', ',', 'I', 'am', 'new', 'to', 'machine', 'learning', '.']


So we have our simple tokenizer working let's apply to our sample dataset

In [12]:
with open("../the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [16]:
preprocessed = simple_tokenizer(raw_text)

print(len(preprocessed))

4690


In [18]:
print(preprocessed[:20])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was']


We just finished input text to tokenized text step. In next step we convert these tokenized text into integers.

### Tokenized Text to Token IDs

First we build a vocabulary by removing duplicate tokenized text and alphabetically sorting them. The unique tokenized text are then  aggregated in a vocabulary where they are mapped with unique integer value.

In [19]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


After determine the vocab size we will create the vocabulary

In [20]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [24]:
# lets see first 50 tokens of vocabulary which we created above

for i,item in enumerate(vocab.items()):
    print(item)
    if i > 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)
('His', 51)


> We need inverse version of token IDs as well when we convert output of LLM from token ID back to tokenized text.

Now since we have our vocabulary ready let's build a simple tokenizer class which has two methods encode to convert text to token IDs and decode method to convert token IDs back to text. 

In [26]:
class SimpleTokenizer:
    def __init__(self,vocab) -> None:
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self,text):
        result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in result if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for i in ids]) 
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [27]:
tokenizer = SimpleTokenizer(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [28]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.
