For training large language models, we need to prepare input data. For preparing input data, we need to split text into individual words or subwords token and which further needs to encode into vector representations for the LLM.   

Our goal is to convert individual words into tokens and then token into vector representations or embeddings for LLM training. 


## Tokenization

Input Text --> Tokenized Text --> Token IDs --> Embeddings Vectors

First we develop a simple tokenizer which converts words into tokens, but later on we will use more sophisticated tokenizers like Byte-Pair Encoding (BPE) from tiktoken library.



### Input Text to Tokenized Text

In [1]:
import re
text = "Hi, I am new to machine learning."
result = re.split(r'(\s)', text)

print(result)

['Hi,', ' ', 'I', ' ', 'am', ' ', 'new', ' ', 'to', ' ', 'machine', ' ', 'learning.']


This splits the sentence on white space characters. But some words are still connected to punctuation characters.

In [2]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hi', ',', '', ' ', 'I', ' ', 'am', ' ', 'new', ' ', 'to', ' ', 'machine', ' ', 'learning', '.', '']


But the list still has whitespace characters we can remove them safely like below

In [3]:
result = [item for item in result if item.strip()]
print(result)

['Hi', ',', 'I', 'am', 'new', 'to', 'machine', 'learning', '.']


> Removing whitespaces depends on requirement of application suppose models sensitive to structure of code for example python code which sensitive to identation and spacing

In [4]:
# including few more punctuations types so that we can handle them and make simple tokenizer function


def simple_tokenizer(text):
    result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    result = [item.strip() for item in result if item.strip()]
    return result


In [5]:
text = "Hi, I am new to machine learning."

print(simple_tokenizer(text))

['Hi', ',', 'I', 'am', 'new', 'to', 'machine', 'learning', '.']


So we have our simple tokenizer working let's apply to our sample dataset

In [6]:
with open("../the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [7]:
preprocessed = simple_tokenizer(raw_text)

print(len(preprocessed))

4690


In [8]:
print(preprocessed[:20])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was']


We just finished input text to tokenized text step. In next step we convert these tokenized text into integers.

### Tokenized Text to Token IDs

First we build a vocabulary by removing duplicate tokenized text and alphabetically sorting them. The unique tokenized text are then  aggregated in a vocabulary where they are mapped with unique integer value.

In [9]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


After determine the vocab size we will create the vocabulary

In [10]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [11]:
# lets see first 50 tokens of vocabulary which we created above

for i,item in enumerate(vocab.items()):
    print(item)
    if i > 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)
('His', 51)


> We need inverse version of token IDs as well when we convert output of LLM from token ID back to tokenized text and then concatenates the them to natural text.

Now since we have our vocabulary ready let's build a simple tokenizer class which has two methods encode to convert text to token IDs and decode method to convert token IDs back to text. 

In [12]:
class SimpleTokenizer:
    def __init__(self,vocab) -> None:
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self,text):
        result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in result if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for i in ids]) 
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)  
        return text

In [13]:
tokenizer = SimpleTokenizer(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [14]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


Case when text on which encode method is applied is not in training set!

In [15]:
text = "Hello, do you like tea or coffee?"

In [16]:
print(tokenizer.encode(text))

KeyError: 'Hello'

The word "Hello" is not contained in the vocabulary created using 'the-verdict.txt' file. So we need to have some mechanism to handle unknown words.

Now we will implement a improved version of our previous `SimpleTokenizer` class which will consider unknown words or words not part of training data with `|unk|` token and using `|endoftext|` token we can separate two unrelated text sources. 



In [17]:
# let's modify the vocab
all_words.extend(["<|endoftext|>","<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_words)}
print(len(vocab.items()))

1132


In [18]:
#checking last 5 entries of updated vocabulary

for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [19]:
class SimpleTokenizerV2:
    def __init__(self,vocab) -> None:
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self,text):
        result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in result if item.strip()]
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for i in ids]) 
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)  
        return text

In [21]:
text1 = "Hello, do you like tea or coffee?"
text2 = "In this universe there are billions of stars!"

text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea or coffee? <|endoftext|> In this universe there are billions of stars!


In [22]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1131, 5, 355, 1126, 628, 975, 734, 1131, 10, 1130, 55, 999, 1131, 992, 169, 1131, 722, 1131, 0]


In [23]:
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do you like tea or <|unk|>? <|endoftext|> In this <|unk|> there are <|unk|> of <|unk|>!


- `[PAD]` is used by when training in batches, to ensure all the texts are of same length in a batch `[PAD]` token is added to shorter length texts.

- GPT uses only `<|endoftext|>` token for end of text and unrelated documents as well as for padding. GPT model doesn't used `<|unk|>` for unknown words, it used byte pair encoding which breaks the words into subwords units.
