## 2.2 Tokenizing text

In [1]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [2]:
import re
text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


Let's modify the regular expression splits on whitespaces (\s) and commas, and periods ([,.]):

In [3]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


As we can see, this creates empty strings, let's remove them

In [4]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


 Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes

In [5]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


This is pretty good, and we are now ready to apply this tokenization to the raw text

In [6]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4690


The above print statement outputs 4690, which is the number of tokens in this text **(without whitespaces)**.
 
 
 Let's print the first 30 tokens for a quick visual check:

In [7]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 Converting tokens into tokens ID

In [8]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)


1130


In [9]:
all_words[:12]

['!', '"', "'", '(', ')', ',', '--', '.', ':', ';', '?', 'A']

In [10]:
vocab = {token:integer for integer,token in enumerate(all_words)}


In [11]:
len(vocab)

1130

In [12]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i > 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)
('His', 51)


Putting it now all together into a tokenizer class

In [13]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text



- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers can then be embedded (later) as input of/for the LLM

In [14]:
tokenizer = SimpleTokenizerV1(vocab)

# all symbols in the """   """ are part of text.
text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""

ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


- We can decode the integers back into text


In [15]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [16]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing
text based on a snippet from the training set. Let's now apply it to a new text sample that is not contained in the training set:

In [17]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

The problem is that the word "Hello" was not used in the **The Verdict** short story. Hence, it is not contained in the vocabulary. This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs.


we will also discuss **additional special tokens** that can be used to provide further context for an LLM during training.

## 2.4 Adding special context tokens

- The above produces an error because the word "Hello" is not contained in the vocabulary
- To deal with such cases, we can add special tokens like "<|unk|>" to the vocabulary to represent unknown words
- Since we are already extending the vocabulary, let's add another token called "<|endoftext|>" which is used in GPT-2 training to denote the end of a text (and it's also used between concatenated text, like if our training datasets consists of multiple articles, books, etc.)

In [18]:
len(preprocessed)

4690

In [19]:
# adding special tokens: unk and endoftext
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer, token in enumerate(all_tokens)}

In [20]:
len(vocab.items())

1132

In [21]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)


('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


- Based on the code output above, we can confirm that the two new special tokens were indeed successfully incorporated into the vocabulary. Next, we adjust the tokenizer from code SimpleTokenizerV1 to new one:

In [22]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int     # A
                        else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])

        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)     #B
        return text
    
#A replace unknown words by <|unk|> tokens
#B Replace spaces before the specified punctuations

In [23]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [24]:
tokenizer = SimpleTokenizerV2(vocab)

In [25]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

- how we see, token IDds contains 1130 for the `<|endoftext|>` seperatorr token as well as two 1131 token, which are used for `unknown` words.

In [26]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

> Note:
So far, we have discussed tokenization as an essential step in processing text as input to
LLMs. Depending on the LLM, some researchers also consider additional special tokens such
as the following:
- [BOS] (beginning of sequence): This token marks the start of a text. It
signifies to the LLM where a piece of content begins.

- [EOS] (end of sequence): This token is positioned at the end of a text,
and is especially useful when concatenating multiple unrelated texts,
similar to <|endoftext|>. For instance, when combining two different
Wikipedia articles or books, the [EOS] token indicates where one article
ends and the next one begins.

- [PAD] (padding): When training LLMs with batch sizes larger than one,
the batch might contain texts of varying lengths. To ensure all texts have
the same length, the shorter texts are extended or "padded" using the
[PAD] token, up to the length of the longest text in the batch.


> Note:
Note that the tokenizer used for GPT models does not need any of these tokens mentioned
above but only uses an `<|endoftext|>` token for simplicity. The <|endoftext|> is
analogous to the `[EOS]` token mentioned above. Also, `<|endoftext|>` is used for padding
as well. However, as we'll explore in subsequent chapters when training on batched inputs,
we typically use a `mask`, meaning we don't attend to padded tokens. Thus, the specific
token chosen for padding becomes inconsequential.

Moreover, the tokenizer used for GPT models also **doesn't use an `<|unk|>`** token for outof-vocabulary words. Instead, GPT models use a **`byte pair encoding tokenizer`**, which breaks down words into subword units, which we will discuss in the next section.

## 2.5 Byte pair encoding

- GPT-2 used BytePair encoding (BPE) as its tokenizer

- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words

- For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges

- The original BPE tokenizer can be found here: https://github.com/openai/gpt-2/blob/master/src/encoder.py

- In this chapter, we are using the BPE tokenizer from OpenAI's open-source tiktoken library, which implements its core algorithms in Rust to improve computational performance

- Author of book created a notebook in the ./bytepair_encoder that compares these two implementations side-by-side (tiktoken was about 5x faster on the sample text)

In [27]:
# pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-win_amd64.whl (798 kB)
                                              0.0/798.9 kB ? eta -:--:--
                                              10.2/798.9 kB ? eta -:--:--
     -                                     30.7/798.9 kB 325.1 kB/s eta 0:00:03
     --                                    61.4/798.9 kB 465.5 kB/s eta 0:00:02
     --                                    61.4/798.9 kB 465.5 kB/s eta 0:00:02
     -----                                112.6/798.9 kB 544.7 kB/s eta 0:00:02
     -------                              174.1/798.9 kB 655.4 kB/s eta 0:00:01
     -------                              174.1/798.9 kB 655.4 kB/s eta 0:00:01
     -----------                          245.8/798.9 kB 752.5 kB/s eta 0:00:01
     -------------                        307.2/798.9 kB 759.5 kB/s eta 0:00:01
     ----------------                     358.4/798.9 kB 794.9 kB/s eta 0:00:01
     --------------------                 450.6/7


[notice] A new release of pip is available: 23.1.2 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [28]:
import importlib
import importlib.metadata
import tiktoken

print("tiktoken version: ", importlib.metadata.version("tiktoken"))

tiktoken version:  0.7.0


In [29]:
tokenizer = tiktoken.get_encoding("gpt2")

In [34]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
)

In [35]:
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [36]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


We can make two noteworthy observations based on the token IDs and decoded text
above. First, the <|endoftext|> token is assigned a relatively large token ID, namely,
`50256`. In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3,
and the original model used in ChatGPT, has a total vocabulary size of `50,257`, with
`<|endoftext|>` being assigned the largest token ID.

- BPE tokenizers break down unknown words into subwords and individual characters:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/11.webp" width="300px">

> EXERCISE 2.1 BYTE PAIR ENCODING OF UNKNOWN WORDS
- Try the BPE tokenizer from the tiktoken library on the unknown words "Akwirw ier"
and print the individual token IDs. Then, call the decode function on each of the
resulting integers in this list to reproduce the mapping shown in Figure 2.11. Lastly,
call the decode method on the token IDs to check whether it can reconstruct the
original input, "Akwirw ier".


In [38]:
text_sample = "Akwirw ier"
example_integers = tokenizer.encode(text_sample)
print(example_integers)

[33901, 86, 343, 86, 220, 959]


In [39]:
example_strings = tokenizer.decode(example_integers)
print(example_strings)

Akwirw ier


The exercise 2.1 implemented in the `exercise_solutions.ipynb` file

A detailed discussion and implementation of BPE is out of the scope of this book, but in
short, it builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words. For example, BPE starts with adding all individual single characters to its vocabulary ("a", "b", ...). In the next stage, it merges character combinations that frequently occur together into subwords. For example, "d" and "e" may be merged into the subword "de," which is common in many English words like "define", "depend", "made", and "hidden". The merges are determined by a frequency cutoff.


## 2.6 Data sampling with a sliding window