# Tokenization

## The Idea

The job of the tokenizer is to split the input text into individual units - called **tokens** - for further processing by the LLM. A token is simply any string - tokens can be characters, they can be subwords, words and (theoretically) even something larger.

We will use the TinyShakespare dataset as our example dataset for this section:

In [4]:
import requests

url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
res = requests.get(url)
content = res.content.decode("utf-8")

Let's have a brief look at the size of this dataset:

In [2]:
len(content)

1115394

Let's also have a brief look at some sample text from this dataset:

In [3]:
content[:100]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

## Byte Pair Encoding

There are many different strategies to split a text into tokens.

One very simple way would be to use character tokenization - simply split the text into it's individual characters.
The problem with this is that the model would have to learn everything from the ground up (including what a word is) which seems vaguely unncessary.

One better way is to use word tokenization - just split the text into words.
However, this implicitly treats all words as equally important and leads to a very large vocabulary size (i.e. the number of possible tokens).

Current approaches therefore have settled somewhere in between at **subword tokenization**.

One very popular scheme is **Byte Pair Encoding (BPE)**, which builds vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words.
Put simply, first all individual characters are added to the vocabulary.
Then, the most common 2-character combinations are added, followed by the most common 3-character combinations etc.

The exact details of BPE are not extremely important right now, just keep in mind that BPE does subword tokenization.

We will use the `tiktoken` library to show an example:

In [1]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

We can use the `encode` method to tokenize a string:

In [5]:
encoded = tokenizer.encode(content, allowed_special={"<|endoftext|>"})

In [6]:
encoded[:10]

[5962, 22307, 25, 198, 8421, 356, 5120, 597, 2252, 11]

Note that the tokenizer doesn't return a list of strings. Instead, it returns a list of integers, where every integer is an ID representing a certain token from the tokenizer vocabulary.

We can view the tokens themselves like this:

In [7]:
for encoded_id in encoded[:10]:
    print(repr(tokenizer.decode([encoded_id])))

'First'
' Citizen'
':'
'\n'
'Before'
' we'
' proceed'
' any'
' further'
','


We can also decode multiple tokens like this:

In [9]:
tokenizer.decode(encoded[:10])

'First Citizen:\nBefore we proceed any further,'

The BPE tokenizer can handle any unknown word:

In [11]:
unknown_text = "asdasdaf"
tokenizer.decode(tokenizer.encode(unknown_text))

'asdasdaf'

This is because BPE breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters:

In [12]:
for encoded_id in tokenizer.encode(unknown_text):
    print(repr(tokenizer.decode([encoded_id])))

'as'
'd'
'as'
'd'
'af'


## Tokenizers in `transformers`

The `transformers` library also allows us to work with tokenizers:

In [13]:
from transformers import GPT2Tokenizer

text = "This is a sentence"

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
encoded_input = tokenizer(text, return_tensors="pt")

print(encoded_input)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

{'input_ids': tensor([[1212,  318,  257, 6827]]), 'attention_mask': tensor([[1, 1, 1, 1]])}


The `input_ids` value represents the token IDs.