# Tokenization

## Example Dataset

We will use TinyShakespare as our example dataset for this section:

In [1]:
import requests

url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
res = requests.get(url)
content = res.content.decode("utf-8")

KeyboardInterrupt: 

Let's have a brief look at the size of this dataset:

In [6]:
len(content)

1115394

Let's also have a brief look at some sample text from this dataset:

In [7]:
content[:100]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

## Byte Pair Encoding

There are many different strategies to split a text into tokens.

One very simple way would be to use character tokenization - simply split the text into it's individual characters.
The problem with this is that the model would have to learn everything from the ground up (including what a word is) which seems vaguely unncessary.

One better way is to use word tokenization - just split the text into words.
However, this implicitly treats all words as equally important and leads to a very large vocabulary size (i.e. the number of possible tokens).

Current approaches therefore have settled somewhere in between at **subword tokenization**.

One very popular scheme is **Byte Pair Encoding (BPE)**.

BPE builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words.

Put simply, first all individual characters are added to the vocabulary.
Then, the most common 2-character combinations are added, followed by the most common 3-character combinations etc.

The exact details of BPE are not extremely important right now, just keep in mind that BPE does subword tokenization:

In [10]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

In [14]:
encoded = tokenizer.encode(content, allowed_special={"<|endoftext|>"})

In [19]:
encoded[:10]

[5962, 22307, 25, 198, 8421, 356, 5120, 597, 2252, 11]

In [20]:
tokenizer.decode(encoded[:10])

'First Citizen:\nBefore we proceed any further,'

In [26]:
for encoded_id in encoded[:10]:
    print(repr(tokenizer.decode([encoded_id])))

'First'
' Citizen'
':'
'\n'
'Before'
' we'
' proceed'
' any'
' further'
','


The BPE tokenizer can handle any unknown word:

In [22]:
unknown_text = "asdasdaf"
tokenizer.decode(tokenizer.encode(unknown_text))

'asdasdaf'

This is because BPE breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters.

BPE builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words.

First, it adds all individual characters to vocabulary.
Then it merges common character combinations etc.

In [27]:
for encoded_id in tokenizer.encode(unknown_text):
    print(repr(tokenizer.decode([encoded_id])))

'as'
'd'
'as'
'd'
'af'


## Tokenizers in `huggingface`

## Sliding Window

In [28]:
len(encoded)

338025

In [29]:
context_size = 6
x = encoded[:context_size]
y = encoded[1:context_size+1]

In [30]:
x

[5962, 22307, 25, 198, 8421, 356]

In [31]:
y

[22307, 25, 198, 8421, 356, 5120]

In [32]:
for i in range(1, context_size+1):
    context = encoded[:i]
    target = encoded[i]
    print(context, target)

[5962] 22307
[5962, 22307] 25
[5962, 22307, 25] 198
[5962, 22307, 25, 198] 8421
[5962, 22307, 25, 198, 8421] 356
[5962, 22307, 25, 198, 8421, 356] 5120


In [36]:
for i in range(1, context_size+1):
    context = encoded[:i]
    target = encoded[i]
    print(repr(tokenizer.decode(context)), repr(tokenizer.decode([target])))

'First' ' Citizen'
'First Citizen' ':'
'First Citizen:' '\n'
'First Citizen:\n' 'Before'
'First Citizen:\nBefore' ' we'
'First Citizen:\nBefore we' ' proceed'


## Efficient Data Loaders in PyTorch

In [65]:
from torch.utils.data import Dataset
import torch

class ShakespeareDataset(Dataset):
    def __init__(self, text, max_context_length, stride):
        self.tokenizer = tiktoken.get_encoding("gpt2")
        self.context_tensors = []
        self.target_tensors = []

        token_ids = self.tokenizer.encode(text)

        for i in range(0, len(token_ids) - max_context_length, stride):
            context = token_ids[i:i+max_context_length]
            target = token_ids[i+1:i+max_context_length+1]
            context_tensor = torch.tensor(context)
            target_tensor = torch.tensor(target)
            self.context_tensors.append(context_tensor)
            self.target_tensors.append(target_tensor)

    def __len__(self):
        return len(self.context_tensors)

    def __getitem__(self, idx):
        return self.context_tensors[idx], self.target_tensors[idx]

In [66]:
from torch.utils.data import DataLoader

# max_context_length is usually at least 256
def create_dataloader(text, batch_size=32, max_context_length=256, stride=32, shuffle=True, drop_last=True):
    dataset = ShakespeareDataset(text, max_context_length, stride)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
    return dataloader

In [67]:
dataloader = create_dataloader(content, batch_size=1, stride=1, max_context_length=4, shuffle=False)

In [68]:
dataloader_iter = iter(dataloader)

In [69]:
first_batch = next(dataloader_iter)

In [70]:
first_batch

[tensor([[ 5962, 22307,    25,   198]]),
 tensor([[22307,    25,   198,  8421]])]

In [71]:
second_batch = next(dataloader_iter)

In [72]:
second_batch

[tensor([[22307,    25,   198,  8421]]), tensor([[  25,  198, 8421,  356]])]

In [73]:
tokenizer.decode(first_batch[0].tolist()[0])

'First Citizen:\n'

In [74]:
tokenizer.decode(first_batch[1].tolist()[0])

' Citizen:\nBefore'

In [75]:
tokenizer.decode(second_batch[0].tolist()[0])

' Citizen:\nBefore'

In [76]:
tokenizer.decode(second_batch[1].tolist()[0])

':\nBefore we'