# Tokenization

Splitting a text block into meaningful subunits is essential to processing text. Text could be split into individual characters, words, or somewhere in between. A very basic approach is shown below that splits up text using white space. There's already a shortcoming, as the final word, 'dog,' has punctuation attached to it.

In [1]:
from collections import defaultdict

'The quick brown fox jumps over the lazy dog.'.split(' ')

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']

With Transformer models, we do subword tokenization and split the text up using a prebuilt tokenizer. This has been trained on a large amount of text where it has learned what are common words and which are less common and could be split into parts (that often look like syllables).

First let's load one for a common Transformer model `distilgpt2`. We can load it with the code below. The `distilgpt2` model is a smaller model based upon `gpt2` which is a predecessor to the language model that underpins ChatGPT.

> To use the code below, you need to install the `transformers` library. 

> To get rid of the warning, install `torch` or `tensorflow` and `ipywidgets`

In [2]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')

The tokenizer has a function `tokenizer.tokenize` that splits up text.

In [3]:
tokenizer.tokenize("The quick fox jumps over the dog.")

['The', 'Ġquick', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġdog', '.']

In [4]:
tokenizer.tokenize("I visited Glasgow.")

['I', 'Ġvisited', 'ĠGlasgow', '.']

You should four tokens, with some starting with an odd character 'Ġ'. That 'Ġ' denotes that the token starts a new word. Try tokenizing "volcano" below with `tokenizer.tokenize` again. It should be split up into two subword tokens.

In [5]:
tokenizer.tokenize("volcano")

['vol', 'cano']

Along with tokenizing the text into tokens/subtokens, we actually want the tokens to be mapped to numbers. The Transformers take the token indices as input. For example, the token index for the word 'Glasgow' is.

In [6]:
tokenizer.vocab['ĠGlasgow']

23995

`tokenizer.vocab` is a big dictionary mapping subword tokens to their indices. Let's see how big the vocabulary that the `distilgpt2` tokenizer has:

In [7]:
len(tokenizer.vocab)

50257

We could manually map the tokenized output to the token indices.But the tokenizer can do it for us using `tokenizer.encode`.

In [8]:
tokenizer.encode("I visited Glasgow.")

[40, 8672, 23995, 13]

You can use the `tokenizer.decode` function to convert from a list of indices back to text.

In [9]:
sentences = [[40, 8672, 23995, 13],[464, 7850, 46922, 4539, 832, 23995, 13]]
for sentence in sentences:
    print(tokenizer.decode(sentence))

I visited Glasgow.
The river Clyde runs through Glasgow.


The tokenizer has a lot of parameters to give extra control. For instance, you sometimes need to truncate very long strings (as there is a limit on the length of input to Transformer models). Use the `tokenizer.encode` function to tokenize "Kelvingrove is a beautiful park in Glasgow." and also trim it to only 5 tokens using `truncation=True` and `max_length=5`.

In [10]:
tokenizer.encode("Kelvingrove is a beautiful park in Glasgow.", truncation=True, max_length=5)

[42, 417, 1075, 305, 303]

Now the most common way to use a tokenizer is below which outputs a format ready to pass into a Transformer model. It uses `return_tensors='pt'` which tells it to return PyTorch tensors. PyTorch tensors are a data structure used for deep learning.

The output has the `input_ids` which are the token indices as well as an `attention_mask` which can be used to tell a Transformer to ignore certain tokens. This occurs when using padding to deal with some sequences being shorter than others. That's not the case here, so the attention values are all one.

In [11]:
tokenizer('Kelvingrove is a park in Glasgow.', return_tensors='pt')

{'input_ids': tensor([[   42,   417,  1075,   305,   303,   318,   257,  3952,   287, 23995,
            13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

It should be noted that each tokenizer is very specific to the text it was trained on. For instance, below is a tokenizer that was trained on Spanish text.

In [12]:
spanish_tokenizer = AutoTokenizer.from_pretrained('datificate/gpt2-small-spanish')

If we give it a previous sentence in English, it tokenizes it very differently and splits up common English words into multiple parts.

In [13]:
spanish_tokenizer.tokenize('The river Clyde runs through Glasgow.')

['The', 'Ġri', 'ver', 'ĠClyde', 'Ġr', 'uns', 'Ġth', 'rough', 'ĠGlasgow', '.']

However, it will tokenize Spanish effectively:

In [14]:
spanish_tokenizer.tokenize('Que te vaya bien')

['Que', 'Ġte', 'Ġvaya', 'Ġbien']

## Sub-tokenization examples

In [15]:
words = ["unhappiness","generalization","understand","caregiver","understandable",
         "counterintuitive","uncharacteristic","misunderstanding","disestablishmentarianism","antidisestablishmentarianism"]

In [16]:
from collections import defaultdict
sub_tokens_by_length = defaultdict(list)
for word in words:
    sub_tokens = tokenizer.tokenize(word)
    sub_tokens_by_length[len(sub_tokens)].append((word,sub_tokens))

sub_tokens_by_length = dict(sorted(sub_tokens_by_length.items()))

In [17]:
for key, value in sub_tokens_by_length.items():
    print(f"{key} Subword Tokens:")
    for (word, sub_tokens) in value:
        print(f"{word} -> \t{sub_tokens}")

2 Subword Tokens:
generalization -> 	['general', 'ization']
understand -> 	['under', 'stand']
counterintuitive -> 	['counter', 'intuitive']
3 Subword Tokens:
unhappiness -> 	['un', 'h', 'appiness']
caregiver -> 	['care', 'g', 'iver']
understandable -> 	['under', 'stand', 'able']
misunderstanding -> 	['mis', 'under', 'standing']
4 Subword Tokens:
uncharacteristic -> 	['unch', 'ar', 'acter', 'istic']
disestablishmentarianism -> 	['dis', 'establishment', 'arian', 'ism']
5 Subword Tokens:
antidisestablishmentarianism -> 	['ant', 'idis', 'establishment', 'arian', 'ism']
