In [1]:
%run supportvectors-common.ipynb



<center><img src="https://d4x5p7s4.rocketcdn.me/wp-content/uploads/2016/03/logo-poster-smaller.png"/> </center>
<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## Pretrained tokenizers



In this lab, we will delve into the tokenizers. A tokenizer transforms a piece of text into tokens so that they can form an input to the subsequent tasks such as a transformer that will use the tokens for a classification task.



In [2]:
example = "The tokenizer does tokenization. It does this to have fun with tokens."

In [3]:
from transformers import AutoTokenizer
import pandas as pd
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [5]:
tokens = tokenizer(example)
print(tokens)

{'input_ids': [101, 1996, 19204, 17629, 2515, 19204, 3989, 1012, 2009, 2515, 2023, 2000, 2031, 4569, 2007, 19204, 2015, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [6]:
pd.DataFrame(dict(tokens))

Unnamed: 0,input_ids,token_type_ids,attention_mask
0,101,0,1
1,1996,0,1
2,19204,0,1
3,17629,0,1
4,2515,0,1
5,19204,0,1
6,3989,0,1
7,1012,0,1
8,2009,0,1
9,2515,0,1


 There are some interesting observations: first, the `example` string has on 12 words and two puntuation (end of sentence periods). But the tokenizer has broken it into 19 tokens! Can you guess why?

Let us investigate it by reversing the tokenization. 



In [7]:
recovered_tokens = tokenizer.convert_ids_to_tokens(tokens.input_ids)
print(recovered_tokens)

['[CLS]', 'the', 'token', '##izer', 'does', 'token', '##ization', '.', 'it', 'does', 'this', 'to', 'have', 'fun', 'with', 'token', '##s', '.', '[SEP]']


Intuitively, we would have expected the tokens to be the words and punctuations. In other words, we would have exptected something like:

`['The', 'tokenizer', 'does', 'tokenization', '.',  'It', 'does', 'this', 'to', 'have', 'fun', 'with', 'tokens', '.']`

The tokenizer has converted everything to lower case! It turns out that the tokenizer `bert-based-uncased` does that as a pre-processing step. 

#### Special tokens

Next, we observe that there is a `[CLS]` token at the beginning, and a `[SEP]` token at the end. The `[CLS]` is a special token, that `BERT` transformer always prepends to the beginning of a segment (here the sentence). And `BERT` appends each segment with `[SEP]`, to separate it from the next segment.

Recall that we covered this while carefully studying the BERT architecture in a previous session.

`BERT` expects two segments, but we are passing only one segment. Each token is tagged with a segment identifier. In this case, the `tokens.token_type_ids` field captures the segment identifier. Since the whole sentence form a segment, all the tokens belong to the first (and only) segment. So the `tokens.token_type_ids` is `0`.

### Word segments

The english language has a rather vast vocabulary. If we were to consider such a large vocabulary, say a vocabulary of 1-million words, then we would have a problem:

Recall that categorical variables have to be hot-encoded into a hot-encoding vector, and the dimensionality of the vector is equal to the cardinality of the classes in the categorical variable. Now, for us, the categorical variable here is the token, and if it takes a value in a large vocabulary, then it will get hot-encoded into a very large vector. 

Very large vectors are inefficient to train; they need far more computational power, and also more data.

So most tokenizers take a much smaller vocabulary, say of 20K tokens. But these tokens do not always correspond to the words. Look carefully at the `example` text:

>The tokenizer does tokenization. It does this to have fun with tokens.

Do we see something interestings? Some of the words are composites: they are made of `token` and some other word-piece. For example, `tokenizer` is `token` + `##izer`. The suffixing word-pieces start with the special characters `##` signifying that they complete the previous token. But we realize that there are lots of words in the vocabulary that the end with `izer`. Also, suffixing many words with an `s` makes them the plural of the original word. 


Some examples are:

>Finalizer
Finalizers
Visualizer
Visualizers
Equalizer
Equalizers
Humanizer
Humanizers
Harmonizer
Harmonizers
Patronizer
Patronizers
Vaporizer
Vaporizers


We have deliberately chosen words that end with `izer`, and that themselves do not look like composites with the `izer`.

Let us see what happens if we pass these words through the tokenizer, and then inspect the results.

In [8]:
izers = """
Finalizer
Finalizers
Visualizer
Visualizers
Equalizer
Equalizers
Humanizer
Humanizers
Harmonizer
Harmonizers
Patronizer
Patronizers
Vaporizer
Vaporizers"""
izer_tokens = tokenizer(izers)
izer_recovered_tokens = tokenizer.convert_ids_to_tokens(izer_tokens.input_ids)
print(izer_recovered_tokens)

['[CLS]', 'final', '##izer', 'final', '##izer', '##s', 'visual', '##izer', 'visual', '##izer', '##s', 'equal', '##izer', 'equal', '##izer', '##s', 'human', '##izer', 'human', '##izer', '##s', 'harmon', '##izer', 'harmon', '##izer', '##s', 'patron', '##izer', 'patron', '##izer', '##s', 'vapor', '##izer', 'vapor', '##izer', '##s', '[SEP]']


Notice how a word like `Visualizers` gets broken into word-segments: `visual`, `##izer` and `##s`.


This gives us an interesting insight: **one way to abbreviate the language vocabulary is to use these word-segments as the building blocks of the vocabulary, and not the words themselves.** As an added benefit, the number of unknown words get reduced, since they can now be split into smaller word-segments that are part of the vocabulary.

There are agglutinative languages such as Sanskrit, Finish, Turkish, Japanese which tend to have a lot of joining of words, so that a word can be arbitrarily long, composed of many sub-words. In all such cases, the advantage of using a smaller vocabulary of subwords comes in very handy.

There are many such approaches to tokenize sentences. Some popular ones are:

* **WordPiece**, as shown above, that BERT tokenizers uses
* **SentencePiece** or **Unigram**, that multilingual models often use
* **Byte-level BPE**, that GPT-2 uses



The `bert-base-uncased` tokenizer does exactly this. There are other tokenizers, with small variations in how they tokenize, but they all essentially follow similar ideas.

#### Attention mask

Now, consider what happens if we give the tokenizer a batch of inputs.

In [9]:

err = "To err is human, to forgive divine"
learn = "To learn is to live."
sentences = [ err, learn]

tokens = tokenizer(sentences, padding=True)
pd.DataFrame(dict(tokens))

Unnamed: 0,input_ids,token_type_ids,attention_mask
0,"[101, 2000, 9413, 2099, 2003, 2529, 1010, 2000...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
1,"[101, 2000, 4553, 2003, 2000, 2444, 1012, 102,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]"


Notice that the first sentence is longer than the second sentence by three tokens. So the attention mask of the second sentence is padded with three zeros. If we do not give `padding=True`, then the attention mask will be a shorter list of `1`s.

This explains the purpose of the attention mask -- it tells the downstream models which tokens to ignore or not focus attention on, since they are simply padding tokens.

We can see this more explicitly below. Notice the presence of the special token `[PAD]`.

In [10]:

for ids in tokens.input_ids:
    recovered = tokenizer.convert_ids_to_tokens(ids)
    print (recovered)

['[CLS]', 'to', 'er', '##r', 'is', 'human', ',', 'to', 'forgive', 'divine', '[SEP]']
['[CLS]', 'to', 'learn', 'is', 'to', 'live', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]']


#### Unknown words

As we mentioned, the english vocabulary is vast. But the tokenizers tend to work with a much smaller vocabulary, and any word they do not recognize, they assign it a special `token id`, `[UNK]`. With the use of `WordPiece` tokenizer in `BERT`, these show up far less often, since the unknown words are often broken down into known sub-word pieces. Let's see this below, where `brillig` is a nonsensical word from the jabberwocky poem.

In [11]:
jabberwocky = 'It was brillig and the slithy toves'
tokens = tokenizer(jabberwocky)
tokens

{'input_ids': [101, 2009, 2001, 7987, 8591, 8004, 1998, 1996, 18036, 10536, 2000, 6961, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
recovered_tokens = tokenizer.convert_ids_to_tokens(tokens.input_ids)
print(recovered_tokens)

['[CLS]', 'it', 'was', 'br', '##ill', '##ig', 'and', 'the', 'slit', '##hy', 'to', '##ves', '[SEP]']


### DECODING

Decoding is the restoration of the text back from the token identifiers. Let us look into this, with the previous example.

In [13]:
tokenizer.decode(tokens.input_ids)

'[CLS] it was brillig and the slithy toves [SEP]'

Observe, however, that the words are in lower case. Also, there is the presence of the `[CLS]` and `[SEP]` tokens.

We could have avoided this, with a slightly different api, calling the `tokenize()` method of the `tokenizer`. In this case, it does not pad it with `[CLS]` and `[SEP]`.

In [14]:
tokens = tokenizer.tokenize(jabberwocky)
print(tokens)

['it', 'was', 'br', '##ill', '##ig', 'and', 'the', 'slit', '##hy', 'to', '##ves']


To explicitly get the input tokens ids, we need to call:
    

In [15]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[2009, 2001, 7987, 8591, 8004, 1998, 1996, 18036, 10536, 2000, 6961]

And finally, we can decode it back the usual way; note the absence of the special tokens in the result, as expected.

In [16]:
tokenizer.decode(token_ids)

'it was brillig and the slithy toves'

### Explicitly calling a tokenizer class

An alternate syntax is to use the explicit class `BertTokenizer` to load the relevant, pretrained tokenizer checkpoint:


In [17]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer

BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

### Sentence-Piece tokenizer

Let us now see how a `SentencePiece` tokenizer would tokenize. We will use XLM-R (`xlm-roberta-base`) for this below. Notice the small variations in the output format. The segment start is given by: `<s>
`, and the end by `</s>`. 

In [18]:
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
tokens = tokenizer(example)
tokens

{'input_ids': [0, 581, 47, 1098, 52825, 14602, 47, 1098, 47691, 5, 1650, 14602, 903, 47, 765, 7477, 678, 47, 84694, 5, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [19]:
print(tokenizer.convert_ids_to_tokens(tokens.input_ids))

['<s>', '▁The', '▁to', 'ken', 'izer', '▁does', '▁to', 'ken', 'ization', '.', '▁It', '▁does', '▁this', '▁to', '▁have', '▁fun', '▁with', '▁to', 'kens', '.', '</s>']


### Anatomy of a tokenizer

A tokenizer internally is a **pipeline** of four logical operations; these are depicted in the figure below.

<img src="images/tokenizer-pipeline.png" />