## Tokenizer

#### Word-based tokenizer

With Word-based tokenizer a sentence can be split by spaces or by punctuation for example.

I'm going to the cinema => will be 5 tokens if split by spaces.

The word-based tokenization has some limitations, for example "house" and "houses" will have each its own number representation although the words are similar just that one word is the plural of the other one. Another limitation is that if you want the model to understand everything of that particular language, the vocabulary size becomes very large and heavy.
One way to solve this is to actually work with the 10k most common words. This will mean that if there is a word in the sentence that does not belong to the vocabulary, so called out of vocabulary, this word will be transformed into an UNKNOWN tag which can be a problem if for example there is more than one word in the same sentence that does not belong to the vocabulary.

#### Character-based tokenizer

170entence can be split by characters. This method reduces drastically the vocabulary size. In the Word-based tokenization, for English a whole vocabulary would be ~170k words, whereas the character-based tokenization would have 256 including special characters and punctuation.
With character-based tokenization there may be a loss of context as a letter by itself does not really mean much. This is true depending on the language. For example for Chinese a single character can have a lot of meaning, while in a latin language no. 

Another issue with character-based tokenization is that the number of tokens will be large whereas with word-based tokenization for a sentence you would have maybe 5 tokens and 40-50 for characters based.


#### Subword tokenization

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. WordPiece is a type of Subword tokenization

In [1]:
from transformers import BertTokenizer, AutoTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Downloading: 100%|██████████| 208k/208k [00:00<00:00, 502kB/s] 
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 4.94kB/s]


In [2]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Downloading: 100%|██████████| 426k/426k [00:00<00:00, 752kB/s] 


In [3]:
tokenizer("Hi there my name is Jarvis, how may I help you?")

{'input_ids': [101, 8790, 1175, 1139, 1271, 1110, 23255, 117, 1293, 1336, 146, 1494, 1128, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [4]:
#To save tokenizer

tokenizer.save_pretrained("folder_of_your_choice")

('folder_of_your_choice\\tokenizer_config.json',
 'folder_of_your_choice\\special_tokens_map.json',
 'folder_of_your_choice\\vocab.txt',
 'folder_of_your_choice\\added_tokens.json',
 'folder_of_your_choice\\tokenizer.json')

In [5]:
sentence = "How does tokenization work?"
tokens = tokenizer.tokenize(sentence)
print(tokens)

['How', 'does', 'token', '##ization', 'work', '?']


In [7]:
#Convert tokens to numbers
to_ids = tokenizer.convert_tokens_to_ids(tokens)
to_ids

[1731, 1674, 22559, 2734, 1250, 136]

In [10]:
#Decode the tokens

decoded = tokenizer.decode(to_ids)
decoded

'How does tokenization work?'

In [11]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

In [12]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [13]:
tokens

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2061,  2031,  1045,   999,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

In [None]:
"""
    Here you can see the shorter sentences was padded with 0
"""