### Tokenizers
Tokenizers aim to find the most meaningful numerical expression to raw text
Common tokenizers: word-based, character-based, subword-based

### Word Based

Words that don't have close meanings are sometimes indexed closely together and the vocab size can be very large.<br>
Out of vocab words can result in a loss of information as they are represented the same way <br>
Hence we want to reduce the number of out of vocab words

In [2]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


### Character Based

Character based tokenization solves the issue of unknown words as all words are made up of the same letters <br>
However, the array size gets very large since each character is a token itself, and each individual token may be less meaningful <br>
This tokenization may be good for ideographic languages (eg Chinese)

### Subword Tokenization
- Frequently used words should not be split into smaller subwords
- Rare words should be decomposed into meaninful subwords

dogs become "dog" + "s" <br>
tokenization becomes "token" + "ization" <br>

Kinds of tokenization: WordPiece, Unigram, Byte-Pair Encoding


### Loading and saving tokenizers

In [4]:
# Uses from_pretrained() and save_pretrained() as with models
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

# We can also use AutoTokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [7]:
tokenizer("Let's try to tokenize")

{'input_ids': [101, 2421, 112, 188, 2222, 1106, 22559, 3708, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [11]:
## Saving tokenizer
tokenizer.save_pretrained("my-bert-tokenizer")

('my-bert-tokenizer/tokenizer_config.json',
 'my-bert-tokenizer/special_tokens_map.json',
 'my-bert-tokenizer/vocab.txt',
 'my-bert-tokenizer/added_tokens.json',
 'my-bert-tokenizer/tokenizer.json')

In [6]:
# We can look at the tokens using the tokenize function
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Let's try to tokenize!")
print(tokens)

['Let', "'", 's', 'try', 'to', 'token', '##ize', '!']


In [8]:
# Looking at the tokens as ids
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[2421, 112, 188, 2222, 1106, 22559, 3708, 106]


Comparing `print(input_ids)` and `tokenizer("Let's try to tokenize")`, we notice an extra id in the beginning and end of the latter statement <br>
This is because the `tokenizer` function inserts a CLS and SEP. We can use `prepare_for_model` to insert the same token_ids <br>
Depending on the tokenizer, different input_ids might be used (eg. RoBERTa uses \<s\> and \</s\> instead)

In [9]:
# Adding CLS and SEP to input_ids
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs["input_ids"])

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[101, 2421, 112, 188, 2222, 1106, 22559, 3708, 106, 102]


In [10]:
# Then, to take a look at the final sentence in their tokens
print(tokenizer.decode(final_inputs["input_ids"]))

[CLS] Let's try to tokenize! [SEP]
