# Huggin Face Tokenizer
* https://www.youtube.com/watch?v=Cz2nvfK28eI&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi&index=54
* https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_11_3_tokenizers.ipynb

Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. Consider how the program might break up following sentences into words.
* This is a test.
* OK, but what about this?
* Is U.S.A the same as USA.?
* What is the best data-set to use?
* I think I will do this-no wait; I will do that.

The hugging face includes tokenizers that can break these sentences into words and subwords. Because English, and some other languages, are made up of common word parts, we tokenize subwords. For example, a gerund word, such as "sleeplin", will be tokenized into "sleep" and "##ing".

In [1]:
!pip install transformers
!pip install transformers[sentencepiece]



First, we create a Hugging Face tokenizer. There are several different tokenizers available from the Hugging Face hub. For this example, we will make use of following tokenizer.
* distilbert-base-uncased

This tokenizer is based on BERT and assumes case-insensitive English text.

In [2]:
from transformers import AutoTokenizer
model = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

We can now tokenize a sample sentence.

In [3]:
encoded = tokenizer('Tokenizing text is easy.')
print(encoded)

{'input_ids': [101, 19204, 6026, 3793, 2003, 3733, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


The result of this tokenization contains two elements:
* input_ids - The individual subword indexes, each index uniquely identifies a subword.
* attention_mask - Which values in **input_ids** are meaningful and not padding(詰め物をする). This sentence had no padding, so all elements have an attention mask of "1". Later, we will request the output to be of a fixed length, introducing padding, which always has an attention mask of "0". Though each tokenizer can be implemented differently, the attention mask of a tokenizer is genrally either "0" or "1".

Due to subwords and special tokens, the number of tokens may not match the number of words in the source string. We can see the meaning of the individual tokens by converting these IDs back to strings.

In [4]:
tokenizer.convert_ids_to_tokens(encoded.input_ids)

['[CLS]', 'token', '##izing', 'text', 'is', 'easy', '.', '[SEP]']

As you can see, there are two special tokens placed at the beginning and end of each sequence. We will soon see how we can include or exclude these special tokens. These special tokens can vary per tokenizer; however, [CLS] begins a sequence for this tokenizer, and [SEP] ends a sequence. You will also see that the gerund(動名詞) "tokenizing" is broken into "token" and "*ing".

For this tokenizer, the special tokens occur between 100 and 103. Most Hugging Face tokenizers use this apprximate range for special tokens. The value zero (0) typically represents padding. We can display all special tokens with this command.

In [5]:
tokenizer.convert_ids_to_tokens([0, 100, 101, 102, 103])

['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']

This tokenizer supports these common tokens:
* [CLS] - Sequence beginning.
* [SEP] - Sequence end.
* [PAD] - Padding.
* [UNK] - Unknown token.
* [MASK] - Mask out tokens for a neural network to predict. Not used in this book.

It is also possible to tokenize lists of sequences. We can pad and truncate(切り詰める) sequences to achieve a standard length by tokenizing many sequences at once.

In [7]:
text = [
    "This movie was great!",
    "I hated this movie, waste of time!",
    "Epic?"
]

encoded = tokenizer(text, padding=True, add_special_tokens=True)

print("**Input IDs**")
for a in encoded['input_ids']:
    print(a)

print("**Attention Maks**")
for a in encoded['attention_mask']:
    print(a)

**Input IDs**
[101, 2023, 3185, 2001, 2307, 999, 102, 0, 0, 0, 0]
[101, 1045, 6283, 2023, 3185, 1010, 5949, 1997, 2051, 999, 102]
[101, 8680, 1029, 102, 0, 0, 0, 0, 0, 0, 0]
**Attention Maks**
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]


Notice the **input_id**'s for the three movie review text sequences. Each of these sequences begins with 101 and we pad with zeros. Just before the padding, each group of IDs ends with 102. The attention masks also have zeros for each of the pqadding entries.

We used two parameters to the tokenizer to control the tokenization process. Some other useful parameters include:
* add_special_tokens (defaults to True) - Whether or not to encode the sequences with the special tokens relative to their model.
* padding (defaults to False) - Acitivates and controls truncation.
* max_length (optional) - Controls the maximum length to use by one of the truncation/padding parameters.