<a href="https://colab.research.google.com/github/Hugging-Face-Sphere/Fall-2022/blob/main/2-finetuning-transformers/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.27.2-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m53.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_

## 1. Word tokenization

Take a shot at tokenizing the text below by splitting it up by **word**. You should only need to call one standard Python method.

In [2]:
original_text = "Jim Henson was a puppeteer"

tokenized_text = "" # ... tokenize it here!

print(tokenized_text)  # Should print ['Jim', 'Henson', 'was', 'a', 'puppeteer']




## 2. Character tokenization

Do the same thing, but split the string up by **character**.

In [3]:
original_text = "Jim Henson was a puppeteer"

tokenized_text = "" # ... tokenize it here!

print(tokenized_text)




## 3. Subword Tokenization

You can load up any Transformer model's tokenizer by calling `from_pretrained` on its architecture's XTokenizer class.

In [4]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

...but it's WAY more convenient to just use the AutoTokenizer class, which will look at the model's `config.json`.

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

We can use the tokenizer by just calling it on the string we want to tokenize. Run the cell! What are you getting out of it? Try passing more than one string to the tokenizer, each as a separate argument. What do you think `token_type_ids` is? (See https://huggingface.co/docs/transformers/glossary#token-type-ids for an explanation!)

In [6]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

To illustrate the encoding process, it's helpful to look at the steps of the tokenizer. First, let's run `.tokenize()` on a sequence. What do you get out of it?

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


Then, we can call `.convert_tokens_to_ids()` on the `tokens`.

In [8]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


Ultimately, we can also go backwards from our encodings to our original sentence. Mess around a bit with the original sequence – is the encoding-decoding process lossless?

In [9]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

Using a Transformer network is simple


And lastly, the tokenizer takes a bunch of optional parameters. For example, you can tell the tokenizer to truncate, pad, and return the encodings as PyTorch tensors, numpy arrays, or TensorFlow tensors. Note that we'll be using PyTorch models, so we'll specify "pt" as our `return_tensors`.

In [10]:
tokenizer(["Hello world, what's up!"], return_tensors="pt")

{'input_ids': tensor([[ 101, 8667, 1362,  117, 1184,  112,  188, 1146,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}