<a href="https://colab.research.google.com/github/qmeng222/transformers-for-NLP/blob/main/Models_%26_Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers # install the Hugging Face Transformers library

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m90.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
Col

In [2]:
from transformers import AutoTokenizer # import `AutoTokenizer` class to automatically load the appropriate tokenizer for a specific pre-trained model

In [5]:
checkpoint = "bert-base-uncased" # specify the pre-trained model name
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # load the appropriate tokenizer for the specified pre-trained model

In [8]:
tokenizer # the tokenizer object

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [9]:
# call the tokenizer object as a function with an input text:
tokenizer("hello world")

{'input_ids': [101, 7592, 2088, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

👆The result is a dictionary containing different information, which is more than just a list of tokens.

 If you're preparing data for model input, using the callable `tokenizer` might be more appropriate.

# Methods for the tokenizer object:

In [13]:
# tokenize an input text using the `tokenize` method of the `tokenizer`：
tokens = tokenizer.tokenize("hello world")
tokens

['hello', 'world']

👆The result is a list of tokens. Each element of the list corresponds to a tokenized unit.

If you specifically need a list of tokens, then `tokenizer.tokenize` is suitable.

In [15]:
# take a list of tokens & convert each token into its corresponding integer identifier (id):
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[7592, 2088]

In [19]:
# take a list of integer identifiers (ids) & convert them into a list of tokens:
tokenizer.convert_ids_to_tokens(ids)

['hello', 'world']

👆It is useful when you want to work with the individual tokens rather than a full decoded string.

In [20]:
# take a list of integer identifiers (ids) & convert them into a single string:
tokenizer.decode(ids)

'hello world'

👆It is useful for converting the model's output, which is typically in the form of integer IDs, back into human-readable text.

In [22]:
# call the tokenizer object as a function with an input text:
model_inputs = tokenizer("hello world")
model_inputs

{'input_ids': [101, 7592, 2088, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

# Multiple text inputs as a list:

In [24]:
data = [
  "I like cats.",
  "Do you like cats too?",
]
tokenizer(data)

{'input_ids': [[101, 1045, 2066, 8870, 1012, 102], [101, 2079, 2017, 2066, 8870, 2205, 1029, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}