In [1]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


# More Properties of Hugging Face's Tokenizers

There are a lot of great features when using tokenizers in Hugging Face that can make it very simple to try out and use different models. Here we'll breifly discuss some properties that can be useful.

We'll load a couple different models:

* `bert-base-cased` ([doc](https://huggingface.co/docs/transformers/model_doc/bert))
* `xlm-roberta-base` ([doc](https://huggingface.co/docs/transformers/model_doc/xlm-roberta))
* `google/pegasus-xsum` ([doc](https://huggingface.co/docs/transformers/model_doc/pegasus))
* `allenai/longformer-base-4096` ([doc](https://huggingface.co/docs/transformers/model_doc/longformer))

In [2]:
model_names = (
    'bert-base-cased',
    'xlm-roberta-base',
    'google/pegasus-xsum',
    'allenai/longformer-base-4096',
)

model_tokenizers = {
    model_name: AutoTokenizer.from_pretrained(model_name)
    for model_name in model_names
}

tokenizer_config.json: 100%|██████████| 25.0/25.0 [00:00<00:00, 118kB/s]


#### `model_max_length`

Many models that tokenizers are associated with can only take in a maximum number of tokens and so the tokenizer might not be equipped to encode a very long sequence. It might not always be relevant, but you can find this length with `.model_max_length`.

In [3]:
for model_name, temp_tokenizer in model_tokenizers.items():
    max_length = temp_tokenizer.model_max_length
    print(f'{model_name}\n\tmax length: {max_length}')
    print('\n')

bert-base-cased
	max length: 512


xlm-roberta-base
	max length: 512


google/pegasus-xsum
	max length: 512


allenai/longformer-base-4096
	max length: 4096




#### Special Tokens

We've already mentioned special tokens like the "unknown" token. Different models use different ways to distinguish special tokens and not all models cover all the special tokens since it's dependent on the model's task it was trained for.

In [4]:
for model_name, temp_tokenizer in model_tokenizers.items():
    special_tokens = temp_tokenizer.all_special_tokens
    print(f'{model_name}\n\tspecial tokens: {special_tokens}')
    print('\n')

bert-base-cased
	special tokens: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']


xlm-roberta-base
	special tokens: ['<s>', '</s>', '<unk>', '<pad>', '<mask>']


google/pegasus-xsum
	special tokens: ['</s>', '<unk>', '<pad>', '<mask_2>', '<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>', '<unk_60>', '<unk_61>', '<unk_62>', '<unk_63>', '<unk_64>', '<

Yout can also call the specific token you're interested in to see its representation.

In [5]:
model_tokenizers['bert-base-cased'].unk_token

'[UNK]'

In [6]:
for model_name, temp_tokenizer in model_tokenizers.items():
    print(f'{model_name}')
    print(f'\tUnknown: \n\t\t{temp_tokenizer.unk_token=}')
    print(f'\tBeginning of Sequence: \n\t\t{temp_tokenizer.bos_token=}')
    print(f'\tEnd of Sequence: \n\t\t{temp_tokenizer.eos_token=}')
    print(f'\tMask: \n\t\t{temp_tokenizer.mask_token=}')
    print(f'\tSentence Separator: \n\t\t{temp_tokenizer.sep_token=}')
    print(f'\tClass of Input: \n\t\t{temp_tokenizer.cls_token=}')
    print('\n')

Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.


bert-base-cased
	Unknown: 
		temp_tokenizer.unk_token='[UNK]'
	Beginning of Sequence: 
		temp_tokenizer.bos_token=None
	End of Sequence: 
		temp_tokenizer.eos_token=None
	Mask: 
		temp_tokenizer.mask_token='[MASK]'
	Sentence Separator: 
		temp_tokenizer.sep_token='[SEP]'
	Class of Input: 
		temp_tokenizer.cls_token='[CLS]'


xlm-roberta-base
	Unknown: 
		temp_tokenizer.unk_token='<unk>'
	Beginning of Sequence: 
		temp_tokenizer.bos_token='<s>'
	End of Sequence: 
		temp_tokenizer.eos_token='</s>'
	Mask: 
		temp_tokenizer.mask_token='<mask>'
	Sentence Separator: 
		temp_tokenizer.sep_token='</s>'
	Class of Input: 
		temp_tokenizer.cls_token='<s>'


google/pegasus-xsum
	Unknown: 
		temp_tokenizer.unk_token='<unk>'
	Beginning of Sequence: 
		temp_tokenizer.bos_token=None
	End of Sequence: 
		temp_tokenizer.eos_token='</s>'
	Mask: 
		temp_tokenizer.mask_token='<mask_2>'
	Sentence Separator: 
		temp_tokenizer.sep_token=None
	Class of Input: 
		temp_tokenizer.cls_token=None


allenai/longform