In [2]:
from transformers import AutoTokenizer

# Properties of Hugging Face's Tokenizers

There are a lot of great features when using tokenizers in Hugging Face that can make it very simple to try out and use different modules. Here we'll briefly discuss some properties that can be useful.

We'll load a couple of different models:
* bert-base-cased
* xlm-roberta-base
* google/pegasus-xsum
* allenai/longformer-base-4096

In [6]:
model_names = {
    "bert-base-cased",
    "xlm-roberta-base",
    "google/pegasus-xsum",
    "allenai/longformer-base-4096",
}

model_tokenizers = {
    model_name: AutoTokenizer.from_pretrained(model_name)
    for model_name in model_names
}

`model_max_length`

Many models that tokenizers are associated with can only take in a maximum number of tokens and so the tokenizer might not be equipped to encode a very long sequence. It might not always be relevant, but you can find this length with `.model_max_length`

In [7]:
for model_name, temp_tokenizer in model_tokenizers.items():
  max_length = temp_tokenizer.model_max_length
  print(f"{model_name}\n\tmax length: {max_length}")
  print('\n')

bert-base-cased
	max length: 512


allenai/longformer-base-4096
	max length: 1000000000000000019884624838656


xlm-roberta-base
	max length: 512


google/pegasus-xsum
	max length: 512




## Special Tokens

Different tokenizers will have different special tokens defined. They might have tokens representing:
* Unknown token
* Beginning of sequence token
* Separator token
* Token used for padding
* Classifier token
* Token used for masking values
Additionally, there may be multiple subtypes of each special token. For example, some tokenizers have multiple different unknown tokens(e.g. `<unk>` and `<unk_2>`)

In [8]:
for model_name, temp_tokenizer in model_tokenizers.items():
  special_tokens = temp_tokenizer.all_special_tokens
  print(f"{model_name}\n\tspecial tokens: {special_tokens}")
  print('\n')

bert-base-cased
	special tokens: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']


allenai/longformer-base-4096
	special tokens: ['<s>', '</s>', '<unk>', '<pad>', '<mask>']


xlm-roberta-base
	special tokens: ['<s>', '</s>', '<unk>', '<pad>', '<mask>']


google/pegasus-xsum
	special tokens: ['</s>', '<unk>', '<pad>', '<mask_2>', '<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_

Also it's possible to call the specific token we're interested in to see its representation.

In [10]:
model_tokenizers['bert-base-cased'].unk_token

'[UNK]'

In [11]:
for model_name, temp_tokenizer in model_tokenizers.items():
  print(f"{model_name}")
  print(f'\tUnknown: \n\t\t{temp_tokenizer.unk_token=}')
  print(f'\tBeginning of Sequence: \n\t\t{temp_tokenizer.bos_token=}')
  print(f'\tEnd of Sequence: \n\t\t{temp_tokenizer.eos_token=}')
  print(f'\tMask: \n\t\t{temp_tokenizer.mask_token=}')
  print(f'\tSequence Separator: \n\t\t{temp_tokenizer.sep_token=}')
  print(f'\tClass of Input: \n\t\t{temp_tokenizer.cls_token=}')
  print('\n')

bert-base-cased
	Unknown: 
		temp_tokenizer.unk_token='[UNK]'
	Beginning of Sequence: 
		temp_tokenizer.bos_token=None
	End of Sequence: 
		temp_tokenizer.eos_token=None
	Mask: 
		temp_tokenizer.mask_token='[MASK]'
	Sequence Separator: 
		temp_tokenizer.sep_token='[SEP]'
	Class of Input: 
		temp_tokenizer.cls_token='[CLS]'


allenai/longformer-base-4096
	Unknown: 
		temp_tokenizer.unk_token='<unk>'
	Beginning of Sequence: 
		temp_tokenizer.bos_token='<s>'
	End of Sequence: 
		temp_tokenizer.eos_token='</s>'
	Mask: 
		temp_tokenizer.mask_token='<mask>'
	Sequence Separator: 
		temp_tokenizer.sep_token='</s>'
	Class of Input: 
		temp_tokenizer.cls_token='<s>'


xlm-roberta-base
	Unknown: 
		temp_tokenizer.unk_token='<unk>'
	Beginning of Sequence: 
		temp_tokenizer.bos_token='<s>'
	End of Sequence: 
		temp_tokenizer.eos_token='</s>'
	Mask: 
		temp_tokenizer.mask_token='<mask>'
	Sequence Separator: 
		temp_tokenizer.sep_token='</s>'
	Class of Input: 
		temp_tokenizer.cls_token='<s>'


googl