# E5 Tokenizer

E5 (v1) is based on Bert. This notebook explores some of the behavior of its tokenizer. 

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-small")
print(type(tokenizer))

  from .autonotebook import tqdm as notebook_tqdm


<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>


The E5 tokenizer is the same you'd get from the bert tokenizer.     

In [3]:
CLS = 101
QUERY = 23032
PASSAGE = 6019
COLON = 1024
SEP = 102
PAD = 0

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(bert_tokenizer))

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>


What happens if we try to tokenize more than 512 tokens? Say 513? 3 will get truncated... one for the value over 512, another for the starting CLS token, another for the final SEP token to indicate end-of-sequence.

In [4]:
from pprint import pprint

text = ":" * 513
batch_dict = tokenizer(text, max_length=512, padding="max_length", truncation=True, return_tensors="pt")
assert len(batch_dict["input_ids"][0]) == 512

reconstituted = tokenizer.decode(batch_dict["input_ids"][0])

print("First ten tokens: ", reconstituted.split()[:10])
print("Last ten tokens: ", reconstituted.split()[-10:])
print("number of tokens from the original 513 colons:", reconstituted.count(":"))





First ten tokens:  ['[CLS]', ':', ':', ':', ':', ':', ':', ':', ':', ':']
Last ten tokens:  [':', ':', ':', ':', ':', ':', ':', ':', ':', '[SEP]']
number of tokens from the original 513 colons: 510


Unfortunately, there's no way to figure out if a string was truncated, or, if it was exactly 511 tokens. However, if it was 510 tokens, we'll see CLS, Tokens, SEP, and PAD. So the existence of PAD gaurantees we did not truncate. 

In [5]:
text = ":" * 509
batch_dict = tokenizer(text, max_length=512, padding="max_length", truncation=True, return_tensors="pt")
reconstituted = tokenizer.decode(batch_dict["input_ids"][0])

print("Last ten tokens: notice the pad token", reconstituted.split()[-10:])
print("number of tokens from the original 513 colons:", reconstituted.count(":"))

assert tokenizer.pad_token_id == batch_dict["input_ids"][0][-1]


Last ten tokens: notice the pad token [':', ':', ':', ':', ':', ':', ':', ':', '[SEP]', '[PAD]']
number of tokens from the original 513 colons: 509


Let's look at the smallest sequence we might imagine: 

In [6]:
text = "query: a"
batch_dict = tokenizer(text, max_length=512, padding="max_length", truncation=True, return_tensors="pt")
reconstituted = tokenizer.decode(batch_dict["input_ids"][0])

print("This should be [cls], query, :, a, [sep], [pad]x506")
print(reconstituted.split()[:10])



This should be [cls], query, :, a, [sep], [pad]x506
['[CLS]', 'query', ':', 'a', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


In [7]:
for key in batch_dict.keys():
    print(key, batch_dict[key].dtype)

input_ids torch.int64
token_type_ids torch.int64
attention_mask torch.int64


Now let's build a table of every single token. 

In [8]:
# Get all tokens from the tokenizer's vocabulary
vocab = tokenizer.get_vocab()

# Sort the tokens by their token ID (values of the dictionary)
sorted_vocab = sorted(vocab.items(), key=lambda item: item[1])

# Set the number of columns for display (e.g., 5 columns)
num_columns = 5

# Print in multi-column format
for idx, (token, token_id) in enumerate(sorted_vocab):
    print(f"{token_id:5}: {token:15}", end="\t")
    # Print a newline after every 'num_columns' tokens
    if (idx + 1) % num_columns == 0:
        print()

    0: [PAD]          	    1: [unused0]      	    2: [unused1]      	    3: [unused2]      	    4: [unused3]      	
    5: [unused4]      	    6: [unused5]      	    7: [unused6]      	    8: [unused7]      	    9: [unused8]      	
   10: [unused9]      	   11: [unused10]     	   12: [unused11]     	   13: [unused12]     	   14: [unused13]     	
   15: [unused14]     	   16: [unused15]     	   17: [unused16]     	   18: [unused17]     	   19: [unused18]     	
   20: [unused19]     	   21: [unused20]     	   22: [unused21]     	   23: [unused22]     	   24: [unused23]     	
   25: [unused24]     	   26: [unused25]     	   27: [unused26]     	   28: [unused27]     	   29: [unused28]     	
   30: [unused29]     	   31: [unused30]     	   32: [unused31]     	   33: [unused32]     	   34: [unused33]     	
   35: [unused34]     	   36: [unused35]     	   37: [unused36]     	   38: [unused37]     	   39: [unused38]     	
   40: [unused39]     	   41: [unused40]     	   42: [unused41]     	   