#### There are mainly three categories of tokenizers:
<div>
<img src="image/tokenizer1.png" width=800/>
</div>

#### Word Based
- Each word have specific ID
<div>
<img src="image/tokenizer2.png" width=800/>
</div>

- Very similar words have entirely different meaning and the vocabulary can end up very large, result in heavy models
<div>
<img src="image/tokenizer3.png" width=200/>
</div>

- Out of vocabulary wrods result in a loss of information and have the same representation
<div>
<img src="image/tokenizer4.png" width=800/>
</div>

#### Character based
- Fewer out of vocabulary words but very long sequences and less meaningful tokens
<div>
<img src="image/tokenizer5.png" width=800/>
</div>

#### Subword based 
Find a middle ground between word and character-based tokenization
<div>
<img src="image/tokenizer6.png" width=800/>
</div>

#### The tokenization pipeline: from input text ot a list of numbers:
<div>
<img src="image/tokenizer7.png" width=800/>
</div>

#### First step of the pipeline is to split the text into tokens using the ***tokenize*** method, different model may have different tokenization convention

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Let's try to tokenize!")
print(tokens)

tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")
tokens = tokenizer.tokenize("Let's try to tokenize!")
print(tokens)

['let', "'", 's', 'try', 'to', 'token', '##ize', '!']
['▁let', "'", 's', '▁try', '▁to', '▁to', 'ken', 'ize', '!']


#### Lastly the tokenizer adds a special tokens the model expects using the ***convert_tokens_to_ids*** and ***prepare_for_model*** method

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Let's try to tokenize!")
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[2292, 1005, 1055, 3046, 2000, 19204, 4697, 999]
{'input_ids': [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


#### The ***decode*** method allows us to check how the final output of the tokenizer translates back into text

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
inputs = tokenizer("Let's try to tokenize!")
print(inputs)
print(tokenizer.decode(inputs["input_ids"]))

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
inputs = tokenizer("Let's try to tokenize!")
print(inputs)
print(tokenizer.decode(inputs["input_ids"]))

{'input_ids': [101, 2421, 112, 188, 2222, 1106, 22559, 3708, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] Let's try to tokenize! [SEP]
{'input_ids': [0, 7939, 18, 860, 7, 19233, 2072, 328, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
<s>Let's try to tokenize!</s>


## Batching inputs together

#### Sentences we want to group inside a batch will often have different length

In [23]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = [
    "I have been waiting for a HuggingFace course my whole life",
    "I hate this"
]
tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
ids = [tokenizer.prepare_for_model(tokenizer.convert_tokens_to_ids(token))["input_ids"] for token in tokens]
print(f"ids 1: {ids[0]} \nids 2: {ids[1]}")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


ids 1: [101, 1045, 2031, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 102] 
ids 2: [101, 1045, 5223, 2023, 102]


#### Usually pad the smaller sentences to the length of the longest one

In [25]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
print(f"The pad token is {tokenizer.pad_token} and the pad token ids is {tokenizer.pad_token_id}")

The pad token is [PAD] and the pad token ids is 0




#### But just passing through a tranformer model will not give the right results, as the attention layers use the padding tokens in the context they look at for each tokens in the sentence
<div>
<img src="image/tokenizer8.png" width=800>
</div>

In [35]:
from transformers import AutoModelForSequenceClassification
import torch

ids1 = torch.tensor([ids[0]])
ids2 = torch.tensor([ids[1]])
all_ids = tokenizer(sentences, padding=True, return_tensors="pt")["input_ids"]

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
print(f"ids1: {ids1} \nids2: {ids2} \nall_ids: {all_ids}")
print(model(ids1).logits)
print(model(ids2).logits)
print(model(all_ids).logits)

ids1: tensor([[  101,  1045,  2031,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,   102]]) 
ids2: tensor([[ 101, 1045, 5223, 2023,  102]]) 
all_ids: tensor([[  101,  1045,  2031,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,   102],
        [  101,  1045,  5223,  2023,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0]])
tensor([[-1.3961,  1.4491]], grad_fn=<AddmmBackward0>)
tensor([[ 4.3705, -3.5111]], grad_fn=<AddmmBackward0>)
tensor([[-1.3961,  1.4491],
        [ 3.0511, -2.6237]], grad_fn=<AddmmBackward0>)


#### To tell the attention layers to ignore the padding tokens, we need to pass them an attention mask. 
#### Use the ***AutoTokenizer*** with ***padding=True***, the tokenizer can directly prepare the inputs with padding and the proper attention mask. 
#### The ids of input sentence and the corresponding attention mask, can be get by indexing the ***AutoTokenizer*** output with ***input_ids*** and ***attention_mask***
<div>
<img src="image/tokenizer9.png" width=800>
</div>

In [39]:
attention_mask = tokenizer(sentences, padding=True, return_tensors="pt")["attention_mask"]
output = model(all_ids, attention_mask)
print(output.logits)

tensor([[-1.3961,  1.4491],
        [ 4.3705, -3.5111]], grad_fn=<AddmmBackward0>)


## Training a new tokenizer
#### The tokenizer can be retrained using the ***AutoTokenizer.train_new_from_iterator*** method
* 1. Gathering a corpus of texts
* 2. Choose a tokenizer architecture
* 3. Train the tokenizer on the corpus
* 4. Save the results 

#### 1. Gather a corpus of texts

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("code_search_net", "python", trust_remote_code=True)
print(raw_datasets)
def get_training_corpus(raw_datasets):
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx: start_idx+1000]
        yield samples["whole_func_string"]
training_corpus = get_training_corpus(raw_datasets)
print(type(training_corpus))

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 412178
    })
    test: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 22176
    })
    validation: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 23107
    })
})
<class 'generator'>


#### 2. Choose tokenizer architecture

In [4]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")



#### 3. Train the tokenizer with corpus

In [6]:
from pathlib import Path
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)
# save the tokenizer weight to folder
if not Path("code-search-net-tokenizer").exists():
    new_tokenizer.save_pretrained("code-search-net-tokenizer")










In [3]:
example  = """
def get_training_corpus(raw_datasets):
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx: start_idx+1000]
        yield samples["whole_func_string"]
"""

print(f"old tokenizer:{old_tokenizer.tokenize(example)}\nnew tokenier:{new_tokenizer.tokenize(example)}")

old tokenizer:['Ċ', 'def', 'Ġget', '_', 'training', '_', 'cor', 'p', 'us', '(', 'raw', '_', 'dat', 'as', 'ets', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġdataset', 'Ġ=', 'Ġraw', '_', 'dat', 'as', 'ets', '["', 'train', '"]', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġfor', 'Ġstart', '_', 'id', 'x', 'Ġin', 'Ġrange', '(', '0', ',', 'Ġlen', '(', 'dat', 'as', 'et', '),', 'Ġ1000', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġsamples', 'Ġ=', 'Ġdataset', '[', 'start', '_', 'id', 'x', ':', 'Ġstart', '_', 'id', 'x', '+', '1000', ']', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġyield', 'Ġsamples', '["', 'wh', 'ole', '_', 'func', '_', 'string', '"]', 'Ċ']
new tokenier:['Ċ', 'def', 'Ġget', '_', 'training', '_', 'corpus', '(', 'raw', '_', 'datasets', '):', 'ĊĠĠĠ', 'Ġdataset', 'Ġ=', 'Ġraw', '_', 'datasets', '["', 'train', '"]', 'ĊĠĠĠ', 'Ġfor', 'Ġstart', '_', 'idx', 'Ġin', 'Ġrange', '(', '0', ',', 'Ġlen', '(', 'dataset', '),', 'Ġ1000', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġsamples', 'Ġ=', 'Ġdataset', '[', 'start', '_', 'idx', ':', 'Ġstart', '_', 'idx',