# BERT's Tokenizer

Tokenizers are essential tools in natural language processing for breaking down text into smaller units, such as words or subwords, known as tokens. These tokens serve as the fundamental building blocks for various NLP tasks, such as text classification, sentiment analysis, and machine translation.

Here, we will have a look to the BERT tokenizer implemented in [HuggingFace](https://huggingface.co). Although not part of the model, tokenizers are a crucial component of language models, in this case BERT (Bidirectional Encoder Representations from Transformers). BERT uses the [WordPiece](https://huggingface.co/docs/transformers/en/tokenizer_summary#wordpiece) tokenization which segments words into meaningful subword units, allowing for a comprehensive understanding of context and semantics within a given text corpus. Through this exploration, we aim to understand the intricacies of tokenization and its significance in modern NLP frameworks

 * [[huggingface.co] Summary of the tokenizers](https://huggingface.co/docs/transformers/en/tokenizer_summary)
 * [[youtube] **Let's build the GPT Tokenizer** by Andrej Karpathy](https://www.youtube.com/watch?v=zduSFxRajkE)
 * [[huggingface.co] Byte-Pair Encoding tokenization](https://huggingface.co/learn/nlp-course/en/chapter6/5)
 * [Tiktoken App](https://tiktokenizer.vercel.app/)

In [1]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_checkpoint = 'bert-base-uncased'

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [4]:
tokenizer.vocab_size

30522

In [5]:
tokenizer.vocab

{'abel': 16768,
 '##par': 19362,
 'indeed': 5262,
 'fbi': 8495,
 'sprang': 15627,
 'fiesta': 24050,
 '##sm': 6491,
 'dumping': 23642,
 'received': 2363,
 'ecclesiastical': 12301,
 'verses': 11086,
 '##zal': 16739,
 '##⇌': 30118,
 'ste': 26261,
 '[unused924]': 929,
 '##mies': 28397,
 '202': 16798,
 'generator': 13103,
 'hanson': 17179,
 'crossbow': 28692,
 'placing': 6885,
 'carolina': 3792,
 'senate': 4001,
 'flexed': 24244,
 'barlow': 17803,
 'ʎ': 1135,
 'founded': 2631,
 'laced': 17958,
 'gleamed': 25224,
 'robot': 8957,
 'wary': 15705,
 'dane': 14569,
 'demanded': 6303,
 '##oga': 18170,
 'arrogance': 24416,
 'surveillance': 9867,
 'amar': 23204,
 'process': 2832,
 'editing': 9260,
 '##date': 13701,
 'zion': 19999,
 '[unused208]': 213,
 '##roller': 26611,
 'unidentified': 20293,
 '↑': 1584,
 'pitcher': 8070,
 'including': 2164,
 'romans': 10900,
 'dyke': 22212,
 'violating': 20084,
 'martyrs': 18945,
 'compromised': 20419,
 'multiple': 3674,
 '1847': 9176,
 '##igen': 29206,
 'pbs': 1

In [6]:
encoding = tokenizer.encode("let's tokenize something?", return_tensors="pt")
encoding

tensor([[  101,  2292,  1005,  1055, 19204,  4697,  2242,  1029,   102]])

In [7]:
tokens = tokenizer.convert_ids_to_tokens(encoding.flatten())
tokens

['[CLS]', 'let', "'", 's', 'token', '##ize', 'something', '?', '[SEP]']