## Tokenization explained 

In [None]:
# Each tokenizer works differently but the underlying mechanism remains the same. Here’s an example using the BERT tokenizer, 
# which is a WordPiece tokenizer:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"
# The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.

tokenized_sequence = tokenizer.tokenize(sequence)
# The tokens are either words or subwords. Here for instance, “VRAM” wasn’t in the model vocabulary, 
# so it’s been split in “V”, “RA” and “M”. To indicate those tokens are not separate words but parts of the same word, 
# a double-hash prefix is added for “RA” and “M”:

print(tokenized_sequence)
# ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
# These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding the sentence 
# to the tokenizer, which leverages the Rust implementation of 🤗 Tokenizers for peak performance.

inputs = tokenizer(sequence)
# The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. 
# The token indices are under the key input_ids:

encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
# Note that the tokenizer automatically adds “special tokens” (if the associated model relies on them) which are special IDs the model 
# sometimes uses.

# If we decode the previous sequence of ids,

decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)
# [CLS] A Titan RTX has 24GB of VRAM [SEP]

# https://huggingface.co/docs/transformers/glossary