# ðŸ¤— Tokenization with Hugging Face AutoTokenizer

## Overview

This notebook demonstrates how text is tokenized using different
pre-trained Transformer tokenizers.

The examples show how sentences are:
- Converted into tokens
- Mapped to token IDs
- Decoded back into tokens
- Interpreted using model-specific special tokens

---

## What This Code Does

1. **Load a Pre-trained Tokenizer (BERT)**  
   Loads the `bert-base-uncased` tokenizer using `AutoTokenizer`.

2. **Encode a Sentence**  
   Converts the input sentence into token IDs and attention-related fields
   using the tokenizerâ€™s default encoding method.

3. **Tokenize Text**  
   Splits the sentence into individual tokens according to the modelâ€™s
   tokenization rules.

4. **Convert Tokens to Token IDs**  
   Maps each token to its corresponding numerical ID from the tokenizerâ€™s
   vocabulary.

5. **Decode Token IDs**  
   Converts token IDs back into readable tokens.

6. **Inspect Special Tokens (BERT)**  
   Decodes special token IDs such as:
   - `[CLS]` â†’ marks the start of a sequence
   - `[SEP]` â†’ marks the end or separation of sequences

7. **Repeat the Process with a Different Model (XLNet)**  
   Loads the `xlnet-base-cased` tokenizer and applies the same steps to show
   how tokenization and special tokens differ between models.

8. **Inspect Special Tokens (XLNet)**  
   Decodes model-specific special tokens used by XLNet to represent
   sequence boundaries and structure.

---

## Output

- Tokenized representations of the input sentence
- Numerical token IDs used by each model
- Decoded tokens and special tokens specific to each tokenizer


In [5]:
from transformers import AutoTokenizer

In [6]:
model = "bert-base-uncased"

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model)

In [8]:
sentence =  "I'm so excited to learn about Transformers library!"

In [9]:
input_ids = tokenizer(sentence)
print(input_ids)

In [10]:
tokens = tokenizer.tokenize(sentence)
print(tokens)

In [11]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

In [12]:
decoded_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(decoded_tokens)

In [None]:
tokenizer.decode(101)  # Example: Decoding the token ID for [CLS] #special token added by our tokenizer to indicate the start of a sentence

In [None]:
tokenizer.decode(102)  # Example: Decoding the token ID for [SEP] #special token added by our tokenizer to indicate the end of a sentence

In [17]:
#another model
model2 = "xlnet-base-cased"

In [18]:
tokenizer2 = AutoTokenizer.from_pretrained(model2)

In [24]:
input_ids = tokenizer2(sentence)
print(input_ids)

In [25]:
tokens =  tokenizer2.tokenize(sentence)
print(tokens)

In [26]:
token_ids = tokenizer2.convert_tokens_to_ids(tokens)
print(token_ids)

In [None]:
tokenizer2.decode(4)    #special token for this model

### Special Tokens are specific placeholders or markers that help the model perform various tasks, 
 - helps the model understand structure and context
 - guide model behaviour 
 - ensure output of tokenization is in a format that model comprehends
 - cls and sep (classification and separator)

In [None]:
tokenizer2.decode(3)  #another special token for this model