# Tokenization

Welcome to your first hands-on lab for Natural Language Processing (NLP)! 
Unlike images, which are naturally numerical arrays, text is a sequence of symbols that machines don't inherently understand. Before you can perform tasks like sentiment analysis or translation, you must first convert your text into a format a model can process.

**Tokenization** is an important first step in any NLP workflow, converting raw text into meaningful units called **tokens**.
These tokens are building blocks used by models, such as BERT, to generate word embeddings - dense vector representations capturing semantic meaning.

This lab provides a practical look into this fundamental process. 
You will explore tokenization by comparing a manual, from-scratch approach with the use of a modern, pre-trained tool.

Specifically, you'll learn to:
* Build a simple tokenizer from scratch to understand the core mechanics, creating a vocabulary where each unique word maps to a numerical ID.
* Use a powerful, pre-trained BERT tokenizer from the popular Hugging Face library to see how professionals handle this task efficiently.
* Understand why matching tokenizers to models is critical and use `AutoTokenizer` as a best practice for ensuring compatibility.
* Observe how this advanced tool automatically handles challenges like out-of-vocabulary (OOV) words by breaking them into **subword tokens**.

## Imports

In [None]:
import torch
from transformers import BertTokenizerFast, AutoTokenizer

import helper_utils

## Manual Tokenization: Building a Vocabulary

* Define a list of sample `sentences`.
* Implement `tokenize` function that converts input `text` to lowercase and splits it into individual words (tokens) based on whitespace.
* Implement `build_vocab` function that takes a list of `sentences`, tokenizes them, and creates a `vocab` (vocabulary).
    * Each unique word encountered is added to the vocabulary and assigned a unique numerical ID.

In [None]:
sentences = [
    'I love my dog',
    'I love my cat'
]

# Tokenization function
def tokenize(text):
  # Lowercase the text and split by whitespace
  return text.lower().split()

# Build the vocabulary
def build_vocab(sentences):
    vocab = {}
    # Iterate through each sentence.
    for sentence in sentences:
        # Tokenize the current sentence
        tokens = tokenize(sentence)
        # Iterate through each token in the sentence
        for token in tokens:
            # If the token is not already in the vocabulary
            if token not in vocab:
                # Add the token to the vocabulary and assign it a unique integer ID
                # IDs start from 1; 0 can be reserved for padding.
                vocab[token] = len(vocab) + 1
    return vocab

# Create the vocabulary index
vocab = build_vocab(sentences)

print("Vocabulary Index:", vocab)

## Using a Pre-trained BERT Tokenizer

* Initialize the `BertTokenizerFast` by loading the pre-trained [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) model directly from Hugging Face.

**Note**: In this notebook environment, the model has been saved and is being loaded locally:

```python
local_tokenizer_path = "./bert_tokenizer_local"
tokenizer = BertTokenizerFast.from_pretrained(local_tokenizer_path)
```

If you were to run this notebook elsewhere, you would initialize it as:
     
```python
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
```

* Use the initialized `tokenizer` to process the `sentences`, creating `encoded_inputs`.
    * `padding=True` ensures all output sequences have the same length.
    * `truncation=True` cuts sequences that are longer than the model's maximum input length.
    * `return_tensors='pt'` specifies that the output should be PyTorch tensors.
* Convert the `input_ids` (numerical representations) from `encoded_inputs` back into their string token representations for easier inspection.
    * These may include special tokens like `[CLS]` and `[SEP]`.
* Retrieve the entire vocabulary (word-to-ID mapping) used by the BERT tokenizer using `tokenizer.get_vocab()`.
* Print the `input_ids` (token IDs) generated by the tokenizer for the sentences.

In [None]:
sentences = [
    'I love my dog',
    'I love my cat'
]

# Define the local directory where the tokenizer is saved
local_tokenizer_path = "./bert_tokenizer_local"

# Initialize the tokenizer from the local directory
tokenizer = BertTokenizerFast.from_pretrained(local_tokenizer_path)

# Tokenize the sentences and encode them
encoded_inputs = tokenizer(sentences, padding=True, 
                           truncation=True, return_tensors='pt')

# To see the tokens for each input (helpful for understanding the output)
tokens = [tokenizer.convert_ids_to_tokens(ids)
          for ids in encoded_inputs["input_ids"]]

# Get the model's vocabulary (mapping from tokens to IDs)
word_index = tokenizer.get_vocab() # For BertTokenizerFast, get_vocab() returns the vocab

# Print the human-readable `tokens` for each sentence
print("Tokens:", tokens)

print("\nToken IDs:", encoded_inputs['input_ids'])

# Print unique tokens from your sentences mapped to their unique IDs 
helper_utils.print_unique_token_id_mappings(tokens, encoded_inputs['input_ids'])

**Remark on Model Compatibility**
It is worth emphasizing that in NLP, tokenizers are not one-size-fits-all tools. Each tokenizer is specifically designed to work with a particular model.
The `bert-base-uncased` tokenizer, for example, is designed to format text in the exact way the BERT model was trained to understand it. 
This includes its specific vocabulary, rules for splitting words, and the use of special tokens like `[CLS]` and `[SEP]`.

*Using the tokenizer that matches your model ensures the input format is exactly what the model expects.* 
Mismatching a model and tokenizer can lead to poor performance or errors.

## Using `AutoTokenizer`

While using a specific class like `BertTokenizerFast` works perfectly, the Hugging Face `transformers` library offers a convenient and robust solution: `AutoTokenizer`.

The `AutoTokenizer` class is a smart wrapper that automatically detects and loads the correct tokenizer class for any given model checkpoint. 
Instead of you needing to remember whether a model requires `BertTokenizerFast`, `GPT2Tokenizer`, or another specific class, `AutoTokenizer.from_pretrained()` handles it for you.

This simplifies your code and, more importantly, prevents potential mismatches between your model and its tokenizer. 

* Initialize the `AutoTokenizer` by loading the same pre-trained `bert-base-uncased` model.
* Use the `AutoTokenizer` to process the `sentences`, creating `encoded_inputs_auto`.
    * The same parameters (`padding`, `truncation`, `return_tensors`) are used to ensure consistent output formatting.
* Convert the `input_ids` from `encoded_inputs_auto` back into their string token representations for inspection.
* Print the `input_ids` generated by the `AutoTokenizer` for the sentences.

In [None]:
# Define the local directory where the tokenizer is saved
local_tokenizer_path = "./bert_tokenizer_local"

# Initialize the tokenizer using the AutoTokenizer class
# This automatically loads the correct tokenizer (BertTokenizerFast in this case)
tokenizer = AutoTokenizer.from_pretrained(local_tokenizer_path)

In [None]:
sentences = [
    'I love my dog',
    'I love my cat'
]

# Tokenize the sentences and encode them
encoded_inputs = tokenizer(sentences, padding=True, 
                           truncation=True, return_tensors='pt')

# To see the tokens for each input (helpful for understanding the output)
tokens = [tokenizer.convert_ids_to_tokens(ids)
          for ids in encoded_inputs["input_ids"]]

# Get the model's vocabulary (mapping from tokens to IDs)
word_index = tokenizer.get_vocab() 

# Print the human-readable `tokens` for each sentence
print("Tokens:", tokens)

print("\nToken IDs:", encoded_inputs['input_ids'])

# Print unique tokens from your sentences mapped to their unique IDs 
helper_utils.print_unique_token_id_mappings(tokens, encoded_inputs['input_ids'])

## (Optional) Try It With Your Own Sentences

You've seen how the pre-trained BERT tokenizer processed the example sentences. Now, it's your turn to experiment! Use the code cell below to input your own sentences and observe how they are tokenized

Test sentences of different lengths. For example,
```python
sentences = [
    'I love my red dog',
    'I love my cat'
]
```

In [None]:
### Add your sentence(s) here
sentences = [
    "",
    # "",
    # "",
]

The next code cell is all set to take these sentences and process them using the tokenizer. It will then print out how your sentences have been converted into 'Tokens' and their corresponding 'Token IDs'.

**Before you run it, here's something to look out for:** If you've included any words that are particularly unique or specific (like names of local people, specific places, or less common nouns), pay close attention to how these words appear in the 'Tokens' list after you run the cell. You might notice they are handled in a distinct way.

An explanation for this behavior, especially concerning these kinds of words, will be provided in the section below the output, under the heading **"Out-of-Vocabulary" Words**.

In [None]:
# Tokenize the sentences and encode them
encoded_inputs = tokenizer(sentences, padding=True, 
                           truncation=True, return_tensors='pt')

# To see the tokens for each input (helpful for understanding the output)
tokens = [tokenizer.convert_ids_to_tokens(ids)
          for ids in encoded_inputs["input_ids"]]

# Get the model's vocabulary (mapping from tokens to IDs)
word_index = tokenizer.get_vocab()

# Print the human-readable `tokens` for each sentence
print("Tokens:", tokens)

print("\nToken IDs:", encoded_inputs['input_ids'])

# Print unique tokens from your sentences mapped to their unique IDs 
helper_utils.print_unique_token_id_mappings(tokens, encoded_inputs['input_ids'])

### "Out-of-Vocabulary" (OOV) Words

You might have seen some words in your sentences (especially unique names or local terms) break into smaller pieces when tokenized. This is expected.

* **What are OOV words?** Words not in the tokenizer's built-in dictionary (e.g., many proper names).
* **How are they handled?** The tokenizer splits OOV words into smaller, known sub-word parts.
* **What does "##" mean?** A sub-word starting with "##" (like ##bs) attaches to the previous piece to form the original word. It's not a new word itself.

**Example**:
If a name like `"Mubsi"` is OOV, it might become ['mu', '##bs', '##i']. This means "mu" + "bs" + "i" are combined to represent "Mubsi".

**Why does this happen?** This "subword tokenization" allows the tokenizer to handle any word, even if it's rare or new, ensuring no word is truly "unknown."

To see this in action, use the `tokenizer` on the `oov_words` and check the output tokens.

In [None]:
# A list of words that are likely "Out-of-Vocabulary" (OOV)
oov_words = ["Tokenization", "HuggingFace", "unintelligible"]

print("--- Subword Tokenization Example ---")

# Iterate through the words and show how they are tokenized
for word in oov_words:
    # The .tokenize() method is a direct way to see the subword breakdown
    subwords = tokenizer.tokenize(word)
    
    # Print the results
    print(f"Original word: '{word}'")
    print(f"Subword tokens: {subwords}\n")

# Conclusion

Congratulations on completing the lab! You have successfully transformed raw text into structured, numerical tensors that a deep learning model can understand.

You started by building a vocabulary manually, tokenizing sentences, and assigning a unique ID to each word. This foundational exercise highlights the core challenge: every unique word needs a numerical representation, and your vocabulary can quickly become massive and difficult to manage.

Then, you saw the modern approach: using a pre-trained tokenizer. With just a few lines of code, it handles the entire preprocessing pipelineâ€”from splitting words and adding special tokens like `[CLS]` and `[SEP]` to padding and truncation. You also saw how **subword tokenization** elegantly solves the out-of-vocabulary problem, ensuring that no word is ever truly "unknown" to the model.

The key takeaway is that the tokenizer and its corresponding model are tightly coupled. The BERT tokenizer formats the text in the exact way the BERT model was trained to understand it, which is essential for achieving state-of-the-art performance. Using tools like `AutoTokenizer` simplifies this process, guaranteeing you always use the correct tokenizer for your chosen model. Now that you can reliably convert any text into model-ready tensors, you are prepared to move on to the next stage: using these tensors to build and train powerful NLP models.